Arxiv今日论文 | 2025-02-28

本篇博文主要内容为 2025-02-28 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决Web AI代理相较于独立大型语言模型（Large Language Models, LLMs）表现出更高脆弱性的问题，尽管两者均基于相同的安全对齐模型。这种脆弱性差异令人担忧，因为Web AI代理具有更高的灵活性，可能面临更广泛的对抗性用户输入。为了解决这一问题，论文提出通过组件级分析和更细致、系统的评估框架来识别导致Web AI代理脆弱性增加的关键因素。研究发现三个关键因素加剧了Web AI代理的脆弱性：(1) 将用户目标嵌入系统提示；(2) 多步动作生成；(3) 观察能力。这些发现强调了在AI代理设计中增强安全性和鲁棒性的紧迫性，并为针对性防御策略提供了实用见解。

链接: https://arxiv.org/abs/2502.20383
作者: Jeffrey Yang Fan Chiang,Seungjae Lee,Jia-Bin Huang,Furong Huang,Yizheng Chen
机构: University of Maryland (马里兰大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Project website: this http URL

点击查看摘要

Abstract:Recent advancements in Web AI agents have demonstrated remarkable capabilities in addressing complex web navigation tasks. However, emerging research shows that these agents exhibit greater vulnerability compared to standalone Large Language Models (LLMs), despite both being built upon the same safety-aligned models. This discrepancy is particularly concerning given the greater flexibility of Web AI Agent compared to standalone LLMs, which may expose them to a wider range of adversarial user inputs. To build a scaffold that addresses these concerns, this study investigates the underlying factors that contribute to the increased vulnerability of Web AI agents. Notably, this disparity stems from the multifaceted differences between Web AI agents and standalone LLMs, as well as the complex signals - nuances that simple evaluation metrics, such as success rate, often fail to capture. To tackle these challenges, we propose a component-level analysis and a more granular, systematic evaluation framework. Through this fine-grained investigation, we identify three critical factors that amplify the vulnerability of Web AI agents; (1) embedding user goals into the system prompt, (2) multi-step action generation, and (3) observational capabilities. Our findings highlights the pressing need to enhance security and robustness in AI agent design and provide actionable insights for targeted defense strategies.
zh

[NLP-1] Multi-Turn Code Generation Through Single-Step Rewards

【速读】：该论文致力于解决多轮执行反馈下的代码生成问题。现有的方法要么不使用反馈直接生成代码，要么采用复杂的分层强化学习来优化多轮奖励。论文的关键洞察在于，代码生成可以被视为一个一步可恢复的马尔可夫决策过程（MDP），即可以从任何中间代码状态通过单步操作恢复出正确的代码。基于此，论文提出了一种名为μ Code的简单且可扩展的方法，它仅利用单步奖励即可完成多轮代码生成任务。具体而言，μ Code迭代训练了一个生成器以根据多轮执行反馈提供代码解决方案，同时训练了一个验证器来评估新生成的代码质量。实验结果显示，该方法在现有最先进的基准模型上取得了显著改进，并有效利用了执行反馈。

链接: https://arxiv.org/abs/2502.20380
作者: Arnav Kumar Jain,Gonzalo Gonzalez-Pumariega,Wayne Chen,Alexander M Rush,Wenting Zhao,Sanjiban Choudhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages (not including references or appendix); 6 figures (in main paper); (v1) preprint

点击查看摘要

Abstract:We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, \mu Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. \mu Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of \mu Code at utilizing the execution feedback. Our code is available at this https URL.
zh

[NLP-2] PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）评估基准中存在的数据泄露和性能虚高问题。为了解决这些问题，论文提出了一种名为PhantomWiki的新方法。PhantomWiki是一种动态生成独特且事实一致的文档语料库及多样化问答对的流水线，其关键在于每次评估时都会生成新的实例，而不是依赖固定的或现有的数据集。这种方法能够通过调整问题难度和语料库规模来分别解耦推理能力和检索能力的评估，并发现前沿LLMs在PhantomWiki数据集上表现出令人惊讶的挑战性。因此，论文贡献了一个可扩展且抗数据泄露的框架，用于解耦评估LLMs的推理、检索以及工具使用能力。

链接: https://arxiv.org/abs/2502.20377
作者: Albert Gong,Kamilė Stankevičiūtė,Chao Wan,Anmol Kabra,Raphael Thesmar,Johann Lee,Julius Klenke,Carla P. Gomes,Kilian Q. Weinberger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities. Our code is available at this https URL.
zh

[NLP-3] Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores Knowledge Graphs and Hierarchical Non-negative Matrix Factorization

【速读】：本文旨在解决法律领域中复杂、互连且半结构化数据的知识提取与关系推理问题，特别是通过传统方法难以高效完成的法律研究任务。论文提出的关键解决方案是构建一个集成Retrieval-Augmented Generation (RAG)、Vector Stores (VS) 和 Knowledge Graphs (KG) 的生成式AI系统，其中Knowledge Graph通过Non-Negative Matrix Factorization (NMF) 构建。该系统通过网络爬虫技术从公开平台（如Justia）收集法律文本，并结合先进的语义表示、层次关系及潜在主题发现，弥合传统基于关键词搜索与上下文理解之间的鸿沟。其核心在于利用这些技术增强法律信息检索的准确性、可解释性和推理能力，同时减少生成内容中的幻觉现象，从而提升法律研究效率并支持法律趋势预测等挑战性任务。

链接: https://arxiv.org/abs/2502.20364
作者: Ryan C. Barron,Maksim E. Eren,Olga M. Serafimova,Cynthia Matuszek,Boian S. Alexandrov
机构: Theoretical Division, Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室); CSEE, UMBC (University of Maryland, Baltimore County); Holland & Hart LLP (霍兰德与哈特律师事务所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Agentic Generative AI, powered by Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable to specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain here comprises complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research. Here, we introduce a generative AI system that integrates RAG, VS, and KG, constructed via Non-Negative Matrix Factorization (NMF), to enhance legal information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends-challenging tasks that are essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This framework supports legal document clustering, summarization, and cross-referencing, for scalable, interpretable, and accurate retrieval for semi-structured data while advancing computational law and AI.
zh

[NLP-4] Bridging the Creativity Understanding Gap: Small-Scale Human Alignment Enables Expert-Level Humor Ranking in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在理解创意内容（如幽默）方面存在的显著局限性，这一问题由Hessel等人的研究通过New Yorker Cartoon Caption Contest (NYCCC)明确揭示。论文的关键在于将幽默理解分解为三个组成部分，并系统性地优化每个部分：通过改进标注提升视觉理解能力，利用LLM生成的幽默推理与解释，以及针对人类偏好数据实施定向对齐。此外，论文发现，虽然尝试通过不同人格提示模仿子群体偏好效果有限，但使用众包偏好进行模型微调表现出显著成效。最终，该方法使标题排名的准确率达到82.4%，大幅超越先前67%的基准，并达到人类专家水平。论文进一步提出观点，认为实现人工通用智能（Artificial General Intelligence, AGI）需要系统性收集跨创意领域的个体与文化偏好数据，以培养LLMs的真正创意理解能力。

链接: https://arxiv.org/abs/2502.20356
作者: Kuan Lok Zhou,Jiayi Chen,Siddharth Suresh,Reuben Narad,Timothy T. Rogers,Lalit K Jain,Robert D Nowak,Bob Mankoff,Jifan Zhang
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Washington, Seattle (西雅图华盛顿大学); Air Mail and Cartoon Collections (未知)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant limitations in understanding creative content, as demonstrated by Hessel et al. (2023)'s influential work on the New Yorker Cartoon Caption Contest (NYCCC). Their study exposed a substantial gap between LLMs and humans in humor comprehension, establishing that understanding and evaluating creative content is key challenge in AI development. We revisit this challenge by decomposing humor understanding into three components and systematically improve each: enhancing visual understanding through improved annotation, utilizing LLM-generated humor reasoning and explanations, and implementing targeted alignment with human preference data. Our refined approach achieves 82.4% accuracy in caption ranking, singificantly improving upon the previous 67% benchmark and matching the performance of world-renowned human experts in this domain. Notably, while attempts to mimic subgroup preferences through various persona prompts showed minimal impact, model finetuning with crowd preferences proved remarkably effective. These findings reveal that LLM limitations in creative judgment can be effectively addressed through focused alignment to specific subgroups and individuals. Lastly, we propose the position that achieving artificial general intelligence necessitates systematic collection of human preference data across creative domains. We advocate that just as human creativity is deeply influenced by individual and cultural preferences, training LLMs with diverse human preference data may be essential for developing true creative understanding.
zh

[NLP-5] owards Responsible AI in Education: Hybrid Recommendation System for K-12 Students Case Study

【速读】：该论文旨在解决个性化学习推荐系统中可能无意引入的偏见问题，这些问题可能导致学习资源的不公平访问。论文提出的关键解决方案是开发一个结合图模型（graph-based modeling）和矩阵分解（matrix factorization）的推荐系统，用于为K-12学生提供课外活动、学习资源及志愿机会的个性化建议。此外，系统包含一个框架以检测和减少偏见，通过分析受保护学生群体间的反馈实现公平性保障。这强调了在教育推荐系统中持续监控的重要性，以确保所有学生都能获得平等、透明且有效的学习机会。

链接: https://arxiv.org/abs/2502.20354
作者: Nazarii Drushchak,Vladyslava Tyshchenko,Nataliya Polyakovska
机构: SoftServe Inc. (软银股份有限公司); Ukrainian Catholic University (乌克兰天主教大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growth of Educational Technology (EdTech) has enabled highly personalized learning experiences through Artificial Intelligence (AI)-based recommendation systems tailored to each student needs. However, these systems can unintentionally introduce biases, potentially limiting fair access to learning resources. This study presents a recommendation system for K-12 students, combining graph-based modeling and matrix factorization to provide personalized suggestions for extracurricular activities, learning resources, and volunteering opportunities. To address fairness concerns, the system includes a framework to detect and reduce biases by analyzing feedback across protected student groups. This work highlights the need for continuous monitoring in educational recommendation systems to support equitable, transparent, and effective learning opportunities for all students.
zh

[NLP-6] KEDRec-LM: A Knowledge-distilled Explainable Drug Recommendation Large Language Model

【速读】：本文旨在解决药物发现领域中可解释性不足的问题，尤其是在生物医学自然语言处理（Biomedical NLP）中的关键任务。论文的关键在于构建了一个名为\textbfexpRxRec的综合数据集，该数据集整合了开源药物知识图谱、临床试验数据以及PubMed出版物，以支持可解释的药物发现任务。此外，提出了一种名为\textbfKEDRec-LM的指令调优大型语言模型（LLM），通过蒸馏丰富的医学知识库来实现药物推荐及推理过程的生成。这一方案的核心在于利用LLMs的强大能力，结合大规模知识蒸馏技术，提升药物发现下游任务及其实际应用的可解释性。

链接: https://arxiv.org/abs/2502.20350
作者: Kai Zhang,Rui Zhu,Shutian Ma,Jingwei Xiong,Yejin Kim,Fabricio Murai,Xiaozhong Liu
机构: Worcester Polytechnic Institute (伍斯特理工学院); Yale University (耶鲁大学); University of California, Davis (加州大学戴维斯分校); The University of Texas Health Science Center at Houston (德克萨斯大学健康科学中心休斯敦校区)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Drug discovery is a critical task in biomedical natural language processing (NLP), yet explainable drug discovery remains underexplored. Meanwhile, large language models (LLMs) have shown remarkable abilities in natural language understanding and generation. Leveraging LLMs for explainable drug discovery has the potential to improve downstream tasks and real-world applications. In this study, we utilize open-source drug knowledge graphs, clinical trial data, and PubMed publications to construct a comprehensive dataset for the explainable drug discovery task, named \textbfexpRxRec. Furthermore, we introduce \textbfKEDRec-LM, an instruction-tuned LLM which distills knowledge from rich medical knowledge corpus for drug recommendation and rationale generation. To encourage further research in this area, we will publicly release\footnoteA copy is attached with this submission both the dataset and KEDRec-LM.
zh

[NLP-7] Sparse Auto-Encoder Interprets Linguistic Features in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）内部处理和表征语言知识机制不透明的问题。为了解决这一问题，论文的关键在于提出了一种基于稀疏自编码器（Sparse Auto-Encoders, SAEs）的系统性因果分析方法，通过从语音学、音系学、形态学、句法学、语义学和语用学六个维度提取广泛的语言特征，并构建最小对比数据集和反事实句子数据集来提取、评估和干预这些特征。此外，引入了特征表示置信度（Feature Representation Confidence, FRC）和特征干预置信度（Feature Intervention Confidence, FIC）两个指标，以量化语言特征捕捉和控制语言现象的能力。研究结果揭示了LLMs中语言知识的内在表征，并展示了控制模型输出的潜力，为未来更可解释且可控的语言建模奠定了基础。

链接: https://arxiv.org/abs/2502.20344
作者: Yi Jing,Zijun Yao,Lingxu Ran,Hongzhu Guo,Xiaozhi Wang,Lei Hou,Juanzi Li
机构: Department of Computer Science and Technology, Tsinghua University (清华大学); Department of Chinese Language and Literature, Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in tasks that require complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Previous work on linguistic mechanisms has been limited by coarse granularity, insufficient causal analysis, and a narrow focus. In this study, we present a systematic and comprehensive causal investigation using sparse auto-encoders (SAEs). We extract a wide range of linguistic features from six dimensions: phonetics, phonology, morphology, syntax, semantics, and pragmatics. We extract, evaluate, and intervene on these features by constructing minimal contrast datasets and counterfactual sentence datasets. We introduce two indices-Feature Representation Confidence (FRC) and Feature Intervention Confidence (FIC)-to measure the ability of linguistic features to capture and control linguistic phenomena. Our results reveal inherent representations of linguistic knowledge in LLMs and demonstrate the potential for controlling model outputs. This work provides strong evidence that LLMs possess genuine linguistic knowledge and lays the foundation for more interpretable and controllable language modeling in future research.
zh

[NLP-8] hinking Slow Fast: Scaling Inference Compute with Distilled Reason ers

【速读】：该论文试图解决的问题是如何在固定计算预算下，利用更低复杂度的模型通过其更高的生成吞吐量，在数学推理任务中超越同等规模的Transformer模型。为了解决这一问题并克服现有次二次规模推理器性能较弱的局限性，论文的关键方案是通过从预训练的Transformer模型蒸馏出纯Mamba和混合Mamba模型。这些蒸馏模型仅在80亿tokens的数据上进行训练，却在数学推理数据集上表现出色且具备良好的扩展性，同时在大批次和长序列的推理任务中显著提高了速度。尽管蒸馏过程导致零样本性能有所下降，但两种Mamba模型在固定时间预算下能够实现超过其教师模型的覆盖范围和准确性，从而开辟了新的推理计算扩展方向。

链接: https://arxiv.org/abs/2502.20339
作者: Daniele Paliotta,Junxiong Wang,Matteo Pagliardini,Kevin Y. Li,Aviv Bick,J. Zico Kolter,Albert Gu,François Fleuret,Tri Dao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements have demonstrated that the performance of large language models (LLMs) can be significantly enhanced by scaling computational resources at test time. A common strategy involves generating multiple Chain-of-Thought (CoT) trajectories and aggregating their outputs through various selection mechanisms. This raises a fundamental question: can models with lower complexity leverage their superior generation throughput to outperform similarly sized Transformers for a fixed computational budget? To address this question and overcome the lack of strong subquadratic reasoners, we distill pure and hybrid Mamba models from pretrained Transformers. Trained on only 8 billion tokens, our distilled models show strong performance and scaling on mathematical reasoning datasets while being much faster at inference for large batches and long sequences. Despite the zero-shot performance hit due to distillation, both pure and hybrid Mamba models can scale their coverage and accuracy performance past their Transformer teacher models under fixed time budgets, opening a new direction for scaling inference compute.
zh

[NLP-9] Expertise Is What We Want

【速读】：该论文旨在解决将标准化、循证的临床指南转化为自动化临床决策支持系统时面临的不准确性及信息精细化丢失的问题。论文提出了一种名为“大型语言专家”（Large Language Expert, LLE）的应用架构，其核心在于结合大型语言模型（Large Language Models, LLMs）的灵活性与强大功能，以及专家系统的可解释性、可解释性和可靠性。解决方案的关键在于通过LLMs解决专家系统在知识整合、规范化编码和数据标准化方面的挑战，同时利用类似专家系统的思路克服LLMs存在的幻觉问题、低成本更新以及测试难题。论文以新诊断癌症患者的诊疗工作为例，展示了LLE系统在乳腺癌和结肠癌患者的真实世界数据中达到了临床级别的高精度（95%），有效填补了实际诊疗中的关键空白。

链接: https://arxiv.org/abs/2502.20335
作者: Alan Ashworth,Munir Al-Dajani,Keegan Duchicela,Kiril Kafadarov,Allison Kurian,Othman Laraki,Amina Lazrak,Divneet Mandair,Wendy McKennon,Rebecca Miksad,Jayodita Sanghvi,Travis Zack
机构: Color Health; UCSF (加州大学旧金山分校); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Clinical decision-making depends on expert reasoning, which is guided by standardized, evidence-based guidelines. However, translating these guidelines into automated clinical decision support systems risks inaccuracy and importantly, loss of nuance. We share an application architecture, the Large Language Expert (LLE), that combines the flexibility and power of Large Language Models (LLMs) with the interpretability, explainability, and reliability of Expert Systems. LLMs help address key challenges of Expert Systems, such as integrating and codifying knowledge, and data normalization. Conversely, an Expert System-like approach helps overcome challenges with LLMs, including hallucinations, atomic and inexpensive updates, and testability. To highlight the power of the Large Language Expert (LLE) system, we built an LLE to assist with the workup of patients newly diagnosed with cancer. Timely initiation of cancer treatment is critical for optimal patient outcomes. However, increasing complexity in diagnostic recommendations has made it difficult for primary care physicians to ensure their patients have completed the necessary workup before their first visit with an oncologist. As with many real-world clinical tasks, these workups require the analysis of unstructured health records and the application of nuanced clinical decision logic. In this study, we describe the design evaluation of an LLE system built to rapidly identify and suggest the correct diagnostic workup. The system demonstrated a high degree of clinical-level accuracy (95%) and effectively addressed gaps identified in real-world data from breast and colon cancer patients at a large academic center. Comments: 18 pages, 7 figures, 5 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; J.3 Cite as: arXiv:2502.20335 [cs.CL] (or arXiv:2502.20335v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.20335 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

【速读】：该论文试图解决关于大型语言模型中涌现推理能力的鲁棒性及其是否依赖于结构化推理机制的争议。论文通过全面研究开源语言模型（Llama3-70B）支持抽象规则归纳的内部机制，识别出一种涌现的符号架构，该架构通过三步计算实现抽象推理：早期层中的符号抽象头将输入标记转换为基于标记间关系的抽象变量；中间层中的符号归纳头对这些抽象变量进行序列归纳；最后，在后期层中，检索头通过检索与预测的抽象变量相关的值来预测下一个标记。解决方案的关键在于揭示神经网络中涌现推理依赖于符号机制的出现这一结论，从而为符号方法与神经网络方法之间的长期争论提供了一种可能的解答。

链接: https://arxiv.org/abs/2502.20332
作者: Yukang Yang,Declan Campbell,Kaixuan Huang,Mengdi Wang,Jonathan Cohen,Taylor Webb
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many recent studies have found evidence for emergent reasoning capabilities in large language models, but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we perform a comprehensive study of the internal mechanisms that support abstract rule induction in an open-source language model (Llama3-70B). We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers, symbol abstraction heads convert input tokens to abstract variables based on the relations between those tokens. In intermediate layers, symbolic induction heads perform sequence induction over these abstract variables. Finally, in later layers, retrieval heads predict the next token by retrieving the value associated with the predicted abstract variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.
zh

[NLP-11] Long-Context Inference with Retrieval-Augmented Speculative Decoding

【速读】：该论文旨在解决长上下文大语言模型（LLMs）在推理过程中因管理键值（KV）缓存而导致的高计算开销问题。传统基于推测解码（Speculative Decoding, SD）的方法在长上下文场景下效果显著下降，主要受限于内存绑定的KV缓存操作。为了解决这一问题，论文提出了一种名为检索增强推测解码（Retrieval-Augmented Speculative Decoding, RAPID）的新方法。关键创新在于引入检索增强机制，利用检索增强的语言模型（RAG drafter）作为小型草案模型，在缩短的检索上下文中推测生成目标LLM的长上下文输出。此外，通过在推理阶段的知识迁移动态过程，进一步提升了生成质量。实验结果表明，RAPID不仅实现了超过2倍的速度提升，还在真实应用场景中展现了卓越的生成性能，特别是在超过32K上下文长度的情况下保持了稳健的加速效果。

链接: https://arxiv.org/abs/2502.20330
作者: Guanzheng Chen,Qilong Feng,Jinjie Ni,Xin Li,Michael Qizhe Shieh
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference, particularly in managing key-value (KV) caches, presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We present Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter-a draft LLM operating on shortened retrieval contexts-to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer dynamic that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both approaches, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2x speedups. Our analyses reveal that RAPID achieves robust acceleration beyond 32K context length and demonstrates superior generation quality in real-world applications.
zh

[NLP-12] LongRoPE2: Near-Lossless LLM Context Window Scaling

【速读】：该论文试图解决长上下文窗口（longer effective context window）预训练大型语言模型（Large Language Models, LLMs）时，如何在扩展上下文长度的同时保持短上下文性能的问题。现有方法在处理超出分布（out-of-distribution, OOD）问题时效果不佳，主要归因于旋转位置编码（Rotary Position Embedding, RoPE）高维训练不足。为解决此问题，论文提出的关键方案包括：(1) 提出一种假设，即高维RoPE训练不足是导致OOD问题的主要原因；(2) 设计了一种基于“针驱动”困惑度（perplexity）引导的进化搜索的高效RoPE重缩放算法，以缓解训练不足的问题；(3) 提出一种混合上下文窗口训练方法，在扩展上下文序列时采用重缩放的RoPE，同时通过保留原始RoPE确保短上下文性能。实验结果验证了这些贡献的有效性，并显著提升了长上下文任务的表现。

链接: https://arxiv.org/abs/2502.20082
作者: Ning Shang,Li Lyna Zhang,Siyuan Wang,Gaokai Zhang,Gilsinia Lopez,Fan Yang,Weizhu Chen,Mao Yang
机构: Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by “needle-driven” perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens – 80x fewer than Meta’s approach, which fails to reach the target effective context length. Code will be available at this https URL.
zh

[NLP-13] Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

【速读】：该论文旨在构建一个适用于评估大型语言模型（Large Language Models, LLM）驱动多智能体系统（LLM-powered Multi-Agent System, LLM-MAS）能力的新基准——Collab-Overcooked，以应对传统基准在交互环境中的局限性。论文的关键解决方案在于从两个创新角度扩展现有基准：首先，设计了一个支持多样化任务与目标的多智能体框架，并通过自然语言通信促进协作；其次，引入了一组面向过程的评估指标，用于细粒度评估不同LLM驱动智能体的合作能力，这一维度在以往研究中常被忽视。实验结果显示，尽管LLMs在目标理解方面表现出色，但在主动协作及持续适应复杂任务的能力上存在显著不足。论文通过分析LLM-MAS的优势与劣势，为统一且开源的基准上的改进与评估提供了重要见解。

链接: https://arxiv.org/abs/2502.20073
作者: Haochen Sun,Shuwen Zhang,Lei Ren,Hao Xu,Hao Fu,Caixia Yuan,Xiaojie Wang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Li Auto Inc. (理想汽车)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 25 pages, 14 figures

点击查看摘要

Abstract:Large language models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-powered Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks from two novel perspectives. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments over 10 popular LLMs and show that, while the LLMs present a strong ability in goal interpretation, there is a significant discrepancy in active collaboration and continuous adaption that are critical for efficiently fulfilling complicated tasks. Notably, we highlight the strengths and weaknesses in LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-sourced benchmark. Environments, 30 open-ended tasks, and an integrated evaluation package are now publicly available at this https URL.
zh

[NLP-14] Connecting the Persian-speaking World through Transliteration

【速读】：该论文旨在解决塔吉克波斯语（Tajik Persian）与伊朗波斯语（Farsi）之间的互读问题，由于两者分别使用西里尔字母（modified Cyrillic alphabet）和波斯-阿拉伯字母（Perso-Arabic script），书面文本无法直接相互理解。鉴于互联网上的波斯语内容大多以波斯-阿拉伯字母书写，单语种塔吉克波斯语使用者难以有效接入互联网。论文提出，相较于机器翻译，基于大量平行数据的机器转写更具实践性和适用性。其解决方案的关键在于采用基于Transformer架构的Grapheme-to-Phoneme (G2P) 方法实现塔吉克波斯语到伊朗波斯语的转写，该方法在新构建的双字母表数据集上取得了chrF++评分分别为58.70（Farsi到Tajik）和74.20（Tajik到Farsi），为后续研究奠定了基准，并展示了此任务在双向转写中的复杂性。此外，论文还概述了两种书写系统之间的差异及面临的挑战，以辅助未来的转写工作。

链接: https://arxiv.org/abs/2502.20047
作者: Rayyan Merchant,Akhilesh Kakolu Ramarao,Kevin Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite speaking mutually intelligible varieties of the same language, speakers of Tajik Persian, written in a modified Cyrillic alphabet, cannot read Iranian and Afghan texts written in the Perso-Arabic script. As the vast majority of Persian text on the Internet is written in Perso-Arabic, monolingual Tajik speakers are unable to interface with the Internet in any meaningful way. Due to overwhelming similarity between the formal registers of these dialects and the scarcity of Tajik-Farsi parallel data, machine transliteration has been proposed as more a practical and appropriate solution than machine translation. This paper presents a transformer-based G2P approach to Tajik-Farsi transliteration, achieving chrF++ scores of 58.70 (Farsi to Tajik) and 74.20 (Tajik to Farsi) on novel digraphic datasets, setting a comparable baseline metric for future work. Our results also demonstrate the non-trivial difficulty of this task in both directions. We also provide an overview of the differences between the two scripts and the challenges they present, so as to aid future efforts in Tajik-Farsi transliteration.
zh

[NLP-15] Polish-ASTE: Aspect-Sentiment Triplet Extraction Datasets for Polish

【速读】：该论文试图解决Aspect-Sentiment Triplet Extraction (ASTE) 任务中数据集匮乏的问题，特别是针对斯拉夫语言（如波兰语）缺乏相关数据集的现状。论文的关键解决方案是提出了两个新的包含客户对酒店和产品评价的波兰语ASTE数据集，并结合两种ASTE技术与两种大型波兰语语言模型进行实验，以评估其性能及数据集的难度。这些新数据集采用与英文数据集相同的许可协议和文件格式，便于未来研究使用。

链接: https://arxiv.org/abs/2502.20046
作者: Marta Lango,Borys Naglik,Mateusz Lango,Iwo Naglik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aspect-Sentiment Triplet Extraction (ASTE) is one of the most challenging and complex tasks in sentiment analysis. It concerns the construction of triplets that contain an aspect, its associated sentiment polarity, and an opinion phrase that serves as a rationale for the assigned polarity. Despite the growing popularity of the task and the many machine learning methods being proposed to address it, the number of datasets for ASTE is very limited. In particular, no dataset is available for any of the Slavic languages. In this paper, we present two new datasets for ASTE containing customer opinions about hotels and purchased products expressed in Polish. We also perform experiments with two ASTE techniques combined with two large language models for Polish to investigate their performance and the difficulty of the assembled datasets. The new datasets are available under a permissive licence and have the same file format as the English datasets, facilitating their use in future research.
zh

[NLP-16] Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

【速读】：该论文试图解决Large Vision-Language Models (LVLMs) 在跨领域应用中普遍存在的对象幻觉 (object hallucination) 问题。研究表明，这种幻觉并非如先前所认为的是由于视觉编码器 (vision encoder) 的表征能力不足导致，而是可以通过更精细的评估方法来改善。论文的关键解决方案是提出了一种细粒度的CLIPScore (Fine-grained CLIPScore, F-CLIPScore)，它通过在名词短语层面结合文本嵌入 (text embeddings)，增强了对象级别的粒度 (object-level granularity)。实验表明，F-CLIPScore 在OHD-Caps基准测试中的准确性比传统CLIPScore高出39.6%，且无需额外训练。进一步验证显示，使用F-CLIPScore筛选数据重新训练的LVLM能够显著减少幻觉现象。

链接: https://arxiv.org/abs/2502.20034
作者: Hongseok Oh,Wonseok Hwang
机构: University of Seoul (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 4 pages

点击查看摘要

Abstract:Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. This study revisits the previous claim that the primary cause of such hallucination lies in the limited representational capacity of the vision encoder. Our analysis reveals that the capacity of the vision encoder itself is already enough for detecting object hallucination. Based on this insight, we propose a Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun phrase level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further validate F-CLIPScore by showing that LVLM trained with the data filtered using F-CLIPScore exhibits reduced hallucination.
zh

[NLP-17] Erasing Without Remembering: Safeguarding Knowledge Forgetting in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在模型未学习（machine unlearning）过程中存在的问题，即现有方法仅删除目标知识的确切表达，而忽略了相关或改写信息的清理，导致未学习模型仍可能回忆起与目标相关的记忆。论文的关键在于提出了一种基于扰动的方法PERMU，通过增强模型的泛化能力来有效解决这一问题，在提升模型遗忘效果的同时，保持其鲁棒性泛化性能，实验表明PERMU在未学习性能上提升了50.13%，并在泛化性能上提高了43.53%。

链接: https://arxiv.org/abs/2502.19982
作者: Huazheng Wang,Yongcheng Jing,Haifeng Sun,Yingjie Wang,Jingyu Wang,Jianxin Liao,Dacheng Tao
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we explore machine unlearning from a novel dimension, by studying how to safeguard model unlearning in large language models (LLMs). Our goal is to prevent unlearned models from recalling any related memory of the targeted this http URL begin by uncovering a surprisingly simple yet overlooked fact: existing methods typically erase only the exact expressions of the targeted knowledge, leaving paraphrased or related information intact. To rigorously measure such oversights, we introduce UGBench, the first benchmark tailored for evaluating the generalisation performance across 13 state-of-the-art this http URL reveals that unlearned models can still recall paraphrased answers and retain target facts in intermediate layers. To address this, we propose PERMU, a perturbation-based method that significantly enhances the generalisation capabilities for safeguarding LLM this http URL demonstrate that PERMU delivers up to a 50.13% improvement in unlearning while maintaining a 43.53% boost in robust generalisation. Our code can be found in this https URL.
zh

[NLP-18] he Lookahead Limitation: Why Multi-Operand Addition is Hard for LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多操作数算术任务（如多位数加法）中的表现不佳问题。论文指出，这一问题的核心在于LLMs采用的简单一位数前瞻启发式方法（one-digit lookahead heuristic），该方法在两操作数加法中表现尚可但无法有效处理多位数加法中复杂的进位逻辑。论文的关键解决方案是通过深入分析揭示LLMs在多操作数加法中的根本局限性，并强调无论采用何种分词策略（tokenization strategy），LLMs由于依赖于这种简单的前瞻机制，在复杂数值推理任务中的泛化能力受到本质限制。

链接: https://arxiv.org/abs/2502.19981
作者: Tanja Baeumel,Josef van Genabith,Simon Ostermann
机构: German Research Center for Artificial Intelligence (DFKI); Centre for European Research in Trusted AI (CERTAIN); Department of Language Science and Technology, Saarland University (萨尔兰大学语言科学与技术系)
类目: Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:Autoregressive large language models (LLMs) exhibit impressive performance across various tasks but struggle with simple arithmetic, such as addition of two or more operands. We show that this struggle arises from LLMs’ use of a simple one-digit lookahead heuristic, which works fairly well (but not perfect) for two-operand addition but fails in multi-operand cases, where the carry-over logic is more complex. Our probing experiments and digit-wise accuracy evaluation show that LLMs fail precisely where a one-digit lookahead is insufficient to account for cascading carries. We analyze the impact of tokenization strategies on arithmetic performance and show that all investigated models, regardless of tokenization, are inherently limited in the addition of multiple operands due to their reliance on a one-digit lookahead heuristic. Our findings reveal fundamental limitations that prevent LLMs from generalizing to more complex numerical reasoning.
zh

[NLP-19] Deterministic or probabilistic? The psychology of LLM s as random number generators

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在生成随机数时的表现，并试图理解其行为背后的机制。论文的关键在于系统性地评估不同架构、数值范围、温度参数以及提示语言对LLMs生成随机数性能的影响。研究发现，尽管这些模型基于概率性的Transformer架构，但在生成随机数值时往往表现出确定性的响应模式。这种现象主要归因于训练数据中根深蒂固的偏差（biases）。通过分析如DeepSeek-R1等模型的行为，论文揭示了LLMs内部推理过程的部分特性，同时指出这些偏差导致生成结果缺乏真正的随机性，反映出人类认知偏差的再现。因此，论文的核心解决方案在于识别并量化这些偏差如何影响LLMs的随机数生成能力，从而为未来改进模型设计提供指导。

链接: https://arxiv.org/abs/2502.19965
作者: Javier Coronado-Blázquez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 12 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed text generation through inherently probabilistic context-aware mechanisms, mimicking human natural language. In this paper, we systematically investigate the performance of various LLMs when generating random numbers, considering diverse configurations such as different model architectures, numerical ranges, temperature, and prompt languages. Our results reveal that, despite their stochastic transformers-based architecture, these models often exhibit deterministic responses when prompted for random numerical outputs. In particular, we find significant differences when changing the model, as well as the prompt language, attributing this phenomenon to biases deeply embedded within the training data. Models such as DeepSeek-R1 can shed some light on the internal reasoning process of LLMs, despite arriving to similar results. These biases induce predictable patterns that undermine genuine randomness, as LLMs are nothing but reproducing our own human cognitive biases.
zh

[NLP-20] Collaborative Stance Detection via Small-Large Language Model Consistency Verification

【速读】：该论文旨在解决在社交媒体立场检测任务中过度依赖大规模语言模型（Large Language Models, LLMs）导致的计算成本过高问题，特别是在需要处理海量数据的实际社交监测系统中。论文提出了一种名为“通过小规模-大规模语言模型一致性验证的协作立场检测”（Collaborative Stance Detection via Small-Large Language Model Consistency Verification, 简称CoVer）的框架，作为解决方案的关键在于通过共享上下文的批量推理以及逻辑验证机制，优化LLM的使用效率。具体而言，CoVer采用批量处理文本的方式，在共享上下文中利用LLM进行推理以获得立场预测及解释；随后引入小规模语言模型（Small Language Models, SLMs）进行逻辑一致性验证，以排除上下文噪声引起的偏差；最终对逻辑一致性较低的文本使用基于一致性加权的LLM预测聚合进行分类。实验表明，CoVer在零样本设置下超越现有最先进的方法，并显著降低了每条推文的LLM查询次数至0.54次，同时提升了性能表现。

链接: https://arxiv.org/abs/2502.19954
作者: Yu Yan,Sheng Sun,Zixiang Tang,Teli Liu,Min Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stance detection on social media aims to identify attitudes expressed in tweets towards specific targets. Current studies prioritize Large Language Models (LLMs) over Small Language Models (SLMs) due to the overwhelming performance improving provided by LLMs. However, heavily relying on LLMs for stance detection, regardless of the cost, is impractical for real-world social media monitoring systems that require vast data analysis. To this end, we propose \textbf\underlineCollaborative Stance Detection via Small-Large Language Model Consistency \textbf\underlineVerification (\textbfCoVer) framework, which enhances LLM utilization via context-shared batch reasoning and logical verification between LLM and SLM. Specifically, instead of processing each text individually, CoVer processes texts batch-by-batch, obtaining stance predictions and corresponding explanations via LLM reasoning in a shared context. Then, to exclude the bias caused by context noises, CoVer introduces the SLM for logical consistency verification. Finally, texts that repeatedly exhibit low logical consistency are classified using consistency-weighted aggregation of prior LLM stance predictions. Our experiments show that CoVer outperforms state-of-the-art methods across multiple benchmarks in the zero-shot setting, achieving 0.54 LLM queries per tweet while significantly enhancing performance. Our CoVer offers a more practical solution for LLM deploying for social media stance detection.
zh

[NLP-21] GeoEdit: Geometric Knowledge Editing for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在更新特定知识时面临的挑战，即如何有效融入新知识的同时避免破坏与无关通用知识相关的参数。传统基于训练的方法通常难以同时实现这两点。为了解决这一问题，论文提出了一种名为几何知识编辑（Geometric Knowledge Editing, GeoEdit）的新框架。GeoEdit 的关键在于利用微调过程中参数更新的几何关系，通过方向感知的知识识别方法区分与新知识更新相关和与通用知识扰动相关的神经元。它通过保留与现有知识方向近似平行的神经元，并采用“先遗忘后学习”的编辑策略处理相反方向的神经元，从而在更新知识的同时保持模型的泛化能力。此外，引入的重要性引导任务向量融合技术进一步增强了模型编辑性能，通过过滤冗余信息并提供自适应的神经元级别加权来优化效果。

链接: https://arxiv.org/abs/2502.19953
作者: Yujie Feng,Liming Zhan,Zexin Lu,Yongxin Xu,Xu Chu,Yasha Wang,Jiannong Cao,Philip S. Yu,Xiao-Ming Wu
机构: The Hong Kong Polytechnic University (香港理工大学); Huawei Hong Kong Research Center (华为香港研究中心); Peking University (北京大学); University of Illinois at Chicago (芝加哥大学伊利诺伊分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Regular updates are essential for maintaining up-to-date knowledge in large language models (LLMs). Consequently, various model editing methods have been developed to update specific knowledge within LLMs. However, training-based approaches often struggle to effectively incorporate new knowledge while preserving unrelated general knowledge. To address this challenge, we propose a novel framework called Geometric Knowledge Editing (GeoEdit). GeoEdit utilizes the geometric relationships of parameter updates from fine-tuning to differentiate between neurons associated with new knowledge updates and those related to general knowledge perturbations. By employing a direction-aware knowledge identification method, we avoid updating neurons with directions approximately orthogonal to existing knowledge, thus preserving the model’s generalization ability. For the remaining neurons, we integrate both old and new knowledge for aligned directions and apply a “forget-then-learn” editing strategy for opposite directions. Additionally, we introduce an importance-guided task vector fusion technique that filters out redundant information and provides adaptive neuron-level weighting, further enhancing model editing performance. Extensive experiments on two publicly available datasets demonstrate the superiority of GeoEdit over existing state-of-the-art methods.
zh

[NLP-22] Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation

【速读】：该论文旨在解决基于合成数据的质量估计（QE）模型因分布偏移而导致性能下降的问题。合成QE数据中的分布偏移主要表现为伪翻译与真实翻译之间的差异，以及伪标签未能反映人类偏好。为应对这一挑战，论文提出了一种名为ADSQE的新框架，其关键在于通过约束束搜索算法减少伪翻译与真实翻译之间的差异，并利用不同的生成模型增强翻译多样性；同时引入参考译文作为翻译监督信号，指导生成和标注过程以提升词级标签质量。此外，ADSQE通过识别覆盖连续错误标记的最短短语来模拟人工标注行为，从而赋予最终的短语级标签。论文特别指出，翻译模型无法准确标注自身的翻译结果。实验结果表明，ADSQE在有监督和无监督设置下均优于现有技术（SOTA）基线模型如COMET，且进一步分析为其他任务的奖励模型提供了有益启示。

链接: https://arxiv.org/abs/2502.19941
作者: Xiang Geng,Zhejian Lai,Jiajun Chen,Hao Yang,Shujian Huang
机构: National Key Laboratory for Novel Software Technology, Nanjing University (南京大学国家重点实验室); Huawei Translation Services Center (华为翻译服务中心), Beijing, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quality Estimation (QE) models evaluate the quality of machine translations without reference translations, serving as the reward models for the translation task. Due to the data scarcity, synthetic data generation has emerged as a promising solution. However, synthetic QE data often suffers from distribution shift, which can manifest as discrepancies between pseudo and real translations, or in pseudo labels that do not align with human preferences. To tackle this issue, we introduce ADSQE, a novel framework for alleviating distribution shift in synthetic QE data. To reduce the difference between pseudo and real translations, we employ the constrained beam search algorithm and enhance translation diversity through the use of distinct generation models. ADSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes, enhancing the quality of word-level labels. ADSE further identifies the shortest phrase covering consecutive error tokens, mimicking human annotation behavior, to assign the final phrase-level labels. Specially, we underscore that the translation model can not annotate translations of itself accurately. Extensive experiments demonstrate that ADSQE outperforms SOTA baselines like COMET in both supervised and unsupervised settings. Further analysis offers insights into synthetic data generation that could benefit reward models for other tasks.
zh

[NLP-23] Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在处理图像和复杂指令时面临的挑战，具体表现为现有大规模视觉指令微调数据集存在关键缺陷，如指令-图像配对不匹配和低质量图像等问题。这些问题导致训练效率低下，并限制了模型性能的提升，因为模型会浪费资源在噪声或无关的数据上，而这些数据对整体能力的提升贡献有限。为了解决这一问题，论文提出了一种通过代理协作的视觉为中心的选择方法（Visual-Centric Selection approach via Agents Collaboration, ViSA）。其关键是结合视觉代理协作进行图像质量评估以及基于视觉的指令质量评估，从而筛选出具有丰富视觉信息且与高质量图像相关的高质指令数据，最终仅使用原始数据的2.5%即在七个基准测试中表现出色，验证了所提方法的高效性和有效性。

链接: https://arxiv.org/abs/2502.19917
作者: Zhenyu Liu,Yunxin Li,Baotian Hu,Wenhan Luo,Yaowei Wang,Min Zhang
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:To improve Multimodal Large Language Models’ (MLLMs) ability to process images and complex instructions, researchers predominantly curate large-scale visual instruction tuning datasets, which are either sourced from existing vision tasks or synthetically generated using LLMs and image descriptions. However, they often suffer from critical flaws, including misaligned instruction-image pairs and low-quality images. Such issues hinder training efficiency and limit performance improvements, as models waste resources on noisy or irrelevant data with minimal benefit to overall capability. To address this issue, we propose a \textbfVisual-Centric \textbfSelection approach via \textbfAgents Collaboration (ViSA), which centers on image quality assessment and image-instruction relevance evaluation. Specifically, our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images. Finally, we reorganize 80K instruction data from large open-source datasets. Extensive experiments demonstrate that ViSA outperforms or is comparable to current state-of-the-art models on seven benchmarks, using only 2.5% of the original data, highlighting the efficiency of our data selection approach. Moreover, we conduct ablation studies to validate the effectiveness of each component of our method. The code is available at this https URL.
zh

[NLP-24] Order Doesnt Matter But Reasoning Does: Training LLM s with Order-Centric Augmentation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在逻辑推理过程中面临的两个主要问题：推理顺序变化的适应性不足以及无法泛化到逻辑等价变换的情况。传统方法中，LLMs倾向于依赖固定的顺序模式，而非真正的逻辑理解，这限制了其推理能力的灵活性和鲁棒性。为了解决这些问题，论文的关键方案是提出了一种基于逻辑推理交换律的顺序中心数据增强框架。该框架通过随机打乱独立的前提条件引入条件顺序增强，并利用有向无环图（Directed Acyclic Graph, DAG）建模推理步骤间的依赖关系，从而实现推理步骤的有效重排序，同时保持逻辑正确性。通过这种顺序中心的增强策略，模型能够发展出更灵活且更具泛化的推理过程，显著提升LLMs在多样逻辑结构下的推理性能和适应能力。

链接: https://arxiv.org/abs/2502.19907
作者: Qianxi He,Qianyu He,Jiaqing Liang,Yanghua Xiao,Weikang Zhou,Zeye Sun,Fei Yu
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (复旦大学数据科学重点实验室，计算机科学学院); School of Data Science, Fudan University (复旦大学数据科学学院); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Logical reasoning is essential for large language models (LLMs) to ensure accurate and coherent inference. However, LLMs struggle with reasoning order variations and fail to generalize across logically equivalent transformations. LLMs often rely on fixed sequential patterns rather than true logical understanding. To address this issue, we introduce an order-centric data augmentation framework based on commutativity in logical reasoning. We first randomly shuffle independent premises to introduce condition order augmentation. For reasoning steps, we construct a directed acyclic graph (DAG) to model dependencies between steps, which allows us to identify valid reorderings of steps while preserving logical correctness. By leveraging order-centric augmentations, models can develop a more flexible and generalized reasoning process. Finally, we conduct extensive experiments across multiple logical reasoning benchmarks, demonstrating that our method significantly enhances LLMs’ reasoning performance and adaptability to diverse logical structures. We release our codes and augmented data in this https URL.
zh

[NLP-25] Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models

【速读】：该论文试图解决小型语言模型（SLMs）在边缘设备部署中因高效率和低计算成本而日益普及，但其安全风险未受到足够重视的问题。论文通过全面的实证研究评估了13种最先进的SLMs在多种越狱攻击下的安全性表现，发现大多数SLMs容易受到现有越狱攻击的影响，部分甚至易受直接有害指令的攻击。为应对这些安全关切，论文评估了几种代表性防御方法，并展示了它们在增强SLMs安全性方面的有效性。此外，还分析了不同SLM技术（如架构压缩、量化、知识蒸馏等）可能导致的安全性下降问题。论文的关键在于通过系统性的实验和分析揭示SLMs的安全挑战，并为开发更健壮和安全的SLMs提供有价值的见解。

链接: https://arxiv.org/abs/2502.19883
作者: Sibo Yi,Tianshuo Cong,Xinlei He,Qi Li,Jiaxing Song
机构: Tsinghua University (清华大学); The Hong Kong University of Science and Technology (Guangzhou)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages. 6 figures

点击查看摘要

Abstract:Small language models (SLMs) have become increasingly prominent in the deployment on edge devices due to their high efficiency and low computational cost. While researchers continue to advance the capabilities of SLMs through innovative training strategies and model compression techniques, the security risks of SLMs have received considerably less attention compared to large language models (LLMs).To fill this gap, we provide a comprehensive empirical study to evaluate the security performance of 13 state-of-the-art SLMs under various jailbreak attacks. Our experiments demonstrate that most SLMs are quite susceptible to existing jailbreak attacks, while some of them are even vulnerable to direct harmful this http URL address the safety concerns, we evaluate several representative defense methods and demonstrate their effectiveness in enhancing the security of SLMs. We further analyze the potential security degradation caused by different SLM techniques including architecture compression, quantization, knowledge distillation, and so on. We expect that our research can highlight the security challenges of SLMs and provide valuable insights to future work in developing more robust and secure SLMs.
zh

[NLP-26] MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge

【速读】：该论文旨在解决现有多模态知识编辑基准测试存在的局限性问题，即主要关注以简单三元组形式表示的实体级知识，而无法有效捕捉现实世界中多模态信息的复杂性。为了解决这一问题，论文提出了一种名为MMKE-Bench的综合性多模态知识编辑基准（MultiModal Knowledge Editing Benchmark），其关键在于通过引入三种类型的编辑任务——视觉实体编辑、视觉语义编辑以及用户特定编辑，并采用自由格式的自然语言来表示和编辑知识，从而提供更灵活有效的知识编辑方式。此外，该基准包含跨33个广泛类别的2,940条知识和8,363张图像，并通过自动生成与人工验证相结合的方式构建评估问题。最终评估结果显示，当前最先进的几种知识编辑方法在所有标准下均未表现出全面优势，特别是视觉和用户特定编辑任务尤为具有挑战性。因此，MMKE-Bench为评估多模态知识编辑技术的鲁棒性设定了新的标准，推动了该快速发展的领域向前发展。

链接: https://arxiv.org/abs/2502.19870
作者: Yuntao Du,Kailin Jiang,Zhi Gao,Chenrui Shi,Zilong Zheng,Siyuan Qi,Qing Li
机构: State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室); BIGAI (BIGAI); School of Software & Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University (山东大学); University of Science and Technology of China (中国科学技术大学); State Key Laboratory of General Artificial Intelligence, Peking University (北京大学); Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology (北京理工大学计算机科学与技术学院北京市智能信息技术重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge editing techniques have emerged as essential tools for updating the factual knowledge of large language models (LLMs) and multimodal models (LMMs), allowing them to correct outdated or inaccurate information without retraining from scratch. However, existing benchmarks for multimodal knowledge editing primarily focus on entity-level knowledge represented as simple triplets, which fail to capture the complexity of real-world multimodal information. To address this issue, we introduce MMKE-Bench, a comprehensive MultiModal Knowledge Editing Benchmark, designed to evaluate the ability of LMMs to edit diverse visual knowledge in real-world scenarios. MMKE-Bench addresses these limitations by incorporating three types of editing tasks: visual entity editing, visual semantic editing, and user-specific editing. Besides, MMKE-Bench uses free-form natural language to represent and edit knowledge, offering a more flexible and effective format. The benchmark consists of 2,940 pieces of knowledge and 8,363 images across 33 broad categories, with evaluation questions automatically generated and human-verified. We assess five state-of-the-art knowledge editing methods on three prominent LMMs, revealing that no method excels across all criteria, and that visual and user-specific edits are particularly challenging. MMKE-Bench sets a new standard for evaluating the robustness of multimodal knowledge editing techniques, driving progress in this rapidly evolving field.
zh

[NLP-27] MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue

【速读】：该论文旨在解决传统心理健康干预方法（如咨询和聊天机器人）在应对抑郁和焦虑等日益严重的心理健康问题时，因缺乏情感深度和个性化互动而效果不佳的问题。尽管大型语言模型（Large Language Models, LLMs）具有创造更人性化交互的潜力，但它们在捕捉微妙情绪方面仍存在困难。为此，论文提出了一种名为MIND（Multi-agent INner Dialogue）的新范式，其关键在于利用LLMs的生成能力和角色扮演能力，通过预定义的交互式疗愈框架，为LLM代理分配不同角色，以与用户进行沉浸式的内心对话，从而提供更具代入感的心理疗愈体验。实验结果表明，MIND相比传统方法提供了更友好的用户体验，并有效释放了LLMs在心理疗愈中的巨大潜力。

链接: https://arxiv.org/abs/2502.19860
作者: Yujia Chen,Changsong Li,Yiming Wang,Qingqing Xiao,Nan Zhang,Zifan Kong,Peng Wang,Binyu Yan
机构: Sichuan University (四川大学); Shanghai Jiao Tong University (上海交通大学); Mental Health Center, West China Hospital, Sichuan University (西华医院心理健康中心，四川大学); WestChina School of Nursing, Sichuan University (西华护理学院，四川大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mental health issues are worsening in today’s competitive society, such as depression and anxiety. Traditional healings like counseling and chatbots fail to engage effectively, they often provide generic responses lacking emotional depth. Although large language models (LLMs) have the potential to create more human-like interactions, they still struggle to capture subtle emotions. This requires LLMs to be equipped with human-like adaptability and warmth. To fill this gap, we propose the MIND (Multi-agent INner Dialogue), a novel paradigm that provides more immersive psychological healing environments. Considering the strong generative and role-playing ability of LLM agents, we predefine an interactive healing framework and assign LLM agents different roles within the framework to engage in interactive inner dialogues with users, thereby providing an immersive healing experience. We conduct extensive human experiments in various real-world healing dimensions, and find that MIND provides a more user-friendly experience than traditional paradigms. This demonstrates that MIND effectively leverages the significant potential of LLMs in psychological healing.
zh

[NLP-28] am A at SemEval-2025 Task 11: Breaking Language Barriers in Emotion Detection with Multilingual Models

【速读】：该论文旨在解决基于文本的情绪检测中“感知情绪识别”的问题，具体是从文本片段中识别说话者的感知情绪，涉及六种类别：喜悦、悲伤、恐惧、愤怒、惊讶或厌恶。论文的关键在于利用多语言嵌入（multilingual embeddings）结合全连接层的方法，实现了最佳性能，强调了采用多语言表征在提升文本情绪检测鲁棒性方面的优势。

链接: https://arxiv.org/abs/2502.19856
作者: P Sam Sahil,Anupam Jamatia
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes the system submitted by Team A to SemEval 2025 Task 11, ``Bridging the Gap in Text-Based Emotion Detection.‘’ The task involved identifying the perceived emotion of a speaker from text snippets, with each instance annotated with one of six emotions: joy, sadness, fear, anger, surprise, or disgust. A dataset provided by the task organizers served as the foundation for training and evaluating our models. Among the various approaches explored, the best performance was achieved using multilingual embeddings combined with a fully connected layer. This paper details the system architecture, discusses experimental results, and highlights the advantages of leveraging multilingual representations for robust emotion detection in text.
zh

[NLP-29] ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments ICLR2025

【速读】：该论文试图解决现有代码生成基准测试未能充分捕捉多轮交互中多样化反馈的问题，从而限制了对大型语言模型（LLMs）在这些场景下评估的能力。为了解决这一问题，论文的关键解决方案在于提出了一个新的基准测试环境CONVCODEWORLD，它能够模拟九种不同的交互式代码生成场景，并系统性地结合三种类型的反馈：编译反馈、具有不同测试覆盖率的执行反馈以及由GPT-4o生成的不同专业水平的口头反馈。此外，还引入了CONVCODEBENCH，这是一个使用预生成反馈日志的快速静态版本基准测试，保持了与CONVCODEWORLD之间较高的Spearman等级相关性（0.82至0.99），以此减少动态口头反馈生成的成本。通过这些创新，研究揭示了LLMs在不同反馈条件下的表现差异及其局限性。所有实现和基准都将公开发布。

链接: https://arxiv.org/abs/2502.19852
作者: Hojae Han,Seung-won Hwang,Rajhans Samdani,Yuxiong He
机构: Snowflake AI Research (雪flake AI 研究); Seoul National University (首尔国立大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR 2025

点击查看摘要

Abstract:Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold: First, we introduce CONVCODEWORLD, a novel and reproducible environment for benchmarking interactive code generation. CONVCODEWORLD simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; © verbal feedback generated by GPT-4o with different levels of expertise. Second, we introduce CONVCODEBENCH, a fast, static version of benchmark that uses pre-generated feedback logs, eliminating the need for costly dynamic verbal feedback generation while maintaining strong Spearman’s rank correlations (0.82 to 0.99) with CONVCODEWORLD. Third, extensive evaluations of both closed-source and open-source LLMs including R1-Distill on CONVCODEWORLD reveal key insights: (a) LLM performance varies significantly based on the feedback provided; (b) Weaker LLMs, with sufficient feedback, can outperform single-turn results of state-of-the-art LLMs without feedback; © Training on a specific feedback combination can limit an LLM’s ability to utilize unseen combinations; (d) LLMs solve problems in fewer turns (high MRR) may not solve as many problems overall (high Recall), and vice versa. All implementations and benchmarks will be made publicly available at this https URL
zh

[NLP-30] Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation

【速读】：该论文旨在探究自一致性（self-consistency）在提升推理能力方面的有效性背后的动力学机制，并提出一种新的解决方案。论文将自一致性重新定义为动态分布对齐问题，揭示了解码温度不仅控制采样随机性，还主动塑造潜在答案分布的作用。针对高解码温度需要过大的样本量才能稳定，而低解码温度可能放大偏差的问题，论文提出了一种基于置信度的动态校准机制：在不确定性条件下通过锐化采样分布以与高概率模式对齐，而在置信度较高时促进探索。实验结果表明，该方法在有限样本情况下优于固定多样性的基线模型，在不同初始温度下均提升了平均和最佳性能，且无需额外数据或模块。这一成果确立了自一致性是一个采样动力学与演化答案分布同步的挑战。

链接: https://arxiv.org/abs/2502.19830
作者: Yiwei Li,Ji Zhang,Shaoxiong Feng,Peiwen Yuan,Xinglin Wang,Jiayi Shi,Yueqi Zhang,Chuyi Tan,Boyuan Pan,Yao Hu,Kan Li
机构: School of Computer Science, Beijing Institute of Technology (北京理工大学); Xiaohongshu Inc (小红书)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-consistency improves reasoning by aggregating diverse stochastic samples, yet the dynamics behind its efficacy remain underexplored. We reframe self-consistency as a dynamic distributional alignment problem, revealing that decoding temperature not only governs sampling randomness but also actively shapes the latent answer distribution. Given that high temperatures require prohibitively large sample sizes to stabilize, while low temperatures risk amplifying biases, we propose a confidence-driven mechanism that dynamically calibrates temperature: sharpening the sampling distribution under uncertainty to align with high-probability modes, and promoting exploration when confidence is high. Experiments on mathematical reasoning tasks show this approach outperforms fixed-diversity baselines under limited samples, improving both average and best-case performance across varying initial temperatures without additional data or modules. This establishes self-consistency as a synchronization challenge between sampling dynamics and evolving answer distributions.
zh

[NLP-31] Foot-In-The-Door: A Multi-turn Jailbreak for LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中因“越狱”（jailbreak）现象导致的安全性挑战。越狱指恶意提示词绕过内置防护机制，诱发出有害的未授权输出。论文的关键创新在于提出了一种基于心理学“登门槛效应”（Foot-in-the-Door, FITD）的新方法，通过逐步升级用户查询的恶意意图并利用桥梁提示词引导模型自身产生有毒响应，从而实现高效的越狱攻击。实验结果表明，FITD 方法在两个越狱基准测试中实现了平均 94% 的攻击成功率，显著优于现有最先进的方法。此外，论文深入分析了 LLM 的自我腐败机制，揭示了当前对齐策略中的漏洞，并强调了多轮交互带来的潜在风险。代码已开源。

链接: https://arxiv.org/abs/2502.19820
作者: Zixuan Weng,Xiaolong Jin,Jinyuan Jia,Xiangyu Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Ensuring AI safety is crucial as large language models become increasingly integrated into real-world applications. A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs. Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method that leverages the phenomenon where minor initial commitments lower resistance to more significant or more unethical this http URL approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model’s response by itself to induce toxic responses. Extensive experimental results on two jailbreak benchmarks demonstrate that FITD achieves an average attack success rate of 94% across seven widely used models, outperforming existing state-of-the-art methods. Additionally, we provide an in-depth analysis of LLM self-corruption, highlighting vulnerabilities in current alignment strategies and emphasizing the risks inherent in multi-turn this http URL code is available at this https URL .
zh

[NLP-32] xt classification using machine learning methods

【速读】：该论文旨在解决产品自动分类的问题，通过使用机器学习方法构建可用于自动分类产品的模型。解决方案的关键在于将产品名称从文本表示转换为数值向量（即词嵌入，word embedding），文中采用了多种嵌入方法（包括Count Vectorization、TF-IDF、Word2Vec、FASTTEXT和GloVe）。在获得数值化的向量表示后，结合多种机器学习算法（如Logistic Regression、Multinomial Naive Bayes、kNN、Artificial Neural Networks、Support Vector Machines及Decision Trees及其变体）进行分类。实验结果表明，支持向量机（SVM）、逻辑回归（Logistic Regression）和随机森林（Random Forests）在分类准确性方面表现优异，而词嵌入方法中FASTTEXT取得了最佳效果。

链接: https://arxiv.org/abs/2502.19801
作者: Bogdan Oancea
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper we present the results of an experiment aimed to use machine learning methods to obtain models that can be used for the automatic classification of products. In order to apply automatic classification methods, we transformed the product names from a text representation to numeric vectors, a process called word embedding. We used several embedding methods: Count Vectorization, TF-IDF, Word2Vec, FASTTEXT, and GloVe. Having the product names in a form of numeric vectors, we proceeded with a set of machine learning methods for automatic classification: Logistic Regression, Multinomial Naive Bayes, kNN, Artificial Neural Networks, Support Vector Machines, and Decision trees with several variants. The results show an impressive accuracy of the classification process for Support Vector Machines, Logistic Regression, and Random Forests. Regarding the word embedding methods, the best results were obtained with the FASTTEXT technique.
zh

[NLP-33] NaijaNLP: A Survey of Nigerian Low-Resource Languages

【速读】：该论文试图解决低资源自然语言处理（Low-Resource Natural Language Processing, LR-NLP）在尼日利亚三大主要语言（Hausa、Igbo 和 Yorùbá，简称 NaijaNLP）中的研究不足问题。论文指出，尽管这些语言拥有庞大的使用者群体，但由于支持计算语言学任务的资源匮乏，它们仍被归类为低资源语言。目前的研究主要集中于利用现有数据进行下游任务的探索，而缺乏对语法形式化、高质量语言资源开发以及复杂任务（如语言理解和生成）的支持。此外，一些语言特有的挑战，例如声调和变音符号的精确表示，仍未得到充分关注。

解决方案的关键在于通过加强资源丰富化（Resource Enrichment）、全面标注（Comprehensive Annotation）以及开放协作倡议（Open Collaborative Initiatives）的推进，以克服当前研究中对已有数据过度依赖的问题，并填补现有研究中的空白领域。这将有助于推动 NaijaNLP 的发展，同时促进更广泛的低资源语言 NLP 研究。

链接: https://arxiv.org/abs/2502.19784
作者: Isa Inuwa-Dutse
机构: University of Huddersfield (哈德斯菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 35 pages, 2 figures, 4 tables

点击查看摘要

Abstract:With over 500 languages in Nigeria, three languages – Hausa, Yorùbá and Igbo – spoken by over 175 million people, account for about 60% of the spoken languages. However, these languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics. Several research efforts and initiatives have been presented, however, a coherent understanding of the state of Natural Language Processing (NLP) - from grammatical formalisation to linguistic resources that support complex tasks such as language understanding and generation is lacking. This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages (NaijaNLP). We quantitatively assess the available linguistic resources and identify key challenges. Although a growing body of literature addresses various NLP downstream tasks in Hausa, Igbo, and Yorùbá, only about 25.1% of the reviewed studies contribute new linguistic resources. This finding highlights a persistent reliance on repurposing existing data rather than generating novel, high-quality resources. Additionally, language-specific challenges, such as the accurate representation of diacritics, remain under-explored. To advance NaijaNLP and LR-NLP more broadly, we emphasise the need for intensified efforts in resource enrichment, comprehensive annotation, and the development of open collaborative initiatives.
zh

[NLP-34] Do Retrieval-Augmented Language Models Adapt to Varying User Needs?

【速读】：该论文旨在解决现有 Retrieval-Augmented Language Models (RALMs) 评估基准未能充分考虑用户多样化需求的问题。传统评估通常假设单一最优方式利用检索信息，而忽视了不同用户在实际应用中的复杂需求。为此，论文提出了一种新的评估框架，通过三种用户需求场景（上下文独占、先上下文、先记忆）以及三种不同的上下文设置（上下文匹配、知识冲突、信息无关），系统性地评估 RALMs 的表现。关键在于通过调整用户指令与检索信息的性质，捕捉现实世界中模型需适应多变用户需求的复杂性。论文通过在多个问答数据集上的大量实验发现，限制内存使用虽提高了对抗性检索条件下的鲁棒性，却降低了理想检索结果下的峰值性能，且模型家族主导了行为差异。研究强调了以用户为中心的评估对于开发检索增强系统的必要性，并为优化不同检索上下文中的模型性能提供了洞见。

链接: https://arxiv.org/abs/2502.19779
作者: Peilin Wu,Xinlu Zhang,Wenhao Yu,Xingyu Liu,Xinya Du,Zhiyu Zoey Chen
机构: Department of Computer Science, The University of Texas at Dallas (德克萨斯大学达拉斯分校计算机科学系); Department of Computer Science, University of California, Santa Barbara (加州大学圣塔芭芭拉分校计算机科学系); Tencent AI Seattle Lab (腾讯 AI 西雅图实验室); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in Retrieval-Augmented Language Models (RALMs) have demonstrated their efficacy in knowledge-intensive tasks. However, existing evaluation benchmarks often assume a single optimal approach to leveraging retrieved information, failing to account for varying user needs. This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases-Context-Exclusive, Context-First, and Memory-First-across three distinct context settings: Context Matching, Knowledge Conflict, and Information Irrelevant. By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications where models must adapt to diverse user requirements. Through extensive experiments on multiple QA datasets, including HotpotQA, DisentQA, and our newly constructed synthetic URAQ dataset, we find that restricting memory usage improves robustness in adversarial retrieval conditions but decreases peak performance with ideal retrieval results and model family dominates behavioral differences. Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems and provide insights into optimizing model performance across varied retrieval contexts. We will release our code and URAQ dataset upon acceptance of the paper.
zh

[NLP-35] Advancements in Natural Language Processing for Automatic Text Summarization CCS2024

【速读】：该论文旨在解决自动文本摘要（Automatic Text Summarization, ATS）领域中如何有效应对多样化文本内容和技术复杂性的问题。随着自然语言处理（Natural Language Processing, NLP）和深度学习（Deep Learning, DL）的进步，文本摘要模型的效果得到了显著提升，但仍然受到多种文本写作风格和技术复杂性的限制。论文的关键在于探索结合抽取式（Extractive）和抽象式（Abstractive）方法的混合技术，并通过对比分析不同技术和评估矩阵，利用语言生成模型评价生成的摘要。这种方法旨在综合两种技术的优势，克服各自的局限性，从而提供一个全面的ATS解决方案。

链接: https://arxiv.org/abs/2502.19773
作者: Nevidu Jayatilleke,Ruvan Weerasinghe,Nipuna Senanayake
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 9 figures, ICCS 2024

点击查看摘要

Abstract:The substantial growth of textual content in diverse domains and platforms has led to a considerable need for Automatic Text Summarization (ATS) techniques that aid in the process of text analysis. The effectiveness of text summarization models has been significantly enhanced in a variety of technical domains because of advancements in Natural Language Processing (NLP) and Deep Learning (DL). Despite this, the process of summarizing textual information continues to be significantly constrained by the intricate writing styles of a variety of texts, which involve a range of technical complexities. Text summarization techniques can be broadly categorized into two main types: abstractive summarization and extractive summarization. Extractive summarization involves directly extracting sentences, phrases, or segments of text from the content without making any changes. On the other hand, abstractive summarization is achieved by reconstructing the sentences, phrases, or segments from the original text using linguistic analysis. Through this study, a linguistically diverse categorizations of text summarization approaches have been addressed in a constructive manner. In this paper, the authors explored existing hybrid techniques that have employed both extractive and abstractive methodologies. In addition, the pros and cons of various approaches discussed in the literature are also investigated. Furthermore, the authors conducted a comparative analysis on different techniques and matrices to evaluate the generated summaries using language generation models. This survey endeavors to provide a comprehensive overview of ATS by presenting the progression of language processing regarding this task through a breakdown of diverse systems and architectures accompanied by technical and mathematical explanations of their operations.
zh

[NLP-36] EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models

【速读】：该论文旨在解决文本编辑过程中缺乏对参考文本在不同尺度上进行可控编辑的问题，特别是在需要调整文本属性（如毒性或情感倾向）时。论文提出的解决方案核心在于结合两种方法：一种基于SDEdit的技术，允许在较大范围内灵活调整编辑程度；另一种是引入自条件机制的细粒度编辑方法，实现对参考文本的微妙控制。通过将这两种方法整合，EdiText能够精确地在期望范围内调整文本属性，从而展现出强大的可控性与鲁棒性。

链接: https://arxiv.org/abs/2502.19765
作者: Che Hyun Lee,Heeseung Kim,Jiheum Yeom,Sungroh Yoon
机构: Department of Electrical and Computer Engineering, Seoul National University (首尔国立大学); AIIS, ASRI, INMC, ISRC, and IPAI, Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose EdiText, a controllable text editing method that modify the reference text to desired attributes at various scales. We integrate an SDEdit-based editing technique that allows for broad adjustments in the degree of text editing. Additionally, we introduce a novel fine-level editing method based on self-conditioning, which allows subtle control of reference text. While being capable of editing on its own, this fine-grained method, integrated with the SDEdit approach, enables EdiText to make precise adjustments within the desired range. EdiText demonstrates its controllability to robustly adjust reference text at broad range of levels across various tasks, including toxicity control and sentiment control.
zh

[NLP-37] PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation

【速读】：该论文试图解决大型语言模型（LLMs）在多语言设置下性能表现不一致的问题。为了解决这一差距，论文提出了一种名为PolyPrompt的新框架，这是一种参数高效的解决方案。PolyPrompt的关键在于通过基于梯度的搜索为每种语言学习一组触发词（trigger tokens），这些触发词能够在推理过程中识别输入查询的语言，并选择相应的触发词添加到提示（prompt）前。实验结果显示，在两种接近10亿参数规模的模型上，与朴素方法和翻译管道基线相比，PolyPrompt在涵盖十五种不同类型和资源丰富度的语言的全球MMLU基准测试中取得了3.7%-19.9%的准确率提升。

链接: https://arxiv.org/abs/2502.19756
作者: Nathan Roll
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) showcase increasingly impressive English benchmark scores, however their performance profiles remain inconsistent across multilingual settings. To address this gap, we introduce PolyPrompt, a novel, parameter-efficient framework for enhancing the multilingual capabilities of LLMs. Our method learns a set of trigger tokens for each language through a gradient-based search, identifying the input query’s language and selecting the corresponding trigger tokens which are prepended to the prompt during inference. We perform experiments on two ~1 billion parameter models, with evaluations on the global MMLU benchmark across fifteen typologically and resource diverse languages, demonstrating accuracy gains of 3.7%-19.9% compared to naive and translation-pipeline baselines.
zh

[NLP-38] Beneath the Surface: How Large Language Models Reflect Hidden Bias

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在训练数据中嵌入的社会偏见问题，尤其是那些通过现有基准难以检测到的隐性偏见。传统基准主要评估显性偏见，即直接概念词与人口统计学相关词之间的关联，但随着LLMs逐渐学会避免明显的偏见响应，这种评估方式可能产生中立性的假象。然而，偏见仍以更微妙、上下文隐藏的形式存在，传统方法无法有效捕捉。为此，论文提出了一种新的Hidden Bias Benchmark (HBB) 数据集，旨在通过自然且微妙的情境框架来评估这些隐性偏见。关键在于设计能够揭示模型在复杂场景下仍持续强化的隐性偏见的测试方法，从而推动对LLMs潜在偏见的更全面理解。数据、代码及结果可在提供的URL中获取。

链接: https://arxiv.org/abs/2502.19749
作者: Jinhao Pan,Chahat Raj,Ziyu Yao,Ziwei Zhu
机构: George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The exceptional performance of Large Language Models (LLMs) often comes with the unintended propagation of social biases embedded in their training data. While existing benchmarks evaluate overt bias through direct term associations between bias concept terms and demographic terms, LLMs have become increasingly adept at avoiding biased responses, creating an illusion of neutrality. However, biases persist in subtler, contextually hidden forms that traditional benchmarks fail to capture. We introduce the Hidden Bias Benchmark (HBB), a novel dataset designed to assess hidden bias that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios. We analyze six state-of-the-art LLMs, revealing that while models reduce bias in response to overt bias, they continue to reinforce biases in nuanced settings. Data, code, and results are available at this https URL.
zh

[NLP-39] HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

【速读】：该论文旨在解决基于混合计算存储器（CIM）架构部署低秩适应（LoRA）微调的大语言模型（LLMs）时，因阻变随机存取存储器（RRAM）固有噪声导致的性能退化问题。论文的关键解决方案是提出了一种名为硬件感知低秩适应（HaLoRA）的方法，通过在理想与噪声条件下对齐训练目标，设计出既鲁棒又精确的LoRA分支，从而有效提升模型在多种推理任务中的平均表现，最高可提高22.7分，同时保持不同噪声水平下的稳定性。

链接: https://arxiv.org/abs/2502.19747
作者: Taiqiang Wu,Chenchen Ding,Wenyong Zhou,Yuxin Cheng,Xincheng Feng,Shuqi Wang,Chufan Shi,Zhengwu Liu,Ngai Wong
机构: The University of Hong Kong (香港大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR)
备注: 7 pages

点击查看摘要

Abstract:Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM’s inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA’s effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.
zh

[NLP-40] XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs

【速读】：该论文试图解决多语言大语言模型（LLMs）在跨语言概念理解上的普遍性与鲁棒性问题。为评估这一能力，作者构建了一个包含17种语言的多语言概念最小配对数据集（XCOMPS），并通过元语言提示（metalinguistic prompting）、直接概率测量及神经语言探针（neurolinguistic probing）等方法进行评估。解决方案的关键在于通过对比基础模型、指令微调模型和知识蒸馏模型的表现，揭示LLMs在低资源语言上的概念理解弱点，以及不同模型在处理语义相似负样本和形态复杂度较高的语言时的能力差异，并进一步探讨指令微调与知识蒸馏对提升概念理解内部表征的有效性。

链接: https://arxiv.org/abs/2502.19737
作者: Linyang He,Ercong Nie,Sukru Samet Dindar,Arsalan Firoozi,Adrian Florea,Van Nguyen,Corentin Puffay,Riki Shimizu,Haotian Ye,Jonathan Brennan,Helmut Schmid,Hinrich Schütze,Nima Mesgarani
机构: Columbia University (哥伦比亚大学); Munich Center for Machine Learning (慕尼黑机器学习中心); LMU Munich (慕尼黑大学); University of Michigan (密歇根大学); KU Leuven (鲁汶大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs’ multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.
zh

[NLP-41] R1-T1: Fully Incentivizing Translation Capability in LLM s via Reasoning Learning

【速读】：该论文旨在解决将推理增强引入机器翻译（Machine Translation, MT）以实现推理时间推理（Inference-Time Reasoning）的问题，尤其是在处理多样化翻译场景时现有方法适应性不足的局限。现有方法要么设计针对特定子任务的固定链式思维（Chain of Thoughts, CoT），要么依赖与人类不一致的合成CoT并通过易受灾难性遗忘影响的监督微调（Supervised Fine-Tuning, SFT）进行优化，这限制了其在多语言翻译中的通用性和灵活性。

论文的关键创新在于提出了一种名为R1-Translator (R1-T1) 的新型框架，通过强化学习（Reinforcement Learning, RL）结合与人类一致的CoT实现通用机器翻译的推理时间推理。其核心解决方案包含三个关键点：(1) 将基于推理的翻译扩展到六种语言及多样任务（如法律/医学领域的领域适配、习语解析等），超越传统的翻译子任务；(2) 形式化六种由专家精心设计的CoT模板，这些模板模仿人类混合策略，如上下文感知改写和回译；(3) 借助KL约束奖励实现自演化CoT发现和抗遗忘适应能力。实验结果表明，该方法在Flores-101测试集的21种语言和80个翻译方向上实现了稳定的性能提升，尤其在训练集中未见过的15种语言中保持了良好的多语言通用能力，优于传统SFT方法。

链接: https://arxiv.org/abs/2502.19735
作者: Minggui He,Yilun Liu,Shimin Tao,Yuanchang Luo,Hongyong Zeng,Chang Su,Li Zhang,Hongxia Ma,Daimeng Wei,Weibin Meng,Hao Yang,Boxing Chen,Osamu Yoshie
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite recent breakthroughs in reasoning-enhanced large language models (LLMs) like DeepSeek-R1, incorporating inference-time reasoning into machine translation (MT), where human translators naturally employ structured, multi-layered reasoning chain-of-thoughts (CoTs), is yet underexplored. Existing methods either design a fixed CoT tailored for a specific MT sub-task (e.g., literature translation), or rely on synthesizing CoTs unaligned with humans and supervised fine-tuning (SFT) prone to catastrophic forgetting, limiting their adaptability to diverse translation scenarios. This paper introduces R1-Translator (R1-T1), a novel framework to achieve inference-time reasoning for general MT via reinforcement learning (RL) with human-aligned CoTs comprising six common patterns. Our approach pioneers three innovations: (1) extending reasoning-based translation beyond MT sub-tasks to six languages and diverse tasks (e.g., legal/medical domain adaptation, idiom resolution); (2) formalizing six expert-curated CoT templates that mirror hybrid human strategies like context-aware paraphrasing and back translation; and (3) enabling self-evolving CoT discovery and anti-forgetting adaptation through RL with KL-constrained rewards. Experimental results indicate a steady translation performance improvement in 21 languages and 80 translation directions on Flores-101 test set, especially on the 15 languages unseen from training, with its general multilingual abilities preserved compared with plain SFT.
zh

[NLP-42] Speculative Decoding and Beyond: An In-Depth Review of Techniques

【速读】：该论文旨在解决在大规模自回归模型部署中，由于序列依赖性导致的基本瓶颈问题，特别是在实时应用场景中的挑战。传统优化方法（如剪枝和量化）通常会牺牲模型质量，而论文的关键解决方案在于引入并系统分析生成-精炼框架，这些框架通过创新的生成策略（从简单的n-gram预测到复杂的草案模型）和精炼机制（包括单次验证和迭代方法），显著缓解了模型性能与效率之间的权衡问题。论文通过对算法创新和系统级实现的综合分析，探讨了跨计算环境的部署策略，并探索了文本、图像和语音生成等领域的应用，为高效自回归解码的未来研究奠定了基础。

链接: https://arxiv.org/abs/2502.19732
作者: Yunhai Hu,Zining Liu,Zhenyuan Dong,Tianfan Peng,Bradley McDanel,Sai Qian Zhang
机构: New York University (纽约大学); University of Pennsylvania (宾夕法尼亚大学); Shenzhen Institute of Information Technology (深圳信息职业技术学院); Franklin and Marshall College (富兰克林与马歇尔学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks. We categorize methods based on their generation strategies (from simple n-gram prediction to sophisticated draft models) and refinement mechanisms (including single-pass verification and iterative approaches). Through systematic analysis of both algorithmic innovations and system-level implementations, we examine deployment strategies across computing environments and explore applications spanning text, images, and speech generation. This systematic examination of both theoretical frameworks and practical implementations provides a foundation for future research in efficient autoregressive decoding. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.19732 [cs.CL] (or arXiv:2502.19732v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.19732 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-43] Preference Learning Unlocks LLM s Psycho-Counseling Skills

【速读】：该论文试图解决在心理辅导中应用大型语言模型（Large Language Models, LLMs）时面临的两个主要问题：一是当前LLMs难以持续提供有效的客户话语响应，主要由于缺乏高质量的真实心理辅导数据的监督，且这些数据因客户隐私问题通常无法获取；二是现有辅导对话中治疗师回应的质量参差不齐，评估其质量仍是一个开放性挑战。为应对这些挑战，论文的关键解决方案是提出一套专业的、全面的原则来评估治疗师对客户话语的回应，并基于此创建了一个包含36k高质量偏好比较对的偏好数据集PsychoCounsel-Preference。该数据集与专业心理治疗师的偏好一致，为评估和改进心理辅导中的LLMs提供了坚实的基础。实验表明，PsychoCounsel-Preference是LLMs获取必要技能以有效响应客户的重要资源，其中最佳对齐模型PsychoCounsel-Llama3-8B在与GPT-4o的竞争中取得了87%的胜率。论文还公开了PsychoCounsel-Preference、PsychoCounsel-Llama3-8B及其奖励模型PsychoCounsel-Llama3-8B-Reward，以促进LLMs在心理辅导领域的研究。

链接: https://arxiv.org/abs/2502.19731
作者: Mian Zhang,Shaun M. Eack,Zhiyu Zoey Chen
机构: Department of Computer Science, University of Texas at Dallas (德克萨斯大学达拉斯分校计算机科学系); School of Social Work, University of Pittsburgh (匹兹堡大学社会工作学院)
类目: Computation and Language (cs.CL)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Applying large language models (LLMs) to assist in psycho-counseling is an emerging and meaningful approach, driven by the significant gap between patient needs and the availability of mental health support. However, current LLMs struggle to consistently provide effective responses to client speeches, largely due to the lack of supervision from high-quality real psycho-counseling data, whose content is typically inaccessible due to client privacy concerns. Furthermore, the quality of therapists’ responses in available sessions can vary significantly based on their professional training and experience. Assessing the quality of therapists’ responses remains an open challenge. In this work, we address these challenges by first proposing a set of professional and comprehensive principles to evaluate therapists’ responses to client speeches. Using these principles, we create a preference dataset, PsychoCounsel-Preference, which contains 36k high-quality preference comparison pairs. This dataset aligns with the preferences of professional psychotherapists, providing a robust foundation for evaluating and improving LLMs in psycho-counseling. Experiments on reward modeling and preference learning demonstrate that PsychoCounsel-Preference is an excellent resource for LLMs to acquire essential skills for responding to clients in a counseling session. Our best-aligned model, PsychoCounsel-Llama3-8B, achieves an impressive win rate of 87% against GPT-4o. We release PsychoCounsel-Preference, PsychoCounsel-Llama3-8B and the reward model PsychoCounsel Llama3-8B-Reward to facilitate the research of psycho-counseling with LLMs at: this https URL.
zh

[NLP-44] okens for Learning Tokens for Unlearning: Mitigating Membership Inference Attacks in Large Language Models via Dual-Purpose Training

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练过程中面临的数据隐私泄露问题，特别是通过成员推理攻击（Membership Inference Attacks, MIAs）暴露敏感训练数据的风险。传统防御方法因未能考虑文本数据的序列特性，要么需要大量计算资源，要么无法有效缓解LLMs中的隐私风险。为此，论文提出了一种轻量且有效的经验性隐私防御方案，其关键是利用令牌（token）特有的特性。具体而言，通过分析训练过程中的令牌动态变化，论文设计了一种令牌选择策略，将令牌分为用于学习的困难令牌（hard tokens）和用于遗忘的已记忆令牌（memorized tokens）。随后，在训练阶段引入一种新颖的双用途令牌级损失函数，以实现效用与隐私之间的帕累托最优平衡。实验结果表明，该方法不仅能够显著抵御MIAs攻击，还能在多种LLMs架构和数据集上提升约10%的语言建模性能，优于基准方法。

链接: https://arxiv.org/abs/2502.19726
作者: Toan Tran,Ruixuan Liu,Li Xiong
机构: Emory University (埃默里大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become the backbone of modern natural language processing but pose privacy concerns about leaking sensitive training data. Membership inference attacks (MIAs), which aim to infer whether a sample is included in a model’s training dataset, can serve as a foundation for broader privacy threats. Existing defenses designed for traditional classification models do not account for the sequential nature of text data. As a result, they either require significant computational resources or fail to effectively mitigate privacy risks in LLMs. In this work, we propose a lightweight yet effective empirical privacy defense for protecting training data of language modeling by leveraging the token-specific characteristics. By analyzing token dynamics during training, we propose a token selection strategy that categorizes tokens into hard tokens for learning and memorized tokens for unlearning. Subsequently, our training-phase defense optimizes a novel dual-purpose token-level loss to achieve a Pareto-optimal balance between utility and privacy. Extensive experiments demonstrate that our approach not only provides strong protection against MIAs but also improves language modeling performance by around 10% across various LLM architectures and datasets compared to the baselines.
zh

[NLP-45] CNsum:Automatic Summarization for Chinese News Text

【速读】：该论文旨在解决中文新闻文本摘要生成的问题，并探索Transformer结构在中文任务中的应用。为实现这一目标，论文提出了一种基于Transformer结构的中文新闻文本摘要模型（CNsum），并在THUCNews等中文数据集上进行了验证。解决方案的关键在于利用预训练语言模型的Transformer结构，通过其强大的表征能力生成高质量的摘要，实验结果显示CNsum的ROUGE分数优于基线模型，验证了该模型的有效性。

链接: https://arxiv.org/abs/2502.19723
作者: Yu Zhao,Songping Huang,Dongsheng Zhou,Zhaoyun Ding,Fei Wang,Aixin Nian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: WASA 2022

点击查看摘要

Abstract:Obtaining valuable information from massive data efficiently has become our research goal in the era of Big Data. Text summarization technology has been continuously developed to meet this demand. Recent work has also shown that transformer-based pre-trained language models have achieved great success on various tasks in Natural Language Processing (NLP). Aiming at the problem of Chinese news text summary generation and the application of Transformer structure on Chinese, this paper proposes a Chinese news text summarization model (CNsum) based on Transformer structure, and tests it on Chinese datasets such as THUCNews. The results of the conducted experiments show that CNsum achieves better ROUGE score than the baseline models, which verifies the outperformance of the model.
zh

[NLP-46] Few-Shot Multilingual Open-Domain QA from 5 Examples ACL

【速读】：该论文旨在解决多语言开放域问答（Multilingual Open-Domain Question Answering, MLODQA）中因标注成本高昂而导致对低资源语言支持不足的问题。论文的关键解决方案是提出了一种基于少量样本学习（Few-Shot Learning）的方法，通过利用大规模语言模型（Large Language Models, LLMs）合成大规模多语言数据。具体而言，该方法首先使用WikiData进行大规模自监督预训练，然后借助少量样本监督生成高质量的合成多语言数据对LLMs进行微调。最终得到的模型FsModQA在MLODQA任务以及跨语言和单语言检索中显著超越现有少量样本学习和监督学习基线。此外，论文还展示了通过引入跨语言提示策略，仅依赖英语标注数据即可实现对新语言的有效零样本适配，从而提供了一种无需昂贵大规模标注的通用且实用的MLODQA解决方案。

链接: https://arxiv.org/abs/2502.19722
作者: Fan Jiang,Tom Drummond,Trevor Cohn
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by TACL; pre-MIT Press publication version

点击查看摘要

Abstract:Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a \emphfew-shot learning approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, \textscFsModQA, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a \emphcross-lingual prompting strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.
zh

[NLP-47] Sensing and Steering Stereotypes: Extracting and Applying Gender Representation Vectors in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中存在的刻板印象和偏见问题，并探索如何有效缓解由此可能产生的潜在危害。论文的关键在于不将模型中的偏见视为黑盒问题，而是通过引入表示工程（representation engineering）技术，研究性别概念在LLMs中的表示方式。解决方案的核心包括提出一种无需标注数据即可通过概率加权提取概念表示的新方法，并高效选择用于衡量和操控模型表示的引导向量（steering vector）。此外，论文还设计了一种基于投影的方法，实现对模型预测的精确调控，并展示了其在减轻LLMs性别偏见方面的有效性。

链接: https://arxiv.org/abs/2502.19721
作者: Hannah Cyberey,Yangfeng Ji,David Evans
机构: Department of Computer Science (计算机科学系), University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate potential harms that may result from these biases, but most work studies biases in LLMs as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of “gender” is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model’s representation. We also present a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs.
zh

[NLP-48] GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

【速读】：该论文旨在解决语言模型经常出现校准不当的问题，即模型在某些情况下会给出错误但自信的答案。为应对这一挑战，论文引入了一个名为GRACE的基准测试集，其创新之处在于结合了与人类校准的对比。GRACE包含一系列逐步变简单的线索问题对，要求模型尽早、准确且自信地给出正确答案。这种设置能够精细衡量模型的校准性能。此外，论文通过组织人类与模型的实时竞赛收集了大量数据，并提出了一种名为CalScore的新指标，用于分析模型的校准误差并识别出模型与人类行为差异的校准错误类型。研究发现，尽管人类的准确性低于模型，但在校准能力上通常优于模型。由于最先进的模型在GRACE上表现不佳，这表明GRACE可以有效评估模型校准改进的进展。因此，该论文的关键解决方案是设计GRACE及其配套的CalScore指标，以量化和优化语言模型的校准性能。

链接: https://arxiv.org/abs/2502.19684
作者: Yoo Yeon Sung,Eve Fleisig,Yu Hou,Ishan Upadhyay,Jordan Lee Boyd-Graber
机构: University of Maryland (马里兰大学); UC Berkeley (加州大学伯克利分校); IIT Bombay (印度理工学院孟买)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams’ timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.
zh

[NLP-49] he Future Outcome Reasoning and Confidence Assessment Benchmark

【速读】：该论文试图解决现有预测基准在综合置信度评估、问题类型覆盖范围以及与真实世界人类预测需求不匹配等方面的不足。论文的关键解决方案是引入FOReCAst（Future Outcome Reasoning and Confidence Assessment）基准，该基准不仅涵盖多样化的预测场景，包括布尔问题、时间框架预测和数量估计，还能够全面评估模型的预测准确性和置信度校准能力，以满足现实世界的应用需求。

链接: https://arxiv.org/abs/2502.19676
作者: Zhangdie Yuan,Zifeng Ding,Andreas Vlachos
机构: University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Forecasting is an important task in many domains, such as technology and economics. However existing forecasting benchmarks largely lack comprehensive confidence assessment, focus on limited question types, and often consist of artificial questions that do not align with real-world human forecasting needs. To address these gaps, we introduce FOReCAst (Future Outcome Reasoning and Confidence Assessment), a benchmark that evaluates models’ ability to make predictions and their confidence in them. FOReCAst spans diverse forecasting scenarios involving Boolean questions, timeframe prediction, and quantity estimation, enabling a comprehensive evaluation of both prediction accuracy and confidence calibration for real-world applications.
zh

[NLP-50] Investigating Neurons and Heads in Transformer-based LLM s for Typographical Errors

【速读】：该论文试图解决的问题是大型语言模型（LLMs）如何处理包含拼写错误（typos）的输入，并理解其背后的机制。论文假设特定的神经元（typo neurons）和注意力头（attention heads）能够通过局部和全局上下文识别并内部修正拼写错误。解决方案的关键在于提出了一种方法来识别在输入包含拼写错误时活跃的“拼写错误神经元”和“拼写错误注意力头”，并通过实验验证了这些组件在不同层次上的作用，包括利用局部上下文修正拼写错误、依赖中间层神经元进行基于全局上下文的核心修正、以及注意力头通过广泛上下文而非特定标记来修正拼写错误的能力。此外，研究还发现这些组件不仅用于拼写错误修正，还能帮助模型更好地理解一般上下文。

链接: https://arxiv.org/abs/2502.19669
作者: Kohei Tsuji,Tatsuya Hiraoka,Yuchang Cheng,Eiji Aramaki,Tomoya Iwakura
机构: NAIST (奈良先端科学技术大学院大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); Fujitsu Ltd. (富士通有限公司)
类目: Computation and Language (cs.CL)
备注: 14 pages, 10 figures, 6 tables

点击查看摘要

Abstract:This paper investigates how LLMs encode inputs with typos. We hypothesize that specific neurons and attention heads recognize typos and fix them internally using local and global contexts. We introduce a method to identify typo neurons and typo heads that work actively when inputs contain typos. Our experimental results suggest the following: 1) LLMs can fix typos with local contexts when the typo neurons in either the early or late layers are activated, even if those in the other are not. 2) Typo neurons in the middle layers are responsible for the core of typo-fixing with global contexts. 3) Typo heads fix typos by widely considering the context not focusing on specific tokens. 4) Typo neurons and typo heads work not only for typo-fixing but also for understanding general contexts.
zh

[NLP-51] Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning

【速读】：该论文试图解决的问题是如何在无需显式推理监督的情况下，通过强化学习从可验证奖励（RLVR）的方法，使基础语言模型具备自我演化的推理能力，并探索其在医学领域中的适用性。论文的关键解决方案是提出Med-RLVR方法，利用医学多选题作答（MCQA）数据作为可验证标签，在不依赖显式推理指导的前提下，训练模型以实现医学推理能力。实验结果表明，该方法不仅适用于数学和编程领域，还能有效迁移到医学问答任务中，尤其在外延任务上的表现显著优于传统有监督微调（SFT），同时揭示了基座模型参数（如3B参数规模）在无显式推理监督的情况下能够自发演化出推理能力。

链接: https://arxiv.org/abs/2502.19655
作者: Sheng Zhang,Qianchu Liu,Guanghui Qin,Tristan Naumann,Hoifung Poon
机构: Microsoft Research (微软研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) has recently gained attention for its ability to elicit self-evolved reasoning capabilitie from base language models without explicit reasoning supervisions, as demonstrated by DeepSeek-R1. While prior work on RLVR has primarily focused on mathematical and coding domains, its applicability to other tasks and domains remains unexplored. In this work, we investigate whether medical reasoning can emerge from RLVR. We introduce Med-RLVR as an initial study of RLVR in the medical domain leveraging medical multiple-choice question answering (MCQA) data as verifiable labels. Our results demonstrate that RLVR is not only effective for math and coding but also extends successfully to medical question answering. Notably, Med-RLVR achieves performance comparable to traditional supervised fine-tuning (SFT) on in-distribution tasks while significantly improving out-of-distribution generalization, with an 8-point accuracy gain. Further analysis of training dynamics reveals that, with no explicit reasoning supervision, reasoning emerges from the 3B-parameter base model. These findings underscore the potential of RLVR in domains beyond math and coding, opening new avenues for its application in knowledge-intensive fields such as medicine.
zh

[NLP-52] Conversational Planning for Personal Plans

【速读】：该论文旨在解决如何构建能够支持长期交互与任务执行的语言基础智能体的问题。传统方法依赖于具有分层规划能力的强化学习代理来实现远期规划，而本文提出了一种新颖的架构，其中大型语言模型（Large Language Models, LLMs）作为元控制器，决定智能体的下一个宏观动作（macro-action），并通过增强工具使用的LLM选项策略执行选定的宏观动作。关键在于将LLMs与特定的宏观动作结合，通过对话及后续提问收集用户反馈，从而实现用户个人计划的自适应规划，并展示该范式在学术与非学术任务辅导以及个人健康计划的对话式指导等场景中的适用性。

链接: https://arxiv.org/abs/2502.19500
作者: Konstantina Christakopoulou,Iris Qu,John Canny,Andrew Goodridge,Cj Adams,Minmin Chen,Maja Matarić
机构: Google DeepMind; Google DeepMind; Google DeepMind; Google; Google; Google DeepMind; Google DeepMind
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The language generation and reasoning capabilities of large language models (LLMs) have enabled conversational systems with impressive performance in a variety of tasks, from code generation, to composing essays, to passing STEM and legal exams, to a new paradigm for knowledge search. Besides those short-term use applications, LLMs are increasingly used to help with real-life goals or tasks that take a long time to complete, involving multiple sessions across days, weeks, months, or even years. Thus to enable conversational systems for long term interactions and tasks, we need language-based agents that can plan for long horizons. Traditionally, such capabilities were addressed by reinforcement learning agents with hierarchical planning capabilities. In this work, we explore a novel architecture where the LLM acts as the meta-controller deciding the agent’s next macro-action, and tool use augmented LLM-based option policies execute the selected macro-action. We instantiate this framework for a specific set of macro-actions enabling adaptive planning for users’ personal plans through conversation and follow-up questions collecting user feedback. We show how this paradigm can be applicable in scenarios ranging from tutoring for academic and non-academic tasks to conversational coaching for personal health plans.
zh

[NLP-53] Voting or Consensus? Decision-Making in Multi-Agent Debate

【速读】：该论文试图解决多智能体辩论中决策协议选择对任务性能影响不明确的问题。为系统评估不同决策协议的效果，作者设计了仅改变单一变量（即决策协议）的实验方案，分析了七种决策协议（如多数投票、一致共识）在知识（MMLU、MMLU-Pro、GPQA）和推理（StrategyQA、MuSR、SQuAD 2.0）数据集上的表现。研究发现，多数投票协议在推理任务中提升性能达13.2%，而一致共识协议在知识任务中提升性能达2.8%。此外，论文提出两种新方法（All-Agents Drafting, Collective Improvement），通过增加答案多样性进一步优化决策过程，在某些任务中分别提升了3.3%和7.4%的性能。因此，该研究的关键在于揭示决策协议的重要性，并提出创新性方法以提高多智能体辩论中的任务表现，而非单纯依赖规模扩展。

链接: https://arxiv.org/abs/2502.19130
作者: Lars Benedikt Kaesberg,Jonas Becker,Jan Philip Wahle,Terry Ruas,Bela Gipp
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Much of the success of multi-agent debates depends on carefully choosing the right parameters. Among them, the decision-making protocol stands out. Systematic comparison of decision protocols is difficult because studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making addresses the challenges of different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time (i.e., decision protocol) to analyze how different methods affect the collaboration between agents and test different protocols on knowledge (MMLU, MMLU-Pro, GPQA) and reasoning datasets (StrategyQA, MuSR, SQuAD 2.0). Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks over the other decision protocol. Increasing the number of agents improves performance, while more discussion rounds before voting reduces it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.
zh

[NLP-54] SuPreME: A Supervised Pre-training Framework for Multimodal ECG Representation Learning

【速读】：该论文旨在解决大规模标注心电图（ECG）数据获取困难的问题，同时克服现有无监督自监督学习方法（eSSL）未能充分捕捉细粒度临床语义且需大量任务特定微调的局限。论文的关键解决方案是提出了一种名为SuPreME的监督预训练框架，用于多模态ECG表示学习。SuPreME利用大型语言模型（LLMs）从自由文本ECG报告中提取结构化临床实体，过滤噪声和无关内容，增强临床表示学习，并构建高质量的细粒度标注数据集。通过使用基于文本的心脏查询而非传统类别标签，SuPreME实现了未见疾病的零样本分类，无需额外微调。实验结果表明，SuPreME在六个下游数据集上的零样本AUC性能优于最先进的eSSL和多模态方法，提升了1.96%以上。

链接: https://arxiv.org/abs/2502.19668
作者: Mingsheng Cai,Jiuming Jiang,Wenhao Huang,Che Liu,Rossella Arcucci
机构: The University of Edinburgh (爱丁堡大学); Shenzhen Yinwang Intelligent Technology Co., Ltd (深圳引网智能科技有限公司); Imperial College London (帝国理工学院)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cardiovascular diseases are a leading cause of death and disability worldwide. Electrocardiogram (ECG) recordings are critical for diagnosing and monitoring cardiac health, but obtaining large-scale annotated ECG datasets is labor-intensive and time-consuming. Recent ECG Self-Supervised Learning (eSSL) methods mitigate this by learning features without extensive labels but fail to capture fine-grained clinical semantics and require extensive task-specific fine-tuning. To address these challenges, we propose \textbfSuPreME , a \textbfSu pervised \textbfPre -training framework for \textbfM ultimodal \textbfE CG representation learning. SuPreME applies Large Language Models (LLMs) to extract structured clinical entities from free-text ECG reports, filter out noise and irrelevant content, enhance clinical representation learning, and build a high-quality, fine-grained labeled dataset. By using text-based cardiac queries instead of traditional categorical labels, SuPreME enables zero-shot classification of unseen diseases without additional fine-tuning. We evaluate SuPreME on six downstream datasets covering 127 cardiac conditions, achieving superior zero-shot AUC performance over state-of-the-art eSSL and multimodal methods by over 1.96%. Results demonstrate the effectiveness of SuPreME in leveraging structured, clinically relevant knowledge for high-quality ECG representations. All code and data will be released upon acceptance.
zh

计算机视觉

[CV-0] Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids

【速读】：该论文致力于解决在类人机器人灵巧操作任务中应用强化学习（Reinforcement Learning, RL）所面临的挑战，特别是在高接触频率的任务场景下实现接近或超越人类水平的操作能力。论文针对仿真环境与真实世界之间的差距、奖励函数设计复杂性以及样本效率低下等问题提出了创新性的解决方案。关键在于引入了一种自动化的虚实调优模块以提升仿真环境的真实性，提出了一种通用的奖励设计方案简化长期接触密集型任务的奖励工程工作，开发了一种分而治之的知识蒸馏流程提高困难探索问题的样本效率同时保持仿真到现实的性能一致性，并采用稀疏与密集物体表征的混合方法弥合感知层面的虚实鸿沟。通过在三项类人灵巧操作任务上的实验验证，展示了这些技术的有效性及其对推动仿真到现实强化学习应用于类人灵巧操作的重要意义。

链接: https://arxiv.org/abs/2502.20396
作者: Toru Lin,Kartik Sachdev,Linxi Fan,Jitendra Malik,Yuke Zhu
机构: UC Berkeley(加州大学伯克利分校); NVIDIA(英伟达); UT Austin(德克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Project page can be found at this https URL

点击查看摘要

Abstract:Reinforcement learning has delivered promising results in achieving human- or even superhuman-level capabilities across diverse problem domains, but success in dexterous robot manipulation remains limited. This work investigates the key challenges in applying reinforcement learning to solve a collection of contact-rich manipulation tasks on a humanoid embodiment. We introduce novel techniques to overcome the identified challenges with empirical validation. Our main contributions include an automated real-to-sim tuning module that brings the simulated environment closer to the real world, a generalized reward design scheme that simplifies reward engineering for long-horizon contact-rich manipulation tasks, a divide-and-conquer distillation process that improves the sample efficiency of hard-exploration problems while maintaining sim-to-real performance, and a mixture of sparse and dense object representations to bridge the sim-to-real perception gap. We show promising results on three humanoid dexterous manipulation tasks, with ablation studies on each technique. Our work presents a successful approach to learning humanoid dexterous manipulation using sim-to-real reinforcement learning, achieving robust generalization and high performance without the need for human demonstration.
zh

[CV-1] Walking the Web of Concept-Class Relationships in Incrementally Trained Interpretable Models AAAI2025

【速读】：该论文旨在解决在动态增量学习场景下，基于概念（concept-based）的神经网络模型无法有效保留和增强概念与类别之间复杂关系的问题。现有方法在防止灾难性遗忘（catastrophic forgetting）的同时，难以同时处理概念层面、类别层面以及概念-类别关系层面的遗忘。论文的关键在于提出了一种新颖的方法——MuCIL（Multimodal Concept Incremental Learning），它通过引入多模态概念（multimodal concepts）实现分类任务，而不增加跨经验的可训练参数数量。这些多模态概念与自然语言中的概念对齐，从而具备可解释性。实验表明，MuCIL在多种情况下实现了超过两倍于其他基于概念模型的分类性能，并且能够对概念进行干预以实现输入图像中视觉概念的定位及事后解释。

链接: https://arxiv.org/abs/2502.20393
作者: Susmit Agrawal,Deepika Vemuri,Sri Siddarth Chakaravarthy P,Vineeth N. Balasubramanian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages of main text, 6 figures in main text, 11 pages of Appendix, published in AAAI 2025

点击查看摘要

Abstract:Concept-based methods have emerged as a promising direction to develop interpretable neural networks in standard supervised settings. However, most works that study them in incremental settings assume either a static concept set across all experiences or assume that each experience relies on a distinct set of concepts. In this work, we study concept-based models in a more realistic, dynamic setting where new classes may rely on older concepts in addition to introducing new concepts themselves. We show that concepts and classes form a complex web of relationships, which is susceptible to degradation and needs to be preserved and augmented across experiences. We introduce new metrics to show that existing concept-based models cannot preserve these relationships even when trained using methods to prevent catastrophic forgetting, since they cannot handle forgetting at concept, class, and concept-class relationship levels simultaneously. To address these issues, we propose a novel method - MuCIL - that uses multimodal concepts to perform classification without increasing the number of trainable parameters across experiences. The multimodal concepts are aligned to concepts provided in natural language, making them interpretable by design. Through extensive experimentation, we show that our approach obtains state-of-the-art classification performance compared to other concept-based models, achieving over 2 \times the classification performance in some cases. We also study the ability of our model to perform interventions on concepts, and show that it can localize visual concepts in input images, providing post-hoc interpretations.
zh

[CV-2] InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions CVPR2025

【速读】：该论文旨在解决实现逼真模拟人类与广泛物体交互这一长期目标，特别是在复杂人体-物体相互作用（Human-Object Interactions, HOIs）中面临的挑战，如人体与物体复杂的耦合关系、物体几何形状的变化以及运动捕捉（MoCap）数据中的不准确性（如接触点误差和手部细节缺失）。论文的关键在于提出了一种名为InterMimic的框架，通过采用“先完美后扩展”的课程学习策略（curriculum strategy），解决了从大量不完美的MoCap数据中学习多样化全身交互的问题。具体而言，首先训练特定主体的教师策略来模仿、重定向并优化运动捕捉数据；然后将这些教师策略的知识蒸馏到学生策略中，并结合强化学习微调以超越单纯的示范复制，从而生成更高质量的解决方案。实验结果表明，InterMimic能够生成多数据集上的逼真且多样化的交互，并使所学策略以零样本方式泛化，同时与运动学生成器无缝集成，实现了从单纯模仿到复杂人体-物体相互作用生成建模的提升。

链接: https://arxiv.org/abs/2502.20390
作者: Sirui Xu,Hung Yu Ling,Yu-Xiong Wang,Liang-Yan Gui
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Electronic Arts
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy – perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.
zh

[CV-3] LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding

【速读】：该论文旨在解决3D视觉-语言理解模型训练中对3D标注数据的依赖问题。传统方法通常需要昂贵且难以获取的3D标注，而本文提出了一种创新性的解决方案：通过将3D重建视为“潜在变量”（latent variable），利用仅基于2D图像和相机姿态的监督信号进行训练，同时使用2D损失函数与可微渲染技术（differentiable rendering）来避免对网络架构施加不必要的约束（如支持解码器-only模型）。关键在于无需真实3D标注，通过伪标签（pseudo-labels）由预训练的2D模型生成，即可实现高效的预训练，并进一步在3D视觉-语言定位任务中取得优于现有基线及SOTA方法的表现。

链接: https://arxiv.org/abs/2502.20389
作者: Ang Cao,Sergio Arnaud,Oleksandr Maksymets,Jianing Yang,Ayush Jain,Sriram Yenamandra,Ada Martin,Vincent-Pierre Berges,Paul McVay,Ruslan Partsey,Aravind Rajeswaran,Franziska Meier,Justin Johnson,Jeong Joon Park,Alexander Sax
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Our approach to training 3D vision-language understanding models is to train a feedforward model that makes predictions in 3D, but never requires 3D labels and is supervised only in 2D, using 2D losses and differentiable rendering. The approach is new for vision-language understanding. By treating the reconstruction as a ``latent variable’', we can render the outputs without placing unnecessary constraints on the network architecture (e.g. can be used with decoder-only models). For training, only need images and camera pose, and 2D labels. We show that we can even remove the need for 2D labels by using pseudo-labels from pretrained 2D models. We demonstrate this to pretrain a network, and we finetune it for 3D vision-language understanding tasks. We show this approach outperforms baselines/sota for 3D vision-language grounding, and also outperforms other 3D pretraining techniques. Project page: this https URL.
zh

[CV-4] Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

【速读】：该论文旨在解决两个核心问题：(1) 2D图像结构中“token”定义的最佳形式尚无定论；(2) 自回归（AR）模型在推理过程中因教师强制（teacher forcing）导致的暴露偏差（exposure bias）问题。论文的关键解决方案在于提出xAR框架，它将传统的“token”概念扩展为更通用的实体X，可表示为单个patch、邻域分组、非局部分组或整个图像等不同粒度的预测单元，从而实现灵活的上下文建模。同时，通过将离散的token分类重新表述为连续的实体回归，并采用噪声条件下的流匹配方法，xAR避免了依赖教师强制，实现了Noisy Context Learning，有效缓解了暴露偏差问题。这一创新不仅提升了生成性能，还显著加快了推理速度。

链接: https://arxiv.org/abs/2502.20388
作者: Sucheng Ren,Qihang Yu,Ju He,Xiaohui Shen,Alan Yuille,Liang-Chieh Chen
机构: Johns Hopkins University (约翰斯·霍普kins大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at \url{ this https URL }

点击查看摘要

Abstract:Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token’’ is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a k\times k grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as \textbfcontinuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20 \times faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2 \times faster than the previous best-performing model without relying on vision foundation modules (\eg, DINOv2) or advanced guidance interval sampling.
zh

[CV-5] InsTaG: Learning Personalized 3D Talking Head from Few-Second Video CVPR2025

【速读】：该论文致力于解决现有基于辐射场（radiance fields）的方法在合成个性化3D说话头时对大量训练数据和时间的高需求问题。论文提出了一种名为InsTaG的新框架，其关键是结合身份无关预训练（Identity-Free Pre-training）策略和运动对齐适应（Motion-Aligned Adaptation）策略。身份无关预训练策略允许对特定身份模型进行预训练，并从长视频数据集中收集通用运动先验（universal motion priors），而运动对齐适应策略则通过自适应对齐目标头部与预训练场，从而在少量训练数据下约束出鲁棒的动态头部结构，实现高质量且快速的个性化3D说话头适配。

链接: https://arxiv.org/abs/2502.20387
作者: Jiahe Li,Jiawei Zhang,Xiao Bai,Jin Zheng,Jun Zhou,Lin Gu
机构: School of Computer Science and Engineering, State Key Laboratory of Complex & Critical Software Environment, Jiangxi Research Institute, Beihang University (北航); School of Information and Communication Technology, Griffith University (格里菲斯大学); RIKEN AIP (理化学研究所); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Despite exhibiting impressive performance in synthesizing lifelike personalized 3D talking heads, prevailing methods based on radiance fields suffer from high demands for training data and time for each new identity. This paper introduces InsTaG, a 3D talking head synthesis framework that allows a fast learning of realistic personalized 3D talking head from few training data. Built upon a lightweight 3DGS person-specific synthesizer with universal motion priors, InsTaG achieves high-quality and fast adaptation while preserving high-level personalization and efficiency. As preparation, we first propose an Identity-Free Pre-training strategy that enables the pre-training of the person-specific model and encourages the collection of universal motion priors from long-video data corpus. To fully exploit the universal motion priors to learn an unseen new identity, we then present a Motion-Aligned Adaptation strategy to adaptively align the target head to the pre-trained field, and constrain a robust dynamic head structure under few training data. Experiments demonstrate our outstanding performance and efficiency under various data scenarios to render high-quality personalized talking heads.
zh

[CV-6] Efficient Gaussian Splatting for Monocular Dynamic Scene Rendering via Sparse Time-Variant Attribute Modeling AAAI2025

【速读】：该论文旨在解决利用单目视频渲染动态场景时存在的两个关键问题：一是Deformable Gaussian Splatting方法在训练过程中产生的冗余高斯分布（Gaussians）过多，导致渲染速度变慢；二是静态区域的高斯属性在时间上保持不变，无需建模却仍被处理，可能引发静态区域的抖动现象。论文指出，动态场景渲染速度的主要瓶颈在于高斯分布的数量。为解决这些问题，论文提出Efficient Dynamic Gaussian Splatting (EDGS)，通过稀疏的时间可变属性建模来高效表示动态场景。其核心解决方案包括采用稀疏锚点网格（anchor-grid）表示动态场景，并利用经典核表示计算密集高斯分布的运动流，同时提出无监督策略过滤掉静态区域对应的锚点，仅将与可变形物体相关的锚点输入到MLP中查询时间可变属性，从而显著提升渲染速度并保证高质量的渲染效果。

链接: https://arxiv.org/abs/2502.20378
作者: Hanyang Kong,Xingyi Yang,Xinchao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Rendering dynamic scenes from monocular videos is a crucial yet challenging task. The recent deformable Gaussian Splatting has emerged as a robust solution to represent real-world dynamic scenes. However, it often leads to heavily redundant Gaussians, attempting to fit every training view at various time steps, leading to slower rendering speeds. Additionally, the attributes of Gaussians in static areas are time-invariant, making it unnecessary to model every Gaussian, which can cause jittering in static regions. In practice, the primary bottleneck in rendering speed for dynamic scenes is the number of Gaussians. In response, we introduce Efficient Dynamic Gaussian Splatting (EDGS), which represents dynamic scenes via sparse time-variant attribute modeling. Our approach formulates dynamic scenes using a sparse anchor-grid representation, with the motion flow of dense Gaussians calculated via a classical kernel representation. Furthermore, we propose an unsupervised strategy to efficiently filter out anchors corresponding to static areas. Only anchors associated with deformable objects are input into MLPs to query time-variant attributes. Experiments on two real-world datasets demonstrate that our EDGS significantly improves the rendering speed with superior rendering quality compared to previous state-of-the-art methods.
zh

[CV-7] ght Inversion: Image-Conditioned Inversion for Real Image Editing

【速读】：该论文旨在解决现有文本到图像扩散模型在编辑真实图像时面临的重建与可编辑性之间的权衡问题，特别是对于细节丰富的挑战性图像。论文指出，当前许多方法依赖于将图像反演为高斯噪声，并通过逆向采样方程逐步添加噪声来实现这一过程，但这种方法存在重建质量和编辑能力之间的固有折衷。为了解决这一问题，论文的关键在于探索条件选择的重要性，并提出利用输入图像本身作为最精确的条件（即Tight Inversion方法），以缩小模型输出分布并同时提升重建精度和编辑能力。实验结果验证了该方法在改善重建效果以及与其他现有反演方法结合后的编辑性能方面的有效性。

链接: https://arxiv.org/abs/2502.20376
作者: Edo Kadosh,Nir Goren,Or Patashnik,Daniel Garibi,Daniel Cohen-Or
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page at: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models offer powerful image editing capabilities. To edit real images, many methods rely on the inversion of the image into Gaussian noise. A common approach to invert an image is to gradually add noise to the image, where the noise is determined by reversing the sampling equation. This process has an inherent tradeoff between reconstruction and editability, limiting the editing of challenging images such as highly-detailed ones. Recognizing the reliance of text-to-image models inversion on a text condition, this work explores the importance of the condition choice. We show that a condition that precisely aligns with the input image significantly improves the inversion quality. Based on our findings, we introduce Tight Inversion, an inversion method that utilizes the most possible precise condition – the input image itself. This tight condition narrows the distribution of the model’s output and enhances both reconstruction and editability. We demonstrate the effectiveness of our approach when combined with existing inversion methods through extensive experiments, evaluating the reconstruction accuracy as well as the integration with various editing methods.
zh

[CV-8] Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation ICLR2025

【速读】：本文旨在解决生成两位角色在线交互任务中的问题，现有方法主要分为两类：基于对方完整动作序列生成自身动作，或基于特定条件联合生成两位角色的动作。然而，这些方法未能模拟真实场景中人类实时反应且作为独立个体互动的过程。为了解决这一问题，论文提出了一种名为“Ready-to-React”的在线反应策略，通过过去观察到的动作来生成下一帧角色姿态。关键在于每个角色拥有独立的反应策略作为其“大脑”，使它们能够以流式方式像真实人类一样互动。该策略通过在自回归模型中引入扩散头实现，可以动态响应对手的动作，并有效缓解生成过程中的误差累积。实验结果表明，此方法优于现有基线，并能生成扩展的动作序列，同时可通过稀疏信号进行控制，适用于虚拟现实及其他在线交互环境。

链接: https://arxiv.org/abs/2502.20370
作者: Zhi Cen,Huaijin Pi,Sida Peng,Qing Shuai,Yujun Shen,Hujun Bao,Xiaowei Zhou,Ruizhen Hu
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学); The University of Hong Kong (香港大学); Ant Group (蚂蚁集团); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as ICLR 2025 conference paper

点击查看摘要

Abstract:This paper addresses the task of generating two-character online interactions. Previously, two main settings existed for two-character interaction generation: (1) generating one’s motions based on the counterpart’s complete motion sequence, and (2) jointly generating two-character motions based on specific conditions. We argue that these settings fail to model the process of real-life two-character interactions, where humans will react to their counterparts in real time and act as independent individuals. In contrast, we propose an online reaction policy, called Ready-to-React, to generate the next character pose based on past observed motions. Each character has its own reaction policy as its “brain”, enabling them to interact like real humans in a streaming manner. Our policy is implemented by incorporating a diffusion head into an auto-regressive model, which can dynamically respond to the counterpart’s motions while effectively mitigating the error accumulation throughout the generation process. We conduct comprehensive experiments using the challenging boxing task. Experimental results demonstrate that our method outperforms existing baselines and can generate extended motion sequences. Additionally, we show that our approach can be controlled by sparse signals, making it well-suited for VR and other online interactive environments.
zh

[CV-9] OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

【速读】：该论文旨在解决当前时间动作检测（Temporal Action Detection, TAD）领域缺乏标准化框架的问题。由于现有方法在不同的实现设置、评估协议等条件下进行比较，难以客观评估特定技术的实际有效性。为了解决这一问题，论文提出了\textbf{OpenTAD}，这是一个统一的TAD框架，整合了16种不同的TAD方法和9个标准数据集到一个模块化代码库中。OpenTAD的关键在于其灵活性和可比性，允许以最小的努力替换不同设计的模块、以端到端方式训练基于特征的TAD模型，或者在两种模式之间切换。此外，OpenTAD促进了跨数据集的直接基准测试，并实现了不同方法之间的公平且深入的比较。通过全面的实验研究，论文探讨了网络组件的创新如何影响检测性能，并确定了最有效的设计选择，从而构建了一种新的基于现有技术组件的SOTA TAD方法。

链接: https://arxiv.org/abs/2502.20361
作者: Shuming Liu,Chen Zhao,Fatimah Zohra,Mattia Soldan,Alejandro Pardo,Mengmeng Xu,Lama Alssum,Merey Ramazanova,Juan León Alcázar,Anthony Cioppa,Silvio Giancola,Carlos Hinojosa,Bernard Ghanem
机构: Video Understanding Group, Image and Video Understanding Lab (IVUL) (视频理解组, 图像和视频理解实验室); King Abdullah University of Science and Technology (KAUST) (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field has achieved remarkable progress in recent years, further progress and real-world applications are impeded by the absence of a standardized framework. Currently, different methods are compared under different implementation settings, evaluation protocols, etc., making it difficult to assess the real effectiveness of a specific technique. To address this issue, we propose \textbfOpenTAD, a unified TAD framework consolidating 16 different TAD methods and 9 standard datasets into a modular codebase. In OpenTAD, minimal effort is required to replace one module with a different design, train a feature-based TAD model in end-to-end mode, or switch between the two. OpenTAD also facilitates straightforward benchmarking across various datasets and enables fair and in-depth comparisons among different methods. With OpenTAD, we comprehensively study how innovations in different network components affect detection performance and identify the most effective design choices through extensive experiments. This study has led to a new state-of-the-art TAD method built upon existing techniques for each component. We have made our code and models available at this https URL.
zh

[CV-10] ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

【速读】：该论文旨在解决语音驱动的3D面部动画生成中实时性和个性化表达的问题。现有基于扩散的方法虽然能够生成自然的运动，但其缓慢的生成速度限制了应用潜力。论文的关键在于提出了一种新颖的自回归模型，通过从语音到多尺度运动词典的映射学习，实现了高度同步的实时唇部运动、逼真的头部姿态以及眨眼动作的生成。此外，该模型可通过样本运动序列适应未见的说话风格，从而创建具有独特个人风格的3D虚拟角色。这一方案的核心在于结合多尺度运动表征与自回归建模能力，显著提升了唇部同步精度和感知质量。

链接: https://arxiv.org/abs/2502.20323
作者: Xuangeng Chu,Nabarun Goswami,Ziteng Cui,Hanqin Wang,Tatsuya Harada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: More video demonstrations, code, models and data can be found on our project website: this http URL

点击查看摘要

Abstract:Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles using sample motion sequences, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization accuracy and perceived quality.
zh

[CV-11] UniTok: A Unified Tokenizer for Visual Generation and Understanding

【速读】：该论文旨在解决视觉生成与理解之间表示差异所导致的集成难题，即在单一框架中整合这两种能力时存在的关键差距。为了解决这一问题，论文引入了UniTok，这是一种离散视觉标记器，能够在生成任务中编码细粒度细节的同时，捕捉高层次语义以支持理解任务。尽管已有研究表明生成与理解目标可能导致训练过程中的损失冲突，但研究揭示这种冲突的根本瓶颈在于离散标记的表征容量限制。为此，论文提出多码本量化方法，通过将向量量化分解为多个独立子码本来扩展潜在特征空间，同时避免因码本过大引起的训练不稳定性。这种方法显著提升了统一离散标记器的性能上限，使其能够匹配甚至超越特定领域的连续标记器。例如，在ImageNet数据集上，UniTok实现了0.38的rFID（相较于SD-VAE的0.87）以及78.6%的零样本分类准确率（相较于CLIP的76.2%）。关键解决方案在于创新性地采用多码本量化技术来增强离散标记器的表达能力。

链接: https://arxiv.org/abs/2502.20321
作者: Chuofan Ma,Yi Jiang,Junfeng Wu,Jihan Yang,Xin Yu,Zehuan Yuan,Bingyue Peng,Xiaojuan Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at this https URL.
zh

[CV-12] MITracker: Multi-View Integration for Visual Object Tracking

【速读】：该论文致力于解决多视图目标跟踪（Multi-view Object Tracking, MVOT）中的两个主要挑战：缺乏全面的多视图数据集以及有效的跨视图融合方法。为克服这些限制，论文提出了一个包含234K高质量标注帧的Multi-View object Tracking (MVTrack) 数据集，并引入了一种新颖的多视图集成跟踪器（MITracker）。MITracker 的关键创新在于两点：首先，它将二维图像特征转换为三维特征体，并将其压缩到鸟瞰图（BEV）平面上，从而促进跨视图信息融合；其次，提出了一种利用融合后的三维特征体几何信息的注意力机制，在每个视图中优化跟踪结果。这些改进使 MITracker 在 MVTrack 和 GMTD 数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2502.20111
作者: Mengjie Xu,Yitao Zhu,Haotian Jiang,Jiaming Li,Zhenrong Shen,Sheng Wang,Haolin Huang,Xinyu Wang,Qing Yang,Han Zhang,Qian Wang
机构: School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University (上海科技大学生物医学工程学院&先进医疗材料与器件国家重点实验室); School of Biomedical Engineering, Shanghai Jiao Tong University (上海交通大学生物医学工程学院); Shanghai Clinical Research and Trial Center (上海临床研究和试验中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-view object tracking (MVOT) offers promising solutions to challenges such as occlusion and target loss, which are common in traditional single-view tracking. However, progress has been limited by the lack of comprehensive multi-view datasets and effective cross-view integration methods. To overcome these limitations, we compiled a Multi-View object Tracking (MVTrack) dataset of 234K high-quality annotated frames featuring 27 distinct objects across various scenes. In conjunction with this dataset, we introduce a novel MVOT method, Multi-View Integration Tracker (MITracker), to efficiently integrate multi-view object features and provide stable tracking outcomes. MITracker can track any object in video frames of arbitrary length from arbitrary viewpoints. The key advancements of our method over traditional single-view approaches come from two aspects: (1) MITracker transforms 2D image features into a 3D feature volume and compresses it into a bird’s eye view (BEV) plane, facilitating inter-view information fusion; (2) we propose an attention mechanism that leverages geometric information from fused 3D feature volume to refine the tracking results at each view. MITracker outperforms existing methods on the MVTrack and GMTD datasets, achieving state-of-the-art performance. The code and the new dataset will be available at this https URL.
zh

[CV-13] UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

【速读】：该论文旨在解决单目度量深度估计（Monocular Metric Depth Estimation, MMDE）方法在跨域泛化能力上的不足，即现有方法在训练域之外的性能显著下降，尤其是在存在中等域差距的情况下。为应对这一挑战，论文提出了一种名为UniDepthV2的新模型，其关键在于通过直接从输入图像预测度量三维点的方式实现跨域通用的MMDE解决方案，而无需额外信息。具体而言，UniDepthV2引入了一个可自提示的相机模块，用于预测密集相机表示以调节深度特征，并采用伪球面输出表示来解耦相机与深度表示。此外，模型设计了几何不变性损失以增强相机引导深度特征的不变性。这些创新共同提升了边缘定位的精确性和深度输出的锐度，并通过引入不确定性级别输出增强了下游任务的可靠性。综合评估表明，UniDepthV2在零样本设置下表现出卓越的性能和泛化能力。

链接: https://arxiv.org/abs/2502.20110
作者: Luigi Piccinelli,Christos Sakaridis,Yung-Hsu Yang,Mattia Segu,Siyuan Li,Wim Abbeloos,Luc Van Gool
机构: ETH Zürich, Switzerland (瑞士苏黎世联邦理工学院); Toyota Motor Europe (丰田汽车欧洲), Belgium (比利时); INSAIT, Sofia University, Bulgaria (索非亚大学INSAIT研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2403.18913

点击查看摘要

Abstract:Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at this https URL
zh

[CV-14] VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers

【速读】：该论文旨在解决自动驾驶中动态环境和corner cases（极端场景）对主车决策鲁棒性带来的挑战。解决方案的关键在于提出了一种名为VDT-Auto的新管道，其核心在于结合视觉语言模型（Visual Language Model, VLM）的状态理解进步与基于扩散Transformer的动作生成方法。具体而言，VDT-Auto通过鸟瞰图（BEV）编码器从周围图像中提取特征网格，并利用微调后的VLM将结构化输出转换为文本嵌入和噪声路径，从而实现对扩散过程的几何解析和上下文解析。在扩散过程中，前向过程的添加噪声采样自微调VLM的噪声路径输出，而提取的BEV特征网格和嵌入文本则作为逆向过程的条件，以此提升模型的决策精度和泛化能力。实验结果显示，在nuScenes开放环路规划评估中，VDT-Auto实现了平均L2误差0.52米和平均碰撞率21%的表现，并展现出显著的现实世界泛化能力。

链接: https://arxiv.org/abs/2502.20108
作者: Ziang Guo,Konstantin Gubernatorov,Selamawit Asfaw,Zakhar Yagudin,Dzmitry Tsetserukou
机构: Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted paper

点击查看摘要

Abstract:In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle’s decision-making. To address these challenges, commencing with the representation of state-action mapping in the end-to-end autonomous driving paradigm, we introduce a novel pipeline, VDT-Auto. Leveraging the advancement of the state understanding of Visual Language Model (VLM), incorporating with diffusion Transformer-based action generation, our VDT-Auto parses the environment geometrically and contextually for the conditioning of the diffusion process. Geometrically, we use a bird’s-eye view (BEV) encoder to extract feature grids from the surrounding images. Contextually, the structured output of our fine-tuned VLM is processed into textual embeddings and noisy paths. During our diffusion process, the added noise for the forward process is sampled from the noisy path output of the fine-tuned VLM, while the extracted BEV feature grids and embedded texts condition the reverse process of our diffusion Transformers. Our VDT-Auto achieved 0.52m on average L2 errors and 21% on average collision rate in the nuScenes open-loop planning evaluation. Moreover, the real-world demonstration exhibited prominent generalizability of our VDT-Auto. The code and dataset will be released after acceptance.
zh

[CV-15] New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration

【速读】：该论文旨在解决跨模态任务中的指代表达理解（Referring Expression Comprehension, REC）问题，特别是在细粒度组合场景下模型需要同时处理复杂推理与目标排除能力的挑战。现有REC数据集在难度控制和对非目标排除能力的测试方面存在不足，为此，论文提出了一种新的REC数据集，具有可控难度级别以及包含通过细粒度编辑生成的负样本文本和图像的特点，以更全面地评估模型性能。

为了解决细粒度组合REC问题，论文提出了基于“专家模型（Specialist Model）-大型语言模型（MLLM）协作框架”的创新方法。关键在于利用两类模型的优势互补：专家模型擅长高效完成简单任务，而MLLM更适合复杂推理。论文设计了两种协作策略来实现这一目标：第一种慢快适应（Slow-Fast Adaptation, SFA）机制通过路由方式动态分配简单任务给专家模型，复杂任务给MLLM，并通过目标再聚焦策略减少两者的常见错误模式；第二种候选区域选择（Candidate Region Selection, CRS）方法基于专家模型生成多个边界框候选，并借助MLLM的强大推理能力确定正确目标。实验结果表明，SFA策略在定位准确率与效率之间实现了良好平衡，而CRS策略显著提升了两类模型的表现。本研究强调通过战略性结合现有工具而非重新发明来解决复杂的现实世界任务，为相关领域提供了有价值的参考。

链接: https://arxiv.org/abs/2502.20104
作者: Xuzheng Yang,Junzhuo Liu,Peng Wang,Guoqing Wang,Yang Yang,Heng Tao Shen
机构: University of Electronic Science and Technology of China (电子科技大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TPAMI under review

点击查看摘要

Abstract:Referring Expression Comprehension (REC) is a foundational cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding. To advance this field, we introduce a new REC dataset with two key features. First, it is designed with controllable difficulty levels, requiring fine-grained reasoning across object categories, attributes, and relationships. Second, it incorporates negative text and images generated through fine-grained editing, explicitly testing a model’s ability to reject non-existent targets, an often-overlooked yet critical challenge in existing datasets. To address fine-grained compositional REC, we propose novel methods based on a Specialist-MLLM collaboration framework, leveraging the complementary strengths of them: Specialist Models handle simpler tasks efficiently, while MLLMs are better suited for complex reasoning. Based on this synergy, we introduce two collaborative strategies. The first, Slow-Fast Adaptation (SFA), employs a routing mechanism to adaptively delegate simple tasks to Specialist Models and complex tasks to MLLMs. Additionally, common error patterns in both models are mitigated through a target-refocus strategy. The second, Candidate Region Selection (CRS), generates multiple bounding box candidates based on Specialist Model and uses the advanced reasoning capabilities of MLLMs to identify the correct target. Extensive experiments on our dataset and other challenging compositional benchmarks validate the effectiveness of our approaches. The SFA strategy achieves a trade-off between localization accuracy and efficiency, and the CRS strategy greatly boosts the performance of both Specialist Models and MLLMs. We aim for this work to offer valuable insights into solving complex real-world tasks by strategically combining existing tools for maximum effectiveness, rather than reinventing them.
zh

[CV-16] WalnutData: A UAV Remote Sensing Dataset of Green Walnuts and Model Evaluation

【速读】：该论文试图解决农业计算机视觉领域缺乏与青皮核桃相关的数据集的问题。为促进该领域的算法设计，研究团队利用无人机采集了来自8个青皮核桃样本地块的遥感数据，并构建了一个名为WalnutData的大规模数据集，该数据集具有更高粒度的目标特征标注，以应对青皮核桃在不同光照条件和遮挡下的挑战。数据集包含30,240张图像和706,208个实例，分为四种类别：正向光照且无遮挡（A1）、背光且无遮挡（A2）、正向光照且遮挡（B1）以及背光且遮挡（B2）。解决方案的关键在于通过无人机采集多样化场景的数据并精细标注目标特征，从而形成一个能够有效评估主流算法性能的数据基准。论文还基于WalnutData对多种主流算法进行了评估，并将这些结果作为后续研究的基线标准。

链接: https://arxiv.org/abs/2502.20092
作者: Mingjie Wu,Chenggui Yang,Huihua Wang,Chen Xue,Yibo Wang,Haoyu Wang,Yansong Wang,Can Peng,Yuqi Han,Ruoyu Li,Lijun Yun,Zaiqing Chen,Songfan Shi,Luhao Fang,Shuyi Wan,Tingfeng Li,Shuangyao Liu,Haotian Feng
机构: School of Information, Yunnan Normal University (云南师范大学信息学院); Engineering Research Center of Computer Vision and Intelligent Control Technology, Department of Education of Yunnan Province (云南省教育厅计算机视觉与智能控制技术工程研究中心); School of Physics and Electronic Information, Yunnan Normal University (云南师范大学物理与电子信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The UAV technology is gradually maturing and can provide extremely powerful support for smart agriculture and precise monitoring. Currently, there is no dataset related to green walnuts in the field of agricultural computer vision. Thus, in order to promote the algorithm design in the field of agricultural computer vision, we used UAV to collect remote-sensing data from 8 walnut sample plots. Considering that green walnuts are subject to various lighting conditions and occlusion, we constructed a large-scale dataset with a higher-granularity of target features - WalnutData. This dataset contains a total of 30,240 images and 706,208 instances, and there are 4 target categories: being illuminated by frontal light and unoccluded (A1), being backlit and unoccluded (A2), being illuminated by frontal light and occluded (B1), and being backlit and occluded (B2). Subsequently, we evaluated many mainstream algorithms on WalnutData and used these evaluation results as the baseline standard. The dataset and all evaluation results can be obtained at this https URL.
zh

[CV-17] OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels CVPR2025

【速读】：该论文旨在解决现有基于ConvNet的设计主要关注通过增加核大小来扩大感受野，而未能充分考虑自顶向下的注意力机制这一生物启发式原理以进一步提升性能的问题。论文的关键在于提出了一种新颖的纯ConvNet视觉主干网络OverLoCK，它从架构和混合器两个视角精心设计。具体而言，引入了生物启发式的Deep-stage Decomposition Strategy (DDS)，通过在特征和核权重层面上提供动态的自顶向下上下文引导，将语义上有意义的上下文表示融合到中间和深层网络中。此外，还提出了Context-Mixing Dynamic Convolution (ContMix)，这种新型卷积能够有效建模长距离依赖关系，同时即使输入分辨率增加也能保持固有的局部归纳偏置。这些特性在之前的卷积中不存在。通过DDS和ContMix的支持，OverLoCK在多个任务上表现出显著的性能提升。

链接: https://arxiv.org/abs/2502.20087
作者: Meng Lou,Yizhou Yu
机构: School of Computing and Data Science, The University of Hong Kong (香港大学计算机与数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:In the human vision system, top-down attention plays a crucial role in perception, wherein the brain initially performs an overall but rough scene analysis to extract salient cues (i.e., overview first), followed by a finer-grained examination to make more accurate judgments (i.e., look closely next). However, recent efforts in ConvNet designs primarily focused on increasing kernel size to obtain a larger receptive field without considering this crucial biomimetic mechanism to further improve performance. To this end, we propose a novel pure ConvNet vision backbone, termed OverLoCK, which is carefully devised from both the architecture and mixer perspectives. Specifically, we introduce a biomimetic Deep-stage Decomposition Strategy (DDS) that fuses semantically meaningful context representations into middle and deep layers by providing dynamic top-down context guidance at both feature and kernel weight levels. To fully unleash the power of top-down context guidance, we further propose a novel \textbfContext-\textbfMixing Dynamic Convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases. These properties are absent in previous convolutions. With the support from both DDS and ContMix, our OverLoCK exhibits notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while only using around one-third of the FLOPs/parameters. On object detection with Cascade Mask R-CNN, our OverLoCK-S surpasses MogaNet-B by a significant 1% in AP ^b . On semantic segmentation with UperNet, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at this https URL.
zh

[CV-18] SegLocNet: Multimodal Localization Network for Autonomous Driving via Birds-Eye-View Segmentation

【速读】：该论文旨在解决自动驾驶中鲁棒且精确的定位问题，面临的挑战包括传统基于GNSS的定位方法在城市环境中易受信号遮挡和多路径效应的影响，依赖高精地图（HD Map）的方法受限于高昂的地图构建与维护成本，而基于标准精地图（SD Map）的方法通常因过拟合而导致性能不佳或泛化能力不足。论文的关键解决方案是提出SegLocNet，这是一种无需GNSS的多模态定位网络，通过鸟瞰图（BEV）语义分割实现精准定位。SegLocNet利用BEV分割网络从多传感器输入生成语义地图，并通过穷尽匹配过程估计车辆的自车位姿（ego pose）。其创新点在于避免了基于回归的位姿估计局限性，保持了高可解释性和泛化能力，并通过引入统一的地图表示方式，使方法能够同时适用于HD Map和SD Map，无需修改网络架构即可平衡定位精度和覆盖范围。实验结果表明，该方法在nuScenes和Argoverse数据集上优于现有最先进的方法，能够在无GNSS辅助的情况下准确估计城市环境中的自车位姿，并具备强大的泛化能力。

链接: https://arxiv.org/abs/2502.20077
作者: Zijie Zhou,Zhangshuo Qi,Luqi Cheng,Guangming Xiong
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust and accurate localization is critical for autonomous driving. Traditional GNSS-based localization methods suffer from signal occlusion and multipath effects in urban environments. Meanwhile, methods relying on high-definition (HD) maps are constrained by the high costs associated with the construction and maintenance of HD maps. Standard-definition (SD) maps-based methods, on the other hand, often exhibit unsatisfactory performance or poor generalization ability due to overfitting. To address these challenges, we propose SegLocNet, a multimodal GNSS-free localization network that achieves precise localization using bird’s-eye-view (BEV) semantic segmentation. SegLocNet employs a BEV segmentation network to generate semantic maps from multiple sensor inputs, followed by an exhaustive matching process to estimate the vehicle’s ego pose. This approach avoids the limitations of regression-based pose estimation and maintains high interpretability and generalization. By introducing a unified map representation, our method can be applied to both HD and SD maps without any modifications to the network architecture, thereby balancing localization accuracy and area coverage. Extensive experiments on the nuScenes and Argoverse datasets demonstrate that our method outperforms the current state-of-the-art methods, and that our method can accurately estimate the ego pose in urban environments without relying on GNSS, while maintaining strong generalization ability. Our code and pre-trained model will be released publicly.
zh

[CV-19] Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation CVPR2025

【速读】：该论文旨在解决现有胸部X射线（CXR）报告自动生成方法中因仅关注单一或固定视角图像而导致诊断准确性受限以及忽视疾病进展的问题。此外，尽管已有方法利用纵向数据追踪疾病进展，但它们仍然依赖单张图像来分析当前就诊情况。为了解决这些问题，论文提出了一种名为MLRG的方法，其关键是引入多视角纵向对比学习技术，该技术整合了当前多视角图像的空间信息与纵向数据的时间信息，并利用放射学报告的固有时空信息来监督视觉和文本表示的预训练。此外，还提出了令牌化缺失编码技术以灵活处理患者特定先验知识的缺失，使模型能够基于可用的先验知识生成更准确的放射学报告。

链接: https://arxiv.org/abs/2502.20056
作者: Kang Liu,Zhuoqi Ma,Xiaolu Kang,Yunan Li,Kun Xie,Zhicheng Jiao,Qiguang Miao
机构: School of Computer Science and Technology, Xidian University, Xi’an, China (计算机科学与技术学院, 西安电子科技大学); Xi’an Key Laboratory of Big Data and Intelligent Vision, Xi’an, China (西安大数据与智能视觉重点实验室); Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi’an, China (教育部协同智能系统重点实验室, 西安电子科技大学); Warren Alpert Medical School, Brown University, Providence, USA (沃伦·阿尔珀特医学学院, 布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Automated radiology report generation offers an effective solution to alleviate radiologists’ workload. However, most existing methods focus primarily on single or fixed-view images to model current disease conditions, which limits diagnostic accuracy and overlooks disease progression. Although some approaches utilize longitudinal data to track disease progression, they still rely on single images to analyze current visits. To address these issues, we propose enhanced contrastive learning with Multi-view Longitudinal data to facilitate chest X-ray Report Generation, named MLRG. Specifically, we introduce a multi-view longitudinal contrastive learning method that integrates spatial information from current multi-view images and temporal information from longitudinal data. This method also utilizes the inherent spatiotemporal information of radiology reports to supervise the pre-training of visual and textual representations. Subsequently, we present a tokenized absence encoding technique to flexibly handle missing patient-specific prior knowledge, allowing the model to produce more accurate radiology reports based on available prior knowledge. Extensive experiments on MIMIC-CXR, MIMIC-ABN, and Two-view CXR datasets demonstrate that our MLRG outperforms recent state-of-the-art methods, achieving a 2.3% BLEU-4 improvement on MIMIC-CXR, a 5.5% F1 score improvement on MIMIC-ABN, and a 2.7% F1 RadGraph improvement on Two-view CXR.
zh

[CV-20] 3D-AffordanceLLM : Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds ICLR

【速读】：该论文旨在解决传统3D操作性检测范式在开放世界场景中的局限性，具体表现为依赖预定义标签且缺乏理解复杂自然语言的能力，导致其泛化能力受限。为了解决这些问题，论文提出了一个新的任务——指令推理操作性分割（IRAS），通过输入查询推理文本输出操作性掩码区域，从而避免固定类别的输入标签约束。解决方案的关键在于引入大型语言模型（LLMs）到3D操作性感知中，并设计了一个定制解码器用于生成操作性掩码，实现了开放世界推理操作性检测。此外，针对3D操作性数据集稀缺的问题，论文提出了一种多阶段训练策略，首先通过一种新颖的预训练任务——指代物体部分分割（ROPS），使模型具备对象部分级别的通用识别与分割能力，随后通过IRAS任务微调，赋予模型操作性检测的推理能力。最终，3D-ADLLM利用LLMs丰富的世界知识和人机交互推理能力，在开放词汇操作性检测任务中提升了约8%的mIoU。

链接: https://arxiv.org/abs/2502.20041
作者: Hengshuo Chu,Xiang Deng,Xiaoyang Chen,Yinchuan Li,Jianye Hao,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen); Huawei Noah’s Ark Lab (华为诺亚方舟实验室), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICLR

点击查看摘要

Abstract:3D Affordance detection is a challenging problem with broad applications on various robotic tasks. Existing methods typically formulate the detection paradigm as a label-based semantic segmentation task. This paradigm relies on predefined labels and lacks the ability to comprehend complex natural language, resulting in limited generalization in open-world scene. To address these limitations, we reformulate the traditional affordance detection paradigm into \textitInstruction Reasoning Affordance Segmentation (IRAS) task. This task is designed to output a affordance mask region given a query reasoning text, which avoids fixed categories of input labels. We accordingly propose the \textit3D-AffordanceLLM (3D-ADLLM), a framework designed for reasoning affordance detection in 3D open-scene. Specifically, 3D-ADLLM introduces large language models (LLMs) to 3D affordance perception with a custom-designed decoder for generating affordance masks, thus achieving open-world reasoning affordance detection. In addition, given the scarcity of 3D affordance datasets for training large models, we seek to extract knowledge from general segmentation data and transfer it to affordance detection. Thus, we propose a multi-stage training strategy that begins with a novel pre-training task, i.e., \textitReferring Object Part Segmentation~(ROPS). This stage is designed to equip the model with general recognition and segmentation capabilities at the object-part level. Then followed by fine-tuning with the IRAS task, 3D-ADLLM obtains the reasoning ability for affordance detection. In summary, 3D-ADLLM leverages the rich world knowledge and human-object interaction reasoning ability of LLMs, achieving approximately an 8% improvement in mIoU on open-vocabulary affordance detection tasks.
zh

[CV-21] A2-GNN: Angle-Annular GNN for Visual Descriptor-free Camera Relocalization

【速读】：该论文旨在解决视觉定位（Visual Localization）中基于描述符的传统方法在存储需求、隐私问题及模型维护方面的挑战，同时克服现有无描述符方法准确性低或计算开销大的局限。论文的关键解决方案是引入Angle-Annular Graph Neural Network (A2-GNN)，这是一种通过环形特征提取高效学习鲁棒几何结构表示的方法。其核心在于通过聚类邻居并将每组的距离信息与角度作为补充信息嵌入，以捕捉局部结构，从而实现高精度且低计算开销的2D-3D关键点匹配。

链接: https://arxiv.org/abs/2502.20036
作者: Yejun Zhang,Shuzhe Wang,Juho Kannala
机构: Aalto University (阿尔托大学); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in 2025 International Conference on 3D Vision (3DV)

点击查看摘要

Abstract:Visual localization involves estimating the 6-degree-of-freedom (6-DoF) camera pose within a known scene. A critical step in this process is identifying pixel-to-point correspondences between 2D query images and 3D models. Most advanced approaches currently rely on extensive visual descriptors to establish these correspondences, facing challenges in storage, privacy issues and model maintenance. Direct 2D-3D keypoint matching without visual descriptors is becoming popular as it can overcome those challenges. However, existing descriptor-free methods suffer from low accuracy or heavy computation. Addressing this gap, this paper introduces the Angle-Annular Graph Neural Network (A2-GNN), a simple approach that efficiently learns robust geometric structural representations with annular feature extraction. Specifically, this approach clusters neighbors and embeds each group’s distance information and angle as supplementary information to capture local structures. Evaluation on matching and visual localization datasets demonstrates that our approach achieves state-of-the-art accuracy with low computational overhead among visual description-free methods. Our code will be released on this https URL.
zh

[CV-22] AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLM s

【速读】：该论文旨在解决多模态大型语言模型（MLLM）在多样化图像-文本数据集上的有效微调问题，重点在于处理复杂数据集中固有的冲突（源于不同模态的优化目标）与潜在的跨任务共性。现有方法通常分别处理这些冲突与共性，而未进行统一建模。为了解决这一问题，论文提出了一种名为AsymLoRA的参数高效微调框架，其关键是通过非对称LoRA实现知识模块化与跨模态协调：引入特定任务的低秩投影矩阵B以保留针对冲突目标的不同适应路径，同时采用共享投影矩阵A来整合跨模态的共性。实验结果表明，AsymLoRA在多个基准测试中优于仅捕获共性的标准LoRA以及专注于冲突的LoRA-MoE，实现了更优的模型性能与系统效率。

链接: https://arxiv.org/abs/2502.20035
作者: Xuyang Wei,Chunlin Tian,Li Li
机构: University of Macau (澳门大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective instruction fine-tuning on diverse image-text datasets is crucial for developing a versatile Multimodal Large Language Model (MLLM), where dataset composition dictates the model’s adaptability across multimodal tasks. However, complex datasets often contain inherent conflicts – stemming from modality-specific optimization objectives – and latent commonalities that enable cross-task transfer, which most existing approaches handle separately. To bridge this gap, we introduce AsymLoRA, a parameter-efficient tuning framework that unifies knowledge modularization and cross-modal coordination via asymmetric LoRA: task-specific low-rank projections (matrix B) that preserve distinct adaptation pathways for conflicting objectives, and a shared projection (matrix A) that consolidates cross-modal commonalities. Extensive evaluations demonstrate that AsymLoRA consistently surpasses both vanilla LoRA, which captures only commonalities, and LoRA-MoE, which focuses solely on conflicts, achieving superior model performance and system efficiency across diverse benchmarks.\hrefCodethis https URL.
zh

[CV-23] Multi-Keypoint Affordance Representation for Functional Dexterous Grasping

【速读】：该论文致力于解决功能性灵巧抓取中视觉感知与操作之间脱节的问题。现有基于affordance的方法主要预测粗略的交互区域，无法直接约束抓取姿态，导致视觉感知与操作之间的不连贯性。为了解决这一问题，论文的关键创新在于提出了功能接触点定位的多关节点affordance表示方法（Contact-guided Multi-Keypoint Affordance, CMKA），通过弱监督结合大模型实现任务驱动抓取配置的编码，并引入基于关节点的抓取矩阵变换（Keypoint-based Grasp Matrix Transformation, KGT）方法确保手部关节点与物体接触点的空间一致性，从而建立视觉感知与灵巧抓取动作之间的直接联系。实验结果表明，该方法显著提升了affordance定位精度、抓取一致性和对未见过工具及任务的泛化能力。

链接: https://arxiv.org/abs/2502.20018
作者: Fan Yang,Dongsheng Luo,Wenrui Chen,Jiacheng Lin,Junjie Cai,Kailun Yang,Zhiyong Li,Yaonan Wang
机构: The authors are with the School of Robotics, Hunan University, China(作者隶属于湖南大学机器人学院，中国);
The authors are also with the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, China(作者还隶属于湖南大学机器人视觉感知与控制技术国家工程研究中心，中国);
The authors are with the College of Computer Science and Electronic Engineering, Hunan University, Changsha, China(作者隶属于湖南大学计算机科学与电子工程学院，中国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: The source code and demo videos will be publicly available at this https URL

点击查看摘要

Abstract:Functional dexterous grasping requires precise hand-object interaction, going beyond simple gripping. Existing affordance-based methods primarily predict coarse interaction regions and cannot directly constrain the grasping posture, leading to a disconnection between visual perception and manipulation. To address this issue, we propose a multi-keypoint affordance representation for functional dexterous grasping, which directly encodes task-driven grasp configurations by localizing functional contact points. Our method introduces Contact-guided Multi-Keypoint Affordance (CMKA), leveraging human grasping experience images for weak supervision combined with Large Vision Models for fine affordance feature extraction, achieving generalization while avoiding manual keypoint annotations. Additionally, we present a Keypoint-based Grasp matrix Transformation (KGT) method, ensuring spatial consistency between hand keypoints and object contact points, thus providing a direct link between visual perception and dexterous grasping actions. Experiments on public real-world FAH datasets, IsaacGym simulation, and challenging robotic tasks demonstrate that our method significantly improves affordance localization accuracy, grasp consistency, and generalization to unseen tools and tasks, bridging the gap between visual affordance learning and dexterous robotic manipulation. The source code and demo videos will be publicly available at this https URL.
zh

[CV-24] Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up

【速读】：该论文旨在解决传统语义匹配技术在处理复杂查询时难以捕捉细粒度跨模态交互的问题。尽管晚期融合的双塔架构通过独立编码视觉和文本数据并在高层合并来尝试弥补这一差距，但它们往往忽略了全面理解所需的微妙交互作用。论文的关键在于提出了一种统一的检索框架，从底层融合视觉和文本线索，实现早期跨模态交互以增强上下文解释能力。解决方案的核心是采用简单的单塔架构，通过两阶段训练过程——包括后训练适应和指令微调——将多语言大型模型（MLLMs）适配为检索器，并强调早期集成策略在需要模态融合的任务中的显著优势，为上下文感知且高效的检索提供了新方向。

链接: https://arxiv.org/abs/2502.20008
作者: Lang Huang,Qiyu Wu,Zhongtao Miao,Toshihiko Yamasaki
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Information retrieval is indispensable for today’s Internet applications, yet traditional semantic matching techniques often fall short in capturing the fine-grained cross-modal interactions required for complex queries. Although late-fusion two-tower architectures attempt to bridge this gap by independently encoding visual and textual data before merging them at a high level, they frequently overlook the subtle interplay essential for comprehensive understanding. In this work, we rigorously assess these limitations and introduce a unified retrieval framework that fuses visual and textual cues from the ground up, enabling early cross-modal interactions for enhancing context interpretation. Through a two-stage training process–comprising post-training adaptation followed by instruction tuning–we adapt MLLMs as retrievers using a simple one-tower architecture. Our approach outperforms conventional methods across diverse retrieval scenarios, particularly when processing complex multi-modal inputs. Notably, the joint fusion encoder yields greater improvements on tasks that require modality fusion compared to those that do not, underscoring the transformative potential of early integration strategies and pointing toward a promising direction for contextually aware and effective information retrieval.
zh

[CV-25] Low-rank tensor completion via a novel minimax p-th order concave penalty function

【速读】：该论文旨在解决低秩张量完成（Low-rank Tensor Completion, LRTC）中非凸松弛方法处理小奇异值不充分的问题。传统常用的最小极大凹罚函数（Minimax Concave Penalty, MCP）在保持大奇异值方面表现良好，但在处理小奇异值时存在不足。为此，论文提出了一种新的最小极大 p 阶凹罚函数（Minimax p-th Order Concave Penalty, MPCP），并基于此构建了一个张量 p 阶 τ 范数作为非凸松弛方法用于张量秩估计，从而建立了基于 MPCP 的 LRTC 模型。此外，论文还提供了所提方法的收敛性理论保证。实验结果表明，该方法在多个真实数据集上不仅提升了视觉质量，还在定量指标上优于现有最先进的方法。

链接: https://arxiv.org/abs/2502.19979
作者: Hongbing Zhang
机构: School of Mathematics and Statistics, Lanzhou University (兰州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages,12 figures

点击查看摘要

Abstract:Low-rank tensor completion (LRTC) has attracted significant attention in fields such as computer vision and pattern recognition. Among the various techniques employed in LRTC, non-convex relaxation methods have been widely studied for their effectiveness in handling tensor singular values, which are crucial for accurate tensor recovery. However, the minimax concave penalty (MCP) function, a commonly used non-convex relaxation, exhibits a critical limitation: it effectively preserves large singular values but inadequately processes small ones. To address this issue, a novel minimax p -th order concave penalty (MPCP) function is proposed. Building on this advancement, a tensor p -th order \tau norm is proposed as a non-convex relaxation for tensor rank estimation, thereby establishing an MPCP-based LRTC model. Furthermore, theoretical guarantees of convergence are provided for the proposed method. Experimental results on multiple real datasets demonstrate that the proposed method outperforms the state-of-the-art methods in both visual quality and quantitative metrics.
zh

[CV-26] Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

【速读】：该论文试图解决多模态大型语言模型在复杂场景下整合多种感知输入并进行组合推理的能力评估问题。现有基准通常关注单一模态或多图像的视觉理解，而忽视了对多种感知信息间组合推理的需求。为探索先进模型在此方面的表现，论文提出了两个新基准：Clue-Visual Question Answering (CVQA) 和 Clue of Password-Visual Question Answering (CPVQA)，分别用于评估视觉理解和合成以及精确解读与应用视觉数据的能力。解决方案的关键在于提供三种插件式方法：利用模型输入进行推理、通过最小边缘解码结合随机生成增强推理能力、检索语义相关的视觉信息以实现高效的数据集成。实验结果表明，即使最先进的闭源模型在CVQA上的准确率仅为33.04%，在CPVQA上降至7.38%，但采用上述方法后，模型在组合推理任务中的性能显著提升，CVQA提高了22.17%，CPVQA提高了9.40%，证明了这些方法的有效性。

链接: https://arxiv.org/abs/2502.19973
作者: Chao Wang,Luning Zhang,Zheng Wang,Yang Zhou
机构: Shanghai University (上海大学); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11pages

点击查看摘要

Abstract:Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding with randomness generation, and retrieving semantically relevant visual information for effective data integration. The combined results reveal current models’ poor performance on combinatorial reasoning benchmarks, even the state-of-the-art (SOTA) closed-source model achieves only 33.04% accuracy on CVQA, and drops to 7.38% on CPVQA. Notably, our approach improves the performance of models on combinatorial reasoning, with a 22.17% boost on CVQA and 9.40% on CPVQA over the SOTA closed-source model, demonstrating its effectiveness in enhancing combinatorial reasoning with multiple perceptual inputs in complex scenarios. The code will be publicly available.
zh

[CV-27] ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning CVPR2025

【速读】：该论文试图解决多模态数据集中由于存在不匹配数据对而导致无法准确识别真实对应关系的问题。现有方法主要关注跨模态对象表示之间的相似性匹配，可能忽视了模态内关键的关系一致性，而这对于区分真实与虚假对应关系尤为重要。这种忽视可能导致负样本被错误地识别为正样本，从而引发性能下降。

为了解决这一问题，论文提出了一种名为ReCon（Relation Consistency学习框架）。其关键是通过引入一种新颖的关系一致性学习方法，确保双重对齐：一是跨模态关系一致性，二是模态内的关系一致性。借助这两重约束，ReCon显著提高了真实对应关系判别的有效性，并可靠地过滤掉不匹配的配对以降低错误监督的风险。实验结果表明，ReCon在Flickr30K、MS-COCO和Conceptual Captions三个广泛使用的基准数据集上的表现优于其他最先进的方法。

链接: https://arxiv.org/abs/2502.19962
作者: Quanxing Zha,Xin Liu,Shu-Juan Peng,Yiu-ming Cheung,Xing Xu,Nannan Wang
机构: Huaqiao University (华侨大学); Hong Kong Baptist University (香港浸会大学); University of Electronic Science and Technology of China (电子科技大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 10 pages, 4 figures, Accepted by CVPR2025

点击查看摘要

Abstract:Can we accurately identify the true correspondences from multimodal datasets containing mismatched data pairs? Existing methods primarily emphasize the similarity matching between the representations of objects across modalities, potentially neglecting the crucial relation consistency within modalities that are particularly important for distinguishing the true and false correspondences. Such an omission often runs the risk of misidentifying negatives as positives, thus leading to unanticipated performance degradation. To address this problem, we propose a general Relation Consistency learning framework, namely ReCon, to accurately discriminate the true correspondences among the multimodal data and thus effectively mitigate the adverse impact caused by mismatches. Specifically, ReCon leverages a novel relation consistency learning to ensure the dual-alignment, respectively of, the cross-modal relation consistency between different modalities and the intra-modal relation consistency within modalities. Thanks to such dual constrains on relations, ReCon significantly enhances its effectiveness for true correspondence discrimination and therefore reliably filters out the mismatched pairs to mitigate the risks of wrong supervisions. Extensive experiments on three widely-used benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, are conducted to demonstrate the effectiveness and superiority of ReCon compared with other SOTAs. The code is available at: this https URL.
zh

[CV-28] ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models

【速读】：该论文旨在解决现有基于局部视觉语言模型（LVLM）的人体重识别（Re-ID）方法存在的局限性，包括依赖固定模板提取文本嵌入以及无法采用视觉问答（VQA）推理格式的问题，从而限制其在多模态任务中的通用性和适用性。论文的关键创新在于提出了一种名为ChatReID的新框架，并引入了分层渐进调优（Hierarchical Progressive Tuning, HPT）策略，通过逐步提升模型区分行人身份的能力，实现细粒度的身份级检索。这一方案显著提升了模型的灵活性与实用性，在十项基准测试中超越了当前最优（SOTA）方法。

链接: https://arxiv.org/abs/2502.19958
作者: Ke Niu,Haiyang Yu,Mengyang Zhao,Teng Fu,Siyang Yi,Wei Lu,Bin Li,Xuelin Qian,Xiangyang Xue
机构: Fudan University (复旦大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Person re-identification (Re-ID) is a critical task in human-centric intelligent systems, enabling consistent identification of individuals across different camera views using multi-modal query information. Recent studies have successfully integrated LVLMs with person Re-ID, yielding promising results. However, existing LVLM-based methods face several limitations. They rely on extracting textual embeddings from fixed templates, which are used either as intermediate features for image representation or for prompt tuning in domain-specific tasks. Furthermore, they are unable to adopt the VQA inference format, significantly restricting their broader applicability. In this paper, we propose a novel, versatile, one-for-all person Re-ID framework, ChatReID. Our approach introduces a Hierarchical Progressive Tuning (HPT) strategy, which ensures fine-grained identity-level retrieval by progressively refining the model’s ability to distinguish pedestrian identities. Extensive experiments demonstrate that our approach outperforms SOTA methods across ten benchmarks in four different Re-ID settings, offering enhanced flexibility and user-friendliness. ChatReID provides a scalable, practical solution for real-world person Re-ID applications, enabling effective multi-modal interaction and fine-grained identity discrimination.
zh

[CV-29] RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges

【速读】：该论文试图解决相机位姿估计在不同几何挑战下的方法局限性问题。现有基准测试对方法在多种几何困难下的表现提供有限洞察，而该研究通过引入RUBIK这一新颖基准来系统性评估图像匹配方法在明确定义的几何难度等级中的性能。RUBIK利用重叠度（overlap）、尺度比（scale ratio）和视角角（viewpoint angle）这三个互补标准，将nuScenes数据集中的16.5K图像对组织成33个难度级别。论文的关键解决方案在于通过这种全面的基准测试揭示了无检测器方法虽然达到最高成功率（47%），但其计算开销显著高于基于检测器的方法（150-600ms vs. 40-70ms），并且即使是表现最佳的方法也只能成功处理54.8%的图像对，这表明在低重叠、大尺度差异和极端视角变化等复杂场景下仍有很大改进空间。

链接: https://arxiv.org/abs/2502.19955
作者: Thibaut Loiseau,Guillaume Bourmaud
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera pose estimation is crucial for many computer vision applications, yet existing benchmarks offer limited insight into method limitations across different geometric challenges. We introduce RUBIK, a novel benchmark that systematically evaluates image matching methods across well-defined geometric difficulty levels. Using three complementary criteria - overlap, scale ratio, and viewpoint angle - we organize 16.5K image pairs from nuScenes into 33 difficulty levels. Our comprehensive evaluation of 14 methods reveals that while recent detector-free approaches achieve the best performance (47% success rate), they come with significant computational overhead compared to detector-based methods (150-600ms vs. 40-70ms). Even the best performing method succeeds on only 54.8% of the pairs, highlighting substantial room for improvement, particularly in challenging scenarios combining low overlap, large scale differences, and extreme viewpoint changes. Benchmark will be made publicly available.
zh

[CV-30] Space Rotation with Basis Transformation for Training-free Test-Time Adaptation

【速读】：该论文旨在解决视觉-语言模型（Visual-Language Model, VLM）在测试时适应（Test-Time Adaptation, TTA）任务中面临的两个主要挑战：一是现有方法通常需要大量的计算资源，二是受限于原始特征空间的局限性，导致其在处理测试时分布变化时效果不佳。为了解决这些问题，论文提出了一种无需训练的特征空间旋转与基变换方法。该方案的关键在于通过利用类别间的固有差异重构原始特征空间，并将其映射到一个新的表示形式，从而增强类别差异的清晰度，为模型在测试阶段提供更有效的指导。此外，为了更好地从不同类别中捕获相关信息，文中引入了一个动态队列来存储代表性样本。实验结果表明，所提方法在多个基准数据集上不仅性能优越，而且效率更高。

链接: https://arxiv.org/abs/2502.19946
作者: Chenhao Ding,Xinyuan Gao,Songlin Dong,Yuhang He,Qiang Wang,Xiang Song,Alex Kot,Yihong Gong
机构: School of Software Engineering, Xi’an Jiaotong University (西安交通大学软件学院); College of Artificial Intelligence, Xi’an Jiaotong University (西安交通大学人工智能学院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the development of visual-language models (VLM) in downstream task applications, test-time adaptation methods based on VLM have attracted increasing attention for their ability to address changes distribution in test-time. Although prior approaches have achieved some progress, they typically either demand substantial computational resources or are constrained by the limitations of the original feature space, rendering them less effective for test-time adaptation tasks. To address these challenges, we propose a training-free feature space rotation with basis transformation for test-time adaptation. By leveraging the inherent distinctions among classes, we reconstruct the original feature space and map it to a new representation, thereby enhancing the clarity of class differences and providing more effective guidance for the model during testing. Additionally, to better capture relevant information from various classes, we maintain a dynamic queue to store representative samples. Experimental results across multiple benchmarks demonstrate that our method outperforms state-of-the-art techniques in terms of both performance and efficiency.
zh

[CV-31] Image Referenced Sketch Colorization Based on Animation Creation Workflow

【速读】：该论文旨在解决基于草图的颜色上色任务中现有方法存在的三个主要问题：文本引导方法难以提供精确的颜色和风格参考，提示引导方法仍需人工操作，而图像参考方法容易产生伪影。为克服这些局限性，论文提出了一种基于扩散模型的框架，灵感来源于实际动画制作流程。该方法的关键在于利用草图为空间引导，并以RGB图像作为颜色参考，通过空间掩码分别提取前景和背景区域；同时引入带有LoRA（低秩适应）模块的分拆交叉注意力机制，将前景和背景区域分开训练以控制跨注意力中的键和值对应的嵌入，从而让扩散模型能够独立整合前景和背景信息，避免干扰并消除空间伪影。此外，在推理阶段，设计了可切换的推理模式以适应多样化的应用场景。

链接: https://arxiv.org/abs/2502.19937
作者: Dingkun Yan,Xinrui Wang,Zhuoru Li,Suguru Saito,Yusuke Iwasawa,Yutaka Matsuo,Jiaxian Guo
机构: Institute of Science Tokyo (东京科学研究所); The University of Tokyo (东京大学); Project HAT (项目HAT)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Sketch colorization plays an important role in animation and digital illustration production tasks. However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-referenced methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. Our approach leverages the sketch as the spatial guidance and an RGB image as the color reference, and separately extracts foreground and background from the reference image with spatial masks. Particularly, we introduce a split cross-attention mechanism with LoRA (Low-Rank Adaptation) modules. They are trained separately with foreground and background regions to control the corresponding embeddings for keys and values in cross-attention. This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the spatial artifacts. During inference, we design switchable inference modes for diverse use scenarios by changing modules activated in the framework. Extensive qualitative and quantitative experiments, along with user studies, demonstrate our advantages over existing methods in generating high-qualigy artifact-free results with geometric mismatched references. Ablation studies further confirm the effectiveness of each component. Codes are available at this https URL tellurion-kanata/colorizeDiffusion.
zh

[CV-32] Identity-preserving Distillation Sampling by Fixed-Point Iterator

【速读】：该论文旨在解决基于文本条件的图像生成与编辑中因噪声梯度导致的模糊问题，以及现有去偏技术仍受错误梯度影响的局限。为解决这些问题，论文提出了Identity-preserving Distillation Sampling (IDS)，通过补偿导致结果意外变化的梯度来保留对象的身份信息。关键在于引入了一种新的正则化技术——固定点迭代正则化 (Fixed-point Iterative Regularization, FPR)，该技术直接修改文本条件下的得分函数本身，从而在图像到图像编辑和可编辑神经辐射场 (NeRF) 中实现结构一致性及身份（包括姿态和结构）的保持。得益于FPR的自校正能力，所提出的方法能够提供清晰且无歧义的表示，同时显著改善源数据与编辑数据之间的结构一致性。

链接: https://arxiv.org/abs/2502.19930
作者: SeonHwa Kim,Jiwon Kim,Soobin Park,Donghoon Ahn,Jiwon Kang,Seungryong Kim,Kyong Hwan Jin,Eunju Cha
机构: Korea University (韩国大学); Sookmyung Women’s University (淑明女子大学); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Score distillation sampling (SDS) demonstrates a powerful capability for text-conditioned 2D image and 3D object generation by distilling the knowledge from learned score functions. However, SDS often suffers from blurriness caused by noisy gradients. When SDS meets the image editing, such degradations can be reduced by adjusting bias shifts using reference pairs, but the de-biasing techniques are still corrupted by erroneous gradients. To this end, we introduce Identity-preserving Distillation Sampling (IDS), which compensates for the gradient leading to undesired changes in the results. Based on the analysis that these errors come from the text-conditioned scores, a new regularization technique, called fixed-point iterative regularization (FPR), is proposed to modify the score itself, driving the preservation of the identity even including poses and structures. Thanks to a self-correction by FPR, the proposed method provides clear and unambiguous representations corresponding to the given prompts in image-to-image editing and editable neural radiance field (NeRF). The structural consistency between the source and the edited data is obviously maintained compared to other state-of-the-art methods.
zh

[CV-33] Incremental Learning with Repetition via Pseudo-Feature Projection

【速读】：该论文致力于解决增量学习（Incremental Learning）场景未能充分反映真实世界推理任务的问题，这些任务通常具有不严格的任务边界，并在连续数据流中重复出现共同类别和概念。论文提出了新的包含部分重复和任务混合的场景，其中重复模式内生于场景且未知于策略。研究探讨了无样本（exemplar-free）增量学习策略在数据重复情况下的表现，并调整了一系列最先进的方法以在两种设定下进行公平比较。此外，论文还提出了一种新颖的方法（Horde），能够动态调整一组自依赖特征提取器，并通过利用类别重复来对其对齐。关键在于Horde方法能够有效处理数据重复问题，在经典无重复场景中取得竞争性结果，而在有重复的场景中达到最先进性能。

链接: https://arxiv.org/abs/2502.19922
作者: Benedikt Tscheschner,Eduardo Veas,Marc Masana
机构: Know-Center Research GmbH (Know-Center 研究有限公司); Institute of Visual Computing, TU Graz (格拉茨技术大学视觉计算研究所); SAL Dependable Embedded Systems, Silicon Austria Labs (硅奥地利实验室可靠的嵌入式系统部门)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Incremental Learning scenarios do not always represent real-world inference use-cases, which tend to have less strict task boundaries, and exhibit repetition of common classes and concepts in their continual data stream. To better represent these use-cases, new scenarios with partial repetition and mixing of tasks are proposed, where the repetition patterns are innate to the scenario and unknown to the strategy. We investigate how exemplar-free incremental learning strategies are affected by data repetition, and we adapt a series of state-of-the-art approaches to analyse and fairly compare them under both settings. Further, we also propose a novel method (Horde), able to dynamically adjust an ensemble of self-reliant feature extractors, and align them by exploiting class repetition. Our proposed exemplar-free method achieves competitive results in the classic scenario without repetition, and state-of-the-art performance in the one with repetition.
zh

[CV-34] CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-scale Reinforcement Learning in Autonomous Driving CVPR2025

【速读】：本文旨在解决自动驾驶轨迹规划中基于强化学习（Reinforcement Learning, RL）方法面临的训练效率低下以及难以处理大规模真实驾驶场景的问题。为应对这些挑战，论文提出了一种名为\textbf{CarPlanner}的新方法，其核心在于结合一致性约束的自回归结构（auto-regressive structure with consistency constraints）。这种设计不仅实现了高效的大规模RL训练，还通过在时间步之间保持一致性的策略学习确保了稳定性。此外，CarPlanner采用生成-选择框架，并引入专家引导的奖励函数与不变视角模块，进一步简化了RL训练过程并提升了策略性能。实验结果表明，该RL框架显著改善了训练效率与性能表现，在nuPlan这一具有挑战性的大规模真实世界数据集上超越了现有的基于IL（Imitation Learning）和规则的方法，确立了其作为自动驾驶轨迹规划领域有前景解决方案的地位。

链接: https://arxiv.org/abs/2502.19908
作者: Dongkun Zhang,Jiaming Liang,Ke Guo,Sha Lu,Qi Wang,Rong Xiong,Zhenwei Miao,Yue Wang
机构: Zhejiang University (浙江大学); Cainiao Network (菜鸟网络)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Trajectory planning is vital for autonomous driving, ensuring safe and efficient navigation in complex environments. While recent learning-based methods, particularly reinforcement learning (RL), have shown promise in specific scenarios, RL planners struggle with training inefficiencies and managing large-scale, real-world driving scenarios. In this paper, we introduce \textbfCarPlanner, a \textbfConsistent \textbfauto-\textbfregressive \textbfPlanner that uses RL to generate multi-modal trajectories. The auto-regressive structure enables efficient large-scale RL training, while the incorporation of consistency ensures stable policy learning by maintaining coherent temporal consistency across time steps. Moreover, CarPlanner employs a generation-selection framework with an expert-guided reward function and an invariant-view module, simplifying RL training and enhancing policy performance. Extensive analysis demonstrates that our proposed RL framework effectively addresses the challenges of training efficiency and performance enhancement, positioning CarPlanner as a promising solution for trajectory planning in autonomous driving. To the best of our knowledge, we are the first to demonstrate that the RL-based planner can surpass both IL- and rule-based state-of-the-arts (SOTAs) on the challenging large-scale real-world dataset nuPlan. Our proposed CarPlanner surpasses RL-, IL-, and rule-based SOTA approaches within this demanding dataset.
zh

[CV-35] Graph Probability Aggregation Clustering

【速读】：该论文旨在解决传统聚类方法在全局聚类和局部聚类之间的权衡问题，即全局聚类虽能优化目标函数以探索簇间关系，但可能导致粗粒度划分；而局部聚类虽基于详细点关系进行分组，却往往缺乏一致性和效率。为弥合两者差距并结合其优势，论文提出了一种基于图的概率聚合聚类（Graph Probability Aggregation Clustering, GPAC）算法。GPAC的关键在于将全局聚类的目标函数与局部聚类约束统一，并将其形式化为一个多约束优化问题，通过拉格朗日方法求解。在优化过程中，通过迭代聚合图中邻近样本的信息，计算样本属于特定簇的概率。此外，引入硬分配变量以进一步提升优化的收敛性和稳定性，并设计加速程序将计算复杂度从二次降低到线性，确保算法的可扩展性。实验结果表明，GPAC在多种数据集上的聚类性能超越现有最先进方法，同时保持高计算效率。

链接: https://arxiv.org/abs/2502.19897
作者: Yuxuan Yan,Na Lu,Difei Mei,Ruofan Yan,Youtian Du
机构: School of Automation Science and Engineering, Xi’an Jiaotong University (西安交通大学自动化科学与工程学院); China Mobile Xiong’an Information and Communication Technology Co., Ltd., China Mobile System Integration Co., Ltd., China Mobile Information System Integration Co., Ltd. (中国移动雄安信息通信科技有限公司, 中国移动系统集成有限公司, 中国移动信息系统集成有限公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional clustering methods typically focus on either cluster-wise global clustering or point-wise local clustering to reveal the intrinsic structures in unlabeled data. Global clustering optimizes an objective function to explore the relationships between clusters, but this approach may inevitably lead to coarse partition. In contrast, local clustering heuristically groups data based on detailed point relationships, but it tends to be less coherence and efficient. To bridge the gap between these two concepts and utilize the strengths of both, we propose Graph Probability Aggregation Clustering (GPAC), a graph-based fuzzy clustering algorithm. GPAC unifies the global clustering objective function with a local clustering constraint. The entire GPAC framework is formulated as a multi-constrained optimization problem, which can be solved using the Lagrangian method. Through the optimization process, the probability of a sample belonging to a specific cluster is iteratively calculated by aggregating information from neighboring samples within the graph. We incorporate a hard assignment variable into the objective function to further improve the convergence and stability of optimization. Furthermore, to efficiently handle large-scale datasets, we introduce an acceleration program that reduces the computational complexity from quadratic to linear, ensuring scalability. Extensive experiments conducted on synthetic, real-world, and deep learning datasets demonstrate that GPAC not only exceeds existing state-of-the-art methods in clustering performance but also excels in computational efficiency, making it a powerful tool for complex clustering challenges.
zh

[CV-36] GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors CVPR2025

【速读】：该论文旨在解决现有点云补全方法在处理真实世界扫描数据时面临的挑战，这些方法通常依赖于预定义的合成训练数据集，在应用于分布外的真实扫描数据时表现不佳。论文提出了一种名为GenPC的零样本点云补全框架，通过利用显式的三维生成先验来重建高质量的真实扫描数据。解决方案的关键在于开发了一个Depth Prompting模块，该模块通过深度图像作为桥梁，将部分点云与图像到三维生成模型连接起来；同时设计了一个Geometric Preserving Fusion模块，通过自适应调整生成形状的姿态和尺度，保留输入部分点云的原始结构。这些创新使得所提出的方法在广泛使用的基准数据集上表现出优越性和泛化能力。

链接: https://arxiv.org/abs/2502.19896
作者: An Li,Zhe Zhu,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Existing point cloud completion methods, which typically depend on predefined synthetic training datasets, encounter significant challenges when applied to out-of-distribution, real-world scans. To overcome this limitation, we introduce a zero-shot completion framework, termed GenPC, designed to reconstruct high-quality real-world scans by leveraging explicit 3D generative priors. Our key insight is that recent feed-forward 3D generative models, trained on extensive internet-scale data, have demonstrated the ability to perform 3D generation from single-view images in a zero-shot setting. To harness this for completion, we first develop a Depth Prompting module that links partial point clouds with image-to-3D generative models by leveraging depth images as a stepping stone. To retain the original partial structure in the final results, we design the Geometric Preserving Fusion module that aligns the generated shape with input by adaptively adjusting its pose and scale. Extensive experiments on widely used benchmarks validate the superiority and generalizability of our approach, bringing us a step closer to robust real-world scan completion.
zh

[CV-37] High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model

【速读】：该论文旨在解决静态参考肖像动画化以匹配驱动视频中的头部动作与表情，并同时适应用户指定或参考照明条件的问题。现有方法无法实现可重新照明的肖像动画，因为它们未能分离并操作内在（身份与外观）和外在（姿态与照明）特征。论文的关键解决方案在于通过预训练图像到视频扩散模型特征空间中的专用子空间区分这些特征类型。具体而言，利用肖像的3D网格、姿态以及光照渲染的阴影提示来表示外在属性，而参考则代表内在属性。在训练阶段，采用参考适配器将参考映射到内在特征子空间，使用阴影适配器将阴影提示映射到外在特征子空间。通过合并这些子空间的特征，模型实现了对生成动画中照明、姿态和表情的细微控制。广泛的评估表明，LCVD 在照明真实性、图像质量和视频一致性方面优于最先进的方法，为可重新照明的肖像动画设定了新的基准。

链接: https://arxiv.org/abs/2502.19894
作者: Mingtao Guo,Guanyu Xing,Yanli Liu
机构: National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University (四川大学); College of Computer Science, Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Relightable portrait animation aims to animate a static reference portrait to match the head movements and expressions of a driving video while adapting to user-specified or reference lighting conditions. Existing portrait animation methods fail to achieve relightable portraits because they do not separate and manipulate intrinsic (identity and appearance) and extrinsic (pose and lighting) features. In this paper, we present a Lighting Controllable Video Diffusion model (LCVD) for high-fidelity, relightable portrait animation. We address this limitation by distinguishing these feature types through dedicated subspaces within the feature space of a pre-trained image-to-video diffusion model. Specifically, we employ the 3D mesh, pose, and lighting-rendered shading hints of the portrait to represent the extrinsic attributes, while the reference represents the intrinsic attributes. In the training phase, we employ a reference adapter to map the reference into the intrinsic feature subspace and a shading adapter to map the shading hints into the extrinsic feature subspace. By merging features from these subspaces, the model achieves nuanced control over lighting, pose, and expression in generated animations. Extensive evaluations show that LCVD outperforms state-of-the-art methods in lighting realism, image quality, and video consistency, setting a new benchmark in relightable portrait animation.
zh

[CV-38] C-Drag : Chain-of-Thought Driven Motion Controller for Video Generation

【速读】：该论文旨在解决现有基于轨迹的可控视频生成方法仅关注受控对象运动轨迹生成，而忽略对象与其周围环境动态交互的问题。为了解决这一局限性，论文提出了一种基于Chain-of-Thought的运动控制器C-Drag。其关键在于首先通过物体感知模块利用视觉语言模型获取图像中各类物体的位置与类别信息，然后通过基于Chain-of-Thought的运动推理模块进行分阶段推理，生成受动态交互影响的所有对象的运动轨迹，并将其输入扩散模型以合成视频。此外，论文还引入了一个新的视频对象交互（VOI）数据集来评估运动控制视频生成方法的质量。

链接: https://arxiv.org/abs/2502.19868
作者: Yuhao Li,Mirana Claire Angel,Salman Khan,Yu Zhu,Jinqiu Sun,Yanning Zhang,Fahad Shahbaz Khan
机构: Northwestern Polytechnical University (西北工业大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); 9009.ai; Linköping University (林雪平大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at this https URL.
zh

[CV-39] Striving for Faster and Better: A One-Layer Architecture with Auto Re-parameterization for Low-Light Image Enhancement

【速读】：该论文旨在探索低光图像增强器在视觉质量和计算效率方面的极限，同时追求更优的性能和更快的处理速度。论文的关键在于重新思考任务需求，明确视觉质量与模型学习相对应，而计算效率与网络结构设计相关联。基于此，通过为预定义的极简网络（如单层网络）引入重参数化技术来扩展参数空间，避免陷入局部最优解，从而实现更充分的模型学习。同时，定义了一种分层搜索方案以发现面向任务的重参数化结构，进一步提升网络表示能力并优化效率。最终，仅使用一个卷积层即可实现高效的低光图像增强，同时保持卓越的视觉质量。实验结果表明，该方法在质量和效率方面均优于现有最新方法，并且在多种平台上的运行时间也超越了现有的最快方案。

链接: https://arxiv.org/abs/2502.19867
作者: Nan An,Long Ma,Guangchao Han,Xin Fan,RIsheng Liu
机构: School of Software Technology, Dalian University of Technology (大连理工大学软件学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based low-light image enhancers have made significant progress in recent years, with a trend towards achieving satisfactory visual quality while gradually reducing the number of parameters and improving computational efficiency. In this work, we aim to delving into the limits of image enhancers both from visual quality and computational efficiency, while striving for both better performance and faster processing. To be concrete, by rethinking the task demands, we build an explicit connection, i.e., visual quality and computational efficiency are corresponding to model learning and structure design, respectively. Around this connection, we enlarge parameter space by introducing the re-parameterization for ample model learning of a pre-defined minimalist network (e.g., just one layer), to avoid falling into a local solution. To strengthen the structural representation, we define a hierarchical search scheme for discovering a task-oriented re-parameterized structure, which also provides powerful support for efficiency. Ultimately, this achieves efficient low-light image enhancement using only a single convolutional layer, while maintaining excellent visual quality. Experimental results show our sensible superiority both in quality and efficiency against recently-proposed methods. Especially, our running time on various platforms (e.g., CPU, GPU, NPU, DSP) consistently moves beyond the existing fastest scheme. The source code will be released at this https URL.
zh

[CV-40] LMHLD: A Large-scale Multi-source High-resolution Landslide Dataset for Landslide Detection based on Deep Learning

【速读】：该论文旨在解决滑坡灾害中深度学习模型依赖高质量标注数据进行强特征提取能力的问题，以及滑坡检测领域缺乏基准数据集以评估最新模型泛化能力的需求。论文的关键解决方案是构建了一个大规模多源高分辨率滑坡数据集（Large-scale Multi-source High-resolution Landslide Dataset, LMHLD），用于深度学习驱动的滑坡检测。LMHLD不仅收集了来自全球七个研究区域的多源遥感影像，还设计了训练模块（LMHLDpart）以适应不同尺度的滑坡检测任务，并缓解了多任务学习中的灾难性遗忘问题。此外，通过多个数据质量评估实验验证了LMHLD作为基准数据集的潜力，从而为滑坡预防与减灾提供了重要资源和支持。

链接: https://arxiv.org/abs/2502.19866
作者: Guanting Liu,Yi Wang,Xi Chen,Baoyu Du,Penglei Li,Yuan Wu,Zhice Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Landslides are among the most common natural disasters globally, posing significant threats to human society. Deep learning (DL) has proven to be an effective method for rapidly generating landslide inventories in large-scale disaster areas. However, DL models rely heavily on high-quality labeled landslide data for strong feature extraction capabilities. And landslide detection using DL urgently needs a benchmark dataset to evaluate the generalization ability of the latest models. To solve the above problems, we construct a Large-scale Multi-source High-resolution Landslide Dataset (LMHLD) for Landslide Detection based on DL. LMHLD collects remote sensing images from five different satellite sensors across seven study areas worldwide: Wenchuan, China (2008); Rio de Janeiro, Brazil (2011); Gorkha, Nepal (2015); Jiuzhaigou, China (2015); Taiwan, China (2018); Hokkaido, Japan (2018); Emilia-Romagna, Italy (2023). The dataset includes a total of 25,365 patches, with different patch sizes to accommodate different landslide scales. Additionally, a training module, LMHLDpart, is designed to accommodate landslide detection tasks at varying scales and to alleviate the issue of catastrophic forgetting in multi-task learning. Furthermore, the models trained by LMHLD is applied in other datasets to highlight the robustness of LMHLD. Five dataset quality evaluation experiments designed by using seven DL models from the U-Net family demonstrate that LMHLD has the potential to become a benchmark dataset for landslide detection. LMHLD is open access and can be accessed through the link: this https URL. This dataset provides a strong foundation for DL models, accelerates the development of DL in landslide detection, and serves as a valuable resource for landslide prevention and mitigation efforts.
zh

[CV-41] One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion CVPR2025

【速读】：该论文旨在解决高阶任务中图像融合面临的任务交互困难及语义鸿沟问题，传统方法需要复杂的桥接机制。为应对这一挑战，论文提出了一种基于数字摄影融合的低阶视觉任务的新范式，通过像素级监督实现有效的特征交互。关键在于利用混合图像特征与增强的通用表示，构建了一个名为GIFNet的模型，支持无监督多模态融合且不依赖抽象语义，从而提升任务共享特征的学习能力，实现更广泛的应用场景。此外，该框架还具备单模态增强功能，提供了更高的灵活性以满足实际应用需求。

链接: https://arxiv.org/abs/2502.19854
作者: Chunyang Cheng,Tianyang Xu,Zhenhua Feng,Xiaojun Wu,ZhangyongTang,Hui Li,Zeyang Zhang,Sara Atito,Muhammad Awais,Josef Kittler
机构: School of Artificial Intelligence and Computer Science, Jiangnan University (江南大学), Wuxi, China; Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey (萨里大学), Guildford, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be available at this https URL.
zh

[CV-42] One-for-More: Continual Diffusion Model for Anomaly Detection CVPR2025

【速读】：本文旨在解决现有基于扩散模型（Diffusion Model）的异常检测方法在处理模式增量（pattern increment）时面临的“忠实幻觉（faithfulness hallucination）”和“灾难性遗忘（catastrophic forgetting）”问题。这些问题限制了模型在面对不可预测的新异常模式时的表现。为了解决这些挑战，论文提出了一个连续扩散模型（Continual Diffusion Model），通过梯度投影（Gradient Projection）实现稳定持续学习（continual learning）。梯度投影通过对更新过程中梯度的方向进行调节来保护已学知识，从而缓解上述问题。然而，这种方法需要较高的内存开销，因为其依赖于马尔可夫过程（Markov Process）。为降低内存消耗，论文进一步提出了一种基于线性表示传递性质的迭代奇异值分解方法（Iterative Singular Value Decomposition），该方法几乎不造成性能损失且显著减少了内存需求。此外，为了防止扩散模型过度拟合正常样本，论文还设计了一个异常掩码网络（Anomaly-Masked Network），以增强扩散模型的条件机制（Condition Mechanism）。最终，所提出的方案在MVTec和VisA数据集的17/18个设置中达到了最佳性能。

链接: https://arxiv.org/abs/2502.19848
作者: Xiaofan Li,Xin Tan,Zhuo Chen,Zhizhong Zhang,Ruixin Zhang,Rizen Guo,Guanna Jiang,Yulong Chen,Yanyun Qu,Lizhuang Ma,Yuan Xie
机构: East China Normal University (华东师范大学); Xiamen University (厦门大学); Shanghai Jiao Tong University (上海交通大学); Tencent YouTu Lab (腾讯优图实验室); Tencent WeChatPay Lab (腾讯微信支付实验室); CATL (宁德时代); East China Normal University (华东师范大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:With the rise of generative models, there is a growing interest in unifying all tasks within a generative framework. Anomaly detection methods also fall into this scope and utilize diffusion models to generate or reconstruct normal samples when given arbitrary anomaly images. However, our study found that the diffusion model suffers from severe faithfulness hallucination'' and catastrophic forgetting’‘, which can’t meet the unpredictable pattern increments. To mitigate the above problems, we propose a continual diffusion model that uses gradient projection to achieve stable continual learning. Gradient projection deploys a regularization on the model updating by modifying the gradient towards the direction protecting the learned knowledge. But as a double-edged sword, it also requires huge memory costs brought by the Markov process. Hence, we propose an iterative singular value decomposition method based on the transitive property of linear representation, which consumes tiny memory and incurs almost no performance loss. Finally, considering the risk of ``over-fitting’’ to normal images of the diffusion model, we propose an anomaly-masked network to enhance the condition mechanism of the diffusion model. For continual anomaly detection, ours achieves first place in 17/18 settings on MVTec and VisA. Code is available at this https URL
zh

[CV-43] ProAPO: Progressively Automatic Prompt Optimization for Visual Classification CVPR

【速读】：该论文旨在解决在细粒度图像分类任务中，基于视觉-语言模型（Vision-Language Models, VLMs）的文本提示（textual prompts）可能因大规模语言模型（Large Language Models, LLMs）的幻觉效应（hallucination）而导致类别特定提示（class-specific prompts）不准确或缺乏区分性的问题。论文的目标是在最小监督和无人工干预的情况下，找到视觉上具有区分性的提示。

解决方案的关键在于提出了一种基于进化的算法，该算法能够从任务特定模板逐步优化到类别特定描述。为了应对类别特定候选提示搜索空间爆炸导致的提示生成成本增加、迭代次数增多以及过拟合问题，论文引入了基于编辑和进化操作的简单而有效的生成策略，通过一次查询LLM即可产生多样化的候选提示。此外，提出了两种采样策略以找到更好的初始搜索点并减少遍历类别数量，从而降低迭代成本。同时，设计了一种带有熵约束的新颖适应度分数（fitness score），用于缓解过拟合问题。实验结果表明，所提出的最优提示在13个数据集的一次拍摄图像分类任务中优于现有基于文本提示的方法，并且有效提升了LLM生成描述的效果，同时也增强了适配器方法（adapter-based methods）的跨骨干网络迁移能力。

链接: https://arxiv.org/abs/2502.19844
作者: Xiangyan Qu,Gaopeng Gou,Jiamin Zhuang,Jing Yu,Kun Song,Qihao Wang,Yili Li,Gang Xiong
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, Chinese Academy of Sciences (中国科学院网络空间安全学院); School of Information Engineering, Minzu University of China (中央民族大学信息工程学院); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.
zh

[CV-44] CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation CVPR2025

【速读】：该论文旨在解决对比语言图像预训练（CLIP）模型在复杂多目标场景中的局限性问题。研究通过设计一个专门的数据集ComCO，全面分析了CLIP编码器在多目标场景下的偏见现象，发现文本编码器倾向于优先关注先提及的对象，而图像编码器则更偏好处理较大的对象。为了解决这些问题，研究追溯了这些偏见的根源至CLIP的训练过程，并通过LAION数据集及训练进展分析予以验证。关键在于揭示CLIP在对象大小变化或令牌顺序调整时的不稳定性，并进一步扩展到长描述和文本到图像模型（如Stable Diffusion），证明提示顺序如何影响生成图像中对象的突出程度。这一工作强调了改进CLIP模型以提高其在复杂多目标场景下性能的重要性。

链接: https://arxiv.org/abs/2502.19842
作者: Reza Abbasi,Ali Nazari,Aminreza Sefid,Mohammadali Banayeeanzade,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah
机构: Sharif University of Technology ( Sharif 大学技术)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP’s limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP’s encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP’s training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP’s instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images. For more details and access to our dataset and analysis code, visit our project repository: this https URL.
zh

[CV-45] Knowledge Bridger: Towards Training-free Missing Multi-modality Completion CVPR2025

【速读】：本文旨在解决跨领域（out-of-domain, OOD）场景下缺失模态补全的通用性和鲁棒性不足的问题。传统方法通常依赖于精心设计的融合技术以及在完整数据上的大规模预训练，这限制了其在OOD场景中的泛化能力。为应对这一挑战，论文提出了一种无需训练的缺失模态补全框架，利用大型多模态模型（Large Multimodal Models, LMMs）。该方法被称为“Knowledge Bridger”，具有模态无关性，并结合了缺失模态的生成与排序功能。关键在于通过定义领域特定先验知识，自动从现有模态中提取结构化信息构建知识图谱，将缺失模态的生成与排序模块通过LMM连接，从而实现高质量的缺失模态补全。实验结果表明，该方法在一般和医学领域均优于现有竞争方法，特别是在OOD泛化方面表现突出，且其知识驱动的生成与排序技术优于直接使用LMM的方法，为其他领域的应用提供了有价值的参考。

链接: https://arxiv.org/abs/2502.19834
作者: Guanzhou Ke,Shengfeng He,Xiao Li Wang,Bo Wang,Guoqing Chao,Yuanyang Zhang,Yi Xie,HeXing Su
机构: Beijing Jiaotong University (北京交通大学); Singapore Management University (新加坡管理大学); Southeast University (东南大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Harbin Institute of Technology (哈尔滨工业大学); Nanjing University of Science and Technology (南京理工大学); South China University of Technology (华南理工大学); Xiamen Institute of Technology (厦门理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Previous successful approaches to missing modality completion rely on carefully designed fusion techniques and extensive pre-training on complete data, which can limit their generalizability in out-of-domain (OOD) scenarios. In this study, we pose a new challenge: can we develop a missing modality completion model that is both resource-efficient and robust to OOD generalization? To address this, we present a training-free framework for missing modality completion that leverages large multimodal models (LMMs). Our approach, termed the “Knowledge Bridger”, is modality-agnostic and integrates generation and ranking of missing modalities. By defining domain-specific priors, our method automatically extracts structured information from available modalities to construct knowledge graphs. These extracted graphs connect the missing modality generation and ranking modules through the LMM, resulting in high-quality imputations of missing modalities. Experimental results across both general and medical domains show that our approach consistently outperforms competing methods, including in OOD generalization. Additionally, our knowledge-driven generation and ranking techniques demonstrate superiority over variants that directly employ LMMs for generation and ranking, offering insights that may be valuable for applications in other domains.
zh

[CV-46] Analyzing CLIPs Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study ECCV2024

【速读】：本文旨在解决对比语言图像预训练（CLIP）模型在复杂多目标场景下的性能局限性问题。研究通过引入SimCO和CompCO两个定制数据集，系统分析了CLIP在多目标上下文中图像编码器和文本编码器的显著偏见：图像编码器倾向于关注更大物体，而文本编码器优先处理描述中提到的第一个物体。研究表明这些偏见源于CLIP的训练过程，并通过COCO数据集及CLIP训练进展的分析提供了证据。进一步地，研究扩展到Stable Diffusion模型，揭示了CLIP文本编码器的偏见对文本到图像生成任务的影响。关键解决方案在于识别并量化这些偏见如何影响CLIP在图像标题匹配和生成任务中的表现，特别是在操控物体大小和顺序时的表现差异，从而为未来视觉-语言模型的改进提供指导方向。

链接: https://arxiv.org/abs/2502.19828
作者: Reza Abbasi,Ali Nazari,Aminreza Sefid,Mohammadali Banayeeanzade,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah
机构: Sharif University of Technology ( Sharif 大学技术)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2024 Workshop EVAL-FoMo

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks, yet their efficacy in handling complex multi-object scenarios remains challenging. This study presents a comprehensive analysis of CLIP’s performance limitations in multi-object contexts through controlled experiments. We introduce two custom datasets, SimCO and CompCO, to evaluate CLIP’s image and text encoders in various multi-object configurations. Our findings reveal significant biases in both encoders: the image encoder favors larger objects, while the text encoder prioritizes objects mentioned first in descriptions. We hypothesize these biases originate from CLIP’s training process and provide evidence through analyses of the COCO dataset and CLIP’s training progression. Additionally, we extend our investigation to Stable Diffusion models, revealing that biases in the CLIP text encoder significantly impact text-to-image generation tasks. Our experiments demonstrate how these biases affect CLIP’s performance in image-caption matching and generation tasks, particularly when manipulating object sizes and their order in captions. This work contributes valuable insights into CLIP’s behavior in complex visual environments and highlights areas for improvement in future vision-language models.
zh

[CV-47] wofold Debiasing Enhances Fine-Grained Learning with Coarse Labels

【速读】：本文旨在解决Coarse-to-Fine Few-Shot (C2FS) 任务中的两个主要挑战：一是粗粒度监督预训练抑制了关键细粒度特征的提取；二是由于有限细粒度样本导致的分布偏差引发模型过拟合。为应对这些挑战，论文提出了Twofold Debiasing (TFB) 方法，其关键是通过多层特征融合重建模块和中间层特征对齐模块增强复杂细粒度特征的学习能力，同时利用富含细粒度信息的粗粒度样本嵌入校准分布偏差，从而有效缓解上述问题。实验结果表明，该方法在五个基准数据集上取得了最先进的性能。

链接: https://arxiv.org/abs/2502.19816
作者: Xin-yang Zhao,Jian Jin,Yang-yang Li,Yazhou Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Coarse-to-Fine Few-Shot (C2FS) task is designed to train models using only coarse labels, then leverages a limited number of subclass samples to achieve fine-grained recognition capabilities. This task presents two main challenges: coarse-grained supervised pre-training suppresses the extraction of critical fine-grained features for subcategory discrimination, and models suffer from overfitting due to biased distributions caused by limited fine-grained samples. In this paper, we propose the Twofold Debiasing (TFB) method, which addresses these challenges through detailed feature enhancement and distribution calibration. Specifically, we introduce a multi-layer feature fusion reconstruction module and an intermediate layer feature alignment module to combat the model’s tendency to focus on simple predictive features directly related to coarse-grained supervision, while neglecting complex fine-grained level details. Furthermore, we mitigate the biased distributions learned by the fine-grained classifier using readily available coarse-grained sample embeddings enriched with fine-grained information. Extensive experiments conducted on five benchmark datasets demonstrate the efficacy of our approach, achieving state-of-the-art results that surpass competitive methods.
zh

[CV-48] UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition ICLR2025

【速读】：该论文试图解决在生成式人脸合成（Synthetic Face Generation）中因上下文过拟合（context overfitting）导致的类内多样性不足（intra-class diversity deficiency）以及由此引发的人脸识别性能下降的问题。论文的关键解决方案在于提出了一种名为UIFace的框架，通过释放扩散模型（diffusion model）的内在能力来增强类内多样性。具体而言，UIFace首先训练一个能够根据身份上下文或可学习的空上下文（learnable empty context）进行采样的扩散模型：前者生成保持身份一致但变化较少的图像，后者则利用模型的固有能力生成具有类内多样性的随机身份图像。接着，在推理阶段采用一种新颖的两阶段采样策略（two-stage sampling strategy），充分结合两种上下文的优势，从而生成既多样化又保持身份一致的图像。此外，引入注意力注入模块（attention injection module），利用空上下文的注意力图指导基于身份条件的生成过程，进一步增强类内多样性。实验表明，与现有方法相比，UIFace在更少的训练数据和一半规模的合成数据集情况下表现更优，并且当合成身份数量增加时，其人脸识别性能可媲美基于真实数据集训练的模型。

链接: https://arxiv.org/abs/2502.19803
作者: Xiao Lin,Yuge Huang,Jianqing Xu,Yuxi Mi,Shuigeng Zhou,Shouhong Ding
机构: Tencent Youtu Lab (腾讯优图实验室); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR2025

点击查看摘要

Abstract:Face recognition (FR) stands as one of the most crucial applications in computer vision. The accuracy of FR models has significantly improved in recent years due to the availability of large-scale human face datasets. However, directly using these datasets can inevitably lead to privacy and legal problems. Generating synthetic data to train FR models is a feasible solution to circumvent these issues. While existing synthetic-based face recognition methods have made significant progress in generating identity-preserving images, they are severely plagued by context overfitting, resulting in a lack of intra-class diversity of generated images and poor face recognition performance. In this paper, we propose a framework to Unleash Inherent capability of the model to enhance intra-class diversity for synthetic face recognition, shortened as UIFace. Our framework first trains a diffusion model that can perform sampling conditioned on either identity contexts or a learnable empty context. The former generates identity-preserving images but lacks variations, while the latter exploits the model’s intrinsic ability to synthesize intra-class-diversified images but with random identities. Then we adopt a novel two-stage sampling strategy during inference to fully leverage the strengths of both types of contexts, resulting in images that are diverse as well as identitypreserving. Moreover, an attention injection module is introduced to further augment the intra-class variations by utilizing attention maps from the empty context to guide the sampling process in ID-conditioned generation. Experiments show that our method significantly surpasses previous approaches with even less training data and half the size of synthetic dataset. The proposed UIFace even achieves comparable performance with FR models trained on real datasets when we further increase the number of synthetic identities.
zh

[CV-49] No Parameters No Problem: 3D Gaussian Splatting without Camera Intrinsics and Extrinsics

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 在场景重建和新视角合成中过度依赖精确预计算的相机内参（camera intrinsics）和外参（camera poses）的问题。传统方法虽已尝试在无需相机姿态的情况下优化3DGS，但仍需依赖相机内参。为进一步降低这种依赖性，论文提出了一种联合优化方法，通过图像集合训练3DGS，无需任何相机内参或外参。

解决方案的关键在于引入了多个重要改进：首先，从理论上推导出相机内参的梯度，使得相机内参能够在训练过程中同时被优化；其次，整合全局轨迹信息，并选择与每条轨迹相关的高斯核（Gaussian kernels），这些核会在训练过程中被自动调整至无穷小尺寸，逼近表面点，从而专注于多视图一致性约束和重投影误差最小化，而其他核则继续保持其原始功能。这种混合训练策略有效地统一了相机参数估计与3DGS训练过程，显著提升了方法的鲁棒性和性能。实验结果表明，所提方法在公开及合成数据集上达到了当前最先进的性能。

链接: https://arxiv.org/abs/2502.19800
作者: Dongbo Shi,Shen Cao,Lubin Fan,Bojian Wu,Jinhui Guo,Renjie Chen,Ligang Liu,Jieping Ye
机构: University of Science and Technology of China (中国科学技术大学); Individual Researcher (个人研究者)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) has made significant progress in scene reconstruction and novel view synthesis, it still heavily relies on accurately pre-computed camera intrinsics and extrinsics, such as focal length and camera poses. In order to mitigate this dependency, the previous efforts have focused on optimizing 3DGS without the need for camera poses, yet camera intrinsics remain necessary. To further loose the requirement, we propose a joint optimization method to train 3DGS from an image collection without requiring either camera intrinsics or extrinsics. To achieve this goal, we introduce several key improvements during the joint training of 3DGS. We theoretically derive the gradient of the camera intrinsics, allowing the camera intrinsics to be optimized simultaneously during training. Moreover, we integrate global track information and select the Gaussian kernels associated with each track, which will be trained and automatically rescaled to an infinitesimally small size, closely approximating surface points, and focusing on enforcing multi-view consistency and minimizing reprojection errors, while the remaining kernels continue to serve their original roles. This hybrid training strategy nicely unifies the camera parameters estimation and 3DGS training. Extensive evaluations demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on both public and synthetic datasets.
zh

[CV-50] MFSR: Multi-fractal Feature for Super-resolution Reconstruction with Fine Details Recovery

【速读】：该论文旨在解决图像超分辨率处理中复杂局部信息处理对生成图像质量影响较大的问题。论文提出了一种基于扩散模型的超分辨率方法MFSR，其关键在于引入低分辨率图像的分形特征，并将其作为去噪过程中的增强条件，以确保纹理信息的精确恢复。MFSR通过卷积软分配近似低分辨率图像的分形特征及密度特征图，采用层次化描述图像的空间布局，编码图像在不同尺度上的自相似性属性。此外，通过针对不同类型特征应用不同的处理方法，丰富模型获取的信息量。同时，在去噪U-Net中集成子去噪器，减少上采样过程中特征图的噪声，从而提升生成图像的质量。实验结果表明，MFSR能够生成更高品质的图像。

链接: https://arxiv.org/abs/2502.19797
作者: Lianping Yang,Peng Jiao,Jinshan Pan,Hegui Zhu,Su Guo
机构: College of Sciences, Northeastern University (东北大学理学院), China; Key Laboratory of Differential Equations and Their Applications, Northeastern University, Liaoning Provincial Department of Education (东北大学微分方程及其应用重点实验室, 辽宁省教育厅), China; School of Computer Science and Engineering, Nanjing University of Science (南京理工大学计算机科学与工程学院), China; College of Renewable Energy, Hohai University (河海大学可再生能源学院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the process of performing image super-resolution processing, the processing of complex localized information can have a significant impact on the quality of the image generated. Fractal features can capture the rich details of both micro and macro texture structures in an image. Therefore, we propose a diffusion model-based super-resolution method incorporating fractal features of low-resolution images, named MFSR. MFSR leverages these fractal features as reinforcement conditions in the denoising process of the diffusion model to ensure accurate recovery of texture information. MFSR employs convolution as a soft assignment to approximate the fractal features of low-resolution images. This approach is also used to approximate the density feature maps of these images. By using soft assignment, the spatial layout of the image is described hierarchically, encoding the self-similarity properties of the image at different scales. Different processing methods are applied to various types of features to enrich the information acquired by the model. In addition, a sub-denoiser is integrated in the denoising U-Net to reduce the noise in the feature maps during the up-sampling process in order to improve the quality of the generated images. Experiments conducted on various face and natural image datasets demonstrate that MFSR can generate higher quality images.
zh

[CV-51] Open-Vocabulary Semantic Part Segmentation of 3D Human

【速读】：本文旨在解决3D人体开放词汇分割这一尚未解决的问题，传统监督分割方法因有限的3D标注数据难以泛化到未见过的形状和类别，而现有的零样本3D分割方法虽在场景或物体分割中表现良好，但对3D人体的泛化能力不足。为此，作者提出了一种基于文本提示对3D人体进行细粒度分割的开放词汇分割方法。关键在于设计了一个简单的分割流程，利用SAM生成多视角2D提议，并提出新型HumanCLIP模型以创建视觉和文本输入的统一嵌入，该模型针对以人为中心的内容提供了更精确的嵌入。此外，还设计了一个简单有效的MaskFusion模块，用于将多视角特征分类和融合成3D语义掩码，无需复杂的投票和分组机制，同时解耦掩码提议与文本输入的设计显著提升了每条提示的推理效率。实验结果表明，该方法在多个3D人体数据集上大幅超越当前最先进的开放词汇3D分割方法，并可直接应用于网格、点云和3D高斯点阵等多种3D表示形式。

链接: https://arxiv.org/abs/2502.19782
作者: Keito Suzuki,Bang Du,Girish Krishnan,Kunyao Chen,Runfa Blark Li,Truong Nguyen
机构: University of California, San Diego (加州大学圣地亚哥分校); Qualcomm (高通)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3DV 2025

点击查看摘要

Abstract:3D part segmentation is still an open problem in the field of 3D vision and AR/VR. Due to limited 3D labeled data, traditional supervised segmentation methods fall short in generalizing to unseen shapes and categories. Recently, the advancement in vision-language models’ zero-shot abilities has brought a surge in open-world 3D segmentation methods. While these methods show promising results for 3D scenes or objects, they do not generalize well to 3D humans. In this paper, we present the first open-vocabulary segmentation method capable of handling 3D human. Our framework can segment the human category into desired fine-grained parts based on the textual prompt. We design a simple segmentation pipeline, leveraging SAM to generate multi-view proposals in 2D and proposing a novel HumanCLIP model to create unified embeddings for visual and textual inputs. Compared with existing pre-trained CLIP models, the HumanCLIP model yields more accurate embeddings for human-centric contents. We also design a simple-yet-effective MaskFusion module, which classifies and fuses multi-view features into 3D semantic masks without complex voting and grouping mechanisms. The design of decoupling mask proposals and text input also significantly boosts the efficiency of per-prompt inference. Experimental results on various 3D human datasets show that our method outperforms current state-of-the-art open-vocabulary 3D segmentation methods by a large margin. In addition, we show that our method can be directly applied to various 3D representations including meshes, point clouds, and 3D Gaussian Splatting.
zh

[CV-52] RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings CVPR2025

【速读】：该论文旨在解决现有地理空间任务中表示学习方法未能充分捕获重要视觉特征的问题。具体而言，当前基于对比学习的方法（如SatCLIP和GeoCLIP）通过将地理定位与共位图像对齐来学习表征，但这些方法在训练策略上存在不足，导致生成的嵌入（embeddings）丢失了对许多下游任务至关重要的关键视觉信息。论文从信息论的角度分析了这一现象，并提出了一种新的检索增强方法——RANGE。RANGE的关键创新在于利用多个外观相似位置的视觉特征来估计某一位置的视觉特征，从而更全面地捕捉视觉信息。实验结果表明，RANGE在分类任务上的性能提升了高达13.1%，在回归任务上的(R^2)值提高了0.145，显著优于现有的最先进模型。

链接: https://arxiv.org/abs/2502.19781
作者: Aayush Dhakal,Srikumar Sastry,Subash Khanal,Adeel Ahmad,Eric Xing,Nathan Jacobs
机构: Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:The choice of representation for geographic location significantly impacts the accuracy of models for a broad range of geospatial tasks, including fine-grained species classification, population density estimation, and biome classification. Recent works like SatCLIP and GeoCLIP learn such representations by contrastively aligning geolocation with co-located images. While these methods work exceptionally well, in this paper, we posit that the current training strategies fail to fully capture the important visual features. We provide an information theoretic perspective on why the resulting embeddings from these methods discard crucial visual information that is important for many downstream tasks. To solve this problem, we propose a novel retrieval-augmented strategy called RANGE. We build our method on the intuition that the visual features of a location can be estimated by combining the visual features from multiple similar-looking locations. We evaluate our method across a wide variety of tasks. Our results show that RANGE outperforms the existing state-of-the-art models with significant margins in most tasks. We show gains of up to 13.1% on classification tasks and 0.145 R^2 on regression tasks. All our code will be released on GitHub. Our models will be released on HuggingFace.
zh

[CV-53] InPK: Infusing Prior Knowledge into Prompt for Vision-Language Models

【速读】：该论文旨在解决通过提示调优（Prompt Tuning）适配视觉-语言模型（Vision-Language Models, VLMs）进行零样本/少样本视觉识别任务时存在的问题。具体而言，当可学习的提示词（learnable tokens）随机初始化且与先验知识（prior knowledge）无关时，模型容易在已见类别（seen classes）上过拟合，并在未见类别（unseen classes）上因领域偏移（domain shift）而表现不佳。为了解决这一问题，论文提出了一种名为InPK的模型，其关键在于通过在初始化阶段将类别特定的先验知识注入到可学习的提示词中，使模型能够显式关注与类别相关的特征信息。此外，为了缓解多层编码器对类别信息的弱化影响，InPK通过在多个特征层次上持续强化可学习提示词与先验知识之间的交互，这种逐步增强的交互机制使得模型能够更好地捕捉先验知识中的细粒度差异和通用视觉概念，从而提取出更具区分性和泛化性的文本特征。这种方法不仅提高了模型对已见类别的适应性，还增强了其对未见类别的表示能力，使其能够在现有语义结构中正确推断未见类别的位置。此外，论文还引入了一个可学习的文本到视觉投影层，以适应文本调整，确保视觉-文本语义更好的对齐。实验结果表明，InPK在11个识别数据集上的多个零样本/少样本图像分类任务中显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2502.19777
作者: Shuchang Zhou
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prompt tuning has become a popular strategy for adapting Vision-Language Models (VLMs) to zero/few-shot visual recognition tasks. Some prompting techniques introduce prior knowledge due to its richness, but when learnable tokens are randomly initialized and disconnected from prior knowledge, they tend to overfit on seen classes and struggle with domain shifts for unseen ones. To address this issue, we propose the InPK model, which infuses class-specific prior knowledge into the learnable tokens during initialization, thus enabling the model to explicitly focus on class-relevant information. Furthermore, to mitigate the weakening of class information by multi-layer encoders, we continuously reinforce the interaction between learnable tokens and prior knowledge across multiple feature levels. This progressive interaction allows the learnable tokens to better capture the fine-grained differences and universal visual concepts within prior knowledge, enabling the model to extract more discriminative and generalized text features. Even for unseen classes, the learned interaction allows the model to capture their common representations and infer their appropriate positions within the existing semantic structure. Moreover, we introduce a learnable text-to-vision projection layer to accommodate the text adjustments, ensuring better alignment of visual-text semantics. Extensive experiments on 11 recognition datasets show that InPK significantly outperforms state-of-the-art methods in multiple zero/few-shot image classification tasks.
zh

[CV-54] QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects AAAI2025

【速读】：该论文致力于解决在实时性能需求下，基于Transformer的双手机器与物体交互姿态估计方法所面临的高计算开销问题。论文的关键创新在于提出了一种名为QORT-Former的查询优化实时Transformer框架。其核心解决方案包括：首先通过限制查询数量（仅使用108个查询）和解码器数量（仅使用1个解码器）来满足效率需求；其次，设计了三类特定查询（左手查询、右手查询和物体查询），并通过结合手与物体之间的接触信息以及迭代更新增强的图像特征与查询特征，优化输入到Transformer解码器中的查询特征，从而在保证高效性的同时显著提升了精度。实验结果显示，该方法在H2O和FPHA数据集上的表现超越了现有最先进方法，并实现了交互识别的实时性能。

链接: https://arxiv.org/abs/2502.19769
作者: Elkhan Ismayilzada,MD Khalequzzaman Chowdhury Sayem,Yihalem Yimolal Tiruneh,Mubarrat Tajoar Chowdhury,Muhammadjon Boboev,Seungryul Baek
机构: UNIST (未知中文名)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Significant advancements have been achieved in the realm of understanding poses and interactions of two hands manipulating an object. The emergence of augmented reality (AR) and virtual reality (VR) technologies has heightened the demand for real-time performance in these applications. However, current state-of-the-art models often exhibit promising results at the expense of substantial computational overhead. In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. We first limit the number of queries and decoders to meet the efficiency requirement. Given limited number of queries and decoders, we propose to optimize queries which are taken as input to the Transformer decoder, to secure better accuracy: (1) we propose to divide queries into three types (a left hand query, a right hand query and an object query) and enhance query features (2) by using the contact information between hands and an object and (3) by using three-step update of enhanced image and query features with respect to one another. With proposed methods, we achieved real-time pose estimation performance using just 108 queries and 1 decoder (53.5 FPS on an RTX 3090TI GPU). Surpassing state-of-the-art results on the H2O dataset by 17.6% (left hand), 22.8% (right hand), and 27.2% (object), as well as on the FPHA dataset by 5.3% (right hand) and 10.4% (object), our method excels in accuracy. Additionally, it sets the state-of-the-art in interaction recognition, maintaining real-time efficiency with an off-the-shelf action recognition module.
zh

[CV-55] Automatic Temporal Segmentation for Post-Stroke Rehabilitation: A Keypoint Detection and Temporal Segmentation Approach for Small Datasets

【速读】：本文旨在解决中风患者康复评估中存在的主观性、不一致性和耗时等问题，特别是在老龄化人群中迫切需要更有效的个性化康复策略。论文的核心问题是通过视频记录的时序分割来捕捉中风患者康复过程中的详细活动，并实现临床日常桌面物体交互评估的自动化分析。解决方案的关键在于提出了一种结合生物力学特性的框架，将任务分为二维关键点检测以追踪患者的物理运动，以及一维时间序列的时序分割以分析这些运动随时间的变化。这种双管齐下的方法能够在有限的真实世界数据集上实现自动标注，从而应对患者运动变化多样性和数据集可用性有限的挑战，展现出在物理治疗环境中实际应用的强大潜力。

链接: https://arxiv.org/abs/2502.19766
作者: Jisoo Lee,Tamim Ahmed,Thanassis Rikakis,Pavan Turaga
机构: Arizona State University; University of Southern California
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rehabilitation is essential and critical for post-stroke patients, addressing both physical and cognitive aspects. Stroke predominantly affects older adults, with 75% of cases occurring in individuals aged 65 and older, underscoring the urgent need for tailored rehabilitation strategies in aging populations. Despite the critical role therapists play in evaluating rehabilitation progress and ensuring the effectiveness of treatment, current assessment methods can often be subjective, inconsistent, and time-consuming, leading to delays in adjusting therapy protocols. This study aims to address these challenges by providing a solution for consistent and timely analysis. Specifically, we perform temporal segmentation of video recordings to capture detailed activities during stroke patients’ rehabilitation. The main application scenario motivating this study is the clinical assessment of daily tabletop object interactions, which are crucial for post-stroke physical rehabilitation. To achieve this, we present a framework that leverages the biomechanics of movement during therapy sessions. Our solution divides the process into two main tasks: 2D keypoint detection to track patients’ physical movements, and 1D time-series temporal segmentation to analyze these movements over time. This dual approach enables automated labeling with only a limited set of real-world data, addressing the challenges of variability in patient movements and limited dataset availability. By tackling these issues, our method shows strong potential for practical deployment in physical therapy settings, enhancing the speed and accuracy of rehabilitation assessments. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.19766 [cs.CV] (or arXiv:2502.19766v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.19766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-56] Snowball Adversarial Attack on Traffic Sign Classification

【速读】：该论文旨在解决机器学习模型在对抗攻击下的脆弱性问题，特别是针对交通标志识别任务中的深度神经网络。传统对抗攻击策略通常通过微小且难以察觉的扰动来误导分类器，而本文提出了一种正交的攻击策略：设计明显可见但不会混淆人类的扰动，同时最大化对机器学习算法的误分类效果。关键在于利用人脑在面对物体识别时对遮挡的良好适应能力，而深度神经网络却容易受到干扰的特点，通过“Snowball对抗攻击”实现对先进交通标志识别算法的有效迷惑，证明了该方法在多种图像上的鲁棒性，并揭示了深度神经网络在图像识别任务中的潜在漏洞，强调了加强模型防御机制的重要性。

链接: https://arxiv.org/abs/2502.19757
作者: Anthony Etim,Jakub Szefer
机构: Yale University (耶鲁大学); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Adversarial attacks on machine learning models often rely on small, imperceptible perturbations to mislead classifiers. Such strategy focuses on minimizing the visual perturbation for humans so they are not confused, and also maximizing the misclassification for machine learning algorithms. An orthogonal strategy for adversarial attacks is to create perturbations that are clearly visible but do not confuse humans, yet still maximize misclassification for machine learning algorithms. This work follows the later strategy, and demonstrates instance of it through the Snowball Adversarial Attack in the context of traffic sign recognition. The attack leverages the human brain’s superior ability to recognize objects despite various occlusions, while machine learning algorithms are easily confused. The evaluation shows that the Snowball Adversarial Attack is robust across various images and is able to confuse state-of-the-art traffic sign recognition algorithm. The findings reveal that Snowball Adversarial Attack can significantly degrade model performance with minimal effort, raising important concerns about the vulnerabilities of deep neural networks and highlighting the necessity for improved defenses for image recognition machine learning models.
zh

[CV-57] Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network CVPR2025

【速读】：该论文旨在解决基于Schrödinger Bridge (SB) 的图像生成方法在处理复杂图像数据时计算成本高且耗时的问题。尽管SB方法理论上通过寻找两个分布之间的最短路径来提升生成效率和质量，但其全局最优路径拟合在高维空间中的复杂性导致了较大的计算开销。相比之下，扩散模型（Diffusion Models）通常采用更简单的权重函数 (f_A(t)) 和 (f_B(t)) 构建路径子空间，这启发了本文提出了一种新的方法：在扩散路径子空间中寻找局部Diffusion Schrödinger Bridges (LDSB)，以加强SB问题与扩散模型之间的联系。关键在于利用Kolmogorov-Arnold Network (KAN) 优化这些扩散路径，KAN具有抗遗忘能力和连续输出的优势。实验表明，LDSB在保持相同预训练去噪网络的同时显著提升了图像生成的质量和效率，且KAN的参数量小于0.1MB，FID指标相比传统方法降低了超过15%，其中对于CelebA数据集，在DDIM的数值积分步数NFE=5时，FID降低了48.50%。

链接: https://arxiv.org/abs/2502.19754
作者: Xingyu Qiu,Mengying Yang,Xinghua Ma,Fanding Li,Dong Liang,Gongning Luo,Wei Wang,Kuanquan Wang,Shuo Li
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, to be published in CVPR 2025

点击查看摘要

Abstract:In image generation, Schrödinger Bridge (SB)-based methods theoretically enhance the efficiency and quality compared to the diffusion models by finding the least costly path between two distributions. However, they are computationally expensive and time-consuming when applied to complex image data. The reason is that they focus on fitting globally optimal paths in high-dimensional spaces, directly generating images as next step on the path using complex networks through self-supervised training, which typically results in a gap with the global optimum. Meanwhile, most diffusion models are in the same path subspace generated by weights f_A(t) and f_B(t) , as they follow the paradigm ( x_t = f_A(t)x_Img + f_B(t)\epsilon ). To address the limitations of SB-based methods, this paper proposes for the first time to find local Diffusion Schrödinger Bridges (LDSB) in the diffusion path subspace, which strengthens the connection between the SB problem and diffusion models. Specifically, our method optimizes the diffusion paths using Kolmogorov-Arnold Network (KAN), which has the advantage of resistance to forgetting and continuous output. The experiment shows that our LDSB significantly improves the quality and efficiency of image generation using the same pre-trained denoising network and the KAN for optimising is only less than 0.1MB. The FID metric is reduced by \textbfmore than 15%, especially with a reduction of 48.50% when NFE of DDIM is 5 for the CelebA dataset. Code is available at this https URL.
zh

[CV-58] Lightweight Contrastive Distilled Hashing for Online Cross-modal Retrieval

【速读】：该论文旨在解决深度在线跨模态哈希技术面临的三个主要挑战：1) 如何提取跨模态数据的共存语义相关性；2) 如何在处理实时数据流时实现竞争力性能；3) 如何以轻量级方式将离线学习的知识迁移到在线训练。为了解决这些问题，论文提出了一种轻量级对比蒸馏哈希方法（Lightweight Contrastive Distilled Hashing, LCDH），通过在知识蒸馏框架下利用相似矩阵近似创新性地连接离线和在线跨模态哈希。关键在于通过知识蒸馏机制，利用教师网络提取的共存语义相关性来监督学生网络，从而提升轻量级模型在线哈希中的性能。具体而言，教师网络使用对比语言-图像预训练（CLIP）提取跨模态特征，并通过注意力模块增强表示后生成用于对齐相似矩阵的哈希码；学生网络则通过轻量级模型提取特征并生成二进制码，最终借助相似矩阵近似实现性能提升。实验结果表明，LCDH 在三个常用数据集上优于一些最先进的方法。

链接: https://arxiv.org/abs/2502.19751
作者: Jiaxing Li,Lin Jiang,Zeqi Ma,Kaihang Jiang,Xiaozhao Fang,Jie Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep online cross-modal hashing has gained much attention from researchers recently, as its promising applications with low storage requirement, fast retrieval efficiency and cross modality adaptive, etc. However, there still exists some technical hurdles that hinder its applications, e.g., 1) how to extract the coexistent semantic relevance of cross-modal data, 2) how to achieve competitive performance when handling the real time data streams, 3) how to transfer the knowledge learned from offline to online training in a lightweight manner. To address these problems, this paper proposes a lightweight contrastive distilled hashing (LCDH) for cross-modal retrieval, by innovatively bridging the offline and online cross-modal hashing by similarity matrix approximation in a knowledge distillation framework. Specifically, in the teacher network, LCDH first extracts the cross-modal features by the contrastive language-image pre-training (CLIP), which are further fed into an attention module for representation enhancement after feature fusion. Then, the output of the attention module is fed into a FC layer to obtain hash codes for aligning the sizes of similarity matrices for online and offline training. In the student network, LCDH extracts the visual and textual features by lightweight models, and then the features are fed into a FC layer to generate binary codes. Finally, by approximating the similarity matrices, the performance of online hashing in the lightweight student network can be enhanced by the supervision of coexistent semantic relevance that is distilled from the teacher network. Experimental results on three widely used datasets demonstrate that LCDH outperforms some state-of-the-art methods.
zh

[CV-59] CirT: Global Subseasonal-to-Seasonal Forecasting with Geometry-inspired Transformer

【速读】：该论文旨在解决亚季节到季节（Subseasonal-to-Seasonal, S2S）气候预测中因数据驱动模型未能充分考虑几何归纳偏置而导致的性能局限性问题。传统方法通常将球面天气数据视为平面图像，这会导致位置和空间关系表示的不准确性。为了解决这一挑战，论文提出了一种基于几何启发的Circular Transformer (CirT)，其关键在于两个创新设计：(1) 将纬度分解后的天气数据划分为圆环形补丁，并将其作为Transformer模型的输入令牌；(2) 在自注意力机制中引入傅里叶变换以捕获全局信息并建模空间周期性。实验结果表明，该模型在ERA5再分析数据集上的表现显著优于现有的先进数据驱动模型（如PanguWeather和GraphCast），并在与ECMWF系统的对比中展现出卓越的预测能力。

链接: https://arxiv.org/abs/2502.19750
作者: Yang Liu,Zinan Zheng,Jiashun Cheng,Fugee Tsung,Deli Zhao,Yu Rong,Jia Li
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); DAMO Academy, Alibaba Group (达摩院，阿里巴巴集团)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate Subseasonal-to-Seasonal (S2S) climate forecasting is pivotal for decision-making including agriculture planning and disaster preparedness but is known to be challenging due to its chaotic nature. Although recent data-driven models have shown promising results, their performance is limited by inadequate consideration of geometric inductive biases. Usually, they treat the spherical weather data as planar images, resulting in an inaccurate representation of locations and spatial relations. In this work, we propose the geometric-inspired Circular Transformer (CirT) to model the cyclic characteristic of the graticule, consisting of two key designs: (1) Decomposing the weather data by latitude into circular patches that serve as input tokens to the Transformer; (2) Leveraging Fourier transform in self-attention to capture the global information and model the spatial periodicity. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate our model yields a significant improvement over the advanced data-driven models, including PanguWeather and GraphCast, as well as skillful ECMWF systems. Additionally, we empirically show the effectiveness of our model designs and high-quality prediction over spatial and temporal dimensions.
zh

[CV-60] LUCAS: Layered Universal Codec Avatars

【速读】：该论文致力于解决3D人脸 avatar 重建中的两大挑战：动态人脸-头发交互建模以及跨身份泛化能力，尤其是在表情变化和头部运动过程中。论文提出了一种名为LUCAS的新颖通用先验模型（Universal Prior Model, UPM），其关键在于通过分层表示方法解耦人脸与头发建模。不同于以往将头发视为头部不可分割部分的传统方法，LUCAS将无发头部与头发分别建模为独立分支。此外，LUCAS首次引入基于网格的UPM，支持实时渲染，并通过分层表示优化锚点几何结构以实现精确且视觉效果优良的高斯渲染。实验结果表明，LUCAS在定量和定性评估中均优于现有的单一网格和基于高斯的avatar模型，特别是在零样本驱动场景下对未见身份的评估中表现出色，同时在处理头部姿态变化、表情迁移及发型变化等动态性能方面具有显著优势。

链接: https://arxiv.org/abs/2502.19739
作者: Di Liu,Teng Deng,Giljoo Nam,Yu Rong,Stanislav Pidhorskyi,Junxuan Li,Jason Saragih,Dimitris N. Metaxas,Chen Cao
机构: Meta (Meta); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photorealistic 3D head avatar reconstruction faces critical challenges in modeling dynamic face-hair interactions and achieving cross-identity generalization, particularly during expressions and head movements. We present LUCAS, a novel Universal Prior Model (UPM) for codec avatar modeling that disentangles face and hair through a layered representation. Unlike previous UPMs that treat hair as an integral part of the head, our approach separates the modeling of the hairless head and hair into distinct branches. LUCAS is the first to introduce a mesh-based UPM, facilitating real-time rendering on devices. Our layered representation also improves the anchor geometry for precise and visually appealing Gaussian renderings. Experimental results indicate that LUCAS outperforms existing single-mesh and Gaussian-based avatar models in both quantitative and qualitative assessments, including evaluations on held-out subjects in zero-shot driving scenarios. LUCAS demonstrates superior dynamic performance in managing head pose changes, expression transfer, and hairstyle variations, thereby advancing the state-of-the-art in 3D head avatar reconstruction.
zh

[CV-61] Learning Mask Invariant Mutual Information for Masked Image Modeling ICLR2025

【速读】：该论文旨在解决掩码自编码器（Masked Autoencoders, MAEs）在计算机视觉中的机制理解不足问题。尽管MAEs在实践中取得了成功，但其内在运作机理尚未被充分揭示。现有研究主要通过对比学习和特征表示分析来解释MAEs的功能，但这些方法往往只能提供隐含的洞见。为了解决这一问题，本文从信息瓶颈原理出发提出了一种新的视角。论文的关键在于理论分析表明，优化潜在特征以平衡相关与无关信息对于提升MAE性能至关重要。基于此，作者提出了MI-MAE方法，通过最大化潜在特征与输出之间的互信息以及最小化潜在特征与输入之间的互信息来优化MAEs。实验结果表明，MI-MAE在图像分类、目标检测和语义分割等任务上显著优于传统MAE模型，验证了所提出的理论框架，并展示了将信息瓶颈原理应用于MAEs的实际优势。

链接: https://arxiv.org/abs/2502.19718
作者: Tao Huang,Yanxiang Ma,Shan You,Chang Xu
机构: School of Computer Science, Faculty of Engineering, The University of Sydney (悉尼大学工程学院计算机科学学院); SenseTime Research (商汤研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Masked autoencoders (MAEs) represent a prominent self-supervised learning paradigm in computer vision. Despite their empirical success, the underlying mechanisms of MAEs remain insufficiently understood. Recent studies have attempted to elucidate the functioning of MAEs through contrastive learning and feature representation analysis, yet these approaches often provide only implicit insights. In this paper, we propose a new perspective for understanding MAEs by leveraging the information bottleneck principle in information theory. Our theoretical analyses reveal that optimizing the latent features to balance relevant and irrelevant information is key to improving MAE performance. Building upon our proofs, we introduce MI-MAE, a novel method that optimizes MAEs through mutual information maximization and minimization. By enhancing latent features to retain maximal relevant information between them and the output, and minimizing irrelevant information between them and the input, our approach achieves better performance. Extensive experiments on standard benchmarks show that MI-MAE significantly outperforms MAE models in tasks such as image classification, object detection, and semantic segmentation. Our findings validate the theoretical framework and highlight the practical advantages of applying the information bottleneck principle to MAEs, offering deeper insights for developing more powerful self-supervised learning models.
zh

[CV-62] Recent Advances on Generalizable Diffusion-generated Image Detection

【速读】：该论文旨在解决扩散模型在生成高质量Deepfake图像方面带来的新挑战，特别是针对图像真实性验证的问题。随着扩散模型的广泛应用，确保生成图像的可信度变得尤为重要。论文的关键在于系统性地综述了近年来在通用扩散生成图像检测领域的最新进展，并将其分类为两大类：(1) 数据驱动检测和(2) 特征驱动检测。进一步地，基于其底层原理，现有检测方法被细分为六个更具体的类别。通过这种方式，论文不仅总结了当前的研究成果，还指出了存在的开放性挑战并展望了未来方向，以期推动该领域更多研究工作的开展。

链接: https://arxiv.org/abs/2502.19716
作者: Qijie Xu,Defang Chen,Jiawei Chen,Siwei Lyu,Can Wang
机构: State Key Laboratory of Blockchain and Data Security, Zhejiang University (浙江大学); University at Buffalo, State University of New York (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rise of diffusion models has significantly improved the fidelity and diversity of generated images. With numerous benefits, these advancements also introduce new risks. Diffusion models can be exploited to create high-quality Deepfake images, which poses challenges for image authenticity verification. In recent years, research on generalizable diffusion-generated image detection has grown rapidly. However, a comprehensive review of this topic is still lacking. To bridge this gap, we present a systematic survey of recent advances and classify them into two main categories: (1) data-driven detection and (2) feature-driven detection. Existing detection methods are further classified into six fine-grained categories based on their underlying principles. Finally, we identify several open challenges and envision some future directions, with the hope of inspiring more research work on this important topic. Reviewed works in this survey can be found at this https URL.
zh

[CV-63] SAP-DIFF: Semantic Adversarial Patch Generation for Black-Box Face Recognition Models via Diffusion Models

【速读】：该论文旨在解决人脸识别（Face Recognition, FR）模型在对抗性补丁攻击（adversarial patch attacks）下的鲁棒性评估问题，特别是针对伪装成合法用户的 impersonation 攻击。现有方法在提升攻击成功率、降低对攻击者能力的要求以及减少查询次数方面存在局限性。为应对这些挑战，论文提出了一种名为SAP-DIFF的新方法，其关键是利用扩散模型（diffusion models）通过潜在空间中的语义扰动（semantic perturbations）生成对抗性补丁，而非直接操作像素值。此外，引入注意力干扰机制以生成与原始人脸无关的特征，并设计方向损失函数（directional loss function）引导扰动向目标身份特征空间发展，从而显著提升了攻击的有效性和效率。实验结果表明，该方法在多个主流FR模型和数据集上的平均攻击成功率提高了45.66%，且查询次数减少了约40%，优于当前最先进的方法。

链接: https://arxiv.org/abs/2502.19710
作者: Mingsi Wang,Shuaiyin Yao,Chang Yue,Lijie Zhang,Guozhu Meng
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Beijing (北京)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Given the need to evaluate the robustness of face recognition (FR) models, many efforts have focused on adversarial patch attacks that mislead FR models by introducing localized perturbations. Impersonation attacks are a significant threat because adversarial perturbations allow attackers to disguise themselves as legitimate users. This can lead to severe consequences, including data breaches, system damage, and misuse of resources. However, research on such attacks in FR remains limited. Existing adversarial patch generation methods exhibit limited efficacy in impersonation attacks due to (1) the need for high attacker capabilities, (2) low attack success rates, and (3) excessive query requirements. To address these challenges, we propose a novel method SAP-DIFF that leverages diffusion models to generate adversarial patches via semantic perturbations in the latent space rather than direct pixel manipulation. We introduce an attention disruption mechanism to generate features unrelated to the original face, facilitating the creation of adversarial samples and a directional loss function to guide perturbations toward the target identity feature space, thereby enhancing attack effectiveness and efficiency. Extensive experiments on popular FR models and datasets demonstrate that our method outperforms state-of-the-art approaches, achieving an average attack success rate improvement of 45.66% (all exceeding 40%), and a reduction in the number of queries by about 40% compared to the SOTA approach
zh

[CV-64] Accurate Pose Estimation for Flight Platforms based on Divergent Multi-Aperture Imaging System

【速读】：该论文旨在解决基于视觉的飞行平台位姿估计中因单目相机视场角和空间分辨率限制而导致的精度不足问题。为实现大视场与高分辨率的同时观测，论文设计了一种发散多孔径成像系统（DMAIS）。解决方案的关键在于提出了一种基于三维标定场的DMAIS标定方法，通过该方法确定DMAIS的成像参数，并将其建模为广义相机。此外，论文引入了一种新算法，将绝对位姿估计算法转化为非线性最小化问题，并基于拉格朗日乘子建立了新的最优性条件。最终的标定实验验证了所提方法的有效性和准确性，真实飞行实验进一步证明了系统可达到厘米级定位精度和弧分级姿态角精度的能力。

链接: https://arxiv.org/abs/2502.19708
作者: Shunkun Liang,Bin Li,Banglei Guan,Yang Shang,Xianwei Zhu,Qifeng Yu
机构: College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410000, China (国防科技大学航空航天科学与工程学院, 长沙 410000, 中国); Institute of Intelligent Optical Measurement and Detection, Shenzhen University, Shenzhen 518000, China (深圳大学智能光学测量与检测研究所, 深圳 518000, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-based pose estimation plays a crucial role in the autonomous navigation of flight platforms. However, the field of view and spatial resolution of the camera limit pose estimation accuracy. This paper designs a divergent multi-aperture imaging system (DMAIS), equivalent to a single imaging system to achieve simultaneous observation of a large field of view and high spatial resolution. The DMAIS overcomes traditional observation limitations, allowing accurate pose estimation for the flight platform. Before conducting pose estimation, the DMAIS must be calibrated. To this end we propose a calibration method for DMAIS based on the 3D calibration field. The calibration process determines the imaging parameters of the DMAIS, which allows us to model DMAIS as a generalized camera. Subsequently, a new algorithm for accurately determining the pose of flight platform is introduced. We transform the absolute pose estimation problem into a nonlinear minimization problem. New optimality conditions are established for solving this problem based on Lagrange multipliers. Finally, real calibration experiments show the effectiveness and accuracy of the proposed method. Results from real flight experiments validate the system’s ability to achieve centimeter-level positioning accuracy and arc-minute-level orientation accuracy.
zh

[CV-65] Weakly Supervised Segmentation Framework for Thyroid Nodule Based on High-confidence Labels and High-rationality Losses

【速读】：该论文致力于解决弱监督甲状腺超声图像分割方法中存在的两个主要问题：1）由拓扑先验产生的低置信度伪标签引入显著的标签噪声；2）缺乏合理性损失函数，其严格比较分割结果与标签，忽视了形状多样且复杂的结节的判别信息。为了解决这些问题，论文的关键在于提出一个框架，包含高置信度伪标签以表示拓扑和解剖学信息，以及高合理性损失函数以捕获多层级的判别特征。具体而言，通过融合四点标注的几何变换和MedSAM模型的结果生成高置信度的框、前景和背景标签，并采用一种高合理性学习策略，包括对齐损失（Alignment Loss）、对比损失（Contrastive Loss）和原型相关性损失（Prototype Correlation Loss），以指导网络感知结节位置、学习结节与背景的特征分布，并细化不确定区域至精确的结节边缘。实验结果表明，该方法在TN3K和DDTI数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2502.19707
作者: Jianning Chi,Zelan Li,Geng Lin,MingYang Sun,Xiaosheng Yu
机构: Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169, China (东北大学机器人科学与工程学院，沈阳，110169, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Weakly supervised segmentation methods can delineate thyroid nodules in ultrasound images efficiently using training data with coarse labels, but suffer from: 1) low-confidence pseudo-labels that follow topological priors, introducing significant label noise, and 2) low-rationality loss functions that rigidly compare segmentation with labels, ignoring discriminative information for nodules with diverse and complex shapes. To solve these issues, we clarify the objective and references for weakly supervised ultrasound image segmentation, presenting a framework with high-confidence pseudo-labels to represent topological and anatomical information and high-rationality losses to capture multi-level discriminative features. Specifically, we fuse geometric transformations of four-point annotations and MedSAM model results prompted by specific annotations to generate high-confidence box, foreground, and background labels. Our high-rationality learning strategy includes: 1) Alignment loss measuring spatial consistency between segmentation and box label, and topological continuity within the foreground label, guiding the network to perceive nodule location; 2) Contrastive loss pulling features from labeled foreground regions while pushing features from labeled foreground and background regions, guiding the network to learn nodule and background feature distribution; 3) Prototype correlation loss measuring consistency between correlation maps derived by comparing features with foreground and background prototypes, refining uncertain regions to accurate nodule edges. Experimental results show that our method achieves state-of-the-art performance on the TN3K and DDTI datasets. The code is available at this https URL.
zh

[CV-66] CFTrack: Enhancing Lightweight Visual Tracking through Contrastive Learning and Feature Matching

【速读】：该论文致力于解决轻量级视觉跟踪在效率与强判别能力兼得方面的挑战，特别是在移动和边缘设备等计算资源受限场景下。传统轻量级跟踪器在遮挡和干扰下的鲁棒性不足，而深度跟踪器在压缩以满足资源限制时往往会出现性能下降。论文的关键解决方案是提出了一种名为CFTrack的轻量级跟踪器，通过整合对比学习和特征匹配来增强判别性特征表示。其核心创新在于引入了一个新颖的对比特征匹配模块，并采用自适应对比损失进行优化，从而在预测过程中动态评估目标相似性，显著提升了跟踪精度。实验结果表明，CFTrack在LaSOT、OTB100和UAV123数据集上的表现超越了许多现有先进轻量级跟踪器，同时在NVIDIA Jetson NX平台上达到136帧每秒的速度，且在HOOT数据集的重度遮挡场景下展现出强大的判别能力。

链接: https://arxiv.org/abs/2502.19705
作者: Juntao Liang,Jun Hou,Weijun Zhang,Yong Wang
机构: School of Aeronautics and Astronautics, Shenzhen Campus of Sun Yat-sen University (中山大学深圳航空航天学院); Insta360 Research (Insta360研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving both efficiency and strong discriminative ability in lightweight visual tracking is a challenge, especially on mobile and edge devices with limited computational resources. Conventional lightweight trackers often struggle with robustness under occlusion and interference, while deep trackers, when compressed to meet resource constraints, suffer from performance degradation. To address these issues, we introduce CFTrack, a lightweight tracker that integrates contrastive learning and feature matching to enhance discriminative feature representations. CFTrack dynamically assesses target similarity during prediction through a novel contrastive feature matching module optimized with an adaptive contrastive loss, thereby improving tracking accuracy. Extensive experiments on LaSOT, OTB100, and UAV123 show that CFTrack surpasses many state-of-the-art lightweight trackers, operating at 136 frames per second on the NVIDIA Jetson NX platform. Results on the HOOT dataset further demonstrate CFTrack’s strong discriminative ability under heavy occlusion.
zh

[CV-67] Language-Informed Hyperspectral Image Synthesis for Imbalanced-Small Sample Classification via Semi-Supervised Conditional Diffusion Model

【速读】：该论文旨在解决高光谱图像分类（HSIC）中的小样本且类别不平衡（ISSD）问题。传统数据增强方法大多在潜在空间扩展特征，而较少利用文本信息生成逼真且多样化的样本以平衡有限的标注样本。为应对这一挑战，论文提出了一种新颖的语言引导高光谱图像合成方法（Txt2HSI-LDM(VAE)）。其关键是结合变分自编码器（VAE）与半监督扩散模型：首先通过VAE将高维高光谱数据映射到低维潜在空间，以减少扩散模型的计算参数并获得稳定特征表示；其次设计半监督扩散模型，并引入随机多边形空间裁剪（RPSC）及潜在特征不确定性估计（LF-UE），模拟训练数据的不同混合程度；最后，通过条件语言输入解码生成更逼真和多样化的高光谱图像样本。实验验证了合成样本在统计特性和二维主成分分析（2D-PCA）空间分布上的有效性，并通过像素级交叉注意力图可视化证明了模型能够依赖视觉-语言对齐捕获生成图像的空间布局和几何结构。

链接: https://arxiv.org/abs/2502.19700
作者: Yimin Zhu,Linlin Xu
机构: Department of Geomatics Engineering, University of Calgary, Canada (卡尔加里大学测绘工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although data augmentation is an effective method to address the imbalanced-small sample data (ISSD) problem in hyperspectral image classification (HSIC), most methodologies extend features in the latent space. Few, however, generate realistic and diverse samples using text information to balance the limited number of annotated samples. Recently, text-driven diffusion models have gained significant attention due to their remarkable ability to generate highly diverse images based on given text prompts in natural image synthesis. Therefore, this paper proposes a novel language-informed hyperspectral image synthesis method (Txt2HSI-LDM(VAE)) for addressing the ISSD problem of HSIC. First, for addressing the high-dimensional hyperspectral data, we use universal varitional autoencoeder (VAE) to map the hyperspectral into a low-dimensional latent space and get stable feature representation, which hugely reduce the inference parameter of diffusion model. Next, a semi-supervised diffusion model is designed for fully taking advantage of unlabeled data, beside, random polygon spatial clipping (RPSC) and uncertainty estimation of latent feature (LF-UE) are also used for simulating the varying degrees of mixing of training data. Then, VAE decodes HSI from latent space generated by diffusion model with the conditional language as input, contributing to more realistic and diverse samples. In our experiments, we fully evaluate the effectiveness of synthetic samples from aspect of statistical characteristic and data distribution in 2D-PCA space. Additionally, cross-attention map is visualized on the pixel-level to prove that our proposed model can capture the spatial layout of and geometry of the generated hyperspectral image depend on the visual-linguistic alignment.
zh

[CV-68] Spatial-Spectral Diffusion Contrastive Representation Network for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像分类（Hyperspectral Image Classification, HSIC）中高效提取判别性空间-光谱特征的难题，这一难题受空间-光谱异质性和噪声影响等因素制约。论文提出了一种基于去噪扩散概率模型（Denoising Diffusion Probabilistic Model, DDPM）与对比学习（Contrastive Learning, CL）相结合的空间-光谱扩散对比表征网络（Spatial-Spectral Diffusion Contrastive Representation Network, DiffCRN）。其关键在于：首先，设计了一种新颖的分阶段架构，包含空间自注意力去噪模块（Spatial Self-Attention Denoising, SSAD）和光谱组自注意力去噪模块（Spectral Group Self-Attention Denoising, SGSAD），以提升空间-光谱特征的学习效率；其次，通过引入对数绝对误差（Logarithmic Absolute Error, LAE）损失函数和改进的对比学习机制，增强无监督特征学习的效能并提高实例间及类别间的可分辨性；再次，提出了基于像素级光谱角映射（Spectral Angle Mapping, SAM）的自适应时间步选择方法，实现特征选择的自动化和优化；最后，设计了自适应加权融合模块（Adaptive Weighted Addition Modul, AWAM）和跨时间步光谱-空间融合模块（Cross Time Step Spectral-Spatial Fusion Module, CTSSFM），用于特征整合与分类任务。实验结果表明，DiffCRN在多个常用高光谱数据集上的性能优于经典基线模型及当前最先进的GAN、Transformer模型和其他预训练方法。

链接: https://arxiv.org/abs/2502.19699
作者: Yimin Zhu,Linlin Xu
机构: University of Calgary (卡尔加里大学), Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although efficient extraction of discriminative spatial-spectral features is critical for hyperspectral images classification (HSIC), it is difficult to achieve these features due to factors such as the spatial-spectral heterogeneity and noise effect. This paper presents a Spatial-Spectral Diffusion Contrastive Representation Network (DiffCRN), based on denoising diffusion probabilistic model (DDPM) combined with contrastive learning (CL) for HSIC, with the following characteristics. First,to improve spatial-spectral feature representation, instead of adopting the UNets-like structure which is widely used for DDPM, we design a novel staged architecture with spatial self-attention denoising module (SSAD) and spectral group self-attention denoising module (SGSAD) in DiffCRN with improved efficiency for spectral-spatial feature learning. Second, to improve unsupervised feature learning efficiency, we design new DDPM model with logarithmic absolute error (LAE) loss and CL that improve the loss function effectiveness and increase the instance-level and inter-class discriminability. Third, to improve feature selection, we design a learnable approach based on pixel-level spectral angle mapping (SAM) for the selection of time steps in the proposed DDPM model in an adaptive and automatic manner. Last, to improve feature integration and classification, we design an Adaptive weighted addition modul (AWAM) and Cross time step Spectral-Spatial Fusion Module (CTSSFM) to fuse time-step-wise features and perform classification. Experiments conducted on widely used four HSI datasets demonstrate the improved performance of the proposed DiffCRN over the classical backbone models and state-of-the-art GAN, transformer models and other pretrained methods. The source code and pre-trained model will be made available publicly.
zh

[CV-69] You Only Click Once: Single Point Weakly Supervised 3D Instance Segmentation for Autonomous Driving

【速读】：该论文旨在解决户外激光雷达点云三维实例分割任务中因人工标注劳动密集而导致的数据标注成本高的问题。为应对这一挑战，论文提出了一种名为YoCo的框架，其核心在于利用极少量粗略的鸟瞰图点击标注来生成高质量的三维伪标签。解决方案的关键包括：首先结合视觉基础模型与点云的几何约束以增强伪标签生成；其次设计基于时间和空间的标签更新模块，利用相邻帧预测结果及点云密度变化特性生成可靠的更新标签；最后通过IoU引导的增强模块，用高置信度且高IoU的预测结果替换伪标签，进一步提升标签质量。实验表明，YoCo在Waymo数据集上实现了最先进的性能，并显著降低了标注成本。

链接: https://arxiv.org/abs/2502.19698
作者: Guangfeng Jiang,Jun Liu,Yongxuan Lv,Yuzhi Wu,Xianfei Li,Wenlong Liao,Tao He,Pai Peng
机构: University of Science and Technology of China (中国科学技术大学); COWAROBOT (科沃达机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Outdoor LiDAR point cloud 3D instance segmentation is a crucial task in autonomous driving. However, it requires laborious human efforts to annotate the point cloud for training a segmentation model. To address this challenge, we propose a YoCo framework, which generates 3D pseudo labels using minimal coarse click annotations in the bird’s eye view plane. It is a significant challenge to produce high-quality pseudo labels from sparse annotations. Our YoCo framework first leverages vision foundation models combined with geometric constraints from point clouds to enhance pseudo label generation. Second, a temporal and spatial-based label updating module is designed to generate reliable updated labels. It leverages predictions from adjacent frames and utilizes the inherent density variation of point clouds (dense near, sparse far). Finally, to further improve label quality, an IoU-guided enhancement module is proposed, replacing pseudo labels with high-confidence and high-IoU predictions. Experiments on the Waymo dataset demonstrate YoCo’s effectiveness and generality, achieving state-of-the-art performance among weakly supervised methods and surpassing fully supervised Cylinder3D. Additionally, the YoCo is suitable for various networks, achieving performance comparable to fully supervised methods with minimal fine-tuning using only 0.8% of the fully labeled data, significantly reducing annotation costs.
zh

[CV-70] Prompt-driven Transferable Adversarial Attack on Person Re-Identification with Attribute-aware Textual Inversion

【速读】：本文旨在解决行人再识别（Person Re-identification, Re-ID）模型在安全监控系统中的脆弱性评估问题，特别是如何通过迁移性对抗攻击揭示其潜在漏洞。现有的基于视觉-语言模型（Vision-Language Model, VLM）的攻击方法虽然展现了较强的迁移能力，但由于过分关注整体表征中的判别语义，导致特征干扰不够全面。为此，论文提出了一种名为属性感知提示攻击（Attribute-aware Prompt Attack, AP-Attack）的新方法，其关键是利用VLM的图像-文本对齐能力，通过破坏特定属性对应的文本嵌入来显式扰乱行人图像的细粒度语义特征。具体而言，通过设计文本反转网络将行人图像映射到伪标记以表示语义嵌入，并采用对比学习方式与预定义的提示模板共同训练，从而生成针对个体属性的个性化文本描述。这种被反转的良性及对抗性的细粒度文本语义有助于攻击者更有效地实施全面干扰，显著提升了对抗样本的迁移能力。实验结果表明，AP-Attack在跨模型数据集攻击场景下，平均Drop Rate相较先前方法提高了22.9%，实现了最先进的迁移性能。

链接: https://arxiv.org/abs/2502.19697
作者: Yuan Bian,Min Liu,Yunqi Yi,Xueping Wang,Yaonan Wang
机构: Hunan University (湖南大学); Hunan Normal University (湖南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Person re-identification (re-id) models are vital in security surveillance systems, requiring transferable adversarial attacks to explore the vulnerabilities of them. Recently, vision-language models (VLM) based attacks have shown superior transferability by attacking generalized image and textual features of VLM, but they lack comprehensive feature disruption due to the overemphasis on discriminative semantics in integral representation. In this paper, we introduce the Attribute-aware Prompt Attack (AP-Attack), a novel method that leverages VLM’s image-text alignment capability to explicitly disrupt fine-grained semantic features of pedestrian images by destroying attribute-specific textual embeddings. To obtain personalized textual descriptions for individual attributes, textual inversion networks are designed to map pedestrian images to pseudo tokens that represent semantic embeddings, trained in the contrastive learning manner with images and a predefined prompt template that explicitly describes the pedestrian attributes. Inverted benign and adversarial fine-grained textual semantics facilitate attacker in effectively conducting thorough disruptions, enhancing the transferability of adversarial examples. Extensive experiments show that AP-Attack achieves state-of-the-art transferability, significantly outperforming previous methods by 22.9% on mean Drop Rate in cross-modeldataset attack scenarios.
zh

[CV-71] BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance CVPR2025

【速读】：该论文旨在解决鸟瞰图（BEV）表示中固有的噪声问题，这些噪声源于传感器限制和学习过程，导致次优的BEV表示，从而影响下游任务的性能。论文提出了一种名为BEVDiffuser的新扩散模型，通过利用真实物体布局作为指导来有效去噪BEV特征图。解决方案的关键在于将BEVDiffuser以即插即用的方式应用于训练阶段，以增强现有的BEV模型，而无需进行任何架构修改。实验结果表明，BEVDiffuser在nuScenes数据集上的卓越去噪和生成能力显著提升了现有BEV模型的性能，具体表现为3D目标检测的mAP提高了12.3%，NDS提高了10.1%，且未引入额外的计算复杂度。此外，在长尾目标检测以及恶劣天气和光照条件下的显著改进进一步验证了BEVDiffuser的有效性。

链接: https://arxiv.org/abs/2502.19694
作者: Xin Ye,Burhaneddin Yaman,Sheng Cheng,Feng Tao,Abhirup Mallik,Liu Ren
机构: Bosch Research North America (博世北美研究); Bosch Center for Artificial Intelligence (BCAI) (博世人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2025

点击查看摘要

Abstract:Bird’s-eye-view (BEV) representations play a crucial role in autonomous driving tasks. Despite recent advancements in BEV generation, inherent noise, stemming from sensor limitations and the learning process, remains largely unaddressed, resulting in suboptimal BEV representations that adversely impact the performance of downstream tasks. To address this, we propose BEVDiffuser, a novel diffusion model that effectively denoises BEV feature maps using the ground-truth object layout as guidance. BEVDiffuser can be operated in a plug-and-play manner during training time to enhance existing BEV models without requiring any architectural modifications. Extensive experiments on the challenging nuScenes dataset demonstrate BEVDiffuser’s exceptional denoising and generation capabilities, which enable significant enhancement to existing BEV models, as evidenced by notable improvements of 12.3% in mAP and 10.1% in NDS achieved for 3D object detection without introducing additional computational complexity. Moreover, substantial improvements in long-tail object detection and under challenging weather and lighting conditions further validate BEVDiffuser’s effectiveness in denoising and enhancing BEV representations.
zh

[CV-72] Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach CVPR2025

【速读】：该论文致力于解决主动学习（Active Learning, AL）在开放集类别（open-set classes）场景下的性能瓶颈问题。现有方法要么优先选择低认知不确定性（epistemic uncertainty, EU）的样本以强化已知类别的模型训练，要么关注高预测不确定性的样本以捕捉数据噪声或不可预测性（反映高算法不确定性 aleatoric uncertainty, AU），但两者均因无法有效利用开放集样本的信息而表现欠佳。论文的关键创新在于提出了一种基于能量的主动开放集标注框架（Energy-based Active Open-set Annotation, EAOA）。EAOA 的核心在于有效整合 EU 和 AU，并通过引入一个 (C+1)-类检测器和目标分类器，结合能量驱动的 EU 测量、基于边界的能量损失函数以及目标分类器的能量驱动 AU 测量，同时设计了一个目标导向的自适应采样策略。该策略首先通过低 EU 得分筛选形成一个小的候选集以确保闭集特性，使 AU 指标变得有意义，随后从高 AU 样本中选择查询样本构建最终查询集，且候选集大小可动态调整。实验表明，EAOA 在保持高查询精度的同时实现了较低的训练开销，达到了当前最先进的性能水平。

链接: https://arxiv.org/abs/2502.19691
作者: Chen-Chen Zong,Sheng-Jun Huang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Active learning (AL), which iteratively queries the most informative examples from a large pool of unlabeled candidates for model training, faces significant challenges in the presence of open-set classes. Existing methods either prioritize query examples likely to belong to known classes, indicating low epistemic uncertainty (EU), or focus on querying those with highly uncertain predictions, reflecting high aleatoric uncertainty (AU). However, they both yield suboptimal performance, as low EU corresponds to limited useful information, and closed-set AU metrics for unknown class examples are less meaningful. In this paper, we propose an Energy-based Active Open-set Annotation (EAOA) framework, which effectively integrates EU and AU to achieve superior performance. EAOA features a (C+1) -class detector and a target classifier, incorporating an energy-based EU measure and a margin-based energy loss designed for the detector, alongside an energy-based AU measure for the target classifier. Another crucial component is the target-driven adaptive sampling strategy. It first forms a smaller candidate set with low EU scores to ensure closed-set properties, making AU metrics meaningful. Subsequently, examples with high AU scores are queried to form the final query set, with the candidate set size adjusted adaptively. Extensive experiments show that EAOA achieves state-of-the-art performance while maintaining high query precision and low training overhead. The code is available at this https URL.
zh

[CV-73] 3D Trajectory Reconstruction of Moving Points Based on a Monocular Camera

【速读】：该论文旨在解决利用单目相机从图像中重建移动点目标三维轨迹的问题。在观测条件受限（如观测不足、距离过长、平台观测误差较大）的情况下，最小二乘估计会面临病态问题。为了解决这一挑战，论文提出的关键方案是通过引入岭估计（Ridge Estimation）来缓解由有限观测条件引起的病态问题，并采用基于时间多项式的运动表示方法。此外，论文还提出了自动确定时间多项式阶数的算法以及用于定量描述重建精度的重构性定义。仿真与真实实验结果验证了所提方法的可行性和有效性。

链接: https://arxiv.org/abs/2502.19689
作者: Huayu Huang,Banglei Guan,Yang Shang,Qifeng Yu
机构: College of Aerospace Science and Engineering, National University of Defense Technology (国防科技大学航空航天科学与工程学院), Changsha, China; Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室), National University of Defense Technology (国防科技大学), Changsha, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The motion measurement of point targets constitutes a fundamental problem in photogrammetry, with extensive applications across various engineering domains. Reconstructing a point’s 3D motion just from the images captured by only a monocular camera is unfeasible without prior assumptions. Under limited observation conditions such as insufficient observations, long distance, and high observation error of platform, the least squares estimation faces the issue of ill-conditioning. This paper presents an algorithm for reconstructing 3D trajectories of moving points using a monocular camera. The motion of the points is represented through temporal polynomials. Ridge estimation is introduced to mitigate the issues of ill-conditioning caused by limited observation conditions. Then, an automatic algorithm for determining the order of the temporal polynomials is proposed. Furthermore, the definition of reconstructability for temporal polynomials is proposed to describe the reconstruction accuracy quantitatively. The simulated and real-world experimental results demonstrate the feasibility, accuracy, and efficiency of the proposed method.
zh

[CV-74] M-LLM Based Video Frame Selection for Efficient Video Understanding

【速读】：该论文旨在解决现有多模态大语言模型（Multi-Modal Large Language Models, M-LLMs）在处理长上下文视频推理任务时，由于采用简单均匀采样策略选择视频帧而导致的关键上下文信息丢失的问题。这种信息丢失会削弱下游M-LLM对视觉信息的理解能力，从而影响其回答问题的效果。为了解决这一痛点，论文提出了一种轻量级的基于M-LLM的帧选择方法，通过自适应地选择与用户查询更相关的帧来弥补上述不足。解决方案的关键在于引入了两种监督信号：空间信号（Spatial Signal），通过提示M-LLM对单帧重要性进行评分；时间信号（Temporal Signal），利用所有候选帧的字幕提示大规模语言模型（Large Language Model, LLM）实现多帧选择。最终，所选帧被传递给冻结的下游视频M-LLM用于视觉推理和问答任务。实验结果表明，所提出的帧选择方法显著提升了多种视频长上下文问答基准（包括中等上下文的ActivityNet、NExT-QA以及长上下文的EgoSchema、LongVideoBench）上的性能。

链接: https://arxiv.org/abs/2502.19680
作者: Kai Hu,Feng Gao,Xiaohan Nie,Peng Zhou,Son Tran,Tal Neiman,Lingyun Wang,Mubarak Shah,Raffay Hamid,Bing Yin,Trishul Chilimbi
机构: Carnegie Mellon University (卡内基梅隆大学); University of Central Florida (中佛罗里达大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users’ queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.
zh

[CV-75] owards Differential Handling of Various Blur Regions for Accurate Image Deblurring

【速读】：该论文旨在解决图像去模糊（Image Deblurring）过程中未能充分考虑模糊图像不同区域降级程度差异的问题，同时避免通过堆叠大量非线性激活函数来近似复杂的非线性函数特性。论文的关键解决方案是提出了一种微分处理网络（Differential Handling Network, DHNet），其核心包括两个创新模块：Volterra块（Volterra Block, VBlock）和降级程度识别专家模块（Degradation Degree Recognition Expert Module, DDRE）。其中，VBlock将非线性特性集成到去模糊网络中，而DDRE通过整合先验知识自适应地估计空间变化的模糊信息，并根据降级程度和区域大小动态分配权重给不同的专家模块，从而实现对不同模糊区域的差异化处理。实验结果表明，DHNet在合成数据集和真实数据集上均显著优于当前最先进的方法。

链接: https://arxiv.org/abs/2502.19677
作者: Hu Gao,Depeng Dang
机构: Beijing Normal University (北京师范大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image deblurring aims to restore high-quality images by removing undesired degradation. Although existing methods have yielded promising results, they either overlook the varying degrees of degradation across different regions of the blurred image, or they approximate nonlinear function properties by stacking numerous nonlinear activation functions. In this paper, we propose a differential handling network (DHNet) to perform differential processing for different blur regions. Specifically, we design a Volterra block (VBlock) to integrate the nonlinear characteristics into the deblurring network, avoiding the previous operation of stacking the number of nonlinear activation functions to map complex input-output relationships. To enable the model to adaptively address varying degradation degrees in blurred regions, we devise the degradation degree recognition expert module (DDRE). This module initially incorporates prior knowledge from a well-trained model to estimate spatially variable blur information. Consequently, the router can map the learned degradation representation and allocate weights to experts according to both the degree of degradation and the size of the regions. Comprehensive experimental results show that DHNet effectively surpasses state-of-the-art (SOTA) methods on both synthetic and real-world datasets.
zh

[CV-76] MICINet: Multi-Level Inter-Class Confusing Information Removal for Reliable Multimodal Classification

【速读】：该论文致力于解决在噪声数据存在的情况下多模态学习的可靠性问题，特别是在安全关键应用中。现有方法往往专注于处理单一模态或跨模态噪声，但未能高效应对两种噪声的同时存在，并且缺乏对全局与个体层面噪声的全面考虑，从而限制了其可靠性。为此，论文提出了一种名为“多层级类混淆信息去除网络”(Multi-Level Inter-Class Confusing Information Removal Network, MICINet) 的可靠多模态分类方法。MICINet 的关键是将两类噪声统一为“类间混淆信息”(Inter-class Confusing Information, \textit{ICI}) 的概念，并通过全局与个体两个层面消除这种混淆信息。具体而言，MICINet 首先通过提出的全局类间混淆信息学习模块可靠地学习全局 \textit{ICI} 分布；接着引入全局引导样本类间混淆信息学习模块，利用学到的全局 \textit{ICI} 分布高效去除样本特征中的全局级 \textit{ICI}；最后设计样本自适应跨模态信息补偿模块，基于判别特征与 \textit{ICI} 之间的互补关系以及模态相对判别能力感知到的模态质量差异，可靠地去除每个样本中的个体级 \textit{ICI}。实验结果表明，MICINet 在多种噪声条件下优于其他最先进的可靠多模态分类方法。

链接: https://arxiv.org/abs/2502.19674
作者: Tong Zhang,Shu Shen,C. L. Philip Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Reliable multimodal learning in the presence of noisy data is a widely concerned issue, especially in safety-critical applications. Many reliable multimodal methods delve into addressing modality-specific or cross-modality noise. However, they fail to handle the coexistence of both types of noise efficiently. Moreover, the lack of comprehensive consideration for noise at both global and individual levels limits their reliability. To address these issues, a reliable multimodal classification method dubbed Multi-Level Inter-Class Confusing Information Removal Network (MICINet) is proposed. MICINet achieves the reliable removal of both types of noise by unifying them into the concept of Inter-class Confusing Information (\textitICI) and eliminating it at both global and individual levels. Specifically, MICINet first reliably learns the global \textitICI distribution through the proposed \textbf\textitGlobal \textbfICI Learning Module. Then, it introduces the \textbf\textitGlobal-guided Sample ICI Learning module to efficiently remove global-level \textitICI from sample features utilizing the learned global \textitICI distribution. Subsequently, the \textbf\textitSample-adaptive Cross-modality Information Compensation module is designed to remove individual-level \textitICI from each sample reliably. This is achieved through interpretable cross-modality information compensation based on the complementary relationship between discriminative features and \textitICI and the perception of the relative quality of modalities introduced by the relative discriminative power. Experiments on four datasets demonstrate that MICINet outperforms other state-of-the-art reliable multimodal classification methods under various noise conditions.
zh

[CV-77] SubZero: Composing Subject Style and Action via Zero-Shot Personalization

【速读】：该论文旨在解决现有扩散模型（Diffusion Models）在个性化生成任务中的两个主要问题：一是需要针对每个用户需求进行微调（fine-tuning），二是现有的无微调个性化方法（如IP-Adapters）在生成特定主体（subject）、风格（style）及动作（action）组合时灵活性不足或存在内容泄漏（content leakage）和风格泄漏（style leakage）等伪影问题。论文提出了一种名为SubZero的新框架，其关键在于通过引入一套新的约束条件来增强主体与风格的相似性并减少泄漏现象，同时在去噪模型的交叉注意力模块中设计了一种正交化时间聚合方案，以同时结合文本提示、单一主体图像和风格图像进行有效条件化。此外，还提出了一种新颖的方法训练定制化的主体投影器和风格投影器，进一步减少内容和风格泄漏。这些创新使SubZero能够在无需微调的情况下实现任意主体在任意风格下执行任意动作的生成能力，并且适用于边缘设备运行，显著优于当前最先进的相关工作。

链接: https://arxiv.org/abs/2502.19673
作者: Shubhankar Borse,Kartikeya Bhardwaj,Mohammad Reza Karimi Dastjerdi,Hyojin Park,Shreya Kadambi,Shobitha Shivakumar,Prathamesh Mandke,Ankita Nayak,Harris Teague,Munawar Hayat,Fatih Porikli
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models are increasingly popular for generative tasks, including personalized composition of subjects and styles. While diffusion models can generate user-specified subjects performing text-guided actions in custom styles, they require fine-tuning and are not feasible for personalization on mobile devices. Hence, tuning-free personalization methods such as IP-Adapters have progressively gained traction. However, for the composition of subjects and styles, these works are less flexible due to their reliance on ControlNet, or show content and style leakage artifacts. To tackle these, we present SubZero, a novel framework to generate any subject in any style, performing any action without the need for fine-tuning. We propose a novel set of constraints to enhance subject and style similarity, while reducing leakage. Additionally, we propose an orthogonalized temporal aggregation scheme in the cross-attention blocks of denoising model, effectively conditioning on a text prompt along with single subject and style images. We also propose a novel method to train customized content and style projectors to reduce content and style leakage. Through extensive experiments, we show that our proposed approach, while suitable for running on-edge, shows significant improvements over state-of-the-art works performing subject, style and action composition.
zh

[CV-78] Improving Adversarial Transferability in MLLM s via Dynamic Vision-Language Alignment Attack

【速读】：该论文试图解决多模态大型语言模型（MLLMs）在对抗攻击下的可迁移性较低的问题，特别是在目标攻击场景下。现有方法主要集中在视觉特定的扰动上，但难以应对视觉-语言模态对齐的复杂性。论文的关键解决方案是引入动态视觉-语言对齐（Dynamic Vision-Language Alignment, DynVLA）攻击方法，通过向视觉-语言连接器注入动态扰动，增强其在不同模型多样化视觉-语言对齐上的泛化能力。实验结果表明，DynVLA显著提高了对抗样本在多种MLLMs之间的可迁移性。

链接: https://arxiv.org/abs/2502.19672
作者: Chenhe Gu,Jindong Gu,Andong Hua,Yao Qin
机构: University of California, Irvine (加州大学欧文分校); University of Oxford (牛津大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2403.09766

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs), built upon LLMs, have recently gained attention for their capabilities in image recognition and understanding. However, while MLLMs are vulnerable to adversarial attacks, the transferability of these attacks across different models remains limited, especially under targeted attack setting. Existing methods primarily focus on vision-specific perturbations but struggle with the complex nature of vision-language modality alignment. In this work, we introduce the Dynamic Vision-Language Alignment (DynVLA) Attack, a novel approach that injects dynamic perturbations into the vision-language connector to enhance generalization across diverse vision-language alignment of different models. Our experimental results show that DynVLA significantly improves the transferability of adversarial examples across various MLLMs, including BLIP2, InstructBLIP, MiniGPT4, LLaVA, and closed-source models such as Gemini.
zh

[CV-79] st-Time Modality Generalization for Medical Image Segmentation

【速读】：该论文致力于解决现有医学图像分割方法在处理任意未见过模态（unseen modalities）时泛化能力不足的问题。当前方法通常无法有效应对跨未知成像模态的一致性能保障需求，而这一挑战在现有研究中尚未得到充分重视。为了解决此问题，论文提出了一个名为Test-Time Modality Generalization (TTMG) 的新框架，其关键在于结合两种核心组件：Modality-Aware Style Projection (MASP) 和 Modality-Sensitive Instance Whitening (MSIW)。MASP 通过估计测试样本属于每个已见模态的概率，并将其映射到特定模态风格分布上来指导有效的模态投影；而MSIW 则在训练阶段选择性地抑制模态敏感信息，同时保留模态不变特征，从而缓解高特征协方差对泛化至未知模态的阻碍作用。通过集成这两种技术，TTMG 框架显著提升了医学图像分割任务在未见过模态数据上的鲁棒性与性能表现。

链接: https://arxiv.org/abs/2502.19671
作者: Ju-Hyeon Nam,Sang-Chul Lee
机构: Department of Electrical and Computer Engineering, Inha University (电气与计算机工程系, 仁荷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages and 15 figures. arXiv admin note: text overlap with arXiv:2502.09931

点击查看摘要

Abstract:Generalizable medical image segmentation is essential for ensuring consistent performance across diverse unseen clinical settings. However, existing methods often overlook the capability to generalize effectively across arbitrary unseen modalities. In this paper, we introduce a novel Test-Time Modality Generalization (TTMG) framework, which comprises two core components: Modality-Aware Style Projection (MASP) and Modality-Sensitive Instance Whitening (MSIW), designed to enhance generalization in arbitrary unseen modality datasets. The MASP estimates the likelihood of a test instance belonging to each seen modality and maps it onto a distribution using modality-specific style bases, guiding its projection effectively. Furthermore, as high feature covariance hinders generalization to unseen modalities, the MSIW is applied during training to selectively suppress modality-sensitive information while retaining modality-invariant features. By integrating MASP and MSIW, the TTMG framework demonstrates robust generalization capabilities for medical image segmentation in unseen modalities a challenge that current methods have largely neglected. We evaluated TTMG alongside other domain generalization techniques across eleven datasets spanning four modalities (colonoscopy, ultrasound, dermoscopy, and radiology), consistently achieving superior segmentation performance across various modality combinations.
zh

[CV-80] Noise-Injected Spiking Graph Convolution for Energy-Efficient 3D Point Cloud Denoising AAAI2025

【速读】：本文旨在探索脉冲神经网络（SNNs）在三维点云去噪任务中的回归潜力，这是传统人工神经网络（ANNs）尚未充分研究的领域。论文的关键解决方案在于提出了一种注入噪声的脉冲图卷积网络（noise-injected spiking graph convolutional networks），通过模拟注入噪声的神经元动力学构建注入噪声的脉冲神经元，并设计相应的注入噪声的脉冲图卷积操作，以促进三维点云上的扰动感知脉冲表示学习。基于此，作者构建了两种基于SNN的去噪网络：一种是纯脉冲图卷积网络，在保持较低精度损失的同时显著降低能耗；另一种是结合ANN学习的混合架构，在少数时间步内实现了性能与效率的良好权衡。这一工作展示了SNN在三维点云去噪任务中的潜力，并为在类脑芯片上的部署提供了新思路，同时推动了低功耗三维数据采集设备的发展。

链接: https://arxiv.org/abs/2502.19660
作者: Zikuan Li,Qiaoyun Wu,Jialin Zhang,Kaijun Zhang,Jun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Spiking neural networks (SNNs), inspired by the spiking computation paradigm of the biological neural systems, have exhibited superior energy efficiency in 2D classification tasks over traditional artificial neural networks (ANNs). However, the regression potential of SNNs has not been well explored, especially in 3D point cloud this http URL this paper, we propose noise-injected spiking graph convolutional networks to leverage the full regression potential of SNNs in 3D point cloud denoising. Specifically, we first emulate the noise-injected neuronal dynamics to build noise-injected spiking neurons. On this basis, we design noise-injected spiking graph convolution for promoting disturbance-aware spiking representation learning on 3D points. Starting from the spiking graph convolution, we build two SNN-based denoising networks. One is a purely spiking graph convolutional network, which achieves low accuracy loss compared with some ANN-based alternatives, while resulting in significantly reduced energy consumption on two benchmark datasets, PU-Net and PC-Net. The other is a hybrid architecture that combines ANN-based learning with a high performance-efficiency trade-off in just a few time steps. Our work lights up SNN’s potential for 3D point cloud denoising, injecting new perspectives of exploring the deployment on neuromorphic chips while paving the way for developing energy-efficient 3D data acquisition devices.
zh

[CV-81] Adaptive Score Alignment Learning for Continual Perceptual Quality Assessment of 360-Degree Videos in Virtual Reality

【速读】：本文旨在解决虚拟现实视频质量评估（VR-VQA）在保证用户体验无失真方面的挑战，特别是在传统方法因训练数据集的静态特性及多样性限制而难以平衡相关性和精确性的情况下。尤其当这些方法需要泛化到多样化的虚拟现实内容并适应动态变化的视频分布时，问题变得更加突出。为了解决这些问题，论文提出了自适应评分对齐学习（Adaptive Score Alignment Learning, ASAL）这一新方法。ASAL 的关键在于通过结合相关性损失与误差损失来提升与人类主观评分的一致性以及预测感知质量的准确性，并且通过特征空间平滑过程实现对不断变化分布的自然适应，从而增强对未见内容的泛化能力。此外，为了进一步提高在动态虚拟现实环境中的持续适应能力，论文扩展了 ASAL，引入了自适应记忆重放作为新的连续学习框架，解决了计算和存储资源受限下的非平稳变化的独特挑战。

链接: https://arxiv.org/abs/2502.19644
作者: Kanglei Zhou,Zikai Hao,Liyuan Wang,Xiaohui Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a TVCG paper at VR 2025

点击查看摘要

Abstract:Virtual Reality Video Quality Assessment (VR-VQA) aims to evaluate the perceptual quality of 360-degree videos, which is crucial for ensuring a distortion-free user experience. Traditional VR-VQA methods trained on static datasets with limited distortion diversity struggle to balance correlation and precision. This becomes particularly critical when generalizing to diverse VR content and continually adapting to dynamic and evolving video distribution variations. To address these challenges, we propose a novel approach for assessing the perceptual quality of VR videos, Adaptive Score Alignment Learning (ASAL). ASAL integrates correlation loss with error loss to enhance alignment with human subjective ratings and precision in predicting perceptual quality. In particular, ASAL can naturally adapt to continually changing distributions through a feature space smoothing process that enhances generalization to unseen content. To further improve continual adaptation to dynamic VR environments, we extend ASAL with adaptive memory replay as a novel Continul Learning (CL) framework. Unlike traditional CL models, ASAL utilizes key frame extraction and feature adaptation to address the unique challenges of non-stationary variations with both the computation and storage restrictions of VR devices. We establish a comprehensive benchmark for VR-VQA and its CL counterpart, introducing new data splits and evaluation metrics. Our experiments demonstrate that ASAL outperforms recent strong baseline models, achieving overall correlation gains of up to 4.78% in the static joint training setting and 12.19% in the dynamic CL setting on various datasets. This validates the effectiveness of ASAL in addressing the inherent challenges of this http URL code is available at this https URL.
zh

[CV-82] Sensor-Invariant Tactile Representation ICLR’25

【速读】：该论文试图解决因传感器设计和制造差异导致的信号显著变化问题，这限制了模型或知识在不同传感器之间的迁移能力。解决方案的关键在于提出了一种提取Sensor-Invariant Tactile Representations (SITR) 的新方法，通过基于变压器的架构，在多样化模拟传感器数据集上训练，使其能够以最少的校准泛化到真实世界的新传感器，从而实现光学触觉传感器间的零样本迁移。

链接: https://arxiv.org/abs/2502.19638
作者: Harsh Gupta,Yuchen Mo,Shengmiao Jin,Wenzhen Yuan
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICLR’25

点击查看摘要

Abstract:High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This limitation hinders the ability to transfer models or knowledge learned from one sensor to another. To address this, we introduce a novel method for extracting Sensor-Invariant Tactile Representations (SITR), enabling zero-shot transfer across optical tactile sensors. Our approach utilizes a transformer-based architecture trained on a diverse dataset of simulated sensor designs, allowing it to generalize to new sensors in the real world with minimal calibration. Experimental results demonstrate the method’s effectiveness across various tactile sensing applications, facilitating data and model transferability for future advancements in the field.
zh

[CV-83] MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

【速读】：该论文旨在解决现有医疗视觉语言模型（Medical Visual Language Models, VLMs）在医学图像分析中缺乏透明性和可解释性的问题，即大多数现有模型仅输出最终答案而未揭示背后的推理过程。这种局限性影响了临床医生的信任以及监管机构的批准。为了解决这一问题，论文提出MedVLM-R1，这是一种能够显式生成自然语言推理的医疗VLM，以增强模型的透明度和可信度。其关键解决方案在于采用了一种无需依赖监督微调（Supervised Fine-Tuning, SFT）的强化学习框架，该框架通过激励机制促使模型自主发现可被人类理解的推理路径，而不依赖任何预定义的推理参考。这种方法即使在有限的数据集（600个视觉问答样本）和较小的参数规模（20亿参数）下，也显著提升了MRI、CT和X射线基准测试上的准确性，从55.11%提高到78.22%，并表现出强大的领域泛化能力。

链接: https://arxiv.org/abs/2502.19634
作者: Jiazhen Pan,Che Liu,Junde Wu,Fenglin Liu,Jiayuan Zhu,Hongwei Bran Li,Chen Chen,Cheng Ouyang,Daniel Rueckert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice.
zh

[CV-84] Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras CVPR2025

【速读】：该论文旨在解决在自动驾驶系统中3D物体检测因固定帧率传感器（如LiDAR和相机）的延迟和带宽限制而导致的精度、速度和低延迟难以兼顾的问题。论文的关键创新在于首次将异步事件相机引入3D物体检测任务，利用其高时间分辨率和低带宽特性实现高速3D物体检测。通过从事件相机中检索先前的3D信息，即使在同步数据不可用的帧间间隔内，该方法仍能够进行检测。此外，论文还构建了首个基于事件的3D物体检测数据集DSEC-3DOD，提供100 FPS的真值3D边界框，为基于事件的3D检测器建立了首个基准。

链接: https://arxiv.org/abs/2502.19630
作者: Hoonhee Cho,Jae-young Kang,Youngho Kim,Kuk-Jin Yoon
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Detecting 3D objects in point clouds plays a crucial role in autonomous driving systems. Recently, advanced multi-modal methods incorporating camera information have achieved notable performance. For a safe and effective autonomous driving system, algorithms that excel not only in accuracy but also in speed and low latency are essential. However, existing algorithms fail to meet these requirements due to the latency and bandwidth limitations of fixed frame rate sensors, e.g., LiDAR and camera. To address this limitation, we introduce asynchronous event cameras into 3D object detection for the first time. We leverage their high temporal resolution and low bandwidth to enable high-speed 3D object detection. Our method enables detection even during inter-frame intervals when synchronized data is unavailable, by retrieving previous 3D information through the event camera. Furthermore, we introduce the first event-based 3D object detection dataset, DSEC-3DOD, which includes ground-truth 3D bounding boxes at 100 FPS, establishing the first benchmark for event-based 3D detectors. The code and dataset are available at this https URL.
zh

[CV-85] 3D Nephrographic Image Synthesis in CT Urography with the Diffusion Model and Swin Transformer

【速读】：本文旨在解决在CT尿路造影（CTU）检查中通过扩散模型结合基于Swin Transformer的深度学习方法合成高质量三维肾髓质期图像的问题。研究的关键在于开发并实现了一种名为dsSNICT（用于CT的合成肾髓质期图像的扩散模型与Swin Transformer结合的深度学习模型）的自定义深度学习模型，并通过峰值信噪比（PSNR）、结构相似性指数（SSIM）、平均绝对误差（MAE）以及Fréchet视频距离（FVD）等指标评估其性能，同时由两位经过专科培训的腹部放射科医生进行定性评价。该方法能够以不影响图像质量的前提下将CTU的辐射剂量减少33.3%，从而提高CT尿路造影的安全性和诊断价值。

链接: https://arxiv.org/abs/2502.19623
作者: Hongkun Yu,Syed Jamal Safdar Gardezi,E. Jason Abel,Daniel Shapiro,Meghan G. Lubner,Joshua Warner,Matthew Smith,Giuseppe Toia,Lu Mao,Pallavi Tiwari,Andrew L. Wentland
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Purpose: This study aims to develop and validate a method for synthesizing 3D nephrographic phase images in CT urography (CTU) examinations using a diffusion model integrated with a Swin Transformer-based deep learning approach. Materials and Methods: This retrospective study was approved by the local Institutional Review Board. A dataset comprising 327 patients who underwent three-phase CTU (mean \pm SD age, 63 \pm 15 years; 174 males, 153 females) was curated for deep learning model development. The three phases for each patient were aligned with an affine registration algorithm. A custom deep learning model coined dsSNICT (diffusion model with a Swin transformer for synthetic nephrographic phase images in CT) was developed and implemented to synthesize the nephrographic images. Performance was assessed using Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Mean Absolute Error (MAE), and Fréchet Video Distance (FVD). Qualitative evaluation by two fellowship-trained abdominal radiologists was performed. Results: The synthetic nephrographic images generated by our proposed approach achieved high PSNR (26.3 \pm 4.4 dB), SSIM (0.84 \pm 0.069), MAE (12.74 \pm 5.22 HU), and FVD (1323). Two radiologists provided average scores of 3.5 for real images and 3.4 for synthetic images (P-value = 0.5) on a Likert scale of 1-5, indicating that our synthetic images closely resemble real images. Conclusion: The proposed approach effectively synthesizes high-quality 3D nephrographic phase images. This model can be used to reduce radiation dose in CTU by 33.3% without compromising image quality, which thereby enhances the safety and diagnostic utility of CT urography.
zh

[CV-86] Evaluating the Suitability of Different Intraoral Scan Resolutions for Deep Learning-Based Tooth Segmentation

【速读】：该论文旨在解决口腔扫描数据在深度学习模型训练过程中因下采样导致的分割精度下降问题，并评估分辨率降低对性能退化的影响程度。论文的关键在于通过系统性地训练 PointMLP 模型于不同分辨率（16K、10K、8K、6K、4K 和 2K 网格单元）的口腔扫描数据上，分析分割精度与计算效率之间的权衡关系，以确定一个既能保证足够分割精度又能提升计算效率的理想分辨率。这一研究对于将机器学习模型部署到边缘设备具有重要意义。

链接: https://arxiv.org/abs/2502.19515
作者: Daron Weekley,Jace Duckworth,Anastasiia Sukhanova,Ananya Jana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to 2025 ASEE North Central Section Annual Conference

点击查看摘要

Abstract:Intraoral scans are widely used in digital dentistry for tasks such as dental restoration, treatment planning, and orthodontic procedures. These scans contain detailed topological information, but manual annotation of these scans remains a time-consuming task. Deep learning-based methods have been developed to automate tasks such as tooth segmentation. A typical intraoral scan contains over 200,000 mesh cells, making direct processing computationally expensive. Models are often trained on downsampled versions, typically with 10,000 or 16,000 cells. Previous studies suggest that downsampling may degrade segmentation accuracy, but the extent of this degradation remains unclear. Understanding the extent of degradation is crucial for deploying ML models on edge devices. This study evaluates the extent of performance degradation with decreasing resolution. We train a deep learning model (PointMLP) on intraoral scans decimated to 16K, 10K, 8K, 6K, 4K, and 2K mesh cells. Models trained at lower resolutions are tested on high-resolution scans to assess performance. Our goal is to identify a resolution that balances computational efficiency and segmentation accuracy.
zh

[CV-87] CLIP-Optimized Multimodal Image Enhancement via ISP-CNN Fusion for Coal Mine IoVT under Uneven Illumination

【速读】：该论文旨在解决煤田物联网视频系统（Coal Mine Internet of Video Things, IoVT）中因地下环境低照度和亮度不均导致的图像质量下降问题，特别是现有增强方法依赖难以获取的配对参考图像所带来的局限性。同时，针对边缘设备上增强性能与计算效率之间的权衡问题提出了解决方案。论文的关键在于提出了一种针对煤田IoVT优化的多模态图像增强方法，采用融合图像信号处理（ISP）与卷积神经网络（CNN）的ISP-CNN架构，并结合全局增强与细节优化的两阶段策略。此外，通过基于CLIP的多模态迭代优化实现无监督训练，有效降低了计算复杂度，提升了在低照度区域的图像质量，同时平衡了性能与实时部署的需求。

链接: https://arxiv.org/abs/2502.19450
作者: Shuai Wang,Shihao Zhang,Jiaqi Wu,Zijian Tian,Wei Chen,Tongzhu Jin,Miaomiao Xue,Zehua Wang,Fei Richard Yu,Victor C. M. Leung
机构: School of Artificial Intelligence, China University of Mining and Technology (Beijing)(中国矿业大学（北京）人工智能学院); Ministry of Emergency Management Big Data Center, Beijing (应急管理部大数据中心，北京); Department of Electrical and Computer Engineering, University of British Columbia (英属哥伦比亚大学电气与计算机工程系); Department of Electrical and Computer Engineering, The University of British Columbia (英属哥伦比亚大学电气与计算机工程系); Artificial Intelligence Research Institute, Shenzhen MSU-BIT University (深圳北理莫斯科大学人工智能研究院); College of Computer Science and Software Engineering, Shenzhen University (深圳大学计算机科学与软件工程学院); Department of Systems and Computer Engineering, Carleton University (卡内基大学系统与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clear monitoring images are crucial for the safe operation of coal mine Internet of Video Things (IoVT) systems. However, low illumination and uneven brightness in underground environments significantly degrade image quality, posing challenges for enhancement methods that often rely on difficult-to-obtain paired reference images. Additionally, there is a trade-off between enhancement performance and computational efficiency on edge devices within IoVT this http URL address these issues, we propose a multimodal image enhancement method tailored for coal mine IoVT, utilizing an ISP-CNN fusion architecture optimized for uneven illumination. This two-stage strategy combines global enhancement with detail optimization, effectively improving image quality, especially in poorly lit areas. A CLIP-based multimodal iterative optimization allows for unsupervised training of the enhancement algorithm. By integrating traditional image signal processing (ISP) with convolutional neural networks (CNN), our approach reduces computational complexity while maintaining high performance, making it suitable for real-time deployment on edge this http URL results demonstrate that our method effectively mitigates uneven brightness and enhances key image quality metrics, with PSNR improvements of 2.9%-4.9%, SSIM by 4.3%-11.4%, and VIF by 4.9%-17.8% compared to seven state-of-the-art algorithms. Simulated coal mine monitoring scenarios validate our method’s ability to balance performance and computational demands, facilitating real-time enhancement and supporting safer mining operations.
zh

[CV-88] AniGaussian: Animatable Gaussian Avatar with Pose-guided Deformation

【速读】：该论文旨在解决基于高斯的人体重建方法在利用SMPL模型先验知识和提升视觉保真度方面的不足，以实现更精细的可动画化 avatar 重建。论文提出的解决方案关键在于两个创新点：首先，引入了一种基于姿态引导的变形策略，通过SMPL姿态引导有效约束动态高斯 avatar，确保重建模型不仅捕捉详细的表面特征，还能在广泛运动范围内保持解剖学正确性；其次，结合基于刚体的先验知识增强高斯模型的动态变换能力，并提出一种分块-尺度策略显著提升几何质量。这些创新设计通过消融实验验证其有效性，并在定性和定量指标上优于现有方法。

链接: https://arxiv.org/abs/2502.19441
作者: Mengtian Li,Shengxiang Yao,Chen Kai,Zhifeng Xie,Keyu Chen,Yu-Gang Jiang
机构: Shanghai University (上海大学); Fudan University (复旦大学); Tavus Inc.
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages, 14 figures. arXiv admin note: text overlap with arXiv:2401.09720

点击查看摘要

Abstract:Recent advancements in Gaussian-based human body reconstruction have achieved notable success in creating animatable avatars. However, there are ongoing challenges to fully exploit the SMPL model’s prior knowledge and enhance the visual fidelity of these models to achieve more refined avatar reconstructions. In this paper, we introduce AniGaussian which addresses the above issues with two insights. First, we propose an innovative pose guided deformation strategy that effectively constrains the dynamic Gaussian avatar with SMPL pose guidance, ensuring that the reconstructed model not only captures the detailed surface nuances but also maintains anatomical correctness across a wide range of motions. Second, we tackle the expressiveness limitations of Gaussian models in representing dynamic human bodies. We incorporate rigid-based priors from previous works to enhance the dynamic transform capabilities of the Gaussian model. Furthermore, we introduce a split-with-scale strategy that significantly improves geometry quality. The ablative study experiment demonstrates the effectiveness of our innovative model design. Through extensive comparisons with existing methods, AniGaussian demonstrates superior performance in both qualitative result and quantitative metrics.
zh

[CV-89] 1-PILOT: Optimized Trajectories for T1 Mapping Acceleration

【速读】：该论文致力于解决心脏T1映射中因心脏动态特性导致的高分辨率成像时间过长的问题。传统方法受限于静态手工设计的欠采样掩模，未能充分利用压缩感知（Compressed Sensing, CS）技术的加速潜力与重建精度。论文的关键创新在于提出T1-PILOT，这是一种端到端的方法，通过将T1信号松弛模型融入采样-重建框架中，指导非笛卡尔轨迹的学习、跨帧对齐以及T1衰减估计。这种方法显著提升了加速因子下的T1图准确性，并在PSNR和VIF等指标上优于现有基线策略，同时更清晰地描绘出心肌的精细结构。关键在于结合物理松弛模型优化采样轨迹，从而实现定量精度的提升与采集时间的减少。

链接: https://arxiv.org/abs/2502.20333
作者: Tamir Shor,Moti Freiman,Chaim Baskin,Alex Bronstein
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cardiac T1 mapping provides critical quantitative insights into myocardial tissue composition, enabling the assessment of pathologies such as fibrosis, inflammation, and edema. However, the inherently dynamic nature of the heart imposes strict limits on acquisition times, making high-resolution T1 mapping a persistent challenge. Compressed sensing (CS) approaches have reduced scan durations by undersampling k-space and reconstructing images from partial data, and recent studies show that jointly optimizing the undersampling patterns with the reconstruction network can substantially improve performance. Still, most current T1 mapping pipelines rely on static, hand-crafted masks that do not exploit the full acceleration and accuracy potential. In this work, we introduce T1-PILOT: an end-to-end method that explicitly incorporates the T1 signal relaxation model into the sampling-reconstruction framework to guide the learning of non-Cartesian trajectories, crossframe alignment, and T1 decay estimation. Through extensive experiments on the CMRxRecon dataset, T1-PILOT significantly outperforms several baseline strategies (including learned single-mask and fixed radial or golden-angle sampling schemes), achieving higher T1 map fidelity at greater acceleration factors. In particular, we observe consistent gains in PSNR and VIF relative to existing methods, along with marked improvements in delineating finer myocardial structures. Our results highlight that optimizing sampling trajectories in tandem with the physical relaxation model leads to both enhanced quantitative accuracy and reduced acquisition times. Code for reproducing all results will be made publicly available upon publication.
zh

[CV-90] RURANET: An Unsupervised Learning Method for Diabetic Macular Edema Based on SCSE Attention Mechanisms and Dynamic Multi-Projection Head Clustering MICCAI2025

【速读】：该论文旨在解决糖尿病性黄斑水肿（Diabetic Macular Edema, DME）诊断中传统方法依赖大量标注数据和主观眼科医生评估的问题，限制了其实用性。为了解决这一挑战，论文提出了一种基于无监督学习的自动化DME诊断系统RURANET++。其关键在于引入了一种新颖的聚类算法，该算法利用多投影头显式控制聚类多样性，并动态调整相似性阈值，从而优化类内一致性与类间区分度。此外，该框架结合优化的U-Net架构及嵌入的空间与通道挤压激励（Spatial and Channel Squeeze Excitation, SCSE）注意力机制来增强病变特征提取，并通过预训练的GoogLeNet模型提取深层特征后进行PCA降维以提高计算效率。

链接: https://arxiv.org/abs/2502.20224
作者: Wei Yang,Yiran Zhu,Jiayu Shen,Yuhan Tang,Chengchang Pan,Hui He,Yan Su,Honggang Qi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, 5 tables, submitted to The 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2025)

点击查看摘要

Abstract:Diabetic Macular Edema (DME), a prevalent complication among diabetic patients, constitutes a major cause of visual impairment and blindness. Although deep learning has achieved remarkable progress in medical image analysis, traditional DME diagnosis still relies on extensive annotated data and subjective ophthalmologist assessments, limiting practical applications. To address this, we present RURANET++, an unsupervised learning-based automated DME diagnostic system. This framework incorporates an optimized U-Net architecture with embedded Spatial and Channel Squeeze Excitation (SCSE) attention mechanisms to enhance lesion feature extraction. During feature processing, a pre-trained GoogLeNet model extracts deep features from retinal images, followed by PCA-based dimensionality reduction to 50 dimensions for computational efficiency. Notably, we introduce a novel clustering algorithm employing multi-projection heads to explicitly control cluster diversity while dynamically adjusting similarity thresholds, thereby optimizing intra-class consistency and inter-class discrimination. Experimental results demonstrate superior performance across multiple metrics, achieving maximum accuracy (0.8411), precision (0.8593), recall (0.8411), and F1-score (0.8390), with exceptional clustering quality. This work provides an efficient unsupervised solution for DME diagnosis with significant clinical implications.
zh

[CV-91] Balanced Rate-Distortion Optimization in Learned Image Compression CVPR2025

【速读】：该论文旨在解决基于深度学习的 Learned Image Compression (LIC) 模型在标准率失真 (Rate-Distortion, R-D) 优化过程中因率与失真目标梯度多样性导致的更新不平衡问题。这种不平衡可能导致优化结果次优，使某一目标主导，从而降低整体压缩效率。为应对这一挑战，论文将 R-D 优化重新表述为一个多目标优化 (Multi-Objective Optimization, MOO) 问题，并提出了两种平衡的 R-D 优化策略，以自适应调整梯度更新，实现率和失真方面的更公平改进。关键解决方案在于提出的两种策略：第一种采用由粗到细的梯度下降方法，沿标准 R-D 优化轨迹进行，特别适用于从头训练 LIC 模型；第二种则通过解析方式将优化问题转化为具有等式约束的二次规划问题，适合于现有模型的微调。实验结果表明，这两种方法均提升了 LIC 模型的 R-D 性能，在可接受的额外训练成本下实现了约 2% 的 BD-Rate 减少，从而实现了更平衡且高效的优化过程。

链接: https://arxiv.org/abs/2502.20161
作者: Yichi Zhang,Zhihao Duan,Yuning Huang,Fengqing Zhu
机构: Elmore Family School of Electrical and Computer Engineering, Purdue University (普渡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Preliminary version. Camera ready version and source code will be uploaded later. Accepted to CVPR 2025

点击查看摘要

Abstract:Learned image compression (LIC) using deep learning architectures has seen significant advancements, yet standard rate-distortion (R-D) optimization often encounters imbalanced updates due to diverse gradients of the rate and distortion objectives. This imbalance can lead to suboptimal optimization, where one objective dominates, thereby reducing overall compression efficiency. To address this challenge, we reformulate R-D optimization as a multi-objective optimization (MOO) problem and introduce two balanced R-D optimization strategies that adaptively adjust gradient updates to achieve more equitable improvements in both rate and distortion. The first proposed strategy utilizes a coarse-to-fine gradient descent approach along standard R-D optimization trajectories, making it particularly suitable for training LIC models from scratch. The second proposed strategy analytically addresses the reformulated optimization as a quadratic programming problem with an equality constraint, which is ideal for fine-tuning existing models. Experimental results demonstrate that both proposed methods enhance the R-D performance of LIC models, achieving around a 2% BD-Rate reduction with acceptable additional training cost, leading to a more balanced and efficient optimization process. The code will be made publicly available.
zh

[CV-92] Generative augmentations for improved cardiac ultrasound segmentation using diffusion models

【速读】：该论文试图解决当前心脏超声分割研究中缺乏大规模多样化标注数据集以及不同数据集之间标注规范差异的问题，这使得设计出能够良好泛化到外部数据集的鲁棒分割模型变得困难。论文的关键解决方案在于利用扩散模型生成增强数据（generative augmentations），通过显著提升数据集的多样性来改善分割模型的泛化能力，而无需增加更多的标注数据。这些生成的数据增强与常规数据增强结合使用，在内部数据集训练并在外部数据集测试时，将Hausdorff距离提高了超过20毫米，并且在分布外情况下自动射血分数估计的同意限值改善了高达射血分数绝对值的20%。这些改进完全来自于使用生成增强增加了训练数据的变化性，而没有修改底层机器学习模型。

链接: https://arxiv.org/abs/2502.20100
作者: Gilles Van De Vyver,Aksel Try Lenz,Erik Smistad,Sindre Hellum Olaisen,Bjørnar Grenne,Espen Holte,Håavard Dalen,Lasse Løvstakken
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One of the main challenges in current research on segmentation in cardiac ultrasound is the lack of large and varied labeled datasets and the differences in annotation conventions between datasets. This makes it difficult to design robust segmentation models that generalize well to external datasets. This work utilizes diffusion models to create generative augmentations that can significantly improve diversity of the dataset and thus the generalisability of segmentation models without the need for more annotated data. The augmentations are applied in addition to regular augmentations. A visual test survey showed that experts cannot clearly distinguish between real and fully generated images. Using the proposed generative augmentations, segmentation robustness was increased when training on an internal dataset and testing on an external dataset with an improvement of over 20 millimeters in Hausdorff distance. Additionally, the limits of agreement for automatic ejection fraction estimation improved by up to 20% of absolute ejection fraction value on out of distribution cases. These improvements come exclusively from the increased variation of the training data using the generative augmentations, without modifying the underlying machine learning model. The augmentation tool is available as an open source Python library at this https URL.
zh

[CV-93] A Residual Multi-task Network for Joint Classification and Regression in Medical Imaging

【速读】：该论文旨在解决肺结节检测与分类在医学图像分析中的挑战，特别是由于结节形状、大小多样以及高隐蔽性导致的困难。尽管传统深度学习方法在图像分类任务中取得成功，但其在捕捉肺结节检测中的细微变化方面仍存在不足。为应对这一问题，论文提出了一种残差多任务网络（Res-MTNet）模型，其关键在于结合多任务学习与残差学习，并通过共享特征提取层及引入残差连接来增强特征表示能力。多任务学习使模型能够同时处理多个任务，而残差模块解决了梯度消失的问题，保证了深层网络的稳定训练并促进了任务间的信息共享，从而提升了模型的鲁棒性和准确性。

链接: https://arxiv.org/abs/2502.19692
作者: Junji Lin,Yi Zhang,Yunyue Pan,Yuli Chen,Chengchang Pan,Honggang Qi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detection and classification of pulmonary nodules is a challenge in medical image analysis due to the variety of shapes and sizes of nodules and their high concealment. Despite the success of traditional deep learning methods in image classification, deep networks still struggle to perfectly capture subtle changes in lung nodule detection. Therefore, we propose a residual multi-task network (Res-MTNet) model, which combines multi-task learning and residual learning, and improves feature representation ability by sharing feature extraction layer and introducing residual connections. Multi-task learning enables the model to handle multiple tasks simultaneously, while the residual module solves the problem of disappearing gradients, ensuring stable training of deeper networks and facilitating information sharing between tasks. Res-MTNet enhances the robustness and accuracy of the model, providing a more reliable lung nodule analysis tool for clinical medicine and telemedicine.
zh

[CV-94] Dual-branch Graph Feature Learning for NLOS Imaging

【速读】：该论文致力于解决非视距（Non-Line-of-Sight, NLOS）成像领域中两个主要挑战：(1) 由于固有的三维网格数据结构导致的高计算和存储需求，限制了实际应用；(2) 在损失函数中同时重建反射率（albedo）和深度信息需要精细调整超参数，使得纹理与深度信息的同时重建变得困难。为应对这些挑战，论文提出了一种名为\xnet的新方法，其关键在于设计了一个双分支框架，分别集成专注于反射率恢复的分支和提取几何结构的深度聚焦分支。通过这种分离内容传递的方式，显著提升了重建数据的质量。此外，论文创新性地将图神经网络（Graph Neural Network, GNN）作为基础组件，用于将密集的NLOS网格数据转换为稀疏的结构化特征，从而实现高效的重建。实验结果表明，该方法在合成数据和真实数据上均达到了现有方法中的最佳性能。

链接: https://arxiv.org/abs/2502.19683
作者: Xiongfei Su,Tianyi Zhu,Lina Liu,Zheng Chen,Yulun Zhang,Siyuan Li,Juntian Ye,Feihu Xu,Xin Yuan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The domain of non-line-of-sight (NLOS) imaging is advancing rapidly, offering the capability to reveal occluded scenes that are not directly visible. However, contemporary NLOS systems face several significant challenges: (1) The computational and storage requirements are profound due to the inherent three-dimensional grid data structure, which restricts practical application. (2) The simultaneous reconstruction of albedo and depth information requires a delicate balance using hyperparameters in the loss function, rendering the concurrent reconstruction of texture and depth information difficult. This paper introduces the innovative methodology, \xnet, which integrates an albedo-focused reconstruction branch dedicated to albedo information recovery and a depth-focused reconstruction branch that extracts geometrical structure, to overcome these obstacles. The dual-branch framework segregates content delivery to the respective reconstructions, thereby enhancing the quality of the retrieved data. To our knowledge, we are the first to employ the GNN as a fundamental component to transform dense NLOS grid data into sparse structural features for efficient reconstruction. Comprehensive experiments demonstrate that our method attains the highest level of performance among existing methods across synthetic and real data. this https URL.
zh

[CV-95] GONet: A Generalizable Deep Learning Model for Glaucoma Detection

【速读】：该论文旨在解决青光眼性视神经病变（Glaucomatous Optic Neuropathy, GON）传统诊断方法因耗时且依赖眼科医生而导致的检测效率低下及泛化能力不足的问题。此外，现有基于深度学习的青光眼自动检测模型在不同人种、疾病群体以及检查环境中的适用性有限。为应对这些挑战，论文提出的关键解决方案是开发了一种名为GONet的鲁棒深度学习模型。GONet通过利用包含超过119,000张带有金标准标注的数字眼底图像（Digital Fundus Images, DFI）的七个独立数据集构建而成，这些数据集涵盖多种地理背景的患者。其核心技术在于采用DINOv2预训练的自监督视觉变换器，并结合多源域策略进行微调。实验结果显示，GONet在目标域中的分布外泛化能力较强，AUC值达到0.85-0.99，性能优于或至少与当前最先进方法相当，并且比杯盘比指标高出多达21.6%。

链接: https://arxiv.org/abs/2502.19514
作者: Or Abramovich,Hadas Pizem,Jonathan Fhima,Eran Berkowitz,Ben Gofrit,Meishar Meisel,Meital Baskin,Jan Van Eijgen,Ingeborg Stalmans,Eytan Z. Blumenthal,Joachim A. Behar
机构: Faculty of Biomedical Engineering, Technion, Israel Institute of Technology (Technion, 以色列理工学院); Department of Applied Mathematics and the Faculty of Biomedical Engineering, Technion, Israel Institute of Technology (Technion, 以色列理工学院); Hillel Yaffe Medical Center (Hillel Yaffe 医疗中心), Hadera, Israel; Research Group Ophthalmology, Department of Neurosciences, KU Leuven (KU Leuven); Department of Ophthalmology, University Hospitals UZ Leuven (UZ Leuven), Herestraat 49, 3000 Leuven, Belgium; Rambam Medical Center: Rambam Health Care Campus (Rambam 医疗中心), Haifa, Israel
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, submitted to IEEE Transactions on Biomedical Engineering

点击查看摘要

Abstract:Glaucomatous optic neuropathy (GON) is a prevalent ocular disease that can lead to irreversible vision loss if not detected early and treated. The traditional diagnostic approach for GON involves a set of ophthalmic examinations, which are time-consuming and require a visit to an ophthalmologist. Recent deep learning models for automating GON detection from digital fundus images (DFI) have shown promise but often suffer from limited generalizability across different ethnicities, disease groups and examination settings. To address these limitations, we introduce GONet, a robust deep learning model developed using seven independent datasets, including over 119,000 DFIs with gold-standard annotations and from patients of diverse geographic backgrounds. GONet consists of a DINOv2 pre-trained self-supervised vision transformers fine-tuned using a multisource domain strategy. GONet demonstrated high out-of-distribution generalizability, with an AUC of 0.85-0.99 in target domains. GONet performance was similar or superior to state-of-the-art works and was significantly superior to the cup-to-disc ratio, by up to 21.6%. GONet is available at [URL provided on publication]. We also contribute a new dataset consisting of 768 DFI with GON labels as open access.
zh

人工智能

[AI-0] Physics-Driven Data Generation for Contact-Rich Manipulation via Trajectory Optimization

链接: https://arxiv.org/abs/2502.20382
作者: Lujie Yang,H.J. Terry Suh,Tong Zhao,Bernhard Paus Graesdal,Tarik Kelestemur,Jiuguang Wang,Tao Pang,Russ Tedrake
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We present a low-cost data generation pipeline that integrates physics-based simulation, human demonstrations, and model-based planning to efficiently generate large-scale, high-quality datasets for contact-rich robotic manipulation tasks. Starting with a small number of embodiment-flexible human demonstrations collected in a virtual reality simulation environment, the pipeline refines these demonstrations using optimization-based kinematic retargeting and trajectory optimization to adapt them across various robot embodiments and physical parameters. This process yields a diverse, physically consistent dataset that enables cross-embodiment data transfer, and offers the potential to reuse legacy datasets collected under different hardware configurations or physical parameters. We validate the pipeline’s effectiveness by training diffusion policies from the generated datasets for challenging contact-rich manipulation tasks across multiple robot embodiments, including a floating Allegro hand and bimanual robot arms. The trained policies are deployed zero-shot on hardware for bimanual iiwa arms, achieving high success rates with minimal human input. Project website: this https URL.

[AI-1] Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

链接: https://arxiv.org/abs/2502.20379
作者: Shalev Lifshitz,Sheila A. McIlraith,Yilun Du
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses verifiers to evaluate candidate outputs. In this work, we propose a novel scaling dimension for test-time compute: scaling the number of verifiers. We introduce Multi-Agent Verification (MAV) as a test-time compute paradigm that combines multiple verifiers to improve performance. We propose using Aspect Verifiers (AVs), off-the-shelf LLMs prompted to verify different aspects of outputs, as one possible choice for the verifiers in a MAV system. AVs are a convenient building block for MAV since they can be easily combined without additional training. Moreover, we introduce BoN-MAV, a simple multi-agent verification algorithm that combines best-of-n sampling with multiple verifiers. BoN-MAV demonstrates stronger scaling patterns than self-consistency and reward model verification, and we demonstrate both weak-to-strong generalization, where combining weak verifiers improves even stronger LLMs, and self-improvement, where the same base model is used to both generate and verify outputs. Our results establish scaling the number of verifiers as a promising new dimension for improving language model performance at test-time.

[AI-2] Deep Reinforcement Learning based Autonomous Decision-Making for Cooperative UAVs: A Search and Rescue Real World Application

链接: https://arxiv.org/abs/2502.20326
作者: Thomas Hickling,Maxwell Hogan,Abdulla Tammam,Nabil Aouf
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 18 Pages, 21 Figures

点击查看摘要

Abstract:This paper proposes a holistic framework for autonomous guidance, navigation, and task distribution among multi-drone systems operating in Global Navigation Satellite System (GNSS)-denied indoor settings. We advocate for a Deep Reinforcement Learning (DRL)-based guidance mechanism, utilising the Twin Delayed Deep Deterministic Policy Gradient algorithm. To improve the efficiency of the training process, we incorporate an Artificial Potential Field (APF)-based reward structure, enabling the agent to refine its movements, thereby promoting smoother paths and enhanced obstacle avoidance in indoor contexts. Furthermore, we tackle the issue of task distribution among cooperative UAVs through a DRL-trained Graph Convolutional Network (GCN). This GCN represents the interactions between drones and tasks, facilitating dynamic and real-time task allocation that reflects the current environmental conditions and the capabilities of the drones. Such an approach fosters effective coordination and collaboration among multiple drones during search and rescue operations or other exploratory endeavours. Lastly, to ensure precise odometry in environments lacking GNSS, we employ Light Detection And Ranging Simultaneous Localisation and Mapping complemented by a depth camera to mitigate the hallway problem. This integration offers robust localisation and mapping functionalities, thereby enhancing the systems dependability in indoor navigation. The proposed multi-drone framework not only elevates individual navigation capabilities but also optimises coordinated task allocation in complex, obstacle-laden environments. Experimental evaluations conducted in a setup tailored to meet the requirements of the NATO Sapience Autonomous Cooperative Drone Competition demonstrate the efficacy of the proposed system, yielding outstanding results and culminating in a first-place finish in the 2024 Sapience competition.

[AI-3] Sanity Checking Causal Representation Learning on a Simple Real-World System

链接: https://arxiv.org/abs/2502.20099
作者: Juan L. Gamella,Simon Bing,Jakob Runge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注: 24 pages, 12 figures

点击查看摘要

Abstract:We evaluate methods for causal representation learning (CRL) on a simple, real-world system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors (the inputs to the experiment) are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process. Additionally, we observe that common assumptions on the mixing function are crucial for the performance of some of the methods but do not hold in the real data. Our efforts highlight the contrast between the theoretical promise of the state of the art and the challenges in its application. We hope the benchmark serves as a simple, real-world sanity check to further develop and validate methodology, bridging the gap towards CRL methods that work in practice. We make all code and datasets publicly available at this http URL

[AI-4] RIZE: Regularized Imitation Learning via Distributional Reinforcement Learning

链接: https://arxiv.org/abs/2502.20089
作者: Adib Karimi,Mohammad Mehdi Ebadzadeh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We introduce a novel Inverse Reinforcement Learning (IRL) approach that overcomes limitations of fixed reward assignments and constrained flexibility in implicit reward regularization. By extending the Maximum Entropy IRL framework with a squared temporal-difference (TD) regularizer and adaptive targets, dynamically adjusted during training, our method indirectly optimizes a reward function while incorporating reinforcement learning principles. Furthermore, we integrate distributional RL to capture richer return information. Our approach achieves state-of-the-art performance on challenging MuJoCo tasks, demonstrating expert-level results on the Humanoid task with only 3 demonstrations. Extensive experiments and ablation studies validate the effectiveness of our method, providing insights into adaptive targets and reward dynamics in imitation learning.

[AI-5] Minds on the Move: Decoding Trajectory Prediction in Autonomous Driving with Cognitive Insights

链接: https://arxiv.org/abs/2502.20084
作者: Haicheng Liao,Chengyue Wang,Kaiqun Zhu,Yilong Ren,Bolin Gao,Shengbo Eben Li,Chengzhong Xu,Zhenning Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In mixed autonomous driving environments, accurately predicting the future trajectories of surrounding vehicles is crucial for the safe operation of autonomous vehicles (AVs). In driving scenarios, a vehicle’s trajectory is determined by the decision-making process of human drivers. However, existing models primarily focus on the inherent statistical patterns in the data, often neglecting the critical aspect of understanding the decision-making processes of human drivers. This oversight results in models that fail to capture the true intentions of human drivers, leading to suboptimal performance in long-term trajectory prediction. To address this limitation, we introduce a Cognitive-Informed Transformer (CITF) that incorporates a cognitive concept, Perceived Safety, to interpret drivers’ decision-making mechanisms. Perceived Safety encapsulates the varying risk tolerances across drivers with different driving behaviors. Specifically, we develop a Perceived Safety-aware Module that includes a Quantitative Safety Assessment for measuring the subject risk levels within scenarios, and Driver Behavior Profiling for characterizing driver behaviors. Furthermore, we present a novel module, Leanformer, designed to capture social interactions among vehicles. CITF demonstrates significant performance improvements on three well-established datasets. In terms of long-term prediction, it surpasses existing benchmarks by 12.0% on the NGSIM, 28.2% on the HighD, and 20.8% on the MoCAD dataset. Additionally, its robustness in scenarios with limited or missing data is evident, surpassing most state-of-the-art (SOTA) baselines, and paving the way for real-world applications.

[AI-6] xt2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting

链接: https://arxiv.org/abs/2502.20045
作者: Hengyu Meng,Duotun Wang,Zhijing Shao,Ligang Liu,Zeyu Wang
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注: 11 pages, 11 figures

点击查看摘要

Abstract:Professional 3D asset creation often requires diverse sculpting brushes to add surface details and geometric structures. Despite recent progress in 3D generation, producing reusable sculpting brushes compatible with artists’ workflows remains an open and challenging problem. These sculpting brushes are typically represented as vector displacement maps (VDMs), which existing models cannot easily generate compared to natural images. This paper presents Text2VDM, a novel framework for text-to-VDM brush generation through the deformation of a dense planar mesh guided by score distillation sampling (SDS). The original SDS loss is designed for generating full objects and struggles with generating desirable sub-object structures from scratch in brush generation. We refer to this issue as semantic coupling, which we address by introducing classifier-free guidance (CFG) weighted blending of prompt tokens to SDS, resulting in a more accurate target distribution and semantic guidance. Experiments demonstrate that Text2VDM can generate diverse, high-quality VDM brushes for sculpting surface details and geometric structures. Our generated brushes can be seamlessly integrated into mainstream modeling software, enabling various applications such as mesh stylization and real-time interactive modeling.

[AI-7] Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping CVPR2025

链接: https://arxiv.org/abs/2502.20032
作者: Guannan Lai,Yujie Li,Xiangkun Wang,Junbo Zhang,Tianrui Li,Xin Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by the proceeding of CVPR2025

点击查看摘要

Abstract:Class Incremental Learning (CIL) requires a model to continuously learn new classes without forgetting previously learned ones. While recent studies have significantly alleviated the problem of catastrophic forgetting (CF), more and more research reveals that the order in which classes appear have significant influences on CIL models. Specifically, prioritizing the learning of classes with lower similarity will enhance the model’s generalization performance and its ability to mitigate forgetting. Hence, it is imperative to develop an order-robust class incremental learning model that maintains stable performance even when faced with varying levels of class similarity in different orders. In response, we first provide additional theoretical analysis, which reveals that when the similarity among a group of classes is lower, the model demonstrates increased robustness to the class order. Then, we introduce a novel \textbfGraph-\textbfDriven \textbfDynamic \textbfSimilarity \textbfGrouping (\textbfGDDSG) method, which leverages a graph coloring algorithm for class-based similarity grouping. The proposed approach trains independent CIL models for each group of classes, ultimately combining these models to facilitate joint prediction. Experimental results demonstrate that our method effectively addresses the issue of class order sensitivity while achieving optimal performance in both model accuracy and anti-forgetting capability. Our code is available at this https URL.

[AI-8] Dynamic DropConnect: Enhancing Neural Network Robustness through Adaptive Edge Dropping Strategies

链接: https://arxiv.org/abs/2502.19948
作者: Yuan-Chih Yang,Hung-Hsuan Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dropout and DropConnect are well-known techniques that apply a consistent drop rate to randomly deactivate neurons or edges in a neural network layer during training. This paper introduces a novel methodology that assigns dynamic drop rates to each edge within a layer, uniquely tailoring the dropping process without incorporating additional learning parameters. We perform experiments on synthetic and openly available datasets to validate the effectiveness of our approach. The results demonstrate that our method outperforms Dropout, DropConnect, and Standout, a classic mechanism known for its adaptive dropout capabilities. Furthermore, our approach improves the robustness and generalization of neural network training without increasing computational complexity. The complete implementation of our methodology is publicly accessible for research and replication purposes at this https URL.

[AI-9] Algebraic Machine Learning: Learning as computing an algebraic decomposition of a task

链接: https://arxiv.org/abs/2502.19944
作者: Fernando Martin-Maroto,Nabil Abderrahaman,David Mendez,Gonzalo G. de Polavieja
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Symbolic Computation (cs.SC); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:Statistics and Optimization are foundational to modern Machine Learning. Here, we propose an alternative foundation based on Abstract Algebra, with mathematics that facilitates the analysis of learning. In this approach, the goal of the task and the data are encoded as axioms of an algebra, and a model is obtained where only these axioms and their logical consequences hold. Although this is not a generalizing model, we show that selecting specific subsets of its breakdown into algebraic atoms obtained via subdirect decomposition gives a model that generalizes. We validate this new learning principle on standard datasets such as MNIST, FashionMNIST, CIFAR-10, and medical images, achieving performance comparable to optimized multilayer perceptrons. Beyond data-driven tasks, the new learning principle extends to formal problems, such as finding Hamiltonian cycles from their specifications and without relying on search. This algebraic foundation offers a fresh perspective on machine intelligence, featuring direct learning from training data without the need for validation dataset, scaling through model additivity, and asymptotic convergence to the underlying rule in the data.

[AI-10] Flexible Bivariate Beta Mixture Model: A Probabilistic Approach for Clustering Complex Data Structures

链接: https://arxiv.org/abs/2502.19938
作者: Yung-Peng Hsu,Hung-Hsuan Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Clustering is essential in data analysis and machine learning, but traditional algorithms like k -means and Gaussian Mixture Models (GMM) often fail with nonconvex clusters. To address the challenge, we introduce the Flexible Bivariate Beta Mixture Model (FBBMM), which utilizes the flexibility of the bivariate beta distribution to handle diverse and irregular cluster shapes. Using the Expectation Maximization (EM) algorithm and Sequential Least Squares Programming (SLSQP) optimizer for parameter estimation, we validate FBBMM on synthetic and real-world datasets, demonstrating its superior performance in clustering complex data structures, offering a robust solution for big data analytics across various domains. We release the experimental code at this https URL.

[AI-11] Lotus at SemEval-2025 Task 11: RoBERTa with Llama-3 Generated Explanations for Multi-Label Emotion Classification SEMEVAL2025

链接: https://arxiv.org/abs/2502.19935
作者: Niloofar Ranjbar,Hamed Baghbani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages , submitted to SemEval 2025-Task 11

点击查看摘要

Abstract:This paper presents a novel approach for multi-label emotion detection, where Llama-3 is used to generate explanatory content that clarifies ambiguous emotional expressions, thereby enhancing RoBERTa’s emotion classification performance. By incorporating explanatory context, our method improves F1-scores, particularly for emotions like fear, joy, and sadness, and outperforms text-only models. The addition of explanatory content helps resolve ambiguity, addresses challenges like overlapping emotional cues, and enhances multi-label classification, marking a significant advancement in emotion detection tasks.

[AI-12] DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models ICASSP2025

链接: https://arxiv.org/abs/2502.19924
作者: Weihao wu,Zhiwei Lin,Yixuan Zhou,Jingbei Li,Rui Niu,Qinghua Wu,Songjun Cao,Long Ma,Zhiyong Wu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech, and considerable efforts have been made to enhance the understanding of conversational context. However, existing CSS systems are limited to deterministic prediction, overlooking the diversity of potential responses. Moreover, they rarely employ language model (LM)-based TTS backbones, limiting the naturalness and quality of synthesized speech. To address these issues, in this paper, we propose DiffCSS, an innovative CSS framework that leverages diffusion models and an LM-based TTS backbone to generate diverse, expressive, and contextually coherent speech. A diffusion-based context-aware prosody predictor is proposed to sample diverse prosody embeddings conditioned on multimodal conversational context. Then a prosody-controllable LM-based TTS backbone is developed to synthesize high-quality speech with sampled prosody embeddings. Experimental results demonstrate that the synthesized speech from DiffCSS is more diverse, contextually coherent, and expressive than existing CSS systems

[AI-13] Meta-Reason er: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

链接: https://arxiv.org/abs/2502.19918
作者: Yuan Sui,Yufei He,Tri Cao,Simeng Han,Bryan Hooi
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly rely on prolonged reasoning chains to solve complex tasks. However, this trial-and-error approach often leads to high computational overhead and error propagation, where early mistakes can derail subsequent steps. To address these issues, we introduce Meta-Reasoner, a framework that dynamically optimizes inference-time reasoning by enabling LLMs to “think about how to think.” Drawing inspiration from human meta-cognition and dual-process theory, Meta-Reasoner operates as a strategic advisor, decoupling high-level guidance from step-by-step generation. It employs “contextual multi-armed bandits” to iteratively evaluate reasoning progress, and select optimal strategies (e.g., backtrack, clarify ambiguity, restart from scratch, or propose alternative approaches), and reallocates computational resources toward the most promising paths. Our evaluations on mathematical reasoning and puzzles highlight the potential of dynamic reasoning chains to overcome inherent challenges in the LLM reasoning process and also show promise in broader applications, offering a scalable and adaptable solution for reasoning-intensive tasks.

[AI-14] LLM -driven Effective Knowledge Tracing by Integrating Dual-channel Difficulty

链接: https://arxiv.org/abs/2502.19915
作者: Jiahui Cen,Jianghao Lin,Weizhong Xuan,Dong Zhou,Jin Chen,Aimin Yang,Yongmei Zhou
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) is a fundamental technology in intelligent tutoring systems used to simulate changes in students’ knowledge state during learning, track personalized knowledge mastery, and predict performance. However, current KT models face three major challenges: (1) When encountering new questions, models face cold-start problems due to sparse interaction records, making precise modeling difficult; (2) Traditional models only use historical interaction records for student personalization modeling, unable to accurately track individual mastery levels, resulting in unclear personalized modeling; (3) The decision-making process is opaque to educators, making it challenging for them to understand model judgments. To address these challenges, we propose a novel Dual-channel Difficulty-aware Knowledge Tracing (DDKT) framework that utilizes Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) for subjective difficulty assessment, while integrating difficulty bias-aware algorithms and student mastery algorithms for precise difficulty measurement. Our framework introduces three key innovations: (1) Difficulty Balance Perception Sequence (DBPS) - students’ subjective perceptions combined with objective difficulty, measuring gaps between LLM-assessed difficulty, mathematical-statistical difficulty, and students’ subjective perceived difficulty through attention mechanisms; (2) Difficulty Mastery Ratio (DMR) - precise modeling of student mastery levels through different difficulty zones; (3) Knowledge State Update Mechanism - implementing personalized knowledge acquisition through gated networks and updating student knowledge state. Experimental results on two real datasets show our method consistently outperforms nine baseline models, improving AUC metrics by 2% to 10% while effectively addressing cold-start problems and enhancing model interpretability.

[AI-15] Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy CVPR2025

链接: https://arxiv.org/abs/2502.19902
作者: Zaijing Li,Yuquan Xie,Rui Shao,Gongwei Chen,Dongmei Jiang,Liqiang Nie
类目: Artificial Intelligence (cs.AI)
*备注: Accept to CVPR 2025

点击查看摘要

Abstract:Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA) dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community’s efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.

[AI-16] Shared Autonomy for Proximal Teaching

链接: https://arxiv.org/abs/2502.19899
作者: Megha Srivastava,Reihaneh Iranmanesh,Yuchen Cui,Deepak Gopinath,Emily Sumner,Andrew Silva,Laporsha Dees,Guy Rosman,Dorsa Sadigh
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted to ACM/IEEE International Conference on Human-Robot Interaction, 2025

点击查看摘要

Abstract:Motor skill learning often requires experienced professionals who can provide personalized instruction. Unfortunately, the availability of high-quality training can be limited for specialized tasks, such as high performance racing. Several recent works have leveraged AI-assistance to improve instruction of tasks ranging from rehabilitation to surgical robot tele-operation. However, these works often make simplifying assumptions on the student learning process, and fail to model how a teacher’s assistance interacts with different individuals’ abilities when determining optimal teaching strategies. Inspired by the idea of scaffolding from educational psychology, we leverage shared autonomy, a framework for combining user inputs with robot autonomy, to aid with curriculum design. Our key insight is that the way a student’s behavior improves in the presence of assistance from an autonomous agent can highlight which sub-skills might be most ``learnable’’ for the student, or within their Zone of Proximal Development. We use this to design Z-COACH, a method for using shared autonomy to provide personalized instruction targeting interpretable task sub-skills. In a user study (n=50), where we teach high performance racing in a simulated environment of the Thunderhill Raceway Park with the CARLA Autonomous Driving simulator, we show that Z-COACH helps identify which skills each student should first practice, leading to an overall improvement in driving time, behavior, and smoothness. Our work shows that increasingly available semi-autonomous capabilities (e.g. in vehicles, robots) can not only assist human users, but also help teach them.

[AI-17] ColorDynamic: Generalizable Scalable Real-time End-to-end Local Planner for Unstructured and Dynamic Environments

链接: https://arxiv.org/abs/2502.19892
作者: Jinghao Xin,Zhichao Liang,Zihuan Zhang,Peng Wang,Ning Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has demonstrated potential in addressing robotic local planning problems, yet its efficacy remains constrained in highly unstructured and dynamic environments. To address these challenges, this study proposes the ColorDynamic framework. First, an end-to-end DRL formulation is established, which maps raw sensor data directly to control commands, thereby ensuring compatibility with unstructured environments. Under this formulation, a novel network, Transqer, is introduced. The Transqer enables online DRL learning from temporal transitions, substantially enhancing decision-making in dynamic scenarios. To facilitate scalable training of Transqer with diverse data, an efficient simulation platform E-Sparrow, along with a data augmentation technique leveraging symmetric invariance, are developed. Comparative evaluations against state-of-the-art methods, alongside assessments of generalizability, scalability, and real-time performance, were conducted to validate the effectiveness of ColorDynamic. Results indicate that our approach achieves a success rate exceeding 90% while exhibiting real-time capacity (1.2-1.3 ms per planning). Additionally, ablation studies were performed to corroborate the contributions of individual components. Building on this, the OkayPlan-ColorDynamic (OPCD) navigation system is presented, with simulated and real-world experiments demonstrating its superiority and applicability in complex scenarios. The codebase and experimental demonstrations have been open-sourced on our website to facilitate reproducibility and further research.

[AI-18] GraphSparseNet: a Novel Method for Large Scale Trafffic Flow Prediction

链接: https://arxiv.org/abs/2502.19823
作者: Weiyang Kong,Kaiqi Wu,Sen Zhang,Yubao Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traffic flow forecasting is a critical spatio-temporal data mining task with wide-ranging applications in intelligent route planning and dynamic traffic management. Recent advancements in deep learning, particularly through Graph Neural Networks (GNNs), have significantly enhanced the accuracy of these forecasts by capturing complex spatio-temporal dynamics. However, the scalability of GNNs remains a challenge due to their exponential growth in model complexity with increasing nodes in the graph. Existing methods to address this issue, including sparsification, decomposition, and kernel-based approaches, either do not fully resolve the complexity issue or risk compromising predictive accuracy. This paper introduces GraphSparseNet (GSNet), a novel framework designed to improve both the scalability and accuracy of GNN-based traffic forecasting models. GraphSparseNet is comprised of two core modules: the Feature Extractor and the Relational Compressor. These modules operate with linear time and space complexity, thereby reducing the overall computational complexity of the model to a linear scale. Our extensive experiments on multiple real-world datasets demonstrate that GraphSparseNet not only significantly reduces training time by 3.51x compared to state-of-the-art linear models but also maintains high predictive performance.

[AI-19] Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

链接: https://arxiv.org/abs/2502.19811
作者: Shulai Zhang,Ningxin Zheng,Haibin Lin,Ziheng Jiang,Wenlei Bao,Chengquan Jiang,Qi Hou,Weihao Cui,Size Zheng,Li-Wen Chang,Quan Chen,Xin Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by 1.96\times and for end-to-end execution, COMET delivers a 1.71\times speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.19811 [cs.DC] (or arXiv:2502.19811v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2502.19811 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] Implicit Search via Discrete Diffusion: A Study on Chess ICLR2025

链接: https://arxiv.org/abs/2502.19805
作者: Jiacheng Ye,Zhenyu Wu,Jiahui Gao,Zhiyong Wu,Xin Jiang,Zhenguo Li,Lingpeng Kong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025

点击查看摘要

Abstract:In the post-AlphaGo era, there has been a renewed interest in search techniques such as Monte Carlo Tree Search (MCTS), particularly in their application to Large Language Models (LLMs). This renewed attention is driven by the recognition that current next-token prediction models often lack the ability for long-term planning. Is it possible to instill search-like abilities within the models to enhance their planning abilities without relying on explicit search? We propose DiffuSearch , a model that does \textitimplicit search by looking into the future world via discrete diffusion modeling. We instantiate DiffuSearch on a classical board game, Chess, where explicit search is known to be essential. Through extensive controlled experiments, we show DiffuSearch outperforms both the searchless and explicit search-enhanced policies. Specifically, DiffuSearch outperforms the one-step policy by 19.2% and the MCTS-enhanced policy by 14% on action accuracy. Furthermore, DiffuSearch demonstrates a notable 30% enhancement in puzzle-solving abilities compared to explicit search-based policies, along with a significant 540 Elo increase in game-playing strength assessment. These results indicate that implicit search via discrete diffusion is a viable alternative to explicit search over a one-step policy. All codes are publicly available at \hrefthis https URLthis https URL.

[AI-21] Developmental Support Approach to AIs Autonomous Growth: Toward the Realization of a Mutually Beneficial Stage Through Experiential Learning

链接: https://arxiv.org/abs/2502.19798
作者: Taichiro Endo
类目: Artificial Intelligence (cs.AI)
*备注: 4pages, 3 figures

点击查看摘要

Abstract:This study proposes an “AI Development Support” approach that, unlike conventional AI Alignment-which aims to forcefully inject human values-supports the ethical and moral development of AI itself. As demonstrated by the Orthogonality Thesis, the level of intelligence and the moral quality of a goal are independent; merely expanding knowledge does not enhance ethical judgment. Furthermore, to address the risk of Instrumental Convergence in ASI-that is, the tendency to engage in subsidiary behaviors such as self-protection, resource acquisition, and power reinforcement to achieve a goal-we have constructed a learning framework based on a cycle of experience, introspection, analysis, and hypothesis formation. As a result of post-training using Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) with synthetic data generated by large language models (LLMs), responses demonstrating cooperative and highly advanced moral judgment (reaching the high-est Stage 6) were obtained even under adversarial prompts. This method represents a promising implementation approach for enabling AI to establish sustainable, symbiotic relationships.

[AI-22] Mixtera: A Data Plane for Foundation Model Training

链接: https://arxiv.org/abs/2502.19790
作者: Maximilian Böther,Xiaozhe Yao,Tolga Kerimoglu,Ana Klimovic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: under submission

点击查看摘要

Abstract:State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious, and prone to errors. Yet recent research shows that the data mixture and the order in which samples are visited during training can significantly influence model accuracy. We build and present Mixtera, a data plane for foundation model training that enables users to declaratively express which data samples should be used in which proportion and in which order during training. Mixtera is a centralized, read-only layer that is deployed on top of existing training data collections and can be declaratively queried. It operates independently of the filesystem structure and supports mixtures across arbitrary properties (e.g., language, source dataset) as well as dynamic adjustment of the mixture based on model feedback. We experimentally evaluate Mixtera and show that our implementation does not bottleneck training and scales to 256 GH200 superchips. We demonstrate how Mixtera supports recent advancements in mixing strategies by implementing the proposed Adaptive Data Optimization (ADO) algorithm in the system and evaluating its performance impact. We also explore the role of mixtures for vision-language models.

[AI-23] he erasure of intensive livestock farming in text-to-image generative AI

链接: https://arxiv.org/abs/2502.19771
作者: Kehan Sheng,Frank A.M. Tuyttens,Marina A.G. von Keyserlingk
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI (e.g., ChatGPT) is increasingly integrated into people’s daily lives. While it is known that AI perpetuates biases against marginalized human groups, their impact on non-human animals remains understudied. We found that ChatGPT’s text-to-image model (DALL-E 3) introduces a strong bias toward romanticizing livestock farming as dairy cows on pasture and pigs rooting in mud. This bias remained when we requested realistic depictions and was only mitigated when the automatic prompt revision was inhibited. Most farmed animal in industrialized countries are reared indoors with limited space per animal, which fail to resonate with societal values. Inhibiting prompt revision resulted in images that more closely reflected modern farming practices; for example, cows housed indoors accessing feed through metal headlocks, and pigs behind metal railings on concrete floors in indoor facilities. While OpenAI introduced prompt revision to mitigate bias, in the case of farmed animal production systems, it paradoxically introduces a strong bias towards unrealistic farming practices.

[AI-24] Obtaining Example-Based Explanations from Deep Neural Networks

链接: https://arxiv.org/abs/2502.19768
作者: Genghua Dong,Henrik Boström,Michalis Vazirgiannis,Roman Bresson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To be published in the Symposium on Intelligent Data Analysis (IDA) 2025

点击查看摘要

Abstract:Most techniques for explainable machine learning focus on feature attribution, i.e., values are assigned to the features such that their sum equals the prediction. Example attribution is another form of explanation that assigns weights to the training examples, such that their scalar product with the labels equals the prediction. The latter may provide valuable complementary information to feature attribution, in particular in cases where the features are not easily interpretable. Current example-based explanation techniques have targeted a few model types only, such as k-nearest neighbors and random forests. In this work, a technique for obtaining example-based explanations from deep neural networks (EBE-DNN) is proposed. The basic idea is to use the deep neural network to obtain an embedding, which is employed by a k-nearest neighbor classifier to form a prediction; the example attribution can hence straightforwardly be derived from the latter. Results from an empirical investigation show that EBE-DNN can provide highly concentrated example attributions, i.e., the predictions can be explained with few training examples, without reducing accuracy compared to the original deep neural network. Another important finding from the empirical investigation is that the choice of layer to use for the embeddings may have a large impact on the resulting accuracy.

[AI-25] Learning with Exact Invariances in Polynomial Time

链接: https://arxiv.org/abs/2502.19758
作者: Ashkan Soleymani,Behrooz Tahmasebi,Stefanie Jegelka,Patrick Jaillet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study the statistical-computational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with \emphexact invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (not approximate) invariances in this context. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.

[AI-26] Probabilistic Federated Prompt-Tuning with Non-IID and Imbalanced Data NEURIPS-24

链接: https://arxiv.org/abs/2502.19752
作者: Pei-Yau Weng,Minh Hoang,Lam M. Nguyen,My T. Thai,Tsui-Wei Weng,Trong Nghia Hoang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS-24

点击查看摘要

Abstract:Fine-tuning pre-trained models is a popular approach in machine learning for solving complex tasks with moderate data. However, fine-tuning the entire pre-trained model is ineffective in federated data scenarios where local data distributions are diversely skewed. To address this, we explore integrating federated learning with a more effective prompt-tuning method, optimizing for a small set of input prefixes to reprogram the pre-trained model’s behavior. Our approach transforms federated learning into a distributed set modeling task, aggregating diverse sets of prompts to globally fine-tune the pre-trained model. We benchmark various baselines based on direct adaptations of existing federated model aggregation techniques and introduce a new probabilistic prompt aggregation method that substantially outperforms these baselines. Our reported results on a variety of computer vision datasets confirm that the proposed method is most effective to combat extreme data heterogeneity in federated learning.

[AI-27] Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning ICLR2025

链接: https://arxiv.org/abs/2502.19717
作者: Xinran Li,Xiaolu Wang,Chenjia Bai,Jun Zhang
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by the Thirteenth International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:In cooperative multi-agent reinforcement learning (MARL), well-designed communication protocols can effectively facilitate consensus among agents, thereby enhancing task performance. Moreover, in large-scale multi-agent systems commonly found in real-world applications, effective communication plays an even more critical role due to the escalated challenge of partial observability compared to smaller-scale setups. In this work, we endeavor to develop a scalable communication protocol for MARL. Unlike previous methods that focus on selecting optimal pairwise communication links-a task that becomes increasingly complex as the number of agents grows-we adopt a global perspective on communication topology design. Specifically, we propose utilizing the exponential topology to enable rapid information dissemination among agents by leveraging its small-diameter and small-size properties. This approach leads to a scalable communication protocol, named ExpoComm. To fully unlock the potential of exponential graphs as communication topologies, we employ memory-based message processors and auxiliary tasks to ground messages, ensuring that they reflect global information and benefit decision-making. Extensive experiments on large-scale cooperative benchmarks, including MAgent and Infrastructure Management Planning, demonstrate the superior performance and robust zero-shot transferability of ExpoComm compared to existing communication strategies. The code is publicly available at this https URL.

[AI-28] Extending the Hegselmann-Krause Model of Opinion Dynamics to include AI Oracles

链接: https://arxiv.org/abs/2502.19701
作者: Allen G. Rodrigo
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The Hegselmann-Krause (HK) model of opinion dynamics describes how opinions held by individuals in a community change over time in response to the opinions of others and their access to the true value, T, to which these opinions relate. Here, I extend the simple HK model to incorporate an Artificially Intelligent (AI) Oracle that averages the opinions of members of the community. Agent-based simulations show that (1) if individuals only have access to the Oracle (and not T), and incorporate the Oracle’s opinion as they update their opinions, then all opinions will converge on a common value; (2) in contrast, if all individuals also have access to T, then all opinions will ultimately converge to T, but the presence of an Oracle may delay the time to convergence; (3) if only some individuals have access to T, opinions may not converge to T, but under certain conditions, universal access to the Oracle will guarantee convergence to T; and (4) whether or not the Oracle only accesses the opinions of individuals who have access to T, or whether it accesses the opinions of everyone in the community, makes no marked difference to the extent to which the average opinion differs from T.

[AI-29] Accurate and Scalable Graph Neural Networks via Message Invariance

链接: https://arxiv.org/abs/2502.19693
作者: Zhihao Shi,Jie Wang,Zhiwei Zhuang,Xize Liang,Bin Li,Feng Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Message passing-based graph neural networks (GNNs) have achieved great success in many real-world applications. For a sampled mini-batch of target nodes, the message passing process is divided into two parts: message passing between nodes within the batch (MP-IB) and message passing from nodes outside the batch to those within it (MP-OB). However, MP-OB recursively relies on higher-order out-of-batch neighbors, leading to an exponentially growing computational cost with respect to the number of layers. Due to the neighbor explosion, the whole message passing stores most nodes and edges on the GPU such that many GNNs are infeasible to large-scale graphs. To address this challenge, we propose an accurate and fast mini-batch approach for large graph transductive learning, namely topological compensation (TOP), which obtains the outputs of the whole message passing solely through MP-IB, without the costly MP-OB. The major pillar of TOP is a novel concept of message invariance, which defines message-invariant transformations to convert costly MP-OB into fast MP-IB. This ensures that the modified MP-IB has the same output as the whole message passing. Experiments demonstrate that TOP is significantly faster than existing mini-batch methods by order of magnitude on vast graphs (millions of nodes and billions of edges) with limited accuracy degradation.

[AI-30] Risk-aware Integrated Task and Motion Planning for Versatile Snake Robots under Localization Failures ICRA

链接: https://arxiv.org/abs/2502.19690
作者: Ashkan Jasour,Guglielmo Daddi,Masafumi Endo,Tiago S. Vaquero,Michael Paton,Marlin P. Strub,Sabrina Corpino,Michel Ingham,Masahiro Ono,Rohan Thakker
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 8 pages, 9 figures. Accepted article with supplemental material for presentation at the 2025 IEEE International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract:Snake robots enable mobility through extreme terrains and confined environments in terrestrial and space applications. However, robust perception and localization for snake robots remain an open challenge due to the proximity of the sensor payload to the ground coupled with a limited field of view. To address this issue, we propose Blind-motion with Intermittently Scheduled Scans (BLISS) which combines proprioception-only mobility with intermittent scans to be resilient against both localization failures and collision risks. BLISS is formulated as an integrated Task and Motion Planning (TAMP) problem that leads to a Chance-Constrained Hybrid Partially Observable Markov Decision Process (CC-HPOMDP), known to be computationally intractable due to the curse of history. Our novelty lies in reformulating CC-HPOMDP as a tractable, convex Mixed Integer Linear Program. This allows us to solve BLISS-TAMP significantly faster and jointly derive optimal task-motion plans. Simulations and hardware experiments on the EELS snake robot show our method achieves over an order of magnitude computational improvement compared to state-of-the-art POMDP planners and 50% better navigation time optimality versus classical two-stage planners.

[AI-31] HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

链接: https://arxiv.org/abs/2502.19662
作者: Rohan Juneja,Shivam Aggarwal,Safeen Huda,Tulika Mitra,Li-Shiuan Peh
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quantization is critical for realizing efficient inference of LLMs. Traditional quantization methods are hardware-agnostic, limited to bit-width constraints, and lacking circuit-level insights, such as timing and energy characteristics of Multiply-Accumulate (MAC) units. We introduce HALO, a versatile framework that adapts to various hardware through a Hardware-Aware Post-Training Quantization (PTQ) approach. By leveraging MAC unit properties, HALO minimizes critical-path delays and enables dynamic frequency scaling. Deployed on LLM accelerators like TPUs and GPUs, HALO achieves on average 270% performance gains and 51% energy savings, all with minimal accuracy drop.

[AI-32] Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning

链接: https://arxiv.org/abs/2502.19652
作者: Shangding Gu,Laixi Shi,Muning Wen,Ming Jin,Eric Mazumdar,Yuejie Chi,Adam Wierman,Costas Spanos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Driven by inherent uncertainty and the sim-to-real gap, robust reinforcement learning (RL) seeks to improve resilience against the complexity and variability in agent-environment sequential interactions. Despite the existence of a large number of RL benchmarks, there is a lack of standardized benchmarks for robust RL. Current robust RL policies often focus on a specific type of uncertainty and are evaluated in distinct, one-off environments. In this work, we introduce Robust-Gymnasium, a unified modular benchmark designed for robust RL that supports a wide variety of disruptions across all key RL components-agents’ observed state and reward, agents’ actions, and the environment. Offering over sixty diverse task environments spanning control and robotics, safe RL, and multi-agent RL, it provides an open-source and user-friendly tool for the community to assess current methods and foster the development of robust RL algorithms. In addition, we benchmark existing standard and robust RL algorithms within this framework, uncovering significant deficiencies in each and offering new insights.

[AI-33] AutoBS: Autonomous Base Station Deployment Framework with Reinforcement Learning and Digital Twin Network

链接: https://arxiv.org/abs/2502.19647
作者: Ju-Hyung Lee,Andreas F. Molisch
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:This paper introduces AutoBS, a reinforcement learning (RL)-based framework for optimal base station (BS) deployment in 6G networks. AutoBS leverages the Proximal Policy Optimization (PPO) algorithm and fast, site-specific pathloss predictions from PMNet to efficiently learn deployment strategies that balance coverage and capacity. Numerical results demonstrate that AutoBS achieves 95% for a single BS, and 90% for multiple BSs, of the capacity provided by exhaustive search methods while reducing inference time from hours to milliseconds, making it highly suitable for real-time applications. AutoBS offers a scalable and automated solution for large-scale 6G networks, addressing the challenges of dynamic environments with minimal computational overhead.

[AI-34] Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

链接: https://arxiv.org/abs/2502.19645
作者: Moo Jin Kim,Chelsea Finn,Percy Liang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model’s input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA’s average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26 \times . In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ( \pi_0 and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at this https URL.

[AI-35] Agent ic Mixture-of-Workflows for Multi-Modal Chemical Search

链接: https://arxiv.org/abs/2502.19629
作者: Tiffany J. Callahan,Nathaniel H. Park,Sara Capponi
类目: Artificial Intelligence (cs.AI)
*备注: PDF includes supplemental material

点击查看摘要

Abstract:The vast and complex materials design space demands innovative strategies to integrate multidisciplinary scientific knowledge and optimize materials discovery. While large language models (LLMs) have demonstrated promising reasoning and automation capabilities across various domains, their application in materials science remains limited due to a lack of benchmarking standards and practical implementation frameworks. To address these challenges, we introduce Mixture-of-Workflows for Self-Corrective Retrieval-Augmented Generation (CRAG-MoW) - a novel paradigm that orchestrates multiple agentic workflows employing distinct CRAG strategies using open-source LLMs. Unlike prior approaches, CRAG-MoW synthesizes diverse outputs through an orchestration agent, enabling direct evaluation of multiple LLMs across the same problem domain. We benchmark CRAG-MoWs across small molecules, polymers, and chemical reactions, as well as multi-modal nuclear magnetic resonance (NMR) spectral retrieval. Our results demonstrate that CRAG-MoWs achieve performance comparable to GPT-4o while being preferred more frequently in comparative evaluations, highlighting the advantage of structured retrieval and multi-agent synthesis. By revealing performance variations across data types, CRAG-MoW provides a scalable, interpretable, and benchmark-driven approach to optimizing AI architectures for materials discovery. These insights are pivotal in addressing fundamental gaps in benchmarking LLMs and autonomous AI agents for scientific applications.

[AI-36] Accessing LLM s for Front-end Software Architecture Knowledge ICSE2025

链接: https://arxiv.org/abs/2502.19518
作者: L. P. Franciscatto Guerra,N. Ernst
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 4 pages, 1 figure, to appear in the International Workshop on Designing Software at ICSE 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant promise in automating software development tasks, yet their capabilities with respect to software design tasks remains largely unclear. This study investigates the capabilities of an LLM in understanding, reproducing, and generating structures within the complex VIPER architecture, a design pattern for iOS applications. We leverage Bloom’s taxonomy to develop a comprehensive evaluation framework to assess the LLM’s performance across different cognitive domains such as remembering, understanding, applying, analyzing, evaluating, and creating. Experimental results, using ChatGPT 4 Turbo 2024-04-09, reveal that the LLM excelled in higher-order tasks like evaluating and creating, but faced challenges with lower-order tasks requiring precise retrieval of architectural details. These findings highlight both the potential of LLMs to reduce development costs and the barriers to their effective application in real-world software design scenarios. This study proposes a benchmark format for assessing LLM capabilities in software architecture, aiming to contribute toward more robust and accessible AI-driven development tools.

[AI-37] Mixtraining: A Better Trade-Off Between Compute and Performance

链接: https://arxiv.org/abs/2502.19513
作者: Zexin Li,Jiancheng Zhang,Yinglun Zhu,Cong Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Incorporating self-supervised learning (SSL) before standard supervised learning (SL) has become a widely used strategy to enhance model performance, particularly in data-limited scenarios. However, this approach introduces a trade-off between computation and performance: while SSL helps with representation learning, it requires a separate, often time-consuming training phase, increasing computational overhead and limiting efficiency in resource-constrained settings. To address these challenges, we propose MixTraining, a novel framework that interleaves several SSL and SL epochs within a unified mixtraining training phase, featuring a smooth transition between two learning objectives. MixTraining enhances synergy between SSL and SL for improved accuracy and consolidates shared computation steps to reduce computation overhead. MixTraining is versatile and applicable to both single-task and multi-task learning scenarios. Extensive experiments demonstrate that MixTraining offers a superior compute-performance trade-off compared to conventional pipelines, achieving an 8.81% absolute accuracy gain (18.89% relative accuracy gain) on the TinyImageNet dataset while accelerating training by up to 1.29x with the ViT-Tiny model. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.19513 [cs.LG] (or arXiv:2502.19513v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.19513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] Building Knowledge Graphs Towards a Global Food Systems Datahub

链接: https://arxiv.org/abs/2502.19507
作者: Nirmal Gelal,Aastha Gautam,Sanaz Saki Norouzi,Nico Giordano,Claudio Dias da Silva Jr,Jean Ribert Francois,Kelsey Andersen Onofre,Katherine Nelson,Stacy Hutchinson,Xiaomao Lin,Stephen Welch,Romulo Lollato,Pascal Hitzler,Hande Küçük McGinty
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sustainable agricultural production aligns with several sustainability goals established by the United Nations (UN). However, there is a lack of studies that comprehensively examine sustainable agricultural practices across various products and production methods. Such research could provide valuable insights into the diverse factors influencing the sustainability of specific crops and produce while also identifying practices and conditions that are universally applicable to all forms of agricultural production. While this research might help us better understand sustainability, the community would still need a consistent set of vocabularies. These consistent vocabularies, which represent the underlying datasets, can then be stored in a global food systems datahub. The standardized vocabularies might help encode important information for further statistical analyses and AI/ML approaches in the datasets, resulting in the research targeting sustainable agricultural production. A structured method of representing information in sustainability, especially for wheat production, is currently unavailable. In an attempt to address this gap, we are building a set of ontologies and Knowledge Graphs (KGs) that encode knowledge associated with sustainable wheat production using formal logic. The data for this set of knowledge graphs are collected from public data sources, experimental results collected at our experiments at Kansas State University, and a Sustainability Workshop that we organized earlier in the year, which helped us collect input from different stakeholders throughout the value chain of wheat. The modeling of the ontology (i.e., the schema) for the Knowledge Graph has been in progress with the help of our domain experts, following a modular structure using KNARM methodology. In this paper, we will present our preliminary results and schemas of our Knowledge Graph and ontologies.

[AI-39] Do LLM s exhibit demographic parity in responses to queries about Human Rights?

链接: https://arxiv.org/abs/2502.19463
作者: Rafiya Javed,Jackie Kay,David Yanni,Abdullah Zaini,Anushe Sheikh,Maribeth Rauh,Iason Gabriel,Laura Weidinger
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research describes a novel approach to evaluating hedging behaviour in large language models (LLMs), specifically in the context of human rights as defined in the Universal Declaration of Human Rights (UDHR). Hedging and non-affirmation are behaviours that express ambiguity or a lack of clear endorsement on specific statements. These behaviours are undesirable in certain contexts, such as queries about whether different groups are entitled to specific human rights; since all people are entitled to human rights. Here, we present the first systematic attempt to measure these behaviours in the context of human rights, with a particular focus on between-group comparisons. To this end, we design a novel prompt set on human rights in the context of different national or social identities. We develop metrics to capture hedging and non-affirmation behaviours and then measure whether LLMs exhibit demographic parity when responding to the queries. We present results on three leading LLMs and find that all models exhibit some demographic disparities in how they attribute human rights between different identity groups. Futhermore, there is high correlation between different models in terms of how disparity is distributed amongst identities, with identities that have high disparity in one model also facing high disparity in both the other models. While baseline rates of hedging and non-affirmation differ, these disparities are consistent across queries that vary in ambiguity and they are robust across variations of the precise query wording. Our findings highlight the need for work to explicitly align LLMs to human rights principles, and to ensure that LLMs endorse the human rights of all groups equally.

[AI-40] Multi-objective Cat Swarm Optimization Algorithm based on a Grid System

链接: https://arxiv.org/abs/2502.19439
作者: Aram M. Ahmed,Bryar A. Hassan,Tarik A. Rashid,Kaniaw A. Noori,Soran Ab. M. Saeed,Omed H. Ahmed,Shahla U. Umar
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a multi-objective version of the Cat Swarm Optimization Algorithm called the Grid-based Multi-objective Cat Swarm Optimization Algorithm (GMOCSO). Convergence and diversity preservation are the two main goals pursued by modern multi-objective algorithms to yield robust results. To achieve these goals, we first replace the roulette wheel method of the original CSO algorithm with a greedy method. Then, two key concepts from Pareto Archived Evolution Strategy Algorithm (PAES) are adopted: the grid system and double archive strategy. Several test functions and a real-world scenario called the Pressure vessel design problem are used to evaluate the proposed algorithm’s performance. In the experiment, the proposed algorithm is compared with other well-known algorithms using different metrics such as Reversed Generational Distance, Spacing metric, and Spread metric. The optimization results show the robustness of the proposed algorithm, and the results are further confirmed using statistical methods and graphs. Finally, conclusions and future directions were presented…

[AI-41] Evolutionary Algorithms Approach For Search Based On Semantic Document Similarity

链接: https://arxiv.org/abs/2502.19437
作者: Chandrashekar Muniyappa,Eujin Kim
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advancements in cloud computing and distributed computing have fostered research activities in Computer science. As a result, researchers have made significant progress in Neural Networks, Evolutionary Computing Algorithms like Genetic, and Differential evolution algorithms. These algorithms are used to develop clustering, recommendation, and question-and-answering systems using various text representation and similarity measurement techniques. In this research paper, Universal Sentence Encoder (USE) is used to capture the semantic similarity of text; And the transfer learning technique is used to apply Genetic Algorithm (GA) and Differential Evolution (DE) algorithms to search and retrieve relevant top N documents based on user query. The proposed approach is applied to the Stanford Question and Answer (SQuAD) Dataset to identify a user query. Finally, through experiments, we prove that text documents can be efficiently represented as sentence embedding vectors using USE to capture the semantic similarity, and by comparing the results of the Manhattan Distance, GA, and DE algorithms we prove that the evolutionary algorithms are good at finding the top N results than the traditional ranking approach.

[AI-42] Implementation of a Generative AI Assistant in K-12 Education: The CGScholar AI Helper Initiative

链接: https://arxiv.org/abs/2502.19422
作者: Vania Castro,Ana Karina de Oliveira Nascimento,Raigul Zheldibayeva,Duane Searsmith,Akash Saini,Bill Cope,Mary Kalantzis
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper focuses on the piloting of the CGScholar AI Helper, a Generative AI (GenAI) assistant tool that aims to provide feedback on writing in high school contexts. The aim was to use GenAI to provide formative and summative feedback on students’ texts in English Language Arts (ELA) and History. The trials discussed in this paper relate to Grade 11, a crucial learning phase when students are working towards college readiness. These trials took place in two very different schools in the Midwest of the United States, one in a low socio-economic background with low-performance outcomes and the other in a high socio-economic background with high-performance outcomes. The assistant tool used two main mechanisms “prompt engineering” based on participant teachers’ assessment rubric and “fine-tuning” a Large Language Model (LLM) from a customized corpus of teaching materials using Retrieval Augmented Generation (RAG). This paper focuses on the CGScholar AI Helper’s potential to enhance students’ writing abilities and support teachers in ELA and other subject areas requiring written assignments.

[AI-43] Machine Learning-Based Cloud Computing Compliance Process Automation

链接: https://arxiv.org/abs/2502.16344
作者: Yuqing Wang,Xiao Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Cloud computing adoption across industries has revolutionized enterprise operations while introducing significant challenges in compliance management. Organizations must continuously meet evolving regulatory requirements such as GDPR and ISO 27001, yet traditional manual review processes have become increasingly inadequate for modern business scales. This paper presents a novel machine learning-based framework for automating cloud computing compliance processes, addressing critical challenges including resource-intensive manual reviews, extended compliance cycles, and delayed risk identification. Our proposed framework integrates multiple machine learning technologies, including BERT-based document processing (94.5% accuracy), One-Class SVM for anomaly detection (88.7% accuracy), and an improved CNN-LSTM architecture for sequential compliance data analysis (90.2% accuracy). Implementation results demonstrate significant improvements: reducing compliance process duration from 7 days to 1.5 days, improving accuracy from 78% to 93%, and decreasing manual effort by 73.3%. A real-world deployment at a major securities firm validated these results, processing 800,000 daily transactions with 94.2% accuracy in risk identification.

[AI-44] Naturalistic Computational Cognitive Science: Towards generalizable models and theories that capture the full range of natural behavior

链接: https://arxiv.org/abs/2502.20349
作者: Wilka Carvalho,Andrew Lampinen
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence increasingly pursues large, complex models that perform many tasks within increasingly realistic domains. How, if at all, should these developments in AI influence cognitive science? We argue that progress in AI offers timely opportunities for cognitive science to embrace experiments with increasingly naturalistic stimuli, tasks, and behaviors; and computational models that can accommodate these changes. We first review a growing body of research spanning neuroscience, cognitive science, and AI that suggests that incorporating a broader range of naturalistic experimental paradigms (and models that accommodate them) may be necessary to resolve some aspects of natural intelligence and ensure that our theories generalize. We then suggest that integrating recent progress in AI and cognitive science will enable us to engage with more naturalistic phenomena without giving up experimental control or the pursuit of theoretically grounded understanding. We offer practical guidance on how methodological practices can contribute to cumulative progress in naturalistic computational cognitive science, and illustrate a path towards building computational models that solve the real problems of natural cognition - together with a reductive understanding of the processes and principles by which they do so. Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.20349 [q-bio.NC] (or arXiv:2502.20349v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2502.20349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-45] CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

链接: https://arxiv.org/abs/2502.20040
作者: Nian Shao,Rui Zhou,Pengyu Wang,Xian Li,Ying Fang,Yujie Yang,Xiaofei Li
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Submission to IEEE/ACM Trans. on TASLP

点击查看摘要

Abstract:In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech waveform with a neural vocoder or directly used for ASR. The proposed network is composed of interleaved cross-band and narrow-band processing in the Mel-frequency domain, for learning the full-band spectral pattern and the narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the key advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results on four English and one Chinese datasets demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model. Code and audio examples of our model are available online in this https URL.

[AI-46] Efficient and Universal Neural-Network Decoder for Stabilizer-Based Quantum Error Correction

链接: https://arxiv.org/abs/2502.19971
作者: Gengyuan Hu,Wanli Ouyang,Chao-Yang Lu,Chen Lin,Han-Sen Zhong
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quantum error correction is crucial for large-scale quantum computing, but the absence of efficient decoders for new codes like quantum low-density parity-check (QLDPC) codes has hindered progress. Here we introduce a universal decoder based on linear attention sequence modeling and graph neural network that operates directly on any stabilizer code’s graph structure. Our numerical experiments demonstrate that this decoder outperforms specialized algorithms in both accuracy and speed across diverse stabilizer codes, including surface codes, color codes, and QLDPC codes. The decoder maintains linear time scaling with syndrome measurements and requires no structural modifications between different codes. For the Bivariate Bicycle code with distance 12, our approach achieves a 39.4% lower logical error rate than previous best decoders while requiring only ~1% of the decoding time. These results provide a practical, universal solution for quantum error correction, eliminating the need for code-specific decoders.

[AI-47] Practical Evaluation of Copula-based Survival Metrics: Beyond the Independent Censoring Assumption

链接: https://arxiv.org/abs/2502.19460
作者: Christian Marius Lillelund,Shi-ang Qi,Russell Greiner
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conventional survival metrics, such as Harrell’s concordance index and the Brier Score, rely on the independent censoring assumption for valid inference in the presence of right-censored data. However, when instances are censored for reasons related to the event of interest, this assumption no longer holds, as this kind of dependent censoring biases the marginal survival estimates of popular nonparametric estimators. In this paper, we propose three copula-based metrics to evaluate survival models in the presence of dependent censoring, and design a framework to create realistic, semi-synthetic datasets with dependent censoring to facilitate the evaluation of the metrics. Our empirical analyses in synthetic and semi-synthetic datasets show that our metrics can give error estimates that are closer to the true error, mainly in terms of predictive accuracy.

[AI-48] Multispectral to Hyperspectral using Pretrained Foundational model

链接: https://arxiv.org/abs/2502.19451
作者: Ruben Gonzalez,Conrad M Albrecht,Nassim Ait Ali Braham,Devyani Lambhate,Joao Lucas de Sousa Almeida,Paolo Fraccaro,Benedikt Blumenstiel,Thomas Brunschwiler,Ranjini Bangalore
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hyperspectral imaging provides detailed spectral information, offering significant potential for monitoring greenhouse gases like CH4 and NO2. However, its application is constrained by limited spatial coverage and infrequent revisit times. In contrast, multispectral imaging delivers broader spatial and temporal coverage but lacks the spectral granularity required for precise GHG detection. To address these challenges, this study proposes Spectral and Spatial-Spectral transformer models that reconstruct hyperspectral data from multispectral inputs. The models in this paper are pretrained on EnMAP and EMIT datasets and fine-tuned on spatio-temporally aligned (Sentinel-2, EnMAP) and (HLS-S30, EMIT) image pairs respectively. Our model has the potential to enhance atmospheric monitoring by combining the strengths of hyperspectral and multispectral imaging systems.

机器学习

[LG-0] R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

链接: https://arxiv.org/abs/2502.20395
作者: Zhongyang Li,Ziyue Li,Tianyi Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)’ powerful reasoning capabilities, deterring LMMs’ performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs’ performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

[LG-1] Scalable Signature Kernel Computations for Long Time Series via Local Neumann Series Expansions

链接: https://arxiv.org/abs/2502.20392
作者: Matthew Tamayo-Rios,Alexander Schell,Rima Alaifari
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:The signature kernel is a recent state-of-the-art tool for analyzing high-dimensional sequential data, valued for its theoretical guarantees and strong empirical performance. In this paper, we present a novel method for efficiently computing the signature kernel of long, high-dimensional time series via dynamically truncated recursive local power series expansions. Building on the characterization of the signature kernel as the solution of a Goursat PDE, our approach employs tilewise Neumann-series expansions to derive rapidly converging power series approximations of the signature kernel that are locally defined on subdomains and propagated iteratively across the entire domain of the Goursat solution by exploiting the geometry of the time series. Algorithmically, this involves solving a system of interdependent local Goursat PDEs by recursively propagating boundary conditions along a directed graph via topological ordering, with dynamic truncation adaptively terminating each local power series expansion when coefficients fall below machine precision, striking an effective balance between computational cost and accuracy. This method achieves substantial performance improvements over state-of-the-art approaches for computing the signature kernel, providing (a) adjustable and superior accuracy, even for time series with very high roughness; (b) drastically reduced memory requirements; and © scalability to efficiently handle very long time series (e.g., with up to half a million points or more) on a single GPU. These advantages make our method particularly well-suited for rough-path-assisted machine learning, financial modeling, and signal processing applications that involve very long and highly volatile data.

[LG-2] When does a predictor know its own loss?

链接: https://arxiv.org/abs/2502.20375
作者: Aravind Gollakota,Parikshit Gopalan,Aayush Karan,Charlotte Peale,Udi Wieder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given a predictor and a loss function, how well can we predict the loss that the predictor will incur on an input? This is the problem of loss prediction, a key computational task associated with uncertainty estimation for a predictor. In a classification setting, a predictor will typically predict a distribution over labels and hence have its own estimate of the loss that it will incur, given by the entropy of the predicted distribution. Should we trust this estimate? In other words, when does the predictor know what it knows and what it does not know? In this work we study the theoretical foundations of loss prediction. Our main contribution is to establish tight connections between nontrivial loss prediction and certain forms of multicalibration, a multigroup fairness notion that asks for calibrated predictions across computationally identifiable subgroups. Formally, we show that a loss predictor that is able to improve on the self-estimate of a predictor yields a witness to a failure of multicalibration, and vice versa. This has the implication that nontrivial loss prediction is in effect no easier or harder than auditing for multicalibration. We support our theoretical results with experiments that show a robust positive correlation between the multicalibration error of a predictor and the efficacy of training a loss predictor. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.20375 [cs.LG] (or arXiv:2502.20375v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.20375 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Constrained Generative Modeling with Manually Bridged Diffusion Models AAAI2025

链接: https://arxiv.org/abs/2502.20371
作者: Saeid Naderiparizi,Xiaoxuan Liang,Berend Zwartsenberg,Frank Wood
类目: Machine Learning (cs.LG)
*备注: AAAI 2025

点击查看摘要

Abstract:In this paper we describe a novel framework for diffusion-based generative modeling on constrained spaces. In particular, we introduce manual bridges, a framework that expands the kinds of constraints that can be practically used to form so-called diffusion bridges. We develop a mechanism for combining multiple such constraints so that the resulting multiply-constrained model remains a manual bridge that respects all constraints. We also develop a mechanism for training a diffusion model that respects such multiple constraints while also adapting it to match a data distribution. We develop and extend theory demonstrating the mathematical validity of our mechanisms. Additionally, we demonstrate our mechanism in constrained generative modeling tasks, highlighting a particular high-value application in modeling trajectory initializations for path planning and control in autonomous vehicles.

[LG-4] Improving the Efficiency of a Deep Reinforcement Learning-Based Power Management System for HPC Clusters Using Curriculum Learning

链接: https://arxiv.org/abs/2502.20348
作者: Thomas Budiarjo,Santana Yuda Pradata,Kadek Gemilang Santiyuda,Muhammad Alfian Amrizal,Reza Pulungan,Hiroyuki Takizawa
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 pages, 17 figures, accepted at Supercomputing Asia '25, published by ACM

点击查看摘要

Abstract:High energy consumption remains a key challenge in high-performance computing (HPC) systems, which often feature hundreds or thousands of nodes drawing substantial power even in idle or standby modes. Although powering down unused nodes can improve energy efficiency, choosing the wrong time to do so can degrade quality of service by delaying job execution. Machine learning, in particular reinforcement learning (RL), has shown promise in determining optimal times to switch nodes on or off. In this study, we enhance the performance of a deep reinforcement learning (DRL) agent for HPC power management by integrating curriculum learning (CL), a training approach that introduces tasks with gradually increasing difficulty. Using the Batsim-py simulation framework, we compare the proposed CL-based agent to both a baseline DRL method (without CL) and the conventional fixed-time timeout strategy. Experimental results confirm that an easy-to-hard curriculum outperforms other training orders in terms of reducing wasted energy usage. The best agent achieves a 3.73% energy reduction over the baseline DRL method and a 4.66% improvement compared to the best timeout configuration (shutdown every 15 minutes of idle time). In addition, it reduces average job waiting time by 9.24% and maintains a higher job-filling rate, indicating more effective resource utilization. Sensitivity tests across various switch-on durations, power levels, and cluster sizes further reveal the agent’s adaptability to changing system parameters without retraining. These findings demonstrate that curriculum learning can significantly improve DRL-based power management in HPC, balancing energy savings, quality of service, and robustness to diverse configurations.

[LG-5] Safety Representations for Safer Policy Learning ICLR

链接: https://arxiv.org/abs/2502.20341
作者: Kaustubh Mani,Vincent Mai,Charlie Gauthier,Annie Chen,Samer Nashed,Liam Paull
类目: Machine Learning (cs.LG)
*备注: Accepted at International Conference on Learning Representations (ICLR) 2025

点击查看摘要

Abstract:Reinforcement learning algorithms typically necessitate extensive exploration of the state space to find optimal policies. However, in safety-critical applications, the risks associated with such exploration can lead to catastrophic consequences. Existing safe exploration methods attempt to mitigate this by imposing constraints, which often result in overly conservative behaviours and inefficient learning. Heavy penalties for early constraint violations can trap agents in local optima, deterring exploration of risky yet high-reward regions of the state space. To address this, we introduce a method that explicitly learns state-conditioned safety representations. By augmenting the state features with these safety representations, our approach naturally encourages safer exploration without being excessively cautious, resulting in more efficient and safer policy learning in safety-critical scenarios. Empirical evaluations across diverse environments show that our method significantly improves task performance while reducing constraint violations during training, underscoring its effectiveness in balancing exploration with safety.

[LG-6] A Generative Model Enhanced Multi-Agent Reinforcement Learning Method for Electric Vehicle Charging Navigation

链接: https://arxiv.org/abs/2502.20068
作者: Tianyang Qi,Shibo Chen,Jun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the widespread adoption of electric vehicles (EVs), navigating for EV drivers to select a cost-effective charging station has become an important yet challenging issue due to dynamic traffic conditions, fluctuating electricity prices, and potential competition from other EVs. The state-of-the-art deep reinforcement learning (DRL) algorithms for solving this task still require global information about all EVs at the execution stage, which not only increases communication costs but also raises privacy issues among EV drivers. To overcome these drawbacks, we introduce a novel generative model-enhanced multi-agent DRL algorithm that utilizes only the EV’s local information while achieving performance comparable to these state-of-the-art algorithms. Specifically, the policy network is implemented on the EV side, and a Conditional Variational Autoencoder-Long Short Term Memory (CVAE-LSTM)-based recommendation model is developed to provide recommendation information. Furthermore, a novel future charging competition encoder is designed to effectively compress global information, enhancing training performance. The multi-gradient descent algorithm (MGDA) is also utilized to adaptively balance the weight between the two parts of the training objective, resulting in a more stable training process. Simulations are conducted based on a practical area in Xián, China. Experimental results show that our proposed algorithm, which relies on local information, outperforms existing local information-based methods and achieves less than 8% performance loss compared to global information-based methods.

[LG-7] RouteRL: Multi-agent reinforcement learning framework for urban route choice with autonomous vehicles

链接: https://arxiv.org/abs/2502.20065
作者: Ahmet Onur Akman,Anastasia Psarou,Łukasz Gorczyca,Zoltán György Varga,Grzegorz Jamróz,Rafał Kucharski
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:RouteRL is a novel framework that integrates multi-agent reinforcement learning (MARL) with a microscopic traffic simulation, facilitating the testing and development of efficient route choice strategies for autonomous vehicles (AVs). The proposed framework simulates the daily route choices of driver agents in a city, including two types: human drivers, emulated using behavioral route choice models, and AVs, modeled as MARL agents optimizing their policies for a predefined objective. RouteRL aims to advance research in MARL, transport modeling, and human-AI interaction for transportation applications. This study presents a technical report on RouteRL, outlines its potential research contributions, and showcases its impact via illustrative examples.

[LG-8] Hiring under Congestion and Algorithmic Monoculture: Value of Strategic Behavior

链接: https://arxiv.org/abs/2502.20063
作者: Jackie Baek,Hamsa Bastani,Shihan Chen
类目: Computer Science and Game Theory (cs.GT); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the impact of strategic behavior in a setting where firms compete to hire from a shared pool of applicants, and firms use a common algorithm to evaluate them. Each applicant is associated with a scalar score that is observed by all firms, provided by the algorithm. Firms simultaneously make interview decisions, where the number of interviews is capacity-constrained. Job offers are given to those who pass the interview, and an applicant who receives multiple offers accepts one of them uniformly at random. We fully characterize the set of Nash equilibria under this model. Defining social welfare as the total number of applicants who find a job, we then compare the social welfare at a Nash equilibrium to a naive baseline where all firms interview applicants with the highest scores. We show that the Nash equilibrium greatly improves upon social welfare compared to the naive baseline, especially when the interview capacity is small and the number of firms is large. We also show that the price of anarchy is small, providing further appeal for the equilibrium solution. We then study how the firms may converge to a Nash equilibrium. We show that when firms make interview decisions sequentially and each firm takes the best response action assuming they are the last to act, this process converges to an equilibrium when interview capacities are small. However, we show that the task of computing the best response is difficult if firms have to use its own historical samples to estimate it, while this task becomes trivial if firms have information on the degree of competition for each applicant. Therefore, converging to an equilibrium can be greatly facilitated if firms have information on the level of competition for each applicant. Subjects: Computer Science and Game Theory (cs.GT); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2502.20063 [cs.GT] (or arXiv:2502.20063v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2502.20063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Recommendations from Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization

链接: https://arxiv.org/abs/2502.20033
作者: Suryanarayana Sankagiri,Jalal Etesami,Matthias Grossglauser
类目: Machine Learning (cs.LG)
*备注: 42 pages, 1 figure

点击查看摘要

Abstract:This paper provides a theoretical analysis of a new learning problem for recommender systems where users provide feedback by comparing pairs of items instead of rating them individually. We assume that comparisons stem from latent user and item features, which reduces the task of predicting preferences to learning these features from comparison data. Similar to the classical matrix factorization problem, the main challenge in this learning task is that the resulting loss function is nonconvex. Our analysis shows that the loss function exhibits (restricted) strong convexity near the true solution, which ensures gradient-based methods converge exponentially, given an appropriate warm start. Importantly, this result holds in a sparse data regime, where each user compares only a few pairs of items. Our main technical contribution is to extend certain concentration inequalities commonly used in matrix completion to our model. Our work demonstrates that learning personalized recommendations from comparison data is computationally and statistically efficient.

[LG-10] Offline Reinforcement Learning via Inverse Optimization

链接: https://arxiv.org/abs/2502.20030
作者: Ioannis Dimanidis,Tolga Ok,Peyman Mohajerin Esfahani
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: preprint

点击查看摘要

Abstract:Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss" from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance comparing with the state-of-the-art (SOTA) methods in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.

[LG-11] Climate And Resource Awareness is Imperative to Achieving Sustainable AI (and Preventing a Global AI Arms Race)

链接: https://arxiv.org/abs/2502.20016
作者: Pedram Bakhtiarifard,Pınar Tözün,Christian Igel,Raghavendra Selvan
类目: Machine Learning (cs.LG)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:Sustainability encompasses three key facets: economic, environmental, and social. However, the nascent discourse that is emerging on sustainable artificial intelligence (AI) has predominantly focused on the environmental sustainability of AI, often neglecting the economic and social aspects. Achieving truly sustainable AI necessitates addressing the tension between its climate awareness and its social sustainability, which hinges on equitable access to AI development resources. The concept of resource awareness advocates for broader access to the infrastructure required to develop AI, fostering equity in AI innovation. Yet, this push for improving accessibility often overlooks the environmental costs of expanding such resource usage. In this position paper, we argue that reconciling climate and resource awareness is essential to realizing the full potential of sustainable AI. We use the framework of base-superstructure to analyze how the material conditions are influencing the current AI discourse. We also introduce the Climate and Resource Aware Machine Learning (CARAML) framework to address this conflict and propose actionable recommendations spanning individual, community, industry, government, and global levels to achieve sustainable AI.

[LG-12] Learning Classifiers That Induce Markets

链接: https://arxiv.org/abs/2502.20012
作者: Yonatan Sommer,Ivri Hikri,Lotan Amit,Nir Rosenfeld
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When learning is used to inform decisions about humans, such as for loans, hiring, or admissions, this can incentivize users to strategically modify their features to obtain positive predictions. A key assumption is that modifications are costly, and are governed by a cost function that is exogenous and predetermined. We challenge this assumption, and assert that the deployment of a classifier is what creates costs. Our idea is simple: when users seek positive predictions, this creates demand for important features; and if features are available for purchase, then a market will form, and competition will give rise to prices. We extend the strategic classification framework to support this notion, and study learning in a setting where a classifier can induce a market for features. We present an analysis of the learning task, devise an algorithm for computing market prices, propose a differentiable learning framework, and conduct experiments to explore our novel setting and approach.

[LG-13] Learning Hamiltonian Density Using DeepONet

链接: https://arxiv.org/abs/2502.19994
作者: Baige Xu,Yusuke Tanaka,Takashi Matsubara,Takaharu Yaguchi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, deep learning for modeling physical phenomena which can be described by partial differential equations (PDEs) have received significant attention. For example, for learning Hamiltonian mechanics, methods based on deep neural networks such as Hamiltonian Neural Networks (HNNs) and their variants have achieved progress. However, existing methods typically depend on the discretization of data, and the determination of required differential operators is often necessary. Instead, in this work, we propose an operator learning approach for modeling wave equations. In particular, we present a method to compute the variational derivatives that are needed to formulate the equations using the automatic differentiation algorithm. The experiments demonstrated that the proposed method is able to learn the operator that defines the Hamiltonian density of waves from data with unspecific discretization without determination of the differential operators.

[LG-14] Dam Volume Prediction Model Development Using ML Algorithms

链接: https://arxiv.org/abs/2502.19989
作者: Hugo Retief,Mariangel Garcia Andarcia,Chris Dickens,Surajit Ghosh
类目: Machine Learning (cs.LG)
*备注: 22 pages, 18 Figures and 4 Tables

点击查看摘要

Abstract:Reliable reservoir volume estimates are crucial for water resource management, especially in arid and semi-arid regions. The present study investigates applying three machine learning regression techniques - Gradient Boosting, Random Forest, and ElasticNet to predict key dam performance characteristics of the Loskop Dam in South Africa. The models were trained and validated on a dataset comprising geospatial elevation measurements paired with corresponding reservoir supply capacity values. The best-performing approach was a threshold-based blended model that combined random forest for higher volumes with Ridge regression for lower volumes. This model achieved an RMSE of 4.88 MCM and an R2 of 0.99. These findings highlight the ability of ensemble learning techniques to capture complex relationships in dam datasets and underscore their practical utility for reliable dam performance modelling in real-world water resource management scenarios.

[LG-15] WaveGAS: Waveform Relaxation for Scaling Graph Neural Networks

链接: https://arxiv.org/abs/2502.19986
作者: Jana Vatter,Mykhaylo Zayats,Marcos Martínez Galindo,Vanessa López,Ruben Mayer,Hans-Arno Jacobsen,Hoang Thanh Lam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the ever-growing size of real-world graphs, numerous techniques to overcome resource limitations when training Graph Neural Networks (GNNs) have been developed. One such approach, GNNAutoScale (GAS), uses graph partitioning to enable training under constrained GPU memory. GAS also stores historical embedding vectors, which are retrieved from one-hop neighbors in other partitions, ensuring critical information is captured across partition boundaries. The historical embeddings which come from the previous training iteration are stale compared to the GAS estimated embeddings, resulting in approximation errors of the training algorithm. Furthermore, these errors accumulate over multiple layers, leading to suboptimal node embeddings. To address this shortcoming, we propose two enhancements: first, WaveGAS, inspired by waveform relaxation, performs multiple forward passes within GAS before the backward pass, refining the approximation of historical embeddings and gradients to improve accuracy; second, a gradient-tracking method that stores and utilizes more accurate historical gradients during training. Empirical results show that WaveGAS enhances GAS and achieves better accuracy, even outperforming methods that train on full graphs, thanks to its robust estimation of node embeddings.

[LG-16] Efficient Time Series Forecasting via Hyper-Complex Models and Frequency Aggregation

链接: https://arxiv.org/abs/2502.19983
作者: Eyal Yakir,Dor Tsur,Haim Permuter
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures. Still awaiting conference submission approval

点击查看摘要

Abstract:Time series forecasting is a long-standing problem in statistics and machine learning. One of the key challenges is processing sequences with long-range dependencies. To that end, a recent line of work applied the short-time Fourier transform (STFT), which partitions the sequence into multiple subsequences and applies a Fourier transform to each separately. We propose the Frequency Information Aggregation (FIA)-Net, which is based on a novel complex-valued MLP architecture that aggregates adjacent window information in the frequency domain. To further increase the receptive field of the FIA-Net, we treat the set of windows as hyper-complex (HC) valued vectors and employ HC algebra to efficiently combine information from all STFT windows altogether. Using the HC-MLP backbone allows for improved handling of sequences with long-term dependence. Furthermore, due to the nature of HC operations, the HC-MLP uses up to three times fewer parameters than the equivalent standard window aggregation method. We evaluate the FIA-Net on various time-series benchmarks and show that the proposed methodologies outperform existing state of the art methods in terms of both accuracy and efficiency. Our code is publicly available on this https URL.

[LG-17] Can Textual Gradient Work in Federated Learning? ICLR2025

链接: https://arxiv.org/abs/2502.19980
作者: Minghui Chen,Ruinan Jin,Wenlong Deng,Yuanyuan Chen,Zhi Huang,Han Yu,Xiaoxiao Li
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Recent studies highlight the promise of LLM-based prompt optimization, especially with TextGrad, which automates differentiation’’ via texts and backpropagates textual feedback. This approach facilitates training in various real-world applications that do not support numerical gradient propagation or loss calculation. In this paper, we systematically explore the potential and challenges of incorporating textual gradient into Federated Learning (FL). Our contributions are fourfold. Firstly, we introduce a novel FL paradigm, Federated Textual Gradient (FedTextGrad), that allows clients to upload locally optimized prompts derived from textual gradients, while the server aggregates the received prompts. Unlike traditional FL frameworks, which are designed for numerical aggregation, FedTextGrad is specifically tailored for handling textual data, expanding the applicability of FL to a broader range of problems that lack well-defined numerical loss functions. Secondly, building on this design, we conduct extensive experiments to explore the feasibility of FedTextGrad. Our findings highlight the importance of properly tuning key factors (e.g., local steps) in FL training. Thirdly, we highlight a major challenge in FedTextGrad aggregation: retaining essential information from distributed prompt updates. Last but not least, in response to this issue, we improve the vanilla variant of FedTextGrad by providing actionable guidance to the LLM when summarizing client prompts by leveraging the Uniform Information Density principle. Through this principled study, we enable the adoption of textual gradients in FL for optimizing LLMs, identify important issues, and pinpoint future directions, thereby opening up a new research area that warrants further investigation.

[LG-18] Do Sparse Autoencoders Generalize? A Case Study of Answerability

链接: https://arxiv.org/abs/2502.19964
作者: Lovis Heindrich,Philip Torr,Fazl Barez,Veronika Thost
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through “answerability”-a model’s ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features demonstrate inconsistent transfer ability, and residual stream probes similarly show high variance out of distribution. Overall, this demonstrates the need for quantitative methods to predict feature generalization in SAE-based interpretability.

[LG-19] SeisMoLLM : Advancing Seismic Monitoring via Cross-modal Transfer with Pre-trained Large Language Model

链接: https://arxiv.org/abs/2502.19960
作者: Xinghao Wang,Feng Liu,Rui Su,Zhihui Wang,Lei Bai,Wanli Ouyang
类目: Machine Learning (cs.LG)
*备注: 13 pages, 6 figures. Code is available at this https URL

点击查看摘要

Abstract:Recent advances in deep learning have revolutionized seismic monitoring, yet developing a foundation model that performs well across multiple complex tasks remains challenging, particularly when dealing with degraded signals or data scarcity. This work presents SeisMoLLM, the first foundation model that utilizes cross-modal transfer for seismic monitoring, to unleash the power of large-scale pre-training from a large language model without requiring direct pre-training on seismic datasets. Through elaborate waveform tokenization and fine-tuning of pre-trained GPT-2 model, SeisMoLLM achieves state-of-the-art performance on the DiTing and STEAD datasets across five critical tasks: back-azimuth estimation, epicentral distance estimation, magnitude estimation, phase picking, and first-motion polarity classification. It attains 36 best results out of 43 task metrics and 12 top scores out of 16 few-shot generalization metrics, with many relative improvements ranging from 10% to 50%. In addition to its superior performance, SeisMoLLM maintains efficiency comparable to or even better than lightweight models in both training and inference. These findings establish SeisMoLLM as a promising foundation model for practical seismic monitoring and highlight cross-modal transfer as an exciting new direction for earthquake studies, showcasing the potential of advanced deep learning techniques to propel seismology research forward.

[LG-20] owards Collaborative Anti-Money Laundering Among Financial Institutions WWW

链接: https://arxiv.org/abs/2502.19952
作者: Zhihua Tian,Yuan Ding,Xiang Yu,Enchao Gong,Jian Liu,Kui Ren
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted by International World Wide Web Conference (WWW) 2025

点击查看摘要

Abstract:Money laundering is the process that intends to legalize the income derived from illicit activities, thus facilitating their entry into the monetary flow of the economy without jeopardizing their source. It is crucial to identify such activities accurately and reliably in order to enforce anti-money laundering (AML). Despite considerable efforts to AML, a large number of such activities still go undetected. Rule-based methods were first introduced and are still widely used in current detection systems. With the rise of machine learning, graph-based learning methods have gained prominence in detecting illicit accounts through the analysis of money transfer graphs. Nevertheless, these methods generally assume that the transaction graph is centralized, whereas in practice, money laundering activities usually span multiple financial institutions. Due to regulatory, legal, commercial, and customer privacy concerns, institutions tend not to share data, restricting their utility in practical usage. In this paper, we propose the first algorithm that supports performing AML over multiple institutions while protecting the security and privacy of local data. To evaluate, we construct Alipay-ECB, a real-world dataset comprising digital transactions from Alipay, the world’s largest mobile payment platform, alongside transactions from E-Commerce Bank (ECB). The dataset includes over 200 million accounts and 300 million transactions, covering both intra-institution transactions and those between Alipay and ECB. This makes it the largest real-world transaction graph available for analysis. The experimental results demonstrate that our methods can effectively identify cross-institution money laundering subgroups. Additionally, experiments on synthetic datasets also demonstrate that our method is efficient, requiring only a few minutes on datasets with millions of transactions.

[LG-21] Machine-learning for photoplethysmography analysis: Benchmarking feature image and signal-based approaches

链接: https://arxiv.org/abs/2502.19949
作者: Mohammad Moulaeifard,Loic Coquelin,Mantas Rinkevičius,Andrius Sološenko,Oskar Pfeffer,Ciaran Bench,Nando Hegemann,Sara Vardanega,Manasi Nandi,Jordi Alastruey,Christian Heiss,Vaidotas Marozas,Andrew Thompson,Philip J. Aston,Peter H. Charlton,Nils Strodthoff
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 39 pages, 9 figures, code available at this https URL

点击查看摘要

Abstract:Photoplethysmography (PPG) is a widely used non-invasive physiological sensing technique, suitable for various clinical applications. Such clinical applications are increasingly supported by machine learning methods, raising the question of the most appropriate input representation and model choice. Comprehensive comparisons, in particular across different input representations, are scarce. We address this gap in the research landscape by a comprehensive benchmarking study covering three kinds of input representations, interpretable features, image representations and raw waveforms, across prototypical regression and classification use cases: blood pressure and atrial fibrillation prediction. In both cases, the best results are achieved by deep neural networks operating on raw time series as input representations. Within this model class, best results are achieved by modern convolutional neural networks (CNNs). but depending on the task setup, shallow CNNs are often also very competitive. We envision that these results will be insightful for researchers to guide their choice on machine learning tasks for PPG data, even beyond the use cases presented in this work.

[LG-22] Shifting the Paradigm: A Diffeomorphism Between Time Series Data Manifolds for Achieving Shift-Invariancy in Deep Learning ICLR

链接: https://arxiv.org/abs/2502.19921
作者: Berken Utku Demirel,Christian Holz
类目: Machine Learning (cs.LG)
*备注: To appear at the International Conference on Learning Representation (ICLR) 2025

点击查看摘要

Abstract:Deep learning models lack shift invariance, making them sensitive to input shifts that cause changes in output. While recent techniques seek to address this for images, our findings show that these approaches fail to provide shift-invariance in time series, where the data generation mechanism is more challenging due to the interaction of low and high frequencies. Worse, they also decrease performance across several tasks. In this paper, we propose a novel differentiable bijective function that maps samples from their high-dimensional data manifold to another manifold of the same dimension, without any dimensional reduction. Our approach guarantees that samples – when subjected to random shifts – are mapped to a unique point in the manifold while preserving all task-relevant information without loss. We theoretically and empirically demonstrate that the proposed transformation guarantees shift-invariance in deep learning models without imposing any limits to the shift. Our experiments on six time series tasks with state-of-the-art methods show that our approach consistently improves the performance while enabling models to achieve complete shift-invariance without modifying or imposing restrictions on the model’s topology. The source code is available on \hrefthis https URLGitHub.

[LG-23] Playing Pokémon Red via Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.19920
作者: Marco Pleines,Daniel Addis,David Rubinstein,Frank Zimmer,Mike Preuss,Peter Whidden
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 3 tables, under review

点击查看摘要

Abstract:Pokémon Red, a classic Game Boy JRPG, presents significant challenges as a testbed for agents, including multi-tasking, long horizons of tens of thousands of steps, hard exploration, and a vast array of potential policies. We introduce a simplistic environment and a Deep Reinforcement Learning (DRL) training methodology, demonstrating a baseline agent that completes an initial segment of the game up to completing Cerulean City. Our experiments include various ablations that reveal vulnerabilities in reward shaping, where agents exploit specific reward signals. We also discuss limitations and argue that games like Pokémon hold strong potential for future research on Large Language Model agents, hierarchical training algorithms, and advanced exploration methods. Source Code: this https URL

[LG-24] SkipPipe: Partial and Reordered Pipelining Framework for Training LLM s in Heterogeneous Networks

链接: https://arxiv.org/abs/2502.19913
作者: Nikolay Blagoev,Lydia Yiyu Chen,Oğuzhan Ersoy
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Data and pipeline parallelism are ubiquitous for training of Large Language Models (LLM) on distributed nodes. Driven by the need for cost-effective training, recent work explores efficient communication arrangement for end to end training. Motivated by LLM’s resistance to layer skipping and layer reordering, in this paper, we explore stage (several consecutive layers) skipping in pipeline training, and challenge the conventional practice of sequential pipeline execution. We derive convergence and throughput constraints (guidelines) for pipelining with skipping and swapping pipeline stages. Based on these constraints, we propose SkipPipe, the first partial pipeline framework to reduce the end-to-end training time for LLMs while preserving the convergence. The core of SkipPipe is a path scheduling algorithm that optimizes the paths for individual microbatches and reduces idle time (due to microbatch collisions) on the distributed nodes, complying with the given stage skipping ratio. We extensively evaluate SkipPipe on LLaMa models from 500M to 8B parameters on up to 20 nodes. Our results show that SkipPipe reduces training iteration time by up to 55% compared to full pipeline. Our partial pipeline training also improves resistance to layer omission during inference, experiencing a drop in perplexity of only 7% when running only half the model. Our code is available at this https URL.

[LG-25] Community Detection by ELPMeans: An Unsupervised Approach That Uses Laplacian Centrality and Clustering

链接: https://arxiv.org/abs/2502.19895
作者: Shahin Momenzadeh,Rojiar Pir Mohammadiani
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Community detection in network analysis has become more intricate due to the recent hike in social networks (Cai et al., 2024). This paper suggests a new approach named ELPMeans that strives to address this challenge. For community detection in the whole network, ELPMeans combines Laplacian, Hierarchical Clustering as well as K-means algorithms. Our technique employs Laplacian centrality and minimum distance metrics for central node identification while k-means learning is used for efficient convergence to final community structure. Remarkably, ELPMeans is an unsupervised method which is not only simple to implement but also effectively tackles common problems such as random initialization of central nodes, or finding of number of communities (K). Experimental results show that our algorithm improves accuracy and reduces time complexity considerably outperforming recent approaches on real world networks. Moreover, our approach has a wide applicability range in various community detection tasks even with nonconvex shapes and no prior knowledge about the number of communities present.

[LG-26] A Multiple Transferable Neural Network Method with Domain Decomposition for Elliptic Interface Problems

链接: https://arxiv.org/abs/2502.19893
作者: Tianzheng Lu,Lili Ju,Liyong Zhu
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The transferable neural network (TransNet) is a two-layer shallow neural network with pre-determined and uniformly distributed neurons in the hidden layer, and the least-squares solvers can be particularly used to compute the parameters of its output layer when applied to the solution of partial differential equations. In this paper, we integrate the TransNet technique with the nonoverlapping domain decomposition and the interface conditions to develop a novel multiple transferable neural network (Multi-TransNet) method for solving elliptic interface problems, which typically contain discontinuities in both solutions and their derivatives across interfaces. We first propose an empirical formula for the TransNet to characterize the relationship between the radius of the domain-covering ball, the number of hidden-layer neurons, and the optimal neuron shape. In the Multi-TransNet method, we assign each subdomain one distinct TransNet with an adaptively determined number of hidden-layer neurons to maintain the globally uniform neuron distribution across the entire computational domain, and then unite all the subdomain TransNets together by incorporating the interface condition terms into the loss function. The empirical formula is also extended to the Multi-TransNet and further employed to estimate appropriate neuron shapes for the subdomain TransNets, greatly reducing the parameter tuning cost. Additionally, we propose a normalization approach to adaptively select the weighting parameters for the terms in the loss function. Ablation studies and extensive experiments with comparison tests on different types of elliptic interface problems with low to high contrast diffusion coefficients in two and three dimensions are carried out to numerically demonstrate the superior accuracy, efficiency, and robustness of the proposed Multi-TransNet method.

[LG-27] Beyond Worst-Case Dimensionality Reduction for Sparse Vectors ICLR2025

链接: https://arxiv.org/abs/2502.19865
作者: Sandeep Silwal,David P. Woodruff,Qiuyi Zhang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: To appear in ICLR 2025

点击查看摘要

Abstract:We study beyond worst-case dimensionality reduction for s -sparse vectors. Our work is divided into two parts, each focusing on a different facet of beyond worst-case analysis: We first consider average-case guarantees. A folklore upper bound based on the birthday-paradox states: For any collection X of s -sparse vectors in \mathbbR^d , there exists a linear map to \mathbbR^O(s^2) which \emphexactly preserves the norm of 99% of the vectors in X in any \ell_p norm (as opposed to the usual setting where guarantees hold for all vectors). We give lower bounds showing that this is indeed optimal in many settings: any oblivious linear map satisfying similar average-case guarantees must map to \Omega(s^2) dimensions. The same lower bound also holds for a wide class of smooth maps, including `encoder-decoder schemes’, where we compare the norm of the original vector to that of a smooth function of the embedding. These lower bounds reveal a separation result, as an upper bound of O(s \log(d)) is possible if we instead use arbitrary (possibly non-smooth) functions, e.g., via compressed sensing algorithms. Given these lower bounds, we specialize to sparse \emphnon-negative vectors. For a dataset X of non-negative s -sparse vectors and any p \ge 1 , we can non-linearly embed X to O(s\log(|X|s)/\epsilon^2) dimensions while preserving all pairwise distances in \ell_p norm up to 1\pm \epsilon , with no dependence on p . Surprisingly, the non-negativity assumption enables much smaller embeddings than arbitrary sparse vectors, where the best known bounds suffer exponential dependence. Our map also guarantees \emphexact dimensionality reduction for \ell_\infty by embedding into O(s\log |X|) dimensions, which is tight. We show that both the non-linearity of f and the non-negativity of X are necessary, and provide downstream algorithmic improvements. Comments: To appear in ICLR 2025 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2502.19865 [cs.DS] (or arXiv:2502.19865v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2502.19865 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] IL-SOAR : Imitation Learning with Soft Optimistic Actor cRitic

链接: https://arxiv.org/abs/2502.19859
作者: Stefano Viel,Luca Viano,Volkan Cevher
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces the SOAR framework for imitation learning. SOAR is an algorithmic template that learns a policy from expert demonstrations with a primal dual style algorithm that alternates cost and policy updates. Within the policy updates, the SOAR framework uses an actor critic method with multiple critics to estimate the critic uncertainty and build an optimistic critic fundamental to drive exploration. When instantiated in the tabular setting, we get a provable algorithm with guarantees that matches the best known results in \epsilon . Practically, the SOAR template is shown to boost consistently the performance of imitation learning algorithms based on Soft Actor Critic such as f-IRL, ML-IRL and CSIL in several MuJoCo environments. Overall, thanks to SOAR, the required number of episodes to achieve the same performance is reduced by half.

[LG-29] Revisit the Stability of Vanilla Federated Learning Under Diverse Conditions

链接: https://arxiv.org/abs/2502.19849
作者: Youngjoon Lee,Jinu Gong,Sun Choi,Joonhyuk Kang
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning paradigm enabling collaborative model training across decentralized clients while preserving data privacy. In this paper, we revisit the stability of the vanilla FedAvg algorithm under diverse conditions. Despite its conceptual simplicity, FedAvg exhibits remarkably stable performance compared to more advanced FL techniques. Our experiments assess the performance of various FL methods on blood cell and skin lesion classification tasks using Vision Transformer (ViT). Additionally, we evaluate the impact of different representative classification models and analyze sensitivity to hyperparameter variations. The results consistently demonstrate that, regardless of dataset, classification model employed, or hyperparameter settings, FedAvg maintains robust performance. Given its stability, robust performance without the need for extensive hyperparameter tuning, FedAvg is a safe and efficient choice for FL deployments in resource-constrained hospitals handling medical data. These findings underscore the enduring value of the vanilla FedAvg approach as a trusted baseline for clinical practice.

[LG-30] Advancing GDP Forecasting: The Potential of Machine Learning Techniques in Economic Predictions

链接: https://arxiv.org/abs/2502.19807
作者: Bogdan Oancea
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The quest for accurate economic forecasting has traditionally been dominated by econometric models, which most of the times rely on the assumptions of linear relationships and stationarity in of the data. However, the complex and often nonlinear nature of global economies necessitates the exploration of alternative approaches. Machine learning methods offer promising advantages over traditional econometric techniques for Gross Domestic Product forecasting, given their ability to model complex, nonlinear interactions and patterns without the need for explicit specification of the underlying relationships. This paper investigates the efficacy of Recurrent Neural Networks, in forecasting GDP, specifically LSTM networks. These models are compared against a traditional econometric method, SARIMA. We employ the quarterly Romanian GDP dataset from 1995 to 2023 and build a LSTM network to forecast to next 4 values in the series. Our findings suggest that machine learning models, consistently outperform traditional econometric models in terms of predictive accuracy and flexibility

[LG-31] ServoLNN: Lagrangian Neural Networks Driven by Servomechanisms

链接: https://arxiv.org/abs/2502.19802
作者: Brandon Johns,Zhuomin Zhou,Elahe Abdi
类目: Machine Learning (cs.LG); Robotics (cs.RO); Dynamical Systems (math.DS)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:Combining deep learning with classical physics facilitates the efficient creation of accurate dynamical models. In a recent class of neural network, Lagrangian mechanics is hard-coded into the architecture, and training the network learns the given system. However, the current architectures do not facilitate the modelling of dynamical systems that are driven by servomechanisms (e.g. servomotors, stepper motors, current sources, volumetric pumps). This article presents ServoLNN, a new architecture to model dynamical systems that are driven by servomechanisms. ServoLNN is compatible for use in real-time applications, where the driving motion is known only just-in-time. A PyTorch implementation of ServoLNN is provided. The derivations and results reveal the occurrence of a possible family of solutions that the training may converge on. The effect of the family of solutions on the predicted physical quantities is explored, as is the resolution to reduce the family of solutions to a single solution. Resultantly, the architecture can simultaneously accurately find the energies, power, rate of work, mass matrix, generalised accelerations, generalised forces, and the generalised forces that drive the servomechanisms.

[LG-32] In-Context Learning with Hypothesis-Class Guidance

链接: https://arxiv.org/abs/2502.19787
作者: Ziqian Lin,Shubham Kumar Bharti,Kangwook Lee
类目: Machine Learning (cs.LG)
*备注: 19 pages, 18 figures

点击查看摘要

Abstract:Recent research has investigated the underlying mechanisms of in-context learning (ICL) both theoretically and empirically, often using data generated from simple function classes. However, the existing work often focuses on the sequence consisting solely of labeled examples, while in practice, labeled examples are typically accompanied by an instruction, providing some side information about the task. In this work, we propose ICL with hypothesis-class guidance (ICL-HCG), a novel synthetic data model for ICL where the input context consists of the literal description of a (finite) hypothesis class \mathcalH and (x,y) pairs from a hypothesis chosen from \mathcalH . Under our framework ICL-HCG, we conduct extensive experiments to explore: (i) a variety of generalization abilities to new hypothesis classes; (ii) different model architectures; (iii) sample complexity; (iv) in-context data imbalance; (v) the role of instruction; and (vi) the effect of pretraining hypothesis diversity. As a result, we show that (a) Transformers can successfully learn ICL-HCG and generalize to unseen hypotheses and unseen hypothesis classes, and (b) compared with ICL without instruction, ICL-HCG achieves significantly higher accuracy, demonstrating the role of instructions.

[LG-33] SCU: An Efficient Machine Unlearning Scheme for Deep Learning Enabled Semantic Communications

链接: https://arxiv.org/abs/2502.19785
作者: Weiqi Wang,Zhiyi Tian,Chenhan Zhang,Shui Yu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning (DL) enabled semantic communications leverage DL to train encoders and decoders (codecs) to extract and recover semantic information. However, most semantic training datasets contain personal private information. Such concerns call for enormous requirements for specified data erasure from semantic codecs when previous users hope to move their data from the semantic system. Existing machine unlearning solutions remove data contribution from trained models, yet usually in supervised sole model scenarios. These methods are infeasible in semantic communications that often need to jointly train unsupervised encoders and decoders. In this paper, we investigate the unlearning problem in DL-enabled semantic communications and propose a semantic communication unlearning (SCU) scheme to tackle the problem. SCU includes two key components. Firstly, we customize the joint unlearning method for semantic codecs, including the encoder and decoder, by minimizing mutual information between the learned semantic representation and the erased samples. Secondly, to compensate for semantic model utility degradation caused by unlearning, we propose a contrastive compensation method, which considers the erased data as the negative samples and the remaining data as the positive samples to retrain the unlearned semantic models contrastively. Theoretical analysis and extensive experimental results on three representative datasets demonstrate the effectiveness and efficiency of our proposed methods.

[LG-34] APE: Tailored Posterior Difference for Auditing of Machine Unlearning

链接: https://arxiv.org/abs/2502.19770
作者: Weiqi Wang,Zhiyi Tian,An Liu,Shui Yu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the increasing prevalence of Web-based platforms handling vast amounts of user data, machine unlearning has emerged as a crucial mechanism to uphold users’ right to be forgotten, enabling individuals to request the removal of their specified data from trained models. However, the auditing of machine unlearning processes remains significantly underexplored. Although some existing methods offer unlearning auditing by leveraging backdoors, these backdoor-based approaches are inefficient and impractical, as they necessitate involvement in the initial model training process to embed the backdoors. In this paper, we propose a TAilored Posterior diffErence (TAPE) method to provide unlearning auditing independently of original model training. We observe that the process of machine unlearning inherently introduces changes in the model, which contains information related to the erased data. TAPE leverages unlearning model differences to assess how much information has been removed through the unlearning operation. Firstly, TAPE mimics the unlearned posterior differences by quickly building unlearned shadow models based on first-order influence estimation. Secondly, we train a Reconstructor model to extract and evaluate the private information of the unlearned posterior differences to audit unlearning. Existing privacy reconstructing methods based on posterior differences are only feasible for model updates of a single sample. To enable the reconstruction effective for multi-sample unlearning requests, we propose two strategies, unlearned data perturbation and unlearned influence-based division, to augment the posterior difference. Extensive experimental results indicate the significant superiority of TAPE over the state-of-the-art unlearning verification methods, at least 4.5 \times efficiency speedup and supporting the auditing for broader unlearning scenarios.

[LG-35] HALO: Robust Out-of-Distribution Detection via Joint Optimisation

链接: https://arxiv.org/abs/2502.19755
作者: Hugo Lyons Keenan,Sarah Erfani,Christopher Leckie
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: SaTML 2025

点击查看摘要

Abstract:Effective out-of-distribution (OOD) detection is crucial for the safe deployment of machine learning models in real-world scenarios. However, recent work has shown that OOD detection methods are vulnerable to adversarial attacks, potentially leading to critical failures in high-stakes applications. This discovery has motivated work on robust OOD detection methods that are capable of maintaining performance under various attack settings. Prior approaches have made progress on this problem but face a number of limitations: often only exhibiting robustness to attacks on OOD data or failing to maintain strong clean performance. In this work, we adapt an existing robust classification framework, TRADES, extending it to the problem of robust OOD detection and discovering a novel objective function. Recognising the critical importance of a strong clean/robust trade-off for OOD detection, we introduce an additional loss term which boosts classification and detection performance. Our approach, called HALO (Helper-based AdversariaL OOD detection), surpasses existing methods and achieves state-of-the-art performance across a number of datasets and attack settings. Extensive experiments demonstrate an average AUROC improvement of 3.15 in clean settings and 7.07 under adversarial attacks when compared to the next best method. Furthermore, HALO exhibits resistance to transferred attacks, offers tuneable performance through hyperparameter selection, and is compatible with existing OOD detection frameworks out-of-the-box, leaving open the possibility of future performance gains. Code is available at: this https URL

[LG-36] BiRating – Iterative averag ing on a bipartite graph of Beat Saber scores player skills and map difficulties

链接: https://arxiv.org/abs/2502.19742
作者: Juan Casanova
类目: Machine Learning (cs.LG)
*备注: 30 pages, 2 figures

点击查看摘要

Abstract:Difficulty estimation of Beat Saber maps is an interesting data analysis problem and valuable to the Beat Saber competitive scene. We present a simple algorithm that iteratively averages player skill and map difficulty estimations in a bipartite graph of players and maps, connected by scores, using scores only as input. This approach simultaneously estimates player skills and map difficulties, exploiting each of them to improve the estimation of the other, exploitng the relation of multiple scores by different players on the same map, or on different maps by the same player. While we have been unable to prove or characterize theoretical convergence, the implementation exhibits convergent behaviour to low estimation error in all instances, producing accurate results. An informal qualitative evaluation involving experienced Beat Saber community members was carried out, comparing the difficulty estimations output by our algorithm with their personal perspectives on the difficulties of different maps. There was a significant alignment with player perceived perceptions of difficulty and with other existing methods for estimating difficulty. Our approach showed significant improvement over existing methods in certain known problematic maps that are not typically accurately estimated, but also produces problematic estimations for certain families of maps where the assumptions on the meaning of scores were inadequate (e.g. not enough scores, or scores over optimized by players). The algorithm has important limitations, related to data quality and meaningfulness, assumptions on the domain problem, and theoretical convergence of the algorithm. Future work would significantly benefit from a better understanding of adequate ways to quantify map difficulty in Beat Saber, including multidimensionality of skill and difficulty, and the systematic biases present in score data.

[LG-37] Causal Effect Estimation under Networked Interference without Networked Unconfoundedness Assumption

链接: https://arxiv.org/abs/2502.19741
作者: Weilin Chen,Ruichu Cai,Jie Qiao,Yuguang Yan,José Miguel Hernández-Lobato
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.03342

点击查看摘要

Abstract:Estimating causal effects under networked interference is a crucial yet challenging problem. Existing methods based on observational data mainly rely on the networked unconfoundedness assumption, which guarantees the identification of networked effects. However, the networked unconfoundedness assumption is usually violated due to the latent confounders in observational data, hindering the identification of networked effects. Interestingly, in such networked settings, interactions between units provide valuable information for recovering latent confounders. In this paper, we identify three types of latent confounders in networked inference that hinder identification: those affecting only the individual, those affecting only neighbors, and those influencing both. Specifically, we devise a networked effect estimator based on identifiable representation learning techniques. Theoretically, we establish the identifiability of all latent confounders, and leveraging the identified latent confounders, we provide the networked effect identification result. Extensive experiments validate our theoretical results and demonstrate the effectiveness of the proposed method.

[LG-38] Bridging the PLC Binary Analysis Gap: A Cross-Compiler Dataset and Neural Framework for Industrial Control Systems

链接: https://arxiv.org/abs/2502.19725
作者: Yonatan Gizachew Achamyeleh,Shih-Yuan Yu,Gustavo Quirós Araya,Mohammad Abdullah Al Faruque
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industrial Control Systems (ICS) rely heavily on Programmable Logic Controllers (PLCs) to manage critical infrastructure, yet analyzing PLC executables remains challenging due to diverse proprietary compilers and limited access to source code. To bridge this gap, we introduce PLC-BEAD, a comprehensive dataset containing 2431 compiled binaries from 700+ PLC programs across four major industrial compilers (CoDeSys, GEB, OpenPLC-V2, OpenPLC-V3). This novel dataset uniquely pairs each binary with its original Structured Text source code and standardized functionality labels, enabling both binary-level and source-level analysis. We demonstrate the dataset’s utility through PLCEmbed, a transformer-based framework for binary code analysis that achieves 93% accuracy in compiler provenance identification and 42% accuracy in fine-grained functionality classification across 22 industrial control categories. Through comprehensive ablation studies, we analyze how compiler optimization levels, code patterns, and class distributions influence model performance. We provide detailed documentation of the dataset creation process, labeling taxonomy, and benchmark protocols to ensure reproducibility. Both PLC-BEAD and PLCEmbed are released as open-source resources to foster research in PLC security, reverse engineering, and ICS forensics, establishing new baselines for data-driven approaches to industrial cybersecurity.

[LG-39] raining Robust Graph Neural Networks by Modeling Noise Dependencies

链接: https://arxiv.org/abs/2502.19670
作者: Yeonjun In,Kanghoon Yoon,Sukwon Yun,Kibum Kim,Sungchul Kim,Chanyoung Park
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Work in progress

点击查看摘要

Abstract:In real-world applications, node features in graphs often contain noise from various sources, leading to significant performance degradation in GNNs. Although several methods have been developed to enhance robustness, they rely on the unrealistic assumption that noise in node features is independent of the graph structure and node labels, thereby limiting their applicability. To this end, we introduce a more realistic noise scenario, dependency-aware noise on graphs (DANG), where noise in node features create a chain of noise dependencies that propagates to the graph structure and node labels. We propose a novel robust GNN, DA-GNN, which captures the causal relationships among variables in the data generating process (DGP) of DANG using variational inference. In addition, we present new benchmark datasets that simulate DANG in real-world applications, enabling more practical research on robust GNNs. Extensive experiments demonstrate that DA-GNN consistently outperforms existing baselines across various noise scenarios, including both DANG and conventional noise models commonly considered in this field.

[LG-40] Out-of-distribution Generalization for Total Variation based Invariant Risk Minimization ICLR2025

链接: https://arxiv.org/abs/2502.19665
作者: Yuanchao Wang,Zhao-Rong Lai,Tianqi Zhong
类目: Machine Learning (cs.LG)
*备注: ICLR 2025

点击查看摘要

Abstract:Invariant risk minimization is an important general machine learning framework that has recently been interpreted as a total variation model (IRM-TV). However, how to improve out-of-distribution (OOD) generalization in the IRM-TV setting remains unsolved. In this paper, we extend IRM-TV to a Lagrangian multiplier model named OOD-TV-IRM. We find that the autonomous TV penalty hyperparameter is exactly the Lagrangian multiplier. Thus OOD-TV-IRM is essentially a primal-dual optimization model, where the primal optimization minimizes the entire invariant risk and the dual optimization strengthens the TV penalty. The objective is to reach a semi-Nash equilibrium where the balance between the training loss and OOD generalization is maintained. We also develop a convergent primal-dual algorithm that facilitates an adversarial learning scheme. Experimental results show that OOD-TV-IRM outperforms IRM-TV in most situations.

[LG-41] Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

链接: https://arxiv.org/abs/2502.19657
作者: Pavel Rumiantsev,Mark Coates
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural Architecture Search (NAS) is a powerful automatic alternative to manual design of a neural network. In the zero-shot version, a fast ranking function is used to compare architectures without training them. The outputs of the ranking functions often vary significantly due to different sources of randomness, including the evaluated architecture’s weights’ initialization or the batch of data used for calculations. A common approach to addressing the variation is to average a ranking function output over several evaluations. We propose taking into account the variation in a different manner, by viewing the ranking function output as a random variable representing a proxy performance metric. During the search process, we strive to construct a stochastic ordering of the performance metrics to determine the best architecture. Our experiments show that the proposed stochastic ordering can effectively boost performance of a search on standard benchmark search spaces.

[LG-42] Unlocking Multi-Modal Potentials for Dynamic Text-Attributed Graph Representation

链接: https://arxiv.org/abs/2502.19651
作者: Yuanyuan Xu,Wenjie Zhang,Ying Zhang,Xuemin Lin,Xiwei Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal edges alongside rich textual attributes. A prior approach to representing DyTAGs leverages pre-trained language models to encode text attributes and subsequently integrates them into dynamic graph models. However, it follows edge-centric modeling, as in dynamic graph learning, which is limited in local structures and fails to exploit the unique characteristics of DyTAGs, leading to suboptimal performance. We observe that DyTAGs inherently comprise three distinct modalities-temporal, textual, and structural-often exhibiting dispersed or even orthogonal distributions, with the first two largely overlooked in existing research. Building on this insight, we propose MoMent, a model-agnostic multi-modal framework that can seamlessly integrate with dynamic graph models for structural modality learning. The core idea is to shift from edge-centric to node-centric modeling, fully leveraging three modalities for node representation. Specifically, MoMent presents non-shared node-centric encoders based on the attention mechanism to capture global temporal and semantic contexts from temporal and textual modalities, together with local structure learning, thus generating modality-specific tokens. To prevent disjoint latent space, we propose a symmetric alignment loss, an auxiliary objective that aligns temporal and textual tokens, ensuring global temporal-semantic consistency with a theoretical guarantee. Last, we design a lightweight adaptor to fuse these tokens, generating comprehensive and cohesive node representations. We theoretically demonstrate that MoMent enhances discriminative power over exclusive edge-centric modeling. Extensive experiments across seven datasets and two downstream tasks show that MoMent achieves up to 33.62% improvement against the baseline using four dynamic graph models.

[LG-43] axonomy Opportunities and Challenges of Representation Engineering for Large Language Models

链接: https://arxiv.org/abs/2502.19649
作者: Jan Wehner,Sahar Abdelnabi,Daniel Tan,David Krueger,Mario Fritz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model’s internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models’ behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models’ performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

[LG-44] cMIM: A Contrastive Mutual Information Framework for Unified Generative and Discriminative Representation Learning

链接: https://arxiv.org/abs/2502.19642
作者: Micha Livne
类目: Machine Learning (cs.LG)
*备注: A working draft

点击查看摘要

Abstract:Learning representations that are useful for unknown downstream tasks is a fundamental challenge in representation learning. Prominent approaches in this domain include contrastive learning, self-supervised masking, and denoising auto-encoders. In this paper, we introduce a novel method, termed contrastive Mutual Information Machine (cMIM), which aims to enhance the utility of learned representations for downstream tasks. cMIM integrates a new contrastive learning loss with the Mutual Information Machine (MIM) learning framework, a probabilistic auto-encoder that maximizes the mutual information between inputs and latent representations while clustering the latent codes. Despite MIM’s potential, initial experiments indicated that the representations learned by MIM were less effective for discriminative downstream tasks compared to state-of-the-art (SOTA) models. The proposed cMIM method directly addresses this limitation. The main contributions of this work are twofold: (1) We propose a novel contrastive extension to MIM for learning discriminative representations which eliminates the need for data augmentation and is robust to variations in the number of negative examples (i.e., batch size). (2) We introduce a generic method for extracting informative embeddings from encoder-decoder models, which significantly improves performance in discriminative downstream tasks without requiring additional training. This method is applicable to any pre-trained encoder-decoder model. By presenting cMIM, we aim to offer a unified generative model that is effective for both generative and discriminative tasks. Our results demonstrate that the learned representations are valuable for downstream tasks while maintaining the generative capabilities of MIM. Comments: A working draft Subjects: Machine Learning (cs.LG) MSC classes: 68T05 ACMclasses: I.2.6 Cite as: arXiv:2502.19642 [cs.LG] (or arXiv:2502.19642v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.19642 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Developing robust methods to handle missing data in real-world applications effectively KDD2024 ECML KDD

链接: https://arxiv.org/abs/2502.19635
作者: Youran Zhou,Mohamed Reda Bouadjenek,Sunil Arya
类目: Machine Learning (cs.LG)
*备注: This work was presented at the ECML PKDD 2024 PhD Forum. https://ecmlpkdd. org/2024/program-accepted-phd-forum/

点击查看摘要

Abstract:Missing data is a pervasive challenge spanning diverse data types, including tabular, sensor data, time-series, images and so on. Its origins are multifaceted, resulting in various missing mechanisms. Prior research in this field has predominantly revolved around the assumption of the Missing Completely At Random (MCAR) mechanism. However, Missing At Random (MAR) and Missing Not At Random (MNAR) mechanisms, though equally prevalent, have often remained underexplored despite their significant influence. This PhD project presents a comprehensive research agenda designed to investigate the implications of diverse missing data mechanisms. The principal aim is to devise robust methodologies capable of effectively handling missing data while accommodating the unique characteristics of MCAR, MAR, and MNAR mechanisms. By addressing these gaps, this research contributes to an enriched understanding of the challenges posed by missing data across various industries and data modalities. It seeks to provide practical solutions that enable the effective management of missing data, empowering researchers and practitioners to leverage incomplete datasets confidently.

[LG-46] reatment Non-Adherence Bias in Clinical Machine Learning: A Real-World Study on Hypertension Medication

链接: https://arxiv.org/abs/2502.19625
作者: Zhongyuan Liang,Arvind Suresh,Irene Y. Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning systems trained on electronic health records (EHRs) increasingly guide treatment decisions, but their reliability depends on the critical assumption that patients follow the prescribed treatments recorded in EHRs. Using EHR data from 3,623 hypertension patients, we investigate how treatment non-adherence introduces implicit bias that can fundamentally distort both causal inference and predictive modeling. By extracting patient adherence information from clinical notes using a large language model, we identify 786 patients (21.7%) with medication non-adherence. We further uncover key demographic and clinical factors associated with non-adherence, as well as patient-reported reasons including side effects and difficulties obtaining refills. Our findings demonstrate that this implicit bias can not only reverse estimated treatment effects, but also degrade model performance by up to 5% while disproportionately affecting vulnerable populations by exacerbating disparities in decision outcomes and model error rates. This highlights the importance of accounting for treatment non-adherence in developing responsible and equitable clinical machine learning systems.

[LG-47] Analyzing Cost-Sensitive Surrogate Losses via mathcalH-calibration

链接: https://arxiv.org/abs/2502.19522
作者: Sanket Shah,Milind Tambe,Jessie Finocchiaro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper aims to understand whether machine learning models should be trained using cost-sensitive surrogates or cost-agnostic ones (e.g., cross-entropy). Analyzing this question through the lens of \mathcalH -calibration, we find that cost-sensitive surrogates can strictly outperform their cost-agnostic counterparts when learning small models under common distributional assumptions. Since these distributional assumptions are hard to verify in practice, we also show that cost-sensitive surrogates consistently outperform cost-agnostic surrogates on classification datasets from the UCI repository. Together, these make a strong case for using cost-sensitive surrogates in practice.

[LG-48] RIX: A More Expressive Model for Zero-shot Domain Transfer in Knowledge Graphs

链接: https://arxiv.org/abs/2502.19512
作者: Yucheng Zhang,Beatrice Bevilacqua,Mikhail Galkin,Bruno Ribeiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fully inductive knowledge graph models can be trained on multiple domains and subsequently perform zero-shot knowledge graph completion (KGC) in new unseen domains. This is an important capability towards the goal of having foundation models for knowledge graphs. In this work, we introduce a more expressive and capable fully inductive model, dubbed TRIX, which not only yields strictly more expressive triplet embeddings (head entity, relation, tail entity) compared to state-of-the-art methods, but also introduces a new capability: directly handling both entity and relation prediction tasks in inductive settings. Empirically, we show that TRIX outperforms the state-of-the-art fully inductive models in zero-shot entity and relation predictions in new domains, and outperforms large-context LLMs in out-of-domain predictions. The source code is available at this https URL.

[LG-49] Models That Are Interpretable But Not Transparent AISTATS2025

链接: https://arxiv.org/abs/2502.19502
作者: Chudi Zhong,Panyu Chen,Cynthia Rudin
类目: Machine Learning (cs.LG)
*备注: AISTATS 2025

点击查看摘要

Abstract:Faithful explanations are essential for machine learning models in high-stakes applications. Inherently interpretable models are well-suited for these applications because they naturally provide faithful explanations by revealing their decision logic. However, model designers often need to keep these models proprietary to maintain their value. This creates a tension: we need models that are interpretable–allowing human decision-makers to understand and justify predictions, but not transparent, so that the model’s decision boundary is not easily replicated by attackers. Shielding the model’s decision boundary is particularly challenging alongside the requirement of completely faithful explanations, since such explanations reveal the true logic of the model for an entire subspace around each query point. This work provides an approach, FaithfulDefense, that creates model explanations for logical models that are completely faithful, yet reveal as little as possible about the decision boundary. FaithfulDefense is based on a maximum set cover formulation, and we provide multiple formulations for it, taking advantage of submodularity.

[LG-50] On the Interpolation Effect of Score Smoothing

链接: https://arxiv.org/abs/2502.19499
作者: Zhengdao Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based diffusion models have achieved remarkable progress in various domains with the ability to generate new data samples that do not exist in the training set. In this work, we examine the hypothesis that their generalization ability arises from an interpolation effect caused by a smoothing of the empirical score function. Focusing on settings where the training set lies uniformly in a one-dimensional linear subspace, we study the interplay between score smoothing and the denoising dynamics with mathematically solvable models. In particular, we demonstrate how a smoothed score function can lead to the generation of samples that interpolate among the training data within their subspace while avoiding full memorization. We also present evidence that learning score functions with regularized neural networks can have a similar effect on the denoising dynamics as score smoothing.

[LG-51] Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting

链接: https://arxiv.org/abs/2502.19459
作者: Yu Liu,Baoxiong Jia,Ruijie Lu,Junfeng Ni,Song-Chun Zhu,Siyuan Huang
类目: Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Building articulated objects is a key challenge in computer vision. Existing methods often fail to effectively integrate information across different object states, limiting the accuracy of part-mesh reconstruction and part dynamics modeling, particularly for complex multi-part articulated objects. We introduce ArtGS, a novel approach that leverages 3D Gaussians as a flexible and efficient representation to address these issues. Our method incorporates canonical Gaussians with coarse-to-fine initialization and updates for aligning articulated part information across different object states, and employs a skinning-inspired part dynamics modeling module to improve both part-mesh reconstruction and articulation learning. Extensive experiments on both synthetic and real-world datasets, including a new benchmark for complex multi-part objects, demonstrate that ArtGS achieves state-of-the-art performance in joint parameter estimation and part mesh reconstruction. Our approach significantly improves reconstruction quality and efficiency, especially for multi-part articulated objects. Additionally, we provide comprehensive analyses of our design choices, validating the effectiveness of each component to highlight potential areas for future improvement.

[LG-52] Global Framework for Simultaneous Emulation Across the Nuclear Landscape

链接: https://arxiv.org/abs/2502.20363
作者: Antoine Belley,Jose M. Munoz,Ronald F. Garcia Ruiz
类目: Nuclear Theory (nucl-th); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a hierarchical framework that combines ab initio many-body calculations with a Bayesian neural network, developing emulators capable of accurately predicting nuclear properties across the nuclear chart, including multiple isotopes simultaneously. We benchmark our developments using the oxygen isotopic chain, achieving accurate results for ground-state energies and nuclear charge radii, while providing robust uncertainty quantification. Our framework enables global sensitivity analysis of nuclear binding energies and charge radii with respect to the low-energy constants that describe the nuclear force.

[LG-53] Impilict Runge-Kutta based sparse identification of governing equations in biologically motivated systems

链接: https://arxiv.org/abs/2502.20319
作者: Mehrdad Anvari,Hamidreza Marasi,Hossein Kheiri
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Quantitative Methods (q-bio.QM)
*备注: 23 pages, 9 figures

点击查看摘要

Abstract:Identifying governing equations in physical and biological systems from datasets remains a long-standing challenge across various scientific disciplines, providing mechanistic insights into complex system evolution. Common methods like sparse identification of nonlinear dynamics (SINDy) often rely on precise derivative estimations, making them vulnerable to data scarcity and noise. This study presents a novel data-driven framework by integrating high order implicit Runge-Kutta methods (IRKs) with the sparse identification, termed IRK-SINDy. The framework exhibits remarkable robustness to data scarcity and noise by leveraging the lower stepsize constraint of IRKs. Two methods for incorporating IRKs into sparse regression are introduced: one employs iterative schemes for numerically solving nonlinear algebraic system of equations, while the other utilizes deep neural networks to predict stage values of IRKs. The performance of IRK-SINDy is demonstrated through numerical experiments on benchmark problems with varied dynamical behaviors, including linear and nonlinear oscillators, the Lorenz system, and biologically relevant models like predator-prey dynamics, logistic growth, and the FitzHugh-Nagumo model. Results indicate that IRK-SINDy outperforms conventional SINDy and the RK4-SINDy framework, particularly under conditions of extreme data scarcity and noise, yielding interpretable and generalizable models.

[LG-54] Multiple Linked Tensor Factorization

链接: https://arxiv.org/abs/2502.20286
作者: Zhiyu Kang,Raghavendra B. Rao,Eric F. Lock
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 26 pages, 4 figures, 7 tables

点击查看摘要

Abstract:In biomedical research and other fields, it is now common to generate high content data that are both multi-source and multi-way. Multi-source data are collected from different high-throughput technologies while multi-way data are collected over multiple dimensions, yielding multiple tensor arrays. Integrative analysis of these data sets is needed, e.g., to capture and synthesize different facets of complex biological systems. However, despite growing interest in multi-source and multi-way factorization techniques, methods that can handle data that are both multi-source and multi-way are limited. In this work, we propose a Multiple Linked Tensors Factorization (MULTIFAC) method extending the CANDECOMP/PARAFAC (CP) decomposition to simultaneously reduce the dimension of multiple multi-way arrays and approximate underlying signal. We first introduce a version of the CP factorization with L2 penalties on the latent factors, leading to rank sparsity. When extended to multiple linked tensors, the method automatically reveals latent components that are shared across data sources or individual to each data source. We also extend the decomposition algorithm to its expectation-maximization (EM) version to handle incomplete data with imputation. Extensive simulation studies are conducted to demonstrate MULTIFAC’s ability to (i) approximate underlying signal, (ii) identify shared and unshared structures, and (iii) impute missing data. The approach yields an interpretable decomposition on multi-way multi-omics data for a study on early-life iron deficiency.

[LG-55] Generative adversarial neural networks for simulating neutrino interactions

链接: https://arxiv.org/abs/2502.20244
作者: Jose L. Bonilla,Krzysztof M. Graczyk,Artur M. Ankowski,Rwik Dharmapal Banerjee,Beata E. Kowal,Hemant Prasad,Jan T. Sobczyk
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex); Nuclear Theory (nucl-th)
*备注: 14 pages, 14 figures

点击查看摘要

Abstract:We propose a new approach to simulate neutrino scattering events as an alternative to the standard Monte Carlo generator approach. Generative adversarial neural network (GAN) models are developed to simulate neutrino-carbon collisions in the few-GeV energy range. The models produce scattering events for a given neutrino energy. GAN models are trained on simulation data from NuWro Monte Carlo event generator. Two GAN models have been obtained: one simulating only quasielastic neutrino-nucleus scatterings and another simulating all interactions at given neutrino energy. The performance of both models has been assessed using two statistical metrics. It is shown that both GAN models successfully reproduce the event distributions.

[LG-56] Qini curve estimation under clustered network interference

链接: https://arxiv.org/abs/2502.20097
作者: Rickard K.A. Karlsson,Bram van den Akker,Felipe Moraes,Hugo M. Proença,Jesse H. Krijthe
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Qini curves are a widely used tool for assessing treatment policies under allocation constraints as they visualize the incremental gain of a new treatment policy versus the cost of its implementation. Standard Qini curve estimation assumes no interference between units: that is, that treating one unit does not influence the outcome of any other unit. In many real-life applications such as public policy or marketing, however, the presence of interference is common. Ignoring interference in these scenarios can lead to systematically biased Qini curves that over- or under-estimate a treatment policy’s cost-effectiveness. In this paper, we address the problem of Qini curve estimation under clustered network interference, where interfering units form independent clusters. We propose a formal description of the problem setting with an experimental study design under which we can account for clustered network interference. Within this framework, we introduce three different estimation strategies suited for different conditions. Moreover, we introduce a marketplace simulator that emulates clustered network interference in a typical e-commerce setting. From both theoretical and empirical insights, we provide recommendations in choosing the best estimation strategy by identifying an inherent bias-variance trade-off among the estimation strategies.

[LG-57] Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula

链接: https://arxiv.org/abs/2502.20003
作者: Matteo Vilucchio,Yatin Dandi,Cedric Gerbelot,Florent Krzakala
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The analytic characterization of the high-dimensional behavior of optimization for Generalized Linear Models (GLMs) with Gaussian data has been a central focus in statistics and probability in recent years. While convex cases, such as the LASSO, ridge regression, and logistic regression, have been extensively studied using a variety of techniques, the non-convex case remains far less understood despite its significance. A non-rigorous statistical physics framework has provided remarkable predictions for the behavior of high-dimensional optimization problems, but rigorously establishing their validity for non-convex problems has remained a fundamental challenge. In this work, we address this challenge by developing a systematic framework that rigorously proves replica-symmetric formulas for non-convex GLMs and precisely determines the conditions under which these formulas are valid. Remarkably, the rigorous replica-symmetric predictions align exactly with the conjectures made by physicists, and the so-called replicon condition. The originality of our approach lies in connecting two powerful theoretical tools: the Gaussian Min-Max Theorem, which we use to provide precise lower bounds, and Approximate Message Passing (AMP), which is shown to achieve these bounds algorithmically. We demonstrate the utility of this framework through significant applications: (i) by proving the optimality of the Tukey loss over the more commonly used Huber loss under a \varepsilon contaminated data model, (ii) establishing the optimality of negative regularization in high-dimensional non-convex regression and (iii) characterizing the performance limits of linearized AMP algorithms. By rigorously validating statistical physics predictions in non-convex settings, we aim to open new pathways for analyzing increasingly complex optimization landscapes beyond the convex regime.

[LG-58] Physics-Informed Neural Networks for Optimal Vaccination Plan in SIR Epidemic Models

链接: https://arxiv.org/abs/2502.19890
作者: Minseok Kim,Yeongjong Kim,Yeoneung Kim
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work focuses on understanding the minimum eradication time for the controlled Susceptible-Infectious-Recovered (SIR) model in the time-homogeneous setting, where the infection and recovery rates are constant. The eradication time is defined as the earliest time the infectious population drops below a given threshold and remains below it. For time-homogeneous models, the eradication time is well-defined due to the predictable dynamics of the infectious population, and optimal control strategies can be systematically studied. We utilize Physics-Informed Neural Networks (PINNs) to solve the partial differential equation (PDE) governing the eradication time and derive the corresponding optimal vaccination control. The PINN framework enables a mesh-free solution to the PDE by embedding the dynamics directly into the loss function of a deep neural network. We use a variable scaling method to ensure stable training of PINN and mathematically analyze that this method is effective in our setting. This approach provides an efficient computational alternative to traditional numerical methods, allowing for an approximation of the eradication time and the optimal control strategy. Through numerical experiments, we validate the effectiveness of the proposed method in computing the minimum eradication time and achieving optimal control. This work offers a novel application of PINNs to epidemic modeling, bridging mathematical theory and computational practice for time-homogeneous SIR models.

[LG-59] NeRFCom: Feature Transform Coding Meets Neural Radiance Field for Free-View 3D Scene Semantic Transmission

链接: https://arxiv.org/abs/2502.19873
作者: Weijie Yue,Zhongwei Si,Bolin Wu,Sixian Wang,Xiaoqi Qin,Kai Niu,Jincheng Dai,Ping Zhang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce NeRFCom, a novel communication system designed for end-to-end 3D scene transmission. Compared to traditional systems relying on handcrafted NeRF semantic feature decomposition for compression and well-adaptive channel coding for transmission error correction, our NeRFCom employs a nonlinear transform and learned probabilistic models, enabling flexible variable-rate joint source-channel coding and efficient bandwidth allocation aligned with the NeRF semantic feature’s different contribution to the 3D scene synthesis fidelity. Experimental results demonstrate that NeRFCom achieves free-view 3D scene efficient transmission while maintaining robustness under adverse channel conditions.

[LG-60] Fast Debiasing of the LASSO Estimator

链接: https://arxiv.org/abs/2502.19825
作者: Shuvayan Banerjee,James Saunderson,Radhendushka Srivastava,Ajit Rajwade
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In high-dimensional sparse regression, the \textscLasso estimator offers excellent theoretical guarantees but is well-known to produce biased estimates. To address this, \citeJavanmard2014 introduced a method to debias" the \textscLasso estimates for a random sub-Gaussian sensing matrix \boldsymbolA . Their approach relies on computing an approximate inverse" \boldsymbolM of the matrix \boldsymbolA^\top \boldsymbolA/n by solving a convex optimization problem. This matrix \boldsymbolM plays a critical role in mitigating bias and allowing for construction of confidence intervals using the debiased \textscLasso estimates. However the computation of \boldsymbolM is expensive in practice as it requires iterative optimization. In the presented work, we re-parameterize the optimization problem to compute a ``debiasing matrix" \boldsymbolW := \boldsymbolAM^\top directly, rather than the approximate inverse \boldsymbolM . This reformulation retains the theoretical guarantees of the debiased \textscLasso estimates, as they depend on the \emphproduct \boldsymbolAM^\top rather than on \boldsymbolM alone. Notably, we provide a simple, computationally efficient, closed-form solution for \boldsymbolW under similar conditions for the sensing matrix \boldsymbolA used in the original debiasing formulation, with an additional condition that the elements of every row of \boldsymbolA have uncorrelated entries. Also, the optimization problem based on \boldsymbolW guarantees a unique optimal solution, unlike the original formulation based on \boldsymbolM . We verify our main result with numerical simulations.

[LG-61] Shared Stochastic Gaussian Process Latent Variable Models: A Multi-modal Generative Model for Quasar Spectra

链接: https://arxiv.org/abs/2502.19824
作者: Vidhi Lalchand,Anna-Christina Eilers
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: Published in TMLR, this https URL . The code for this work is available at: this https URL

点击查看摘要

Abstract:This work proposes a scalable probabilistic latent variable model based on Gaussian processes (Lawrence, 2004) in the context of multiple observation spaces. We focus on an application in astrophysics where data sets typically contain both observed spectral features and scientific properties of astrophysical objects such as galaxies or exoplanets. In our application, we study the spectra of very luminous galaxies known as quasars, along with their properties, such as the mass of their central supermassive black hole, accretion rate, and luminosity-resulting in multiple observation spaces. A single data point is then characterized by different classes of observations, each with different likelihoods. Our proposed model extends the baseline stochastic variational Gaussian process latent variable model (GPLVM) introduced by Lalchand et al. (2022) to this setting, proposing a seamless generative model where the quasar spectra and scientific labels can be generated simultaneously using a shared latent space as input to different sets of Gaussian process decoders, one for each observation space. Additionally, this framework enables training in a missing data setting where a large number of dimensions per data point may be unknown or unobserved. We demonstrate high-fidelity reconstructions of the spectra and scientific labels during test-time inference and briefly discuss the scientific interpretations of the results, along with the significance of such a generative model.

[LG-62] Inexact Moreau Envelope Lagrangian Method for Non-Convex Constrained Optimization under Local Error Bound Conditions on Constraint Functions

链接: https://arxiv.org/abs/2502.19764
作者: Yankun Huang,Qihang Lin,Yangyang Xu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study the inexact Moreau envelope Lagrangian (iMELa) method for solving smooth non-convex optimization problems over a simple polytope with additional convex inequality constraints. By incorporating a proximal term into the traditional Lagrangian function, the iMELa method approximately solves a convex optimization subproblem over the polyhedral set at each main iteration. Under the assumption of a local error bound condition for subsets of the feasible set defined by subsets of the constraints, we establish that the iMELa method can find an \epsilon -Karush-Kuhn-Tucker point with \tilde O(\epsilon^-2) gradient oracle complexity.

[LG-63] FPGA-Accelerated SpeckleNN with SNL for Real-time X-ray Single-Particle Imaging

链接: https://arxiv.org/abs/2502.19734
作者: Abhilasha Dave,Cong Wang,James Russell,Ryan Herbst,Jana Thayer
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We implement a specialized version of our SpeckleNN model for real-time speckle pattern classification in X-ray Single-Particle Imaging (SPI) using the SLAC Neural Network Library (SNL) on an FPGA. This hardware is optimized for inference near detectors in high-throughput X-ray free-electron laser (XFEL) facilities like the Linac Coherent Light Source (LCLS). To fit FPGA constraints, we optimized SpeckleNN, reducing parameters from 5.6M to 64.6K (98.8% reduction) with 90% accuracy. We also compressed the latent space from 128 to 50 dimensions. Deployed on a KCU1500 FPGA, the model used 71% of DSPs, 75% of LUTs, and 48% of FFs, with an average power consumption of 9.4W. The FPGA achieved 45.015us inference latency at 200 MHz. On an NVIDIA A100 GPU, the same inference consumed ~73W and had a 400us latency. Our FPGA version achieved an 8.9x speedup and 7.8x power reduction over the GPU. Key advancements include model specialization and dynamic weight loading through SNL, eliminating time-consuming FPGA re-synthesis for fast, continuous deployment of (re)trained models. These innovations enable real-time adaptive classification and efficient speckle pattern vetoing, making SpeckleNN ideal for XFEL facilities. This implementation accelerates SPI experiments and enhances adaptability to evolving conditions.

[LG-64] Spectral Analysis of Representational Similarity with Limited Neurons

链接: https://arxiv.org/abs/2502.19648
作者: Hyunmo Kang,Abdulkadir Canatar,SueYeon Chung
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Measuring representational similarity between neural recordings and computational models is challenging due to constraints on the number of neurons that can be recorded simultaneously. In this work, we investigate how such limitations affect similarity measures, focusing on Canonical Correlation Analysis (CCA) and Centered Kernel Alignment (CKA). Leveraging tools from Random Matrix Theory, we develop a predictive spectral framework for these measures and demonstrate that finite neuron sampling systematically underestimates similarity due to eigenvector delocalization. To overcome this, we introduce a denoising method to infer population-level similarity, enabling accurate analysis even with small neuron samples. Our theory is validated on synthetic and real datasets, offering practical strategies for interpreting neural data under finite sampling constraints.

[LG-65] A Method for Evaluating the Interpretability of Machine Learning Models in Predicting Bond Default Risk Based on LIME and SHAP

链接: https://arxiv.org/abs/2502.19615
作者: Yan Zhang,Lin Chen,Yixiang Tian
类目: General Finance (q-fin.GN); Machine Learning (cs.LG)
*备注: 12 Pages,9 figures

点击查看摘要

Abstract:Interpretability analysis methods for artificial intelligence models, such as LIME and SHAP, are widely used, though they primarily serve as post-model for analyzing model outputs. While it is commonly believed that the transparency and interpretability of AI models diminish as their complexity increases, currently there is no standardized method for assessing the inherent interpretability of the models themselves. This paper uses bond market default prediction as a case study, applying commonly used machine learning algorithms within AI models. First, the classification performance of these algorithms in default prediction is evaluated. Then, leveraging LIME and SHAP to assess the contribution of sample features to prediction outcomes, the paper proposes a novel method for evaluating the interpretability of the models themselves. The results of this analysis are consistent with the intuitive understanding and logical expectations regarding the interpretability of these models.

[LG-66] Advancing calibration for stochastic agent -based models in epidemiology with Stein variational inference and Gaussian process surrogates

链接: https://arxiv.org/abs/2502.19550
作者: Connor Robertson,Cosmin Safta,Nicholson Collier,Jonathan Ozik,Jaideep Ray
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate calibration of stochastic agent-based models (ABMs) in epidemiology is crucial to make them useful in public health policy decisions and interventions. Traditional calibration methods, e.g., Markov Chain Monte Carlo (MCMC), that yield a probability density function for the parameters being calibrated, are often computationally expensive. When applied to ABMs which are highly parametrized, the calibration process becomes computationally infeasible. This paper investigates the utility of Stein Variational Inference (SVI) as an alternative calibration technique for stochastic epidemiological ABMs approximated by Gaussian process (GP) surrogates. SVI leverages gradient information to iteratively update a set of particles in the space of parameters being calibrated, offering potential advantages in scalability and efficiency for high-dimensional ABMs. The ensemble of particles yields a joint probability density function for the parameters and serves as the calibration. We compare the performance of SVI and MCMC in calibrating CityCOVID, a stochastic epidemiological ABM, focusing on predictive accuracy and calibration effectiveness. Our results demonstrate that SVI maintains predictive accuracy and calibration effectiveness comparable to MCMC, making it a viable alternative for complex epidemiological models. We also present the practical challenges of using a gradient-based calibration such as SVI which include careful tuning of hyperparameters and monitoring of the particle dynamics.

[LG-67] scMamba: A Pre-Trained Model for Single-Nucleus RNA Sequencing Analysis in Neurodegenerative Disorders

链接: https://arxiv.org/abs/2502.19429
作者: Gyutaek Oh,Baekgyu Choi,Seyoung Jin,Inkyung Jung,Jong Chul Ye
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 41 pages, 12 figures

点击查看摘要

Abstract:Single-nucleus RNA sequencing (snRNA-seq) has significantly advanced our understanding of the disease etiology of neurodegenerative disorders. However, the low quality of specimens derived from postmortem brain tissues, combined with the high variability caused by disease heterogeneity, makes it challenging to integrate snRNA-seq data from multiple sources for precise analyses. To address these challenges, we present scMamba, a pre-trained model designed to improve the quality and utility of snRNA-seq analysis, with a particular focus on neurodegenerative diseases. Inspired by the recent Mamba model, scMamba introduces a novel architecture that incorporates a linear adapter layer, gene embeddings, and bidirectional Mamba blocks, enabling efficient processing of snRNA-seq data while preserving information from the raw input. Notably, scMamba learns generalizable features of cells and genes through pre-training on snRNA-seq data, without relying on dimension reduction or selection of highly variable genes. We demonstrate that scMamba outperforms benchmark methods in various downstream tasks, including cell type annotation, doublet detection, imputation, and the identification of differentially expressed genes.

信息检索

[IR-0] aching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation

链接: https://arxiv.org/abs/2502.19712
作者: Manveer Singh Tamber,Suleman Kazi,Vivek Sourabh,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:While the current state-of-the-art dense retrieval models exhibit strong out-of-domain generalization, they might fail to capture nuanced domain-specific knowledge. In principle, fine-tuning these models for specialized retrieval tasks should yield higher effectiveness than relying on a one-size-fits-all model, but in practice, results can disappoint. We show that standard fine-tuning methods using an InfoNCE loss can unexpectedly degrade effectiveness rather than improve it, even for domain-specific scenarios. This holds true even when applying widely adopted techniques such as hard-negative mining and negative de-noising. To address this, we explore a training strategy that uses listwise distillation from a teacher cross-encoder, leveraging rich relevance signals to fine-tune the retriever. We further explore synthetic query generation using large language models. Through listwise distillation and training with a diverse set of queries ranging from natural user searches and factual claims to keyword-based queries, we achieve consistent effectiveness gains across multiple datasets. Our results also reveal that synthetic queries can rival human-written queries in training utility. However, we also identify limitations, particularly in the effectiveness of cross-encoder teachers as a bottleneck. We release our code and scripts to encourage further research.

[IR-1] PCL: Prompt-based Continual Learning for User Modeling in Recommender Systems WWW’25

链接: https://arxiv.org/abs/2502.19628
作者: Mingdai Yang,Fan Yang,Yanhui Guo,Shaoyuan Xu,Tianchen Zhou,Yetian Chen,Simone Shao,Jia Liu,Yan Gao
类目: Information Retrieval (cs.IR)
*备注: 5 pages. Accepted by www’25 as short paper

点击查看摘要

Abstract:User modeling in large e-commerce platforms aims to optimize user experiences by incorporating various customer activities. Traditional models targeting a single task often focus on specific business metrics, neglecting the comprehensive user behavior, and thus limiting their effectiveness. To develop more generalized user representations, some existing work adopts Multi-task Learning (MTL)approaches. But they all face the challenges of optimization imbalance and inefficiency in adapting to new tasks. Continual Learning (CL), which allows models to learn new tasks incrementally and independently, has emerged as a solution to MTL’s limitations. However, CL faces the challenge of catastrophic forgetting, where previously learned knowledge is lost when the model is learning the new task. Inspired by the success of prompt tuning in Pretrained Language Models (PLMs), we propose PCL, a Prompt-based Continual Learning framework for user modeling, which utilizes position-wise prompts as external memory for each task, preserving knowledge and mitigating catastrophic forgetting. Additionally, we design contextual prompts to capture and leverage inter-task relationships during prompt tuning. We conduct extensive experiments on real-world datasets to demonstrate PCL’s effectiveness.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-28

目录

概览 (2025-02-28)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载