Arxiv今日论文 | 2025-02-11

本篇博文主要内容为 2025-02-11 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决数学推理任务中的强化学习（Reinforcement Learning, RL）所面临的稀疏奖励难题，特别是在长链思维（long chain of thoughts）导致部分正确性的情况下。论文的关键解决方案是提出了一种新的RL框架，称为OREAL（Outcome REARD-based Reinforcement Learning），该框架通过行为克隆（behavior cloning）在最佳选择N（best-of-N, BoN）采样下的正向轨迹上进行学习，并引入了KL-正则化最优策略。此外，为了确保正负样本之间的梯度一致性，论文建议重塑负样本的奖励。同时，OREAL通过引入基于标记的奖励模型来缓解稀疏奖励问题，该模型用于在推理轨迹中采样重要标记以促进学习。实验结果表明，使用OREAL训练的7B模型在MATH-500数据集上达到了94.0%的pass@1准确率，与32B模型相当；而OREAL-32B模型进一步提升了准确率达到95.0%。

链接: https://arxiv.org/abs/2502.06781
作者: Chengqi Lyu,Songyang Gao,Yuzhe Gu,Wenwei Zhang,Jianfei Gao,Kuikun Liu,Ziyi Wang,Shuaibin Li,Qian Zhao,Haian Huang,Weihan Cao,Jiangning Liu,Hongwei Liu,Junnan Liu,Songyang Zhang,Dahua Lin,Kai Chen
机构: Shanghai AI Laboratory(上海人工智能实验室); Shanghai Jiao Tong University(上海交通大学); MMLab, The Chinese University of Hong Kong(香港中文大学多媒体实验室); HKGAI under InnoHK(香港创新科技署HK-GAI)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: We released our code, data, and model on this https URL

点击查看摘要

Abstract:Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbfOutcome \textbfREw\textbfArd-based reinforcement \textbfLearning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research\footnotethis https URL.
zh

[NLP-1] On the Emergence of Thinking in LLM s I: Searching for the Right Intuition

【速读】：该论文旨在探索训练大型推理模型（LRMs）的算法框架，并特别关注如何在大规模语言模型（LLMs）中实现高效且可扩展的搜索机制。论文的关键创新在于提出了一种名为“通过自我博弈进行强化学习”（Reinforcement Learning via Self-Play, RLSP）的后训练框架。RLSP框架通过三个步骤实现：(1) 监督微调以获取人类或合成的推理过程演示；(2) 使用探索奖励信号鼓励多样且高效的推理行为；(3) 在结果验证器的帮助下进行策略优化训练，确保正确性同时防止奖励操纵。论文的关键突破在于，在PPO训练过程中将探索信号与正确性信号解耦，并仔细平衡二者，从而提高性能和效率。实验结果显示，RLSP显著提升了数学推理能力，表明其能够促进复杂推理能力在LLMs中的涌现。

链接: https://arxiv.org/abs/2502.06773
作者: Guanghao Ye,Khiem Duc Pham,Xinzhi Zhang,Sivakanth Gopi,Baolin Peng,Beibin Li,Janardhan Kulkarni,Huseyin A. Inan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Abstract shortened for arXiv

点击查看摘要

Abstract:Recent AI advancements, such as OpenAI’s new models, are transforming LLMs into LRMs (Large Reasoning Models) that perform reasoning during inference, taking extra time and compute for higher-quality outputs. We aim to uncover the algorithmic framework for training LRMs. Methods like self-consistency, PRM, and AlphaZero suggest reasoning as guided search. We ask: what is the simplest, most scalable way to enable search in LLMs? We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP). RLSP involves three steps: (1) supervised fine-tuning with human or synthetic demonstrations of the reasoning process, (2) using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and (3) RL training with an outcome verifier to ensure correctness while preventing reward hacking. Our key innovation is to decouple exploration and correctness signals during PPO training, carefully balancing them to improve performance and efficiency. Empirical studies in the math domain show that RLSP improves reasoning. On the Llama-3.1-8B-Instruct model, RLSP can boost performance by 23% in MATH-500 test set; On AIME 2024 math problems, Qwen2.5-32B-Instruct improved by 10% due to RLSP. However, a more important finding of this work is that the models trained using RLSP, even with the simplest exploration reward that encourages the model to take more intermediate steps, showed several emergent behaviors such as backtracking, exploration of ideas, and verification. These findings demonstrate that RLSP framework might be enough to enable emergence of complex reasoning abilities in LLMs when scaled. Lastly, we propose a theory as to why RLSP search strategy is more suitable for LLMs inspired by a remarkable result that says CoT provably increases computational power of LLMs, which grows as the number of steps in CoT \citeli2024chain,merrill2023expresssive. Comments: Abstract shortened for arXiv Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2502.06773 [cs.AI] (or arXiv:2502.06773v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.06773 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-2] Reason Flux: Hierarchical LLM Reasoning via Scaling Thought Templates

【速读】：该论文旨在解决大规模语言模型（LLMs）在数学推理任务中的优化问题。关键在于通过引入结构化且通用的思维模板库、分层强化学习以及新的推理扩展系统，有效优化推理搜索空间，并提升复杂问题处理能力。实验结果显示，提出的ReasonFlux-32B模型在MATH基准测试中达到了91.2%的准确率，在AIME基准测试中解决了56.7%的问题，显著超越了其他强大的LLMs如OpenAI o1-preview和DeepSeek V3。

链接: https://arxiv.org/abs/2502.06772
作者: Ling Yang,Zhaochen Yu,Bin Cui,Mengdi Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: this https URL
zh

[NLP-3] Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs

【速读】：该论文旨在解决在大规模输入令牌（tokens）上进行推断（inference）时，基于变换器（transformer）模型所需的显著计算资源问题，这限制了其在普通硬件上的应用。解决方案的关键在于提出了一种可调机制，通过在每个生成步骤中仅关注最相关的令牌（使用top-k选择机制），从而减少前向传递（forward pass）的成本。实验表明，该方法能够在大约16GB的GPU内存中对高达100万个令牌的上下文窗口进行推断，并且模型能够处理由减少的键和值引起的稀疏性，最终在常见的长上下文基准测试中实现了超过95%的模型性能。

链接: https://arxiv.org/abs/2502.06766
作者: Ryan Synk,Monte Hoover,John Kirchenbauer,Neel Jain,Alex Stein,Manli Shu,Josue Melendez Sanchez,Ramani Duraiswami,Tom Goldstein
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 8 figures, 2 tables in main body

点击查看摘要

Abstract:There is growing demand for performing inference with hundreds of thousands of input tokens on trained transformer models. Inference at this extreme scale demands significant computational resources, hindering the application of transformers at long contexts on commodity (i.e not data center scale) hardware. To address the inference time costs associated with running self-attention based transformer language models on long contexts and enable their adoption on widely available hardware, we propose a tunable mechanism that reduces the cost of the forward pass by attending to only the most relevant tokens at every generation step using a top-k selection mechanism. We showcase the efficiency gains afforded by our method by performing inference on context windows up to 1M tokens using approximately 16GB of GPU RAM. Our experiments reveal that models are capable of handling the sparsity induced by the reduced number of keys and values. By attending to less than 2% of input tokens, we achieve over 95% of model performance on common long context benchmarks (LM-Eval, AlpacaEval, and RULER).
zh

[NLP-4] Rationalization Models for Text-to-SQL

【速读】：该论文旨在解决文本到SQL查询转换（Text-to-SQL）模型在处理复杂查询时的执行准确性和可解释性问题。解决方案的关键在于引入一种框架，通过生成逐步推理（Chain-of-Thought, CoT）理由来增强模型的微调过程。这些理由包括中间SQL语句和解释，作为构建最终SQL查询的增量步骤。通过迭代的动态少样本知识蒸馏程序，从教师模型中获取提示，并使用验证后的分解查询训练推理模型，从而实现大量合成CoT注释的生成。实验结果表明，这种逐步查询生成方法尤其提升了中高度复杂查询的执行准确性，并增强了模型的可解释性。

链接: https://arxiv.org/abs/2502.06759
作者: Gaetano Rossiello,Nhan Pham,Michael Glass,Junkyu Lee,Shankar Subramanian
机构: IBM T.J. Watson Research Center (IBM T.J. 沃森研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We introduce a framework for generating Chain-of-Thought (CoT) rationales to enhance text-to-SQL model fine-tuning. These rationales consist of intermediate SQL statements and explanations, serving as incremental steps toward constructing the final SQL query. The process begins with manually annotating a small set of examples, which are then used to prompt a large language model in an iterative, dynamic few-shot knowledge distillation procedure from a teacher model. A rationalization model is subsequently trained on the validated decomposed queries, enabling extensive synthetic CoT annotations for text-to-SQL datasets. To evaluate the approach, we fine-tune small language models with and without these rationales on the BIRD dataset. Results indicate that step-by-step query generation improves execution accuracy, especially for moderately and highly complex queries, while also enhancing explainability.
zh

[NLP-5] Can 1B LLM Surpass 405B LLM ? Rethinking Compute-Optimal Test-Time Scaling

【速读】：该论文旨在解决如何系统性地分析 Test-Time Scaling (TTS) 方法在不同策略模型（Policy Models）、过程奖励模型（Process Reward Models）及任务难度下的优化方法，并探究扩展计算在复杂任务中提升大语言模型（LLMs）性能的程度。关键在于确定最优的TTS策略，使其能够根据具体的任务和模型特性进行调整，从而使得较小的语言模型通过TTS方法能够超越较大的模型。实验结果表明，采用计算最优的TTS策略，小至10亿参数的语言模型能够在MATH-500数据集上超过拥有4050亿参数的语言模型。

链接: https://arxiv.org/abs/2502.06703
作者: Runze Liu,Junqi Gao,Jian Zhao,Kaiyan Zhang,Xiu Li,Biqing Qi,Wanli Ouyang,Bowen Zhou
机构: Shanghai AI Laboratory; Tsinghua University; Harbin Institute of Technology; BUPT; Shanghai AI Laboratory; Shanghai AI Laboratory; Shanghai AI Laboratory; Shanghai AI Laboratory; Tsinghua University; Tsinghua University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.
zh

[NLP-6] Multi-label Scandinavian Language Identification (SLIDE)

【速读】：该论文旨在解决句子级多标签斯堪的纳维亚语言识别（LID）问题，特别是针对丹麦语、挪威博克马尔语、挪威尼诺斯克语以及瑞典语。论文的关键在于提出了一种能够同时识别多种语言的能力，这是任何准确的语言识别方法所必需的，并且介绍了一种新的多标签LID模型训练方法。

链接: https://arxiv.org/abs/2502.06692
作者: Mariia Fedorova,Jonas Sebulon Frydenberg,Victoria Handford,Victoria Ovedie Chruickshank Langø,Solveig Helene Willoch,Marthe Løken Midtgaard,Yves Scherrer,Petter Mæhlum,David Samuel
机构: Department of Informatics, University of Oslo (计算机与信息科学系，奥斯陆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.
zh

[NLP-7] Boosting Self-Efficacy and Performance of Large Language Models via Verbal Efficacy Stimulations ICONIP2024

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在零样本任务中的性能提升问题，并特别关注通过简化的提示工程而非复杂的领域适应来增强其表现。论文的关键在于引入了言语效能刺激（Verbal Efficacy Stimulations, VES），包含鼓励、挑衅和批评三种类型的提示，以多方面提升模型的效能感与任务表现，同时考察不同难度任务下的影响差异。实验结果表明，这三种类型的VES在大多数任务中均提升了LLMs的表现，且最有效的VES类型因模型而异。

链接: https://arxiv.org/abs/2502.06669
作者: Rui Chen,Tailai Peng,Xinran Xie,Dekun Lin,Zhe Cui,Zheng Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: to be published in ICONIP 2024

点击查看摘要

Abstract:Significant improvements have been observed in the zero-shot capabilities of the Large Language Models (LLMs). Due to their high sensitivity to input, research has increasingly focused on enhancing LLMs’ performance via direct and simple prompt engineering rather than intricate domain adaptation. Studies suggest that LLMs exhibit emotional intelligence, and both positive and negative emotions can potentially enhance task performances. However, prior interaction prompts have predominantly concentrated on a single stimulus type, neglecting to compare different stimulus effects, examine the influence of varying task difficulties, or explore underlying mechanisms. This paper, inspired by the positive correlation between self-efficacy and task performance within the social cognitive theory, introduces Verbal Efficacy Stimulations (VES). Our VES comprises three types of verbal prompts: encouraging, provocative, and critical, addressing six aspects such as helpfulness and competence. And we further categorize task difficulty, aiming to extensively investigate how distinct VES influence the self-efficacy and task achievements of language models at varied levels of difficulty. The experimental results show that the three types of VES improve the performance of LLMs on most tasks, and the most effective VES varies for different models. In extensive experiments, we have obtained some findings consistent with psychological theories, providing novel insights for future research.
zh

[NLP-8] Automatic Evaluation of Healthcare LLM s Beyond Question-Answering

【速读】：该论文旨在解决大型语言模型（LLMs）在医疗领域的评估方法中，事实性（factuality）与表达性（expressiveness）之间的平衡问题。当前的评估基准主要依赖于开放性或封闭性的问题回答（QA）测试，分别侧重于模型的表达能力和事实准确性，但两者之间的关联尚未得到充分理解。论文的关键解决方案在于引入了一套多维度的医疗领域LLM评估体系，并通过新的医疗基准CareQA以及一种新颖的开放性评估指标——宽松困惑度（Relaxed Perplexity），来探索和缓解现有评估方法中的盲点和局限性。

链接: https://arxiv.org/abs/2502.06666
作者: Anna Arias-Duart,Pablo Agustin Martin-Torres,Daniel Hinjos,Pablo Bernabeu-Perez,Lucia Urcelay Ganzabal,Marta Gonzalez Mallo,Ashwin Kumar Gururajan,Enrique Lopez-Cuena,Sergio Alvarez-Napagao,Dario Garcia-Gasulla
机构: Barcelona Supercomputing Center (BSC); Universitat Politècnica de Catalunya (UPC)–BarcelonaTech; Independent Researcher (formerly affiliated with BSC)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality of responses but lack expressiveness. Open-ended capture the model’s capacity to produce discourse responses but are harder to assess for correctness. These two approaches are commonly used, either independently or together, though their relationship remains poorly understood. This work is focused on the healthcare domain, where both factuality and discourse matter greatly. It introduces a comprehensive, multi-axis suite for healthcare LLM evaluation, exploring correlations between open and close benchmarks and metrics. Findings include blind spots and overlaps in current methodologies. As an updated sanity check, we release a new medical benchmark–CareQA–, with both open and closed variants. Finally, we propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to mitigate the identified limitations.
zh

[NLP-9] Who Taught You That? Tracing Teachers in Model Distillation

【速读】：该论文旨在解决通过学生模型的输出识别其教师模型的问题。论文的关键在于设计了一种基于词法特征的判别模型，发现仅使用n-gram相似性不足以准确识别教师模型，而学生模型偏好的词性（Part-of-Speech, PoS）模板可以模仿其教师模型的特征，从而实现教师模型的识别。

链接: https://arxiv.org/abs/2502.06659
作者: Somin Wadhwa,Chantal Shaib,Silvio Amir,Byron C. Wallace
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: Preprint; under review

点击查看摘要

Abstract:Model distillation – using outputs from a large teacher model to teach a small student model – is a practical means of creating efficient models for a particular task. We ask: Can we identify a students’ teacher based on its outputs? Such “footprints” left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to distill specific capabilities of massive proprietary LLMs into deployed smaller LMs, potentially violating terms of service. We consider practical task distillation targets including summarization, question answering, and instruction-following. We assume a finite set of candidate teacher models, which we treat as blackboxes. We design discriminative models that operate over lexical features. We find that n -gram similarity alone is unreliable for identifying teachers, but part-of-speech (PoS) templates preferred by student models mimic those of their teachers.
zh

[NLP-10] In-Context Learning (and Unlearning) of Length Biases NAACL2025

【速读】：该论文旨在解决大型语言模型在上下文学习过程中受到长度偏差（length biases）影响的问题。研究的关键在于揭示利用上下文学习方法可以减轻模型中已编码的长度偏差，而无需进行昂贵的参数更新（costly parameter updates）。通过分析模型如何在上下文窗口内学习长度信息，并探索调节模型展示偏差程度的因素，论文展示了上下文学习在调整模型预测行为以减少偏差方面的潜力。

链接: https://arxiv.org/abs/2502.06653
作者: Stephanie Schoch,Yangfeng Ji
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Large language models have demonstrated strong capabilities to learn in-context, where exemplar input-output pairings are appended to the prompt for demonstration. However, existing work has demonstrated the ability of models to learn lexical and label biases in-context, which negatively impacts both performance and robustness of models. The impact of other statistical data biases remains under-explored, which this work aims to address. We specifically investigate the impact of length biases on in-context learning. We demonstrate that models do learn length biases in the context window for their predictions, and further empirically analyze the factors that modulate the level of bias exhibited by the model. In addition, we show that learning length information in-context can be used to counter the length bias that has been encoded in models (e.g., via fine-tuning). This reveals the power of in-context learning in debiasing model prediction behaviors without the need for costly parameter updates.
zh

[NLP-11] ransparent NLP: Using RAG and LLM Alignment for Privacy QA

【速读】：该论文旨在解决在满足《通用数据保护条例》(GDPR)透明度原则的过程中，语言模型由于其概率性质导致的真实性和可理解性问题。关键解决方案在于引入带有对齐技术的状态-of-the-art检索增强生成（RAG）系统，并通过整合重放式自回归推理（RAIN）模块及其多维扩展（MultiRAIN），以优化响应的精确性和可理解性。评估结果表明，采用对齐模块的RAG系统在多数指标上优于基准RAG系统，尽管仍未能完全匹配人工答案。

链接: https://arxiv.org/abs/2502.06652
作者: Anna Leschanowsky,Zahra Kolagar,Erion Çano,Ivan Habernal,Dara Hallinan,Emanuël A. P. Habets,Birgit Popp
机构: Fraunhofer Institute for Integrated Circuits (IIS); Ruhr University Bochum & Research Center Trustworthy Data Science and Security; FIZ Karlsruhe - Leibniz Institute for Information Infrastructure; International Audio Laboratories Erlangen
类目: Computation and Language (cs.CL)
备注: Submitted to ARR

点击查看摘要

Abstract:The transparency principle of the General Data Protection Regulation (GDPR) requires data processing information to be clear, precise, and accessible. While language models show promise in this context, their probabilistic nature complicates truthfulness and comprehensibility. This paper examines state-of-the-art Retrieval Augmented Generation (RAG) systems enhanced with alignment techniques to fulfill GDPR obligations. We evaluate RAG systems incorporating an alignment module like Rewindable Auto-regressive Inference (RAIN) and our proposed multidimensional extension, MultiRAIN, using a Privacy QA dataset. Responses are optimized for preciseness and comprehensibility and are assessed through 21 metrics, including deterministic and large language model-based evaluations. Our results show that RAG systems with an alignment module outperform baseline RAG systems on most metrics, though none fully match human answers. Principal component analysis of the results reveals complex interactions between metrics, highlighting the need to refine metrics. This study provides a foundation for integrating advanced natural language processing systems into legal compliance frameworks. Comments: Submitted to ARR Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.06652 [cs.CL] (or arXiv:2502.06652v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.06652 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-12] he 2021 Tokyo Olympics Multilingual News Article Dataset

【速读】：该论文旨在解决多语种新闻聚类算法性能评估资源有限的问题。为了解决这一问题，论文的关键在于创建了一个包含10,940篇来自1,918个不同出版机构的多语种新闻文章的数据集，这些文章覆盖了2021年东京奥运会的1,350个子事件，并使用在线聚类算法将文章按报道同一子事件进行分组，最后通过人工标注和评估确保数据集的质量。该数据集以CSV格式提供，可作为评估多语种新闻聚类算法性能的资源，并可用于从不同角度分析2021年东京奥运会的动态和事件。

链接: https://arxiv.org/abs/2502.06648
作者: Erik Novak,Erik Calcina,Dunja Mladenić,Marko Grobelnik
机构: Jožef Stefan Institute (约瑟夫·斯特凡研究所)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce a dataset of multilingual news articles covering the 2021 Tokyo Olympics. A total of 10,940 news articles were gathered from 1,918 different publishers, covering 1,350 sub-events of the 2021 Olympics, and published between July 1, 2021, and August 14, 2021. These articles are written in nine languages from different language families and in different scripts. To create the dataset, the raw news articles were first retrieved via a service that collects and analyzes news articles. Then, the articles were grouped using an online clustering algorithm, with each group containing articles reporting on the same sub-event. Finally, the groups were manually annotated and evaluated. The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms, for which limited datasets are available. It can also be used to analyze the dynamics and events of the 2021 Tokyo Olympics from different perspectives. The dataset is available in CSV format and can be accessed from the this http URL repository.
zh

[NLP-13] Steel-LLM :From Scratch to Open Source – A Personal Journey in Building a Chinese-Centric LLM

【速读】：该论文旨在解决开发高质量开源中文语言模型（Chinese-centric Language Model）的挑战，特别是在有限计算资源条件下。解决方案的关键在于从零开始构建一个包含10亿参数的模型（1-billion-parameter model），重点收集和训练大量中文数据（Chinese data），同时融入少量英文数据（a small proportion of English data）。通过详尽的数据收集、模型设计、训练方法以及对过程中遇到挑战的深入分析，该研究提供了宝贵的参考资源，助力其他研究人员和实践者开发自己的大型语言模型（Large Language Models, LLMs）。

链接: https://arxiv.org/abs/2502.06635
作者: Qingshui Gu,Shu Li,Tianyu Zheng,Zhaoxiang Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project’s key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at this https URL.
zh

[NLP-14] Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches NAACL2025

【速读】：该论文旨在解决大规模多文档摘要（Large-scale Multi-Document Summarization, MDS）中的关键挑战，即如何生成高质量且信息完整的摘要。论文对比了压缩方法和全文方法两种系统。压缩方法通过多阶段处理流程生成摘要，但通常会导致信息丢失；而全文方法利用长上下文推理的最新进展，承诺提供无损摘要。研究的关键在于发现全文方法和检索增强方法在大多数情况下表现最佳，尽管压缩方法在某些中间阶段具有潜力，并且在局部性能上可能优于全文方法，但由于多阶段处理导致信息损失和缺乏全局视角。因此，论文提出需要开发混合方法，结合压缩与全文策略以实现大规模多文档摘要的最佳性能。

链接: https://arxiv.org/abs/2502.06617
作者: Adithya Pratapa,Teruko Mitamura
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: NAACL 2025 camera-ready version

点击查看摘要

Abstract:Automatically summarizing large text collections is a valuable tool for document research, with applications in journalism, academic research, legal work, and many other fields. In this work, we contrast two classes of systems for large-scale multi-document summarization (MDS): compression and full-text. Compression-based methods use a multi-stage pipeline and often lead to lossy summaries. Full-text methods promise a lossless summary by relying on recent advances in long-context reasoning. To understand their utility on large-scale MDS, we evaluated them on three datasets, each containing approximately one hundred documents per summary. Our experiments cover a diverse set of long-context transformers (Llama-3.1, Command-R, Jamba-1.5-Mini) and compression methods (retrieval-augmented, hierarchical, incremental). Overall, we find that full-text and retrieval methods perform the best in most settings. With further analysis into the salient information retention patterns, we show that compression-based methods show strong promise at intermediate stages, even outperforming full-context. However, they suffer information loss due to their multi-stage pipeline and lack of global context. Our results highlight the need to develop hybrid approaches that combine compression and full-text approaches for optimal performance on large-scale multi-document summarization.
zh

[NLP-15] Do we really have to filter out random noise in pre-training data for language models?

【速读】：该论文旨在解决Web规模预训练数据中存在的随机噪声对大规模语言模型（LLMs）性能的影响问题。论文的关键解决方案是引入了一种新的即插即用局部梯度匹配损失（Local Gradient Matching loss），通过对齐正常特征和扰动特征的梯度，从而在不需了解模型参数的情况下增强下游任务头的去噪能力。

链接: https://arxiv.org/abs/2502.06604
作者: Jinghan Ru,Yuxin Xie,Xianwei Zhuang,Yuguo Yin,Yuexian Zou
机构: School of Electronic and Computer Engineering, Peking University (电子与计算机工程学院, 北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Web-scale pre-training datasets are the cornerstone of LLMs’ success. However, text data curated from the internet inevitably contains random noise caused by decoding errors or unregulated web content. In contrast to previous works that focus on low quality or synthetic data, our study \textbfprovides the first systematic investigation into such random noise through a cohesive ``What-Why-How’’ framework. Surprisingly, we observed that the resulting increase in next-token prediction (NTP) loss was significantly lower than the proportion of random noise. We provide a theoretical justification for this phenomenon, which also elucidates the success of multilingual models. On the other hand, experiments show that the model’s performance in downstream tasks is not based solely on the NTP loss, which means that random noise may result in degraded downstream performance. To address the potential adverse effects, we introduce a novel plug-and-play Local Gradient Matching loss, which explicitly enhances the denoising capability of the downstream task head by aligning the gradient of normal and perturbed features without requiring knowledge of the model’s parameters. Additional experiments on 8 language and 14 vision benchmarks further validate its effectiveness.
zh

[NLP-16] Evaluation of Multilingual Image Captioning: How far can we get with CLIP models? NAACL2025

【速读】：该论文旨在解决多语种图像描述评估的问题，特别是现有评估方法在多语种环境下的局限性。关键解决方案在于提出并验证了几种策略，包括使用质量感知的机器翻译数据集及人类判断，以及重新利用针对语义推理设计的多语种数据集，以评估CLIPScore指标的多语种变体。实验结果表明，经过微调的多语种模型能够跨语言泛化，并处理复杂的语言挑战，从而实现高质量的跨语言评估。

链接: https://arxiv.org/abs/2502.06600
作者: Gonçalo Gomes,Chrysoula Zerva,Bruno Martins
机构: Instituto Superior Técnico, University of Lisbon(里斯本大学技术高等研究院); INESC-ID; Instituto de Telecomunicações(电信研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages, and additional tests with natively multilingual and multicultural data further attest to the high-quality assessments.
zh

[NLP-17] Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training NAACL2025

【速读】：该论文旨在解决由于面向代理的预训练数据稀缺，导致基于大语言模型 (LLM) 的自主代理通常依赖于复杂的提示或广泛的微调，从而难以在保持强泛化能力的同时引入新功能的问题。关键解决方案在于引入Hephaestus-Forge，这是一个大规模的预训练语料库，旨在增强LLM代理在API功能调用、内在推理和规划以及适应环境反馈方面的基本能力。通过持续在Hephaestus-Forge上进行预训练，Hephaestus在三个代理基准测试中表现出色，超越了小型到中型的开源LLM，并与商业LLM相匹敌，证明了该预训练语料库的有效性。

链接: https://arxiv.org/abs/2502.06589
作者: Yuchen Zhuang,Jingfeng Yang,Haoming Jiang,Xin Liu,Kewei Cheng,Sanket Lokegaonkar,Yifan Gao,Qing Ping,Tianyi Liu,Binxuan Huang,Zheng Li,Zhengyang Wang,Pei Chen,Ruijie Wang,Rongzhi Zhang,Nasser Zalmout,Priyanka Nigam,Bing Yin,Chao Zhang
机构: Amazon; Georgia Institute of Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to NAACL 2025 main conference

点击查看摘要

Abstract:Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.
zh

[NLP-18] LawGPT : Knowledge-Guided Data Generation and Its Application to Legal LLM

【速读】：该论文旨在解决大型语言模型（LLMs）在法律推理任务中的显著局限性。这些问题包括专有模型的数据隐私风险和高昂的推理成本，以及开源模型因缺乏足够的法律领域训练数据而表现不佳。为了解决这些局限性，论文提出了一种名为KgDG的知识引导数据生成框架，通过利用法律知识增强生成多样性，并引入精炼和验证过程以确保生成数据的质量。关键解决方案在于结合知识引导的数据生成与验证机制，从而提升开源LLMs在法律推理任务中的性能。

链接: https://arxiv.org/abs/2502.06572
作者: Zhi Zhou,Kun-Yang Yu,Shi-Yu Tian,Jiang-Xin Shi,Xiao-Wen Yang,Pengxiao Song,Yi-Xuan Jin,Lan-Zhe Guo,Yu-Feng Li
机构: Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs), both proprietary and open-source, have demonstrated remarkable capabilities across various natural language processing tasks. However, they face significant limitations in legal reasoning tasks. Proprietary models introduce data privacy risks and high inference costs, while open-source models underperform due to insufficient legal domain training data. To address these limitations, we study data generation for legal reasoning to improve the legal reasoning performance of open-source LLMs with the help of proprietary LLMs. This is challenging due to the lack of legal knowledge in proprietary LLMs and the difficulty in verifying the generated data. We propose KgDG, a knowledge-guided data generation framework for legal reasoning. Our framework enables leveraging legal knowledge to enhance generation diversity and introduces a refinement and verification process to ensure the quality of generated data. Moreover, we expand the generated dataset to further enhance the LLM reasoning capabilities. Using KgDG, we create a synthetic legal reasoning dataset containing 50K high-quality examples. Our trained model LawGPT outperforms existing legal-specific LLMs and achieves performance comparable to proprietary LLMs, demonstrating the effectiveness of KgDG and LawGPT. Our code and resources is publicly available at this https URL .
zh

[NLP-19] Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation ICLR2025

【速读】：该论文旨在解决现有基准测试在评估一阶逻辑（First-order logic, FOL）推理能力时面临的复杂性、可扩展性和多样性不足的问题。为了解决这些问题，论文提出了一种名为ProverGen的新框架，该框架结合了大规模语言模型（Large Language Models, LLMs）的生成能力与符号证明器的严谨性和精确性，从而创建了一个可扩展、多样化且高质量的FOL推理数据集ProverQA。ProverQA的关键特点是包含了每个问题的可访问且逻辑连贯的中间推理步骤。论文的关键解决方案在于ProverGen框架的创新结合，使得生成的数据集能够有效评估和训练模型的推理能力。

链接: https://arxiv.org/abs/2502.06563
作者: Chengwen Qi,Ren Ma,Bowen Li,He Du,Binyuan Hui,Jinwang Wu,Yuanjun Laili,Conghui He
机构: Beihang University(北京航空航天大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Fudan University(复旦大学); Zhongguancun Laboratory(中关村实验室); State Key Laboratory of Intelligent Manufacturing Systems Technology, Beijing(北京智能制造业国家重点实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:First-order logic (FOL) reasoning, which involves sequential deduction, is pivotal for intelligent systems and serves as a valuable task for evaluating reasoning capabilities, particularly in chain-of-thought (CoT) contexts. Existing benchmarks often rely on extensive human annotation or handcrafted templates, making it difficult to achieve the necessary complexity, scalability, and diversity for robust evaluation. To address these limitations, we propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, ProverQA. ProverQA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. Our evaluation shows that state-of-the-art LLMs struggle to solve ProverQA problems, even with CoT prompting, highlighting the dataset’s challenging nature. We also finetune Llama3.1-8B-Instruct on a separate training set generated by our framework. The finetuned model demonstrates consistent improvements on both in-distribution and out-of-distribution test sets, suggesting the value of our proposed data generation framework. Code available at: this https URL
zh

[NLP-20] Position: Its Time to Act on the Risk of Efficient Personalized Text Generation

【速读】：该论文旨在探讨利用高质量开源生成式AI文本模型（Generative AI，简称LLMs）和高效的微调技术创建个性化模型所带来的新安全风险。这些个性化模型能够生成符合特定个体需求的高质量文本，并能逼真地模仿其写作风格。论文的关键在于指出，这种技术的易得性和低成本性使得恶意行为者能够大规模地冒充特定个体，例如用于网络钓鱼邮件等场景，这带来了新的安全隐患。这些风险与广泛讨论的图像、语音或视频深度伪造攻击不同，且目前尚未得到研究社区和现有开源及闭源模型的充分关注和解决。

链接: https://arxiv.org/abs/2502.06560
作者: Eugenia Iofinova,Andrej Jovanovic,Dan Alistarh
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The recent surge in high-quality open-sourced Generative AI text models (colloquially: LLMs), as well as efficient finetuning techniques, has opened the possibility of creating high-quality personalized models, i.e., models generating text attuned to a specific individual’s needs and capable of credibly imitating their writing style by leveraging that person’s own data to refine an open-source model. The technology to create such models is accessible to private individuals, and training and running such models can be done cheaply on consumer-grade hardware. These advancements are a huge gain for usability and privacy. This position paper argues, however, that these advancements also introduce new safety risks by making it practically feasible for malicious actors to impersonate specific individuals at scale, for instance for the purpose of phishing emails, based on small amounts of publicly available text. We further argue that these risks are complementary to - and distinct from - the much-discussed risks of other impersonation attacks such as image, voice, or video deepfakes, and are not adequately addressed by the larger research community, or the current generation of open - and closed-source models.
zh

[NLP-21] ProjectTest: A Project-level Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

【速读】：该论文旨在解决现有评估基准主要关注函数或类级别的代码，而忽视更实用且更具挑战性的项目级代码库的问题。为了解决这一局限性，论文提出了ProjectTest，这是一个涵盖Python、Java和JavaScript的项目级单元测试生成基准，包含每种语言各20个中等规模且高质量的项目。论文的关键解决方案在于引入ProjectTest基准，以更全面地评估大型代码库中的单元测试生成能力，并通过详细的错误分析及手动和自我错误修正场景下的评估，揭示了前沿大语言模型（LLMs）在处理复杂项目级代码时所面临的挑战及其潜在改进空间。

链接: https://arxiv.org/abs/2502.06556
作者: Yibo Wang,Congying Xia,Wenting Zhao,Jiangshu Du,Chunyu Miao,Zhongfen Deng,Philip S. Yu,Chen Xing
机构: University of Illinois Chicago(芝加哥伊利诺伊大学); Salesforce Research(销售力量研究); Scale AI(Scale AI)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unit test generation has become a promising and important use case of LLMs. However, existing evaluation benchmarks for assessing LLM unit test generation capabilities focus on function- or class-level code rather than more practical and challenging project-level codebases. To address such limitation, we propose ProjectTest, a project-level benchmark for unit test generation covering Python, Java, and JavaScript. ProjectTest features 20 moderate-sized and high-quality projects per language. We evaluate nine frontier LLMs on ProjectTest and the results show that all frontier LLMs tested exhibit moderate performance on ProjectTest on Python and Java, highlighting the difficulty of ProjectTest. We also conduct a thorough error analysis, which shows that even frontier LLMs, such as Claude-3.5-Sonnet, have significant simple errors, including compilation and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms.
zh

[NLP-22] Efficient Scientific Full Text Classification: The Case of EICAT Impact Assessments

【速读】：该论文旨在解决科学全文高效分类的问题，特别是在处理大型文本输入时如何在保持或提升分类性能的同时提高效率。关键解决方案在于开发一种方法，通过选择性地选取输入句子的子集来减少输入大小。这种方法利用了诸如人工标注证据、大规模语言模型（LLM）生成的标注或可解释性评分等多种来源的数据训练句子选择模型，从而优化了包括编码器和解码器在内的语言模型的性能，并通过缩短输入长度提升了效率。此外，重复采样较短的输入片段也被证明是一个非常有效的策略，尽管会略微增加成本，但可以进一步改善分类性能。

链接: https://arxiv.org/abs/2502.06551
作者: Marc Felix Brinner,Sina Zarrieß
机构: Bielefeld University (比勒菲尔德大学), Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study explores strategies for efficiently classifying scientific full texts using both small, BERT-based models and local large language models like Llama-3.1 8B. We focus on developing methods for selecting subsets of input sentences to reduce input size while simultaneously enhancing classification performance. To this end, we compile a novel dataset consisting of full-text scientific papers from the field of invasion biology, specifically addressing the impacts of invasive species. These papers are aligned with publicly available impact assessments created by researchers for the International Union for Conservation of Nature (IUCN). Through extensive experimentation, we demonstrate that various sources like human evidence annotations, LLM-generated annotations or explainability scores can be used to train sentence selection models that improve the performance of both encoder- and decoder-based language models while optimizing efficiency through the reduction in input length, leading to improved results even if compared to models like ModernBERT that are able to handle the complete text as input. Additionally, we find that repeated sampling of shorter inputs proves to be a very effective strategy that, at a slightly increased cost, can further improve classification performance.
zh

[NLP-23] Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning NAACL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实现长期目标方面的能力挑战。为了解决这一问题，论文探讨了通过强化学习（Reinforcement Learning, RL）微调预训练的LLMs以探索优化给定目标的解决方案。然而，由于需要在发现新解法与保持与预训练模型的接近性之间找到平衡，以避免基本能力的退化，基于LLMs的探索十分困难。论文研究了这一过程中通常使用Kullback-Leibler (KL)惩罚项来控制这种平衡。关键解决方案在于引入了一种简单的KL惩罚项修改方法，该方法有利于对“关键令牌”（critical tokens）进行探索，从而提高RL微调阶段的效率。研究表明，不同程度的预训练会影响探索过程，并且这些关键令牌对最终结果有显著影响。

链接: https://arxiv.org/abs/2502.06533
作者: Jean Vassoyan,Nathanaël Beau,Roman Plaud
机构: Université Paris-Saclay, CNRS, ENS Paris-Saclay, Centre Borelli(巴黎高等师范学院); onepoint(奥鹏教育); Université de Paris, LLF, CNRS(巴黎大学); Institut Polytechnique de Paris(巴黎高等理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 6 figures, 5 tables. Accepted for publication in the Findings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2025

点击查看摘要

Abstract:The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of “critical tokens” which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.
zh

[NLP-24] GuideLLM : Exploring LLM -Guided Conversation with Applications in Autobiography Interviewing

【速读】：该论文旨在探索大型语言模型（Large Language Models, LLMs）在引导对话中的潜力，特别是在目标导航（Goal Navigation）、情境管理（Context Management）和共情互动（Empathetic Engagement）这三个基本组成部分上的应用。论文的关键解决方案是提出了GuideLLM系统，并通过一个面试环境对其进行了评估。该评估涉及多个主题，产生了约1400个对话轮次和超过184,000个标记，以及每个聊天机器人评估过程中提到的超过200个事件。研究通过与六种最先进的LLMs（如GPT-4o和Llama-3-70b-Instruct）进行对比，从面试质量和自传生成质量的角度评估了GuideLLM的表现。此外，还通过人类参与者收集反馈，以进一步验证GuideLLM的优势。实验结果表明，GuideLLM在自动评估中显著优于基线LLMs，并且在人类评分中也保持了持续领先的表现。

链接: https://arxiv.org/abs/2502.06494
作者: Jinhao Duan,Xinyu Zhao,Zhuoxuan Zhang,Eunhye Ko,Lily Boddy,Chenan Wang,Tianhao Li,Alexander Rasgon,Junyuan Hong,Min Kyung Lee,Chenxi Yuan,Qi Long,Ying Ding,Tianlong Chen,Kaidi Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages; the first three authors contributed equally

点击查看摘要

Abstract:Although Large Language Models (LLMs) succeed in human-guided conversations such as instruction following and question answering, the potential of LLM-guided conversations-where LLMs direct the discourse and steer the conversation’s objectives-remains under-explored. In this study, we first characterize LLM-guided conversation into three fundamental components: (i) Goal Navigation; (ii) Context Management; (iii) Empathetic Engagement, and propose GuideLLM as an installation. We then implement an interviewing environment for the evaluation of LLM-guided conversation. Specifically, various topics are involved in this environment for comprehensive interviewing evaluation, resulting in around 1.4k turns of utterances, 184k tokens, and over 200 events mentioned during the interviewing for each chatbot evaluation. We compare GuideLLM with 6 state-of-the-art LLMs such as GPT-4o and Llama-3-70b-Instruct, from the perspective of interviewing quality, and autobiography generation quality. For automatic evaluation, we derive user proxies from multiple autobiographies and employ LLM-as-a-judge to score LLM behaviors. We further conduct a human-involved experiment by employing 45 human participants to chat with GuideLLM and baselines. We then collect human feedback, preferences, and ratings regarding the qualities of conversation and autobiography. Experimental results indicate that GuideLLM significantly outperforms baseline LLMs in automatic evaluation and achieves consistent leading performances in human ratings.
zh

[NLP-25] Adaptive Prompting: Ad-hoc Prompt Composition for Social Bias Detection NAACL2025

【速读】：该论文旨在解决自动提示优化过程中单一提示技术及其组合依赖输入的低效性问题。论文的关键在于提出了一种自适应提示方法，能够针对特定输入预测最优的提示组合，从而提高大型语言模型在高度上下文依赖任务中的性能。通过在社会偏见检测任务上应用此方法，并与三种大型语言模型在三个数据集上的表现进行评估，验证了该方法的有效性和鲁棒性。实验结果表明，所提出的自适应提示方法不仅在社会偏见检测任务中表现出色，而且在其他任务上的初步实验也支持其泛化能力。

链接: https://arxiv.org/abs/2502.06487
作者: Maximilian Spliethöver,Tim Knebler,Fabian Fumagalli,Maximilian Muschalik,Barbara Hammer,Eyke Hüllermeier,Henning Wachsmuth
机构: Leibniz University Hannover(汉诺威莱布尼茨大学) Institute of Artificial Intelligence(人工智能研究所); Bielefeld University(比勒菲尔德大学) CITEC(认知交互技术卓越集群); LMU Munich(慕尼黑大学) MCML(机器认知与机器学习研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Recent advances on instruction fine-tuning have led to the development of various prompting techniques for large language models, such as explicit reasoning steps. However, the success of techniques depends on various parameters, such as the task, language model, and context provided. Finding an effective prompt is, therefore, often a trial-and-error process. Most existing approaches to automatic prompting aim to optimize individual techniques instead of compositions of techniques and their dependence on the input. To fill this gap, we propose an adaptive prompting approach that predicts the optimal prompt composition ad-hoc for a given input. We apply our approach to social bias detection, a highly context-dependent task that requires semantic understanding. We evaluate it with three large language models on three datasets, comparing compositions to individual techniques and other baselines. The results underline the importance of finding an effective prompt composition. Our approach robustly ensures high detection performance, and is best in several settings. Moreover, first experiments on other tasks support its generalizability.
zh

[NLP-26] KARMA: Leverag ing Multi-Agent LLM s for Automated Knowledge Graph Enrichment

【速读】：该论文旨在解决现代人工智能系统所需的知识图谱（Knowledge Graphs, KGs）难以跟上快速增长的科学文献手动更新的问题。解决方案的关键在于KARMA框架，它采用多代理大型语言模型（Multi-Agent Large Language Models, LLMs）自动化知识图谱的丰富性，通过结构化分析无结构文本实现。这一方法通过九个协作代理，包括实体发现、关系提取、模式对齐和冲突解决，迭代解析文档、验证提取的知识，并将其整合到现有的图结构中，同时遵守领域特定的模式。

链接: https://arxiv.org/abs/2502.06472
作者: Yuxing Lu,Jinzhuo Wang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Digital Libraries (cs.DL)
备注: 24 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel framework employing multi-agent large language models (LLMs) to automate KG enrichment through structured analysis of unstructured text. Our approach employs nine collaborative agents, spanning entity discovery, relation extraction, schema alignment, and conflict resolution that iteratively parse documents, verify extracted knowledge, and integrate it into existing graph structures while adhering to domain-specific schema. Experiments on 1,200 PubMed articles from three different domains demonstrate the effectiveness of KARMA in knowledge graph enrichment, with the identification of up to 38,230 new entities while achieving 83.1% LLM-verified correctness and reducing conflict edges by 18.6% through multi-layer assessments.
zh

[NLP-27] A Survey of Theory of Mind in Large Language Models : Evaluations Representations and Safety Risks AAAI2025

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在行为理论（behavioural Theory of Mind, ToM）和表征理论（representational Theory of Mind, ToM）方面的能力，并识别由此产生的高级LLM ToM能力带来的重要安全风险。论文的关键在于提出有效的评估和缓解这些风险的研究方向。

链接: https://arxiv.org/abs/2502.06470
作者: Hieu Minh “Jord” Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Advancing Artificial Intelligence through Theory of Mind Workshop, AAAI 2025

点击查看摘要

Abstract:Theory of Mind (ToM), the ability to attribute mental states to others and predict their behaviour, is fundamental to social intelligence. In this paper, we survey studies evaluating behavioural and representational ToM in Large Language Models (LLMs), identify important safety risks from advanced LLM ToM capabilities, and suggest several research directions for effective evaluation and mitigation of these risks.
zh

[NLP-28] Beyond Literal Token Overlap: Token Alignability for Multilinguality NAACL2025

【速读】：该论文旨在解决现有方法在衡量跨语言知识迁移时，过分依赖于字符重叠或词级分布相似性的问题。特别是在不同书写系统或拼写规则的语言对之间，这些传统度量方法无法有效评估跨语言能力。论文的关键解决方案在于提出子词对齐性（Subword Token Alignability）这一新指标，以更好地理解和评估使用不同书写系统或遵循不同正字法规则的语言对之间的跨语言特性。这一新指标尤其适用于那些字符重叠少但实际跨语言性能良好的语言对。

链接: https://arxiv.org/abs/2502.06468
作者: Katharina Hämmerl,Tomasz Limisiewicz,Jindřich Libovický,Alexander Fraser
机构: Centre for Information and Language Processing, LMU Munich(慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning(慕尼黑机器学习中心); Faculty of Mathematics and Physics, Charles University, Czech Republic(捷克共和国查尔斯大学数学与物理学院); Technical University of Munich, Germany(德国慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Previous work has considered token overlap, or even similarity of token distributions, as predictors for multilinguality and cross-lingual knowledge transfer in language models. However, these very literal metrics assign large distances to language pairs with different scripts, which can nevertheless show good cross-linguality. This limits the explanatory strength of token overlap for knowledge transfer between language pairs that use distinct scripts or follow different orthographic conventions. In this paper, we propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We analyse this metric in the context of both encoder and decoder models, look at data size as a potential distractor, and discuss how this insight may be applied to multilingual tokenisation in future work. We recommend our subword token alignability metric for identifying optimal language pairs for cross-lingual transfer, as well as to guide the construction of better multilingual tokenisers in the future. We publish our code and reproducibility details.
zh

[NLP-29] MATH-Perturb: Benchmarking LLM s Math Reasoning Abilities against Hard Perturbations

【速读】：该论文旨在探讨大型语言模型在数学推理任务中的表现是否源于真正的推理能力还是单纯的记忆。为了解决这一问题，论文通过构造MATH-P-Simple和MATH-P-Hard两个数据集，分别使用简单扰动和硬性扰动来修改原始数学问题。关键在于发现当面对硬性扰动导致问题本质发生变化时，不同模型的表现显著下降，这表明现有模型可能依赖于记忆而非真正的推理能力。此外，论文还指出模型可能存在一种新的记忆形式，即盲目应用学习到的问题解决技能而未能评估其在修改后的上下文中的适用性，特别是在利用原始问题进行上下文学习时。

链接: https://arxiv.org/abs/2502.06453
作者: Kaixuan Huang,Jiacheng Guo,Zihao Li,Xiang Ji,Jiawei Ge,Wenzhe Li,Yingqing Guo,Tianle Cai,Hui Yuan,Runzhe Wang,Yue Wu,Ming Yin,Shange Tang,Yangsibo Huang,Chi Jin,Xinyun Chen,Chiyuan Zhang,Mengdi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, prior work has constructed mathematical benchmarks when questions undergo simple perturbations – modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycksmath et. al., 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16.49%) and gemini-2.0-flash-thinking (-12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models.
zh

[NLP-30] Content-Driven Local Response: Supporting Sentence-Level and Message-Level Mobile Email Replies With and Without AI

【速读】：该论文旨在解决移动电子邮件场景下，生成式人工智能（AI）所生成的文本不能始终反映用户真实意图的问题。这一问题使得用户在使用带有AI参与的界面时面临权衡取舍。论文的关键解决方案是提出了一种名为内容驱动局部响应（CDLR）的新界面概念，该概念受到微任务处理的启发。CDLR允许用户通过选择句子插入响应，这同时引导AI提供更为贴切的建议。该方法支持局部AI建议与消息级改进的结合，从而实现灵活的工作流程，并保持减少输入和错误的优势。

链接: https://arxiv.org/abs/2502.06430
作者: Tim Zindulka,Sven Goller,Florian Lehmann,Daniel Buschek
机构: University of Bayreuth (拜罗伊特大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 23 pages, 14 figures, 2 tables, ACM CHI 2025

点击查看摘要

Abstract:Mobile emailing demands efficiency in diverse situations, which motivates the use of AI. However, generated text does not always reflect how people want to respond. This challenges users with AI involvement tradeoffs not yet considered in email UIs. We address this with a new UI concept called Content-Driven Local Response (CDLR), inspired by microtasking. This allows users to insert responses into the email by selecting sentences, which additionally serves to guide AI suggestions. The concept supports combining AI for local suggestions and message-level improvements. Our user study (N=126) compared CDLR with manual typing and full reply generation. We found that CDLR supports flexible workflows with varying degrees of AI involvement, while retaining the benefits of reduced typing and errors. This work contributes a new approach to integrating AI capabilities: By redesigning the UI for workflows with and without AI, we can empower users to dynamically adjust AI involvement.
zh

[NLP-31] Systematic Outliers in Large Language Models ICLR2025

【速读】：该论文旨在解决大型语言模型（LLMs）中异常值（outliers）的影响及其形成机制的问题。这些异常值显著影响模型性能，并且在模型压缩过程中带来挑战。论文的关键在于定义并分类了三种类型的异常值：激活异常值（activation outliers）、权重异常值（weight outliers）和注意力异常值（attention outliers），并通过分析它们在不同维度上的分布，揭示了它们出现的内在联系及其对注意力机制的最终影响。基于这些观察，论文通过理论推导和实验验证了这些异常值源于自注意力机制中的softmax运算，并提出它们作为注意力机制内的隐式上下文感知缩放因子。研究还表明，通过结构化消除这些系统性异常值可以加速收敛并提高模型压缩效果。

链接: https://arxiv.org/abs/2502.06415
作者: Yongqi An,Xu Zhao,Tao Yu,Ming Tang,Jinqiao Wang
机构: Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences(自动化研究所，中国科学院); School of artificial intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); Wuhan AI Research(武汉人工智能研究院); Objecteye Inc.(Objecteye Inc.); Institute of Automation, Chinese Academy of Sciences(自动化研究所，中国科学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICLR 2025. Project Page: this https URL

点击查看摘要

Abstract:Outliers have been widely observed in Large Language Models (LLMs), significantly impacting model performance and posing challenges for model compression. Understanding the functionality and formation mechanisms of these outliers is critically important. Existing works, however, largely focus on reducing the impact of outliers from an algorithmic perspective, lacking an in-depth investigation into their causes and roles. In this work, we provide a detailed analysis of the formation process, underlying causes, and functions of outliers in LLMs. We define and categorize three types of outliers-activation outliers, weight outliers, and attention outliers-and analyze their distributions across different dimensions, uncovering inherent connections between their occurrences and their ultimate influence on the attention mechanism. Based on these observations, we hypothesize and explore the mechanisms by which these outliers arise and function, demonstrating through theoretical derivations and experiments that they emerge due to the self-attention mechanism’s softmax operation. These outliers act as implicit context-aware scaling factors within the attention mechanism. As these outliers stem from systematic influences, we term them systematic outliers. Our study not only enhances the understanding of Transformer-based LLMs but also shows that structurally eliminating outliers can accelerate convergence and improve model compression. The code is avilable at this https URL.
zh

[NLP-32] SynthDetoxM: Modern LLM s are Few-Shot Parallel Detoxification Data Annotators NAACL2025

【速读】：该论文旨在解决多语言文本去毒化过程中平行数据稀缺的问题。关键解决方案在于引入了一个用于生成多语言平行去毒化数据的流程，并开发了一个名为SynthDetoxM的新数据集，该数据集包含16,000个高质量的去毒化句对，涵盖德语、法语、西班牙语和俄语。这些数据通过从不同的毒性评估数据集中获取，并使用九种现代开源LLM在少量提示设置下重写生成。实验表明，基于合成数据集训练的模型在数据受限的情况下优于基于人工标注的MultiParaDetox数据集训练的模型，且在少量提示设置下表现更优。

链接: https://arxiv.org/abs/2502.06394
作者: Daniil Moskovskiy,Nikita Sushko,Sergey Pletenev,Elena Tutubalina,Alexander Panchenko
机构: AIRI; Skoltech; Sber AI; ISP RAS Research Center for Trusted AI
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Main Conference

点击查看摘要

Abstract:Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.
zh

[NLP-33] he exponential distribution of the orders of demonstrative numeral adjective and noun

【速读】：该论文旨在探讨名词短语中指示代词、数词、形容词和名词构成的优选顺序的频率分布，并试图确定这种分布是否符合指数分布或幂律分布。论文的关键在于发现所有24种可能的顺序均具有非零概率的指数模型能够更好地拟合数据，从而挑战了幂律分布如齐夫定律对于词汇频率不可避免的观点，并表明词序变化不存在硬性约束，未被观察到的顺序可能是由于采样不足所致。

链接: https://arxiv.org/abs/2502.06342
作者: Ramon Ferrer-i-Cancho
机构: 未知
类目: Computation and Language (cs.CL); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:The frequency of the preferred order for a noun phrase formed by demonstrative, numeral, adjective and noun has received significant attention over the last two decades. We investigate the actual distribution of the preferred 24 possible orders. There is no consensus on whether it can be well-fitted by an exponential or a power law distribution. We find that an exponential distribution is a much better model. This finding and other circumstances where an exponential-like distribution is found challenge the view that power-law distributions, e.g., Zipf’s law for word frequencies, are inevitable. We also investigate which of two exponential distributions gives a better fit: an exponential model where the 24 orders have non-zero probability or an exponential model where the number of orders that can have non-zero probability is variable. When parsimony and generalizability are prioritized, we find strong support for the exponential model where all 24 orders have non-zero probability. This finding suggests that there is no hard constraint on word order variation and then unattested orders merely result from undersampling, consistently with Cysouw’s view.
zh

[NLP-34] Expect the Unexpected: FailSafe Long Context QA for Finance

【速读】：该论文旨在解决大型语言模型（LLMs）在金融领域应用中的鲁棒性和上下文感知能力问题。通过提出一个新的长期上下文金融基准测试集FailSafeQA，该研究设计了六种人类交互方式的变化来测试模型在处理查询失败（Query Failure）和上下文失败（Context Failure）场景下的表现。论文的关键解决方案在于采用LLM-as-a-Judge方法，并使用Qwen2.5-72B-Instruct模型，结合细粒度评分标准来评估模型的鲁棒性（Robustness）、上下文定位（Context Grounding）和合规性（Compliance）得分。研究表明，尽管某些模型在缓解输入扰动方面表现出色，但它们需要平衡准确作答与避免产生幻觉的能力。

链接: https://arxiv.org/abs/2502.06329
作者: Kiran Kamble,Melisa Russak,Dmytro Mozolevskyi,Muayad Ali,Mateusz Russak,Waseem AlShikh
机构: Writer, Inc
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: this https URL
zh

[NLP-35] Can AI Examine Novelty of Patents?: Novelty Evaluation Based on the Correspondence between Patent Claim and Prior Art

【速读】：该论文旨在解决专利新颖性评估这一关键且具有挑战性的任务，传统上由专利审查员手工完成。论文的关键解决方案在于评估大型语言模型（Large Language Models, LLMs）在通过对比权利要求与引用的现有技术文件来评估专利新颖性方面的能力，这类似于专利审查员的工作流程。论文构建了一个专门用于新颖性评估的数据集，并分析了LLMs在这项任务中的表现。研究表明，虽然分类模型在评估新颖性方面效果不佳，但生成式模型能够以合理的准确性做出预测，并且其解释足够准确以理解目标专利与现有技术之间的关系。这些发现展示了LLMs在辅助专利评估方面的潜力，有望减轻审查员和申请人的工作负担。

链接: https://arxiv.org/abs/2502.06316
作者: Hayato Ikoma,Teruko Mitamura
机构: Language Technology Institute, SCS (语言技术研究所, SCS); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Assessing the novelty of patent claims is a critical yet challenging task traditionally performed by patent examiners. While advancements in NLP have enabled progress in various patent-related tasks, novelty assessment remains unexplored. This paper introduces a novel challenge by evaluating the ability of large language models (LLMs) to assess patent novelty by comparing claims with cited prior art documents, following the process similar to that of patent examiners done. We present the first dataset specifically designed for novelty evaluation, derived from real patent examination cases, and analyze the capabilities of LLMs to address this task. Our study reveals that while classification models struggle to effectively assess novelty, generative models make predictions with a reasonable level of accuracy, and their explanations are accurate enough to understand the relationship between the target patent and prior art. These findings demonstrate the potential of LLMs to assist in patent evaluation, reducing the workload for both examiners and applicants. Our contributions highlight the limitations of current models and provide a foundation for improving AI-driven patent analysis through advanced models and refined datasets.
zh

[NLP-36] Latent Convergence Modulation in Large Language Models : A Novel Approach to Iterative Contextual Realignment

【速读】：该论文旨在解决在自回归生成模型中令牌预测稳定性的问题，其中早期推理步骤中的微小变化常常导致长序列中的显著语义漂移。解决方案的关键在于引入了一种结构化调制机制，用于调节隐藏状态转换，确保潜在表示轨迹保持与先前上下文依赖的一致性，同时保持生成的灵活性。这种调制框架设计用于变压器架构内，动态约束表示演化，而无需外部记忆依赖或广泛的架构修改。

链接: https://arxiv.org/abs/2502.06302
作者: Patricia Porretta,Sylvester Pakenham,Huxley Ainsworth,Gregory Chatten,Godfrey Allerton,Simon Hollingsworth,Vance Periwinkle
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Token prediction stability remains a challenge in autoregressive generative models, where minor variations in early inference steps often lead to significant semantic drift over extended sequences. A structured modulation mechanism was introduced to regulate hidden state transitions, ensuring that latent representation trajectories remain aligned with prior contextual dependencies while preserving generative flexibility. The modulation framework was designed to function within transformer-based architectures, dynamically constraining representation evolution without imposing external memory dependencies or extensive architectural modifications. Empirical evaluations demonstrated that structured latent adjustments contributed to reductions in perplexity fluctuations, entropy variance, and lexical instability, improving coherence in long-form text generation. Gradient propagation stability was further analyzed, revealing that the modulation process led to smoother optimization pathways, mitigating erratic fluctuations in weight updates across successive inference steps. The computational efficiency of the modulation process was assessed, showing that its integration within transformer-based architectures introduced only marginal overhead while maintaining compatibility with existing optimization frameworks. The structured modulation constraints also influenced syntactic variation, preventing excessive repetition while maintaining balanced sentence length distributions. Comparative evaluations against baseline models reinforced the role of controlled latent state evolution in improving pronoun resolution, logical consistency, and contextual alignment across autoregressive text generation tasks.
zh

[NLP-37] SeaExam and SeaBench: Benchmarking LLM s with Local Multilingual Questions in Southeast Asia NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在评估东南亚（SEA）应用场景中的能力时所面临的问题。现有的多语言数据集大多源自英语翻译，无法全面反映SEA地区的真实场景。为了解决这一问题，论文提出了两个新的基准测试，即SeaExam和SeaBench。SeaExam基于区域教育考试构建，涵盖地方历史和文学等科目；SeaBench则围绕多轮开放式任务设计，以反映SEA社区日常互动。关键在于使用真实世界场景的数据来更有效地评估LLMs在SEA语言任务上的表现。

链接: https://arxiv.org/abs/2502.06298
作者: Chaoqun Liu,Wenxuan Zhang,Jiahao Ying,Mahani Aljunied,Anh Tuan Luu,Lidong Bing
机构: Nanyang Technological University (南洋理工大学), Singapore; DAMO Academy (达摩院), Alibaba Group (阿里巴巴集团), Singapore; Hupan Lab (湖畔实验室), Hangzhou, China; Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of NAACL 2025

点击查看摘要

Abstract:This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.
zh

[NLP-38] Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE

【速读】：该论文旨在解决基于推测解码（Speculative Decoding, SD）方法中，由较小的草稿模型预测多个令牌时因模型容量限制导致的预测准确性问题。解决方案的关键在于引入Jakiro方法，利用专家混合（Mixture of Experts, MoE）机制，使独立专家生成多样化的预测结果，从而有效解除候选预测之间的相关性。此外，论文还提出了一种结合自回归解码与并行解码的混合推理策略，并通过特征对比机制增强后者以提高精度。这种方法显著提升了预测准确性并实现了更高的推理加速。

链接: https://arxiv.org/abs/2502.06282
作者: Haiduo Huang,Fuwei Yang,Zhenhua Liu,Yixing Xu,Jinze Li,Yang Liu,Xuanwu Yin,Dong Li,Pengju Ren,Emad Barsoum
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at this https URL.
zh

[NLP-39] DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models

【速读】：该论文旨在评估现代大型语言模型（Large Language Models, LLMs）在参与论辩、审议以及与人类专家保持一致方面的能力。论文的关键在于通过DebateBench数据集来衡量LLMs在理解辩论规则和评估标准后，能否分析多个演讲并推理所有发言人的论点，以得出最终结果。初步评估显示，当前的LLMs在这项任务上表现不佳，表明需要开发更复杂的技术以提高其性能。

链接: https://arxiv.org/abs/2502.06279
作者: Utkarsh Tiwari,Aryan Seth,Adi Mukherjee,Kaavya Mer,Kavish,Dhruv Kumar
机构: Birla Institute of Technology and Science, Pilani (BITS普纳理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce DebateBench, a novel dataset consisting of an extensive collection of transcripts and metadata from some of the world’s most prestigious competitive debates. The dataset consists of British Parliamentary debates from prestigious debating tournaments on diverse topics, annotated with detailed speech-level scores and house rankings sourced from official adjudication data. We curate 256 speeches across 32 debates with each debate being over 1 hour long with each input being an average of 32,000 tokens. Designed to capture long-context, large-scale reasoning tasks, DebateBench provides a benchmark for evaluating modern large language models (LLMs) on their ability to engage in argumentation, deliberation, and alignment with human experts. To do well on DebateBench, the LLMs must perform in-context learning to understand the rules and evaluation criteria of the debates, then analyze 8 seven minute long speeches and reason about the arguments presented by all speakers to give the final results. Our preliminary evaluation using GPT o1, GPT-4o, and Claude Haiku, shows that LLMs struggle to perform well on DebateBench, highlighting the need to develop more sophisticated techniques for improving their performance.
zh

[NLP-40] Emergent Response Planning in LLM

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）在预测下一个令牌（token）的过程中是否表现出潜在的规划行为。研究发现这些模型的隐藏表示能够编码未来输出，而不仅仅是下一个令牌。论文的关键在于通过简单的探测方法证明LLMs的提示表示（prompt representations）能够编码其整个响应中的全局属性，包括结构属性（如响应长度、推理步骤）、内容属性（如故事写作中的角色选择、响应末尾的选择题答案）以及行为属性（如答案置信度、事实一致性）。此外，研究还探讨了这种规划能力如何随模型规模和任务的不同而变化，并在生成过程中如何演变。研究表明，LLMs在其隐藏表示中提前规划未来的潜力，这为提高透明度和生成控制提供了可能的应用方向。

链接: https://arxiv.org/abs/2502.06258
作者: Zhichen Dong,Zhanhui Zhou,Zhixuan Liu,Chao Yang,Chaochao Lu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we argue that large language models (LLMs), though trained to predict only the next token, exhibit emergent planning behaviors: \textbftheir hidden representations encode future outputs beyond the next token . Through simple probing, we demonstrate that LLM prompt representations encode global attributes of their entire responses, including \textitstructural attributes (response length, reasoning steps), \textitcontent attributes (character choices in storywriting, multiple-choice answers at the end of response), and \textitbehavioral attributes (answer confidence, factual consistency). In addition to identifying response planning, we explore how it scales with model size across tasks and how it evolves during generation. The findings that LLMs plan ahead for the future in their hidden representations suggests potential applications for improving transparency and generation control.
zh

[NLP-41] K-ON: Stacking Knowledge On the Head Layer of Large Language Model AAAI2025

【速读】：该论文旨在解决知识图谱（Knowledge Graph, KG）与自然语言处理（Natural Language Processing, NLP）任务之间的粒度不匹配问题。论文的关键解决方案是提出K-ON方法，通过采用多头层进行下一个k步预测，将知识图谱的知识融入到大规模语言模型（Large Language Models, LLMs）中。K-ON不仅能够一步生成实体级别的结果，还支持基于实体的对比损失，这是知识图谱表示学习中最有力的工具之一。

链接: https://arxiv.org/abs/2502.06257
作者: Lingbing Guo,Yichi Zhang,Zhongpu Bo,Zhuo Chen,Mengshu Sun,Zhiqiang Zhang,Wen Zhang,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AAAI 2025 (Oral)

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly improved various natural language processing (NLP) tasks. Typically, LLMs are trained to predict the next token, aligning well with many NLP tasks. However, in knowledge graph (KG) scenarios, entities are the fundamental units and identifying an entity requires at least several tokens. This leads to a granularity mismatch between KGs and natural languages. To address this issue, we propose K-ON, which integrates KG knowledge into the LLM by employing multiple head layers for next k-step prediction. K-ON can not only generate entity-level results in one step, but also enables contrastive loss against entities, which is the most powerful tool in KG representation learning. Experimental results show that K-ON outperforms state-of-the-art methods that incorporate text and even the other modalities.
zh

[NLP-42] Evaluating Entity Retrieval in Electronic Health Records: a Semantic Gap Perspective

【速读】：该论文旨在解决电子健康记录（EHRs）中实体检索任务缺乏全面评估基准的问题。为了解决这一问题，论文提出开发并发布一个新的基准测试，特别关注语义鸿沟问题。解决方案的关键在于使用MIMIC-III数据集中的出院摘要，并结合与这些摘要相关的ICD代码和处方标签作为查询，利用GPT-4进行相关性判断标注，从而构建包含1,000份患者笔记、1,246个查询以及超过77,000个相关性标注的数据集。此外，引入了一种新的分类系统来评估相关匹配的语义类别，包括字符串、同义词、缩写、上下位关系和隐含意义，以提供对语义鸿沟的第一项评估。通过此基准测试，评估了几种检索方法，发现密集检索器在处理语义匹配方面表现尤为出色。

链接: https://arxiv.org/abs/2502.06252
作者: Zhengyun Zhao,Hongyi Yuan,Jingjing Liu,Haichao Chen,Huaiyuan Ying,Songchi Zhou,Sheng Yu
机构: Tsinghua University(清华大学); Peking Union Medical College(协和医学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Under review, and the dataset will be made public upon reception of our paper

点击查看摘要

Abstract:Entity retrieval plays a crucial role in the utilization of Electronic Health Records (EHRs) and is applied across a wide range of clinical practices. However, a comprehensive evaluation of this task is lacking due to the absence of a public benchmark. In this paper, we propose the development and release of a novel benchmark for evaluating entity retrieval in EHRs, with a particular focus on the semantic gap issue. Using discharge summaries from the MIMIC-III dataset, we incorporate ICD codes and prescription labels associated with the notes as queries, and annotate relevance judgments using GPT-4. In total, we use 1,000 patient notes, generate 1,246 queries, and provide over 77,000 relevance annotations. To offer the first assessment of the semantic gap, we introduce a novel classification system for relevance matches. Leveraging GPT-4, we categorize each relevant pair into one of five categories: string, synonym, abbreviation, hyponym, and implication. Using the proposed benchmark, we evaluate several retrieval methods, including BM25, query expansion, and state-of-the-art dense retrievers. Our findings show that BM25 provides a strong baseline but struggles with semantic matches. Query expansion significantly improves performance, though it slightly reduces string match capabilities. Dense retrievers outperform traditional methods, particularly for semantic matches, and general-domain dense retrievers often surpass those trained specifically in the biomedical domain.
zh

[NLP-43] Confidence Improves Self-Consistency in LLM s

【速读】：该论文旨在解决自一致性解码（Self-consistency decoding）在提升大型语言模型（LLMs）推理任务性能时计算成本过高的问题。其关键解决方案是引入置信度感知的自一致性（Confidence-Informed Self-Consistency, CISC），通过基于模型直接获得的置信分数进行加权多数投票，优先考虑高置信度路径，从而以显著减少的推理路径样本量识别正确答案。

链接: https://arxiv.org/abs/2502.06233
作者: Amir Taubenfeld,Tom Sheffer,Eran Ofek,Amir Feder,Ariel Goldstein,Zorik Gekhman,Gal Yona
机构: Google Research(谷歌研究); The Hebrew University of Jerusalem(耶路撒冷希伯来大学); Columbia University(哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-consistency decoding enhances LLMs’ performance on reasoning tasks by sampling diverse reasoning paths and selecting the most frequent answer. However, it is computationally expensive, as sampling many of these (lengthy) paths is required to increase the chances that the correct answer emerges as the most frequent one. To address this, we introduce Confidence-Informed Self-Consistency (CISC). CISC performs a weighted majority vote based on confidence scores obtained directly from the model. By prioritizing high-confidence paths, it can identify the correct answer with a significantly smaller sample size. When tested on nine models and four datasets, CISC outperforms self-consistency in nearly all configurations, reducing the required number of reasoning paths by over 40% on average. In addition, we introduce the notion of within-question confidence evaluation, after showing that standard evaluation methods are poor predictors of success in distinguishing correct and incorrect answers to the same question. In fact, the most calibrated confidence method proved to be the least effective for CISC. Lastly, beyond these practical implications, our results and analyses show that LLMs can effectively judge the correctness of their own outputs, contributing to the ongoing debate on this topic.
zh

[NLP-44] Examining False Positives under Inference Scaling for Mathematical Reasoning

【速读】：该论文旨在解决语言模型在数学推理任务中出现的虚假正解（false positive solutions）问题，即模型可能给出正确的最终答案，但其推理路径存在缺陷。关键在于通过系统分析不同开源模型、数据集难度级别以及解码策略下虚假正解的特性和程度，揭示这些错误解答如何影响语言模型的推理时间缩放行为。实验结果显示，即使采用基于采样的推理时间缩放方法也无法缓解此问题，且使用pass@N评估指标时，虚假正解更为显著，表明自动评估所显示的性能上限可能被高估。

链接: https://arxiv.org/abs/2502.06217
作者: Yu Wang,Nan Yang,Liang Wang,Furu Wei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in language models have led to significant improvements in mathematical reasoning across various benchmarks. However, most of these benchmarks rely on automatic evaluation methods that only compare final answers using heuristics, without verifying the underlying reasoning steps. This limitation results in false positive solutions, where models may produce correct final answers but with flawed deduction paths. In this paper, we systematically examine the prevalence of false positive solutions in mathematical problem solving for language models. We analyze the characteristics and extent of this issue across different open-source models, datasets of varying difficulty levels, and decoding strategies. Specifically, we explore how false positives influence the inference time scaling behavior of language models. Our experimental results reveal that: (1) false positive solutions persist across different models, datasets, and decoding methods, (2) sampling-based inference time scaling methods do not alleviate the problem, and (3) the pass@N evaluation metric is more susceptible to false positives, suggesting a significantly lower scaling ceiling than what automatic evaluations indicate. Additionally, we analyze specific instances of false positives and discuss potential limitations in self-improvement techniques and synthetic data generation under such conditions.
zh

[NLP-45] LessLeak-Bench: A First Investigation of Data Leakage in LLM s Across 83 Software Engineering Benchmarks

【速读】：该论文旨在解决大型语言模型（LLMs）在软件工程（SE）基准测试中的数据泄露问题。数据泄露可能导致评估结果的有效性受到质疑。论文的关键解决方案是引入\textbfLessLeak-Bench，这是一个移除了泄露样本的新基准，从而实现更可靠的LLM评估。通过这项研究，作者增强了对SE基准中数据泄露的理解，并为未来涉及LLMs的SE研究提供了有价值的见解。

链接: https://arxiv.org/abs/2502.06215
作者: Xin Zhou,Martin Weyssow,Ratnadira Widyasari,Ting Zhang,Junda He,Yunbo Lyu,Jianming Chang,Beiqi Zhang,Dan Huang,David Lo
机构: Singapore Management University (新加坡管理大学); Southeast University (东南大学); Wuhan University (武汉大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are widely utilized in software engineering (SE) tasks, such as code generation and automated program repair. However, their reliance on extensive and often undisclosed pre-training datasets raises significant concerns about data leakage, where the evaluation benchmark data is unintentionally ``seen’’ by LLMs during the model’s construction phase. The data leakage issue could largely undermine the validity of LLM-based research and evaluations. Despite the increasing use of LLMs in the SE community, there is no comprehensive study that assesses the extent of data leakage in SE benchmarks for LLMs yet. To address this gap, this paper presents the first large-scale analysis of data leakage in 83 SE benchmarks concerning LLMs. Our results show that in general, data leakage in SE benchmarks is minimal, with average leakage ratios of only 4.8%, 2.8%, and 0.7% for Python, Java, and C/C++ benchmarks, respectively. However, some benchmarks exhibit relatively higher leakage ratios, which raises concerns about their bias in evaluation. For instance, QuixBugs and BigCloneBench have leakage ratios of 100.0% and 55.7%, respectively. Furthermore, we observe that data leakage has a substantial impact on LLM evaluation. We also identify key causes of high data leakage, such as the direct inclusion of benchmark data in pre-training datasets and the use of coding platforms like LeetCode for benchmark construction. To address the data leakage, we introduce \textbfLessLeak-Bench, a new benchmark that removes leaked samples from the 83 SE benchmarks, enabling more reliable LLM evaluations in future research. Our study enhances the understanding of data leakage in SE benchmarks and provides valuable insights for future research involving LLMs in SE.
zh

[NLP-46] Unveiling the Capabilities of Large Language Models in Detecting Offensive Language with Annotation Disagreement ACL2025

【速读】：该论文旨在解决在实际数据集中由于人类标注不一致所引发的挑战，特别是这些不一致样本难以检测且具有模糊性的问题。关键在于系统评估大型语言模型（LLMs）在处理标注不一致样本时的检测能力，并分析模型信心与标注一致性之间的关系，以此提供改进基于LLMs的仇恨言论检测方法的指导。

链接: https://arxiv.org/abs/2502.06207
作者: Junyu Lu,Kai Ma,Kaichun Wang,Kelaiti Xiao,Roy Ka-Wei Lee,Bo Xu,Liang Yang,Hongfei Lin
机构: Dalian University of Technology (大连理工大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, submitted to the ACL 2025

点击查看摘要

Abstract:LLMs are widely used for offensive language detection due to their advanced capability. However, the challenges posed by human annotation disagreement in real-world datasets remain underexplored. These disagreement samples are difficult to detect due to their ambiguous nature. Additionally, the confidence of LLMs in processing disagreement samples can provide valuable insights into their alignment with human annotators. To address this gap, we systematically evaluate the ability of LLMs to detect offensive language with annotation disagreement. We compare the binary accuracy of multiple LLMs across varying annotation agreement levels and analyze the relationship between LLM confidence and annotation agreement. Furthermore, we investigate the impact of disagreement samples on LLM decision-making during few-shot learning and instruction fine-tuning. Our findings highlight the challenges posed by disagreement samples and offer guidance for improving LLM-based offensive language detection.
zh

[NLP-47] C-3PO: Compact Plug-and-Play Proxy Optimization to Achieve Human-like Retrieval-Augmented Generation

【速读】：该论文旨在解决检索增强生成（Retrieval-augmented generation, RAG）系统中独立开发的检索器与大型语言模型（LLMs）之间的对齐难题。现有方法通常通过修改组件或引入简单的中间模块来应对，这导致了实际应用中的局限性和次优性能。论文的关键解决方案是提出C-3PO框架，这是一种以代理为中心的方法，通过轻量级多智能体系统促进检索器与LLMs之间的通信。C-3PO框架实现了三个专门的智能体协同工作，优化整个RAG流程而不改变检索器和LLMs。这些智能体共同评估检索需求、生成有效查询以及选择适合LLMs的信息。为了实现有效的多智能体协调，文中还开发了一种树状结构的滚动策略用于强化学习中的奖励分配。实验结果表明，C-3PO显著提升了RAG系统的性能，并保持了即插即用的灵活性和优越的泛化能力。

链接: https://arxiv.org/abs/2502.06205
作者: Guoxin Chen,Minpeng Liao,Peiying Yu,Dingmin Wang,Zile Qiao,Chao Yang,Xin Zhao,Kai Fan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Ongong work

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems face a fundamental challenge in aligning independently developed retrievers and large language models (LLMs). Existing approaches typically involve modifying either component or introducing simple intermediate modules, resulting in practical limitations and sub-optimal performance. Inspired by human search behavior – typically involving a back-and-forth process of proposing search queries and reviewing documents, we propose C-3PO, a proxy-centric framework that facilitates communication between retrievers and LLMs through a lightweight multi-agent system. Our framework implements three specialized agents that collaboratively optimize the entire RAG pipeline without altering the retriever and LLMs. These agents work together to assess the need for retrieval, generate effective queries, and select information suitable for the LLMs. To enable effective multi-agent coordination, we develop a tree-structured rollout approach for reward credit assignment in reinforcement learning. Extensive experiments in both in-domain and out-of-distribution scenarios demonstrate that C-3PO significantly enhances RAG performance while maintaining plug-and-play flexibility and superior generalization capabilities.
zh

[NLP-48] Non-literal Understanding of Number Words by Language Models

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在理解数字时是否与人类相似，特别关注夸张（hyperbole）和语用光环效应（pragmatic halo effects）。研究发现，LLMs在理解过程中与人类存在显著差异，这些差异并非源于先验知识的不同，而是处理先验知识的方式不同。关键解决方案在于通过受理性话语行为（Rational Speech Act, RSA）模型启发的链式思维提示（chain-of-thought prompting），使LLMs的解读更加接近人类认知。

链接: https://arxiv.org/abs/2502.06204
作者: Polina Tsvilodub,Kanishk Gandhi,Haoran Zhao,Jan-Philipp Fränken,Michael Franke,Noah D. Goodman
机构: Stanford University (斯坦福大学); University of Tübingen, Germany (德国图宾根大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:Humans naturally interpret numbers non-literally, effortlessly combining context, world knowledge, and speaker intent. We investigate whether large language models (LLMs) interpret numbers similarly, focusing on hyperbole and pragmatic halo effects. Through systematic comparison with human data and computational models of pragmatic reasoning, we find that LLMs diverge from human interpretation in striking ways. By decomposing pragmatic reasoning into testable components, grounded in the Rational Speech Act framework, we pinpoint where LLM processing diverges from human cognition – not in prior knowledge, but in reasoning with it. This insight leads us to develop a targeted solution – chain-of-thought prompting inspired by an RSA model makes LLMs’ interpretations more human-like. Our work demonstrates how computational cognitive models can both diagnose AI-human differences and guide development of more human-like language understanding capabilities.
zh

[NLP-49] Discourse-Driven Evaluation: Unveiling Factual Inconsistency in Long Document Summarization NAACL2025

【速读】：该论文旨在解决长文档摘要中的事实不一致检测问题，特别是在复杂结构源文章和长摘要长度的情况下。解决方案的关键在于提出一个框架，该框架将长文本分解为基于话语分析的片段，并利用话语信息来更好地聚合自然语言推理模型预测的句子级评分。通过这种方法，论文展示了在多种评估基准上超越不同模型基线的改进性能，尤其关注于长文档摘要的事实不一致评分。

链接: https://arxiv.org/abs/2502.06185
作者: Yang Zhong,Diane Litman
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025 camera-ready version

点击查看摘要

Abstract:Detecting factual inconsistency for long document summarization remains challenging, given the complex structure of the source article and long summary length. In this work, we study factual inconsistency errors and connect them with a line of discourse analysis. We find that errors are more common in complex sentences and are associated with several discourse features. We propose a framework that decomposes long texts into discourse-inspired chunks and utilizes discourse information to better aggregate sentence-level scores predicted by natural language inference models. Our approach shows improved performance on top of different model baselines over several evaluation benchmarks, covering rich domains of texts, focusing on long document summarization. This underscores the significance of incorporating discourse features in developing models for scoring summaries for long document factual inconsistency.
zh

[NLP-50] RideKE: Leverag ing Low-Resource User-Generated Twitter Content for Sentiment and Emotion Detection in Kenyan Code-Switched Dataset WASSA2024

【速读】：该论文旨在解决在低资源语言环境中，特别是在肯尼亚的代码转换数据中，利用Twitter进行情感和情绪分类的挑战。主要问题是由于低质量、稀缺的内容以及语言使用中的俚语和代码转换现象，使得识别这些语言的推文变得困难。Twitter主要支持高资源语言，因此处理低资源语言的数据面临额外的困难。论文的关键解决方案在于评估四种最先进的基于变压器的预训练模型（XLM-R, DistilBERT, mBERT, AfriBERTa）在监督和半监督方法下的表现，并发现XLM-R模型在情感分析中表现出色，无论是监督还是半监督模式均优于其他模型。

链接: https://arxiv.org/abs/2502.06180
作者: Naome A. Etori,Maria L. Gini
机构: University of Minnesota -Twin Cities (明尼苏达大学双城分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in WASSA 2024

点击查看摘要

Abstract:Social media has become a crucial open-access platform for individuals to express opinions and share experiences. However, leveraging low-resource language data from Twitter is challenging due to scarce, poor-quality content and the major variations in language use, such as slang and code-switching. Identifying tweets in these languages can be difficult as Twitter primarily supports high-resource languages. We analyze Kenyan code-switched data and evaluate four state-of-the-art (SOTA) transformer-based pretrained models for sentiment and emotion classification, using supervised and semi-supervised methods. We detail the methodology behind data collection and annotation, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2%) and F1 score (66.1%), XLM-R semi-supervised (67.2% accuracy, 64.1% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8%) and F1 score (31%), mBERT semi-supervised (accuracy (59% and F1 score 26.5%). AfriBERTa models show the lowest accuracy and F1 scores. All models tend to predict neutral sentiment, with Afri-BERT showing the highest bias and unique sensitivity to empathy emotion. this https URL
zh

[NLP-51] Uncertainty-Aware Adaptation of Large Language Models for Protein-Protein Interaction Analysis

【速读】：该论文旨在解决在使用大型语言模型（Large Language Models, LLMs）预测蛋白质结构和相互作用时存在的不确定性问题，以提高生物医学应用中结果的可重复性。关键解决方案在于提出了一种结合LoRA集成和贝叶斯LoRA模型的不确定性感知型LLM适应方法，用于蛋白质相互作用（PPI）分析，并通过不确定性量化（UQ）确保对蛋白质行为的置信度校准，从而在不同疾病背景下实现可靠的PPI识别，增强计算生物学中的可信度和可重复性。

链接: https://arxiv.org/abs/2502.06173
作者: Sanket Jantre,Tianle Wang,Gilchan Park,Kriti Chopra,Nicholas Jeon,Xiaoning Qian,Nathan M. Urban,Byung-Jun Yoon
机构: Computing and Data Sciences Directorate, Brookhaven National Laboratory (计算与数据科学理事会,布鲁克黑文国家实验室); Department of Electrical and Computer Engineering, Texas A&M University (电气与计算机工程系,德克萨斯A&M大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Identification of protein-protein interactions (PPIs) helps derive cellular mechanistic understanding, particularly in the context of complex conditions such as neurodegenerative disorders, metabolic syndromes, and cancer. Large Language Models (LLMs) have demonstrated remarkable potential in predicting protein structures and interactions via automated mining of vast biomedical literature; yet their inherent uncertainty remains a key challenge for deriving reproducible findings, critical for biomedical applications. In this study, we present an uncertainty-aware adaptation of LLMs for PPI analysis, leveraging fine-tuned LLaMA-3 and BioMedGPT models. To enhance prediction reliability, we integrate LoRA ensembles and Bayesian LoRA models for uncertainty quantification (UQ), ensuring confidence-calibrated insights into protein behavior. Our approach achieves competitive performance in PPI identification across diverse disease contexts while addressing model uncertainty, thereby enhancing trustworthiness and reproducibility in computational biology. These findings underscore the potential of uncertainty-aware LLM adaptation for advancing precision medicine and biomedical research.
zh

[NLP-52] Universal Approximation of Visual Autoregressive Transformers

【速读】：该论文旨在探究基于变换器的预训练模型在图像生成任务中的基本极限，并特别关注Visual Autoregressive (VAR) 变换器。论文的关键贡献在于证明单头VAR变换器在单一自注意力层和单一插值层的情况下具有通用性。从统计学角度来看，这些简单的VAR变换器能够作为任何图像到图像Lipschitz函数的通用近似器。此外，流式自回归变换器也展现出类似的近似能力。因此，论文提供了一种有效且计算高效的VAR变换器策略设计原则，可将其应用扩展至更复杂的图像生成及其他相关领域。简而言之，该论文通过理论分析解决了VAR变换器在图像生成任务中的效能与通用性问题，其关键是证明了简单VAR变换器的通用近似能力。

链接: https://arxiv.org/abs/2502.06167
作者: Yifang Chen,Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song
机构: The University of Chicago (芝加哥大学); University of New South Wales (新南威尔士大学); The University of Hong Kong (香港大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); The Simons Institute for the Theory of Computing at UC Berkeley (伯克利加州大学西蒙斯计算理论研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction’’ framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any image-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.
zh

[NLP-53] Scaling Public Health Text Annotation: Zero-Shot Learning vs. Crowdsourcing for Improved Efficiency and Labeling Accuracy

【速读】：该论文旨在解决通过社会媒体数据研究健康相关行为时，手动标注数据劳动密集且成本高的问题。解决方案的关键在于利用大规模语言模型（Large Language Models, LLMs）进行零样本标注（zero-shot labeling），以期在睡眠障碍、体力活动和久坐行为相关的推文分类任务中达到或超越传统众包标注的性能，并显著减少标注时间。研究表明，在简单分类任务中，LLMs可以媲美人类的表现，但在需要更精细领域知识的任务中准确性有所下降。

链接: https://arxiv.org/abs/2502.06150
作者: Kamyar Kazari,Yong Chen,Zahra Shakeri
机构: Institute of Health Policy, Management, and Evaluation (IHPME), Dalla Lana School of Public Health, University of Toronto (多伦多大学), Canada
类目: Computation and Language (cs.CL)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:Public health researchers are increasingly interested in using social media data to study health-related behaviors, but manually labeling this data can be labor-intensive and costly. This study explores whether zero-shot labeling using large language models (LLMs) can match or surpass conventional crowd-sourced annotation for Twitter posts related to sleep disorders, physical activity, and sedentary behavior. Multiple annotation pipelines were designed to compare labels produced by domain experts, crowd workers, and LLM-driven approaches under varied prompt-engineering strategies. Our findings indicate that LLMs can rival human performance in straightforward classification tasks and significantly reduce labeling time, yet their accuracy diminishes for tasks requiring more nuanced domain knowledge. These results clarify the trade-offs between automated scalability and human expertise, demonstrating conditions under which LLM-based labeling can be efficiently integrated into public health research without undermining label quality.
zh

[NLP-54] Optimizing Knowledge Integration in Retrieval-Augmented Generation with Self-Selection

【速读】：该论文旨在解决如何有效整合大型语言模型（LLMs）内部参数知识与外部检索知识的问题。关键解决方案在于提出了一种新颖的Self-Selection RAG框架，并设计了Self-Selection-RGP方法，通过直接偏好优化（DPO）训练LLM，使其能够在包含内部参数知识单独生成以及结合外部检索知识生成的成对响应中进行选择，从而提升生成和选择正确答案的能力。实验结果验证了该方法在Natural Questions (NQ) 和 TrivialQA 数据集上的优越性。

链接: https://arxiv.org/abs/2502.06148
作者: Yan Weng,Fengbin Zhu,Tong Ye,Haoyan Liu,Fuli Feng,Tat-Seng Chua
机构: University of Science and Technology of China(中国科学技术大学); Hefei(合肥); National University of Singapore(新加坡国立大学); Institute of Dataspace, Hefei Comprehensive National Science Center(合肥综合性国家科学中心数据空间研究所)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG), which integrates external knowledge into Large Language Models (LLMs), has proven effective in enabling LLMs to produce more accurate and reliable responses. However, it remains a significant challenge how to effectively integrate external retrieved knowledge with internal parametric knowledge in LLMs. In this work, we propose a novel Self-Selection RAG framework, where the LLM is made to select from pairwise responses generated with internal parametric knowledge solely and with external retrieved knowledge together to achieve enhanced accuracy. To this end, we devise a Self-Selection-RGP method to enhance the capabilities of the LLM in both generating and selecting the correct answer, by training the LLM with Direct Preference Optimization (DPO) over a curated Retrieval Generation Preference (RGP) dataset. Experimental results with two open-source LLMs (i.e., Llama2-13B-Chat and Mistral-7B) well demonstrate the superiority of our approach over other baseline methods on Natural Questions (NQ) and TrivialQA datasets.
zh

[NLP-55] LegalViz: Legal Text Visualization by Text To Diagram Generation NAACL2025

【速读】：该论文旨在解决法律文档（Legal Documents）理解难度高的问题，特别是判决书和法院命令，这些文档需要高度专业的法律知识。为使非专业人士能够获取专家知识，论文提出了一种创新的数据集LegalViz，包含23种语言的7,010个法律文档及其可视化配对实例，使用Graphviz的DOT图形描述语言。解决方案的关键在于通过简单的图表从复杂的法律文本中提取并展示关键法律实体、交易、法律来源及陈述，从而使得每项判决中的核心要素一目了然。此外，论文还提出了新的评估指标来衡量法律图表可视化的质量，包括考虑图结构、文本相似性和法律内容等因素，并在少样本和微调大规模语言模型生成法律图表方面进行了实证研究。研究表明，使用LegalViz训练的模型优于现有模型，包括GPT系列，验证了该数据集的有效性。

链接: https://arxiv.org/abs/2502.06147
作者: Eri Onami,Taiki Miyanishi,Koki Maeda,Shuhei Kurita
机构: Nara Institute of Science and Technology(NAIST); RIKEN AIP; The University of Tokyo; Institute of Science Tokyo; National Institute of Informatics; ATR; NII LLMC
类目: Computation and Language (cs.CL)
备注: NAACL2025

点击查看摘要

Abstract:Legal documents including judgments and court orders require highly sophisticated legal knowledge for understanding. To disclose expert knowledge for non-experts, we explore the problem of visualizing legal texts with easy-to-understand diagrams and propose a novel dataset of LegalViz with 23 languages and 7,010 cases of legal document and visualization pairs, using the DOT graph description language of Graphviz. LegalViz provides a simple diagram from a complicated legal corpus identifying legal entities, transactions, legal sources, and statements at a glance, that are essential in each judgment. In addition, we provide new evaluation metrics for the legal diagram visualization by considering graph structures, textual similarities, and legal contents. We conducted empirical studies on few-shot and finetuning large language models for generating legal diagrams and evaluated them with these metrics, including legal content-based evaluation within 23 languages. Models trained with LegalViz outperform existing models including GPTs, confirming the effectiveness of our dataset.
zh

[NLP-56] LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLM s NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理长篇上下文时，因固定长度位置嵌入导致的效率限制以及处理长序列时计算成本呈二次增长的问题。关键解决方案是提出了一种名为Long-form Context Injection with Recurrent Compression (LCIRC)的方法，通过循环压缩技术在不重新训练整个模型的情况下，高效处理超出模型长度限制的长篇序列。此外，引入了查询相关上下文建模，选择性地压缩与查询相关的有用信息，确保模型保留最相关的内容。

链接: https://arxiv.org/abs/2502.06139
作者: Sumin An,Junyoung Sung,Wonpyo Park,Chanjun Park,Paul Hongsuck Seo
机构: Korea University(韩国大学); Google(谷歌)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Main

点击查看摘要

Abstract:While large language models (LLMs) excel in generating coherent and contextually rich outputs, their capacity to efficiently handle long-form contexts is limited by fixed-length position embeddings. Additionally, the computational cost of processing long sequences increases quadratically, making it challenging to extend context length. To address these challenges, we propose Long-form Context Injection with Recurrent Compression (LCIRC), a method that enables the efficient processing long-form sequences beyond the model’s length limit through recurrent compression without retraining the entire model. We further introduce query dependent context modeling, which selectively compresses query-relevant information, ensuring that the model retains the most pertinent content. Our empirical results demonstrate that Query Dependent LCIRC (QD-LCIRC) significantly improves LLM’s ability to manage extended contexts, making it well-suited for tasks that require both comprehensive context understanding and query relevance.
zh

[NLP-57] Enhancing Document Key Information Localization Through Data Augmentation AAAI2025

【速读】：该论文旨在解决在数字和手写文档中定位关键信息的问题。解决方案的关键在于引入了一个包含文档增强阶段和目标检测阶段的流程。具体而言，通过模仿手写文档的外观来扩充数字文档的训练集，从而提升模型的泛化能力并实现高性能的目标检测。

链接: https://arxiv.org/abs/2502.06132
作者: Yue Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted as a workshop paper in DOCUI-AAAI2025

点击查看摘要

Abstract:The Visually Rich Form Document Intelligence and Understanding (VRDIU) Track B focuses on the localization of key information in document images. The goal is to develop a method capable of localizing objects in both digital and handwritten documents, using only digital documents for training. This paper presents a simple yet effective approach that includes a document augmentation phase and an object detection phase. Specifically, we augment the training set of digital documents by mimicking the appearance of handwritten documents. Our experiments demonstrate that this pipeline enhances the models’ generalization ability and achieves high performance in the competition.
zh

[NLP-58] Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models ICLR2025

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在多模态任务中易产生与视觉输入不一致的幻觉文本响应的问题，从而限制其在实际应用中的有效性。解决方案的关键在于引入了一种无需训练的算法——解码带生成反馈（Decoding with Generative Feedback, DeGF），通过利用文本到图像生成模型提供的反馈，在解码过程中有效减轻幻觉现象。具体而言，DeGF从LVLMs生成的初始响应中生成一幅图像，作为辅助视觉参考，并通过互补或对比解码提供自我反馈以验证和修正初始响应。

链接: https://arxiv.org/abs/2502.06130
作者: Ce Zhang,Zifu Wan,Zhehan Kan,Martin Q. Ma,Simon Stepputtis,Deva Ramanan,Russ Salakhutdinov,Louis-Philippe Morency,Katia Sycara,Yaqi Xie
机构: School of Computer Science, Carnegie Mellon University (卡内基梅隆大学); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by ICLR 2025. Project page: this https URL

点击查看摘要

Abstract:While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at this https URL.
zh

[NLP-59] ask-driven Layerwise Additive Activation Intervention NAACL2025

【速读】：该论文旨在解决现代语言模型（Language Models, LMs）在实时应用中适应新上下文能力不足的问题。论文的关键解决方案在于提出了一种逐层增效激活干预框架，通过优化干预过程来提高样本效率，从而增强语言模型在预训练基础上的准确性，并超越现有的干预基准方法。

链接: https://arxiv.org/abs/2502.06115
作者: Hieu Trung Nguyen,Bao Nguyen,Binh Nguyen,Viet Anh Nguyen
机构: The Chinese University of Hong Kong(香港中文大学); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Modern language models (LMs) have significantly advanced generative modeling in natural language processing (NLP). Despite their success, LMs often struggle with adaptation to new contexts in real-time applications. A promising approach to task adaptation is activation intervention, which steers the LMs’ generation process by identifying and manipulating the activations. However, existing interventions are highly dependent on heuristic rules or require many prompt inputs to determine effective interventions. This paper proposes a layer-wise additive activation intervention framework that optimizes the intervention process, thus enhancing the sample efficiency. We benchmark our framework on various datasets, demonstrating improvements in the accuracy of pre-trained LMs and competing intervention baselines.
zh

[NLP-60] Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks

【速读】：该论文旨在探索模型内部的训练动态，以揭示学习过程中的机制。论文的关键解决方案是提出了一种名为电路调谐（Circuit-tuning）的两阶段算法，该算法通过迭代地进行电路发现（circuit discovery），屏蔽无关边，并更新负责特定任务的剩余参数，从而提高模型性能并保持其通用能力。

链接: https://arxiv.org/abs/2502.06106
作者: Yueyan Li,Caixia Yuan,Xiaojie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The study of mechanistic interpretability aims to reverse-engineer a model to explain its behaviors. While recent studies have focused on the static mechanism of a certain behavior, the training dynamics inside a model remain to be explored. In this work, we develop an interpretable method for fine-tuning and reveal the mechanism behind learning. We first propose the concept of node redundancy as an extension of intrinsic dimension and explain the idea behind circuit discovery from a fresh view. Based on the theory, we propose circuit-tuning, a two-stage algorithm that iteratively performs circuit discovery to mask out irrelevant edges and updates the remaining parameters responsible for a specific task. Experiments show that our method not only improves performance on a wide range of tasks but is also scalable while preserving general capabilities. We visualize and analyze the circuits before, during, and after fine-tuning, providing new insights into the self-organization mechanism of a neural network in the learning process.
zh

[NLP-61] RALLRec: Improving Retrieval Augmented Large Language Model Recommendation with Representation Learning WWW’25

【速读】：该论文旨在解决现有检索增强生成（Retrieval Augmented Generation, RAG）方法主要依赖文本语义而难以整合最相关项目的问题，从而限制系统性能。关键解决方案在于提出了一种名为Representation learning for retrieval-Augmented Large Language model Recommendation (RALLRec)的方法。该方法通过提示大型语言模型（LLMs）生成更详细的项目描述，结合从LLMs和推荐模型中提取的文本和协同过滤语义进行联合表示学习，并引入一种简单的重排序方法以捕捉用户偏好的动态变化。

链接: https://arxiv.org/abs/2502.06101
作者: Jian Xu,Sichun Luo,Xiangyu Chen,Haoming Huang,Hanxu Hou,Linqi Song
机构: Tsinghua University(清华大学); City University of Hong Kong(香港城市大学); Dongguan University of Technology(东莞理工学院); Alibaba Group(阿里巴巴集团)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by TheWebConf’25 (WWW’25) as a Short Paper

点击查看摘要

Abstract:Large Language Models (LLMs) have been integrated into recommendation systems to enhance user behavior comprehension. The Retrieval Augmented Generation (RAG) technique is further incorporated into these systems to retrieve more relevant items and improve system performance. However, existing RAG methods rely primarily on textual semantics and often fail to incorporate the most relevant items, limiting the effectiveness of the systems. In this paper, we propose Representation learning for retrieval-Augmented Large Language model Recommendation (RALLRec). Specifically, we enhance textual semantics by prompting LLMs to generate more detailed item descriptions, followed by joint representation learning of textual and collaborative semantics, which are extracted by the LLM and recommendation models, respectively. Considering the potential time-varying characteristics of user interest, a simple yet effective reranking method is further introduced to capture the dynamics of user preference. We conducted extensive experiments on three real-world datasets, and the evaluation results validated the effectiveness of our method. Code is made public at this https URL. Comments: Accepted by TheWebConf’25 (WWW’25) as a Short Paper Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2502.06101 [cs.IR] (or arXiv:2502.06101v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2502.06101 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-62] ConMeC: A Dataset for Metonymy Resolution with Common Nouns NAACL2025

【速读】：该论文旨在解决在自然语言处理（NLP）系统中识别常见名词（如desk、baby和school）的隐喻使用问题。此前的研究主要集中在命名实体的隐喻解析，而忽视了常见名词中的隐喻现象。论文的关键在于创建了一个包含6,000个句子的新数据集ConMeC，并引入了一种基于链式思维提示的方法，用于利用大型语言模型（LLMs）检测隐喻。实验结果表明，LLMs在处理明确的隐喻类别时可以达到与监督BERT模型相当的性能，但在需要细微语义理解的实例上仍存在挑战。

链接: https://arxiv.org/abs/2502.06087
作者: Saptarshi Ghosh,Tianyu Jiang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 12 tables, 3 figures, NAACL 2025

点击查看摘要

Abstract:Metonymy plays an important role in our daily communication. People naturally think about things using their most salient properties or commonly related concepts. For example, by saying “The bus decided to skip our stop today,” we actually mean that the bus driver made the decision, not the bus. Prior work on metonymy resolution has mainly focused on named entities. However, metonymy involving common nouns (such as desk, baby, and school) is also a frequent and challenging phenomenon. We argue that NLP systems should be capable of identifying the metonymic use of common nouns in context. We create a new metonymy dataset ConMeC, which consists of 6,000 sentences, where each sentence is paired with a target common noun and annotated by humans to indicate whether that common noun is used metonymically or not in that context. We also introduce a chain-of-thought based prompting method for detecting metonymy using large language models (LLMs). We evaluate our LLM-based pipeline, as well as a supervised BERT model on our dataset and three other metonymy datasets. Our experimental results demonstrate that LLMs could achieve performance comparable to the supervised BERT model on well-defined metonymy categories, while still struggling with instances requiring nuanced semantic understanding. Our dataset is publicly available at: this https URL.
zh

[NLP-63] Is a Peeled Apple Still Red? Evaluating LLM s Ability for Conceptual Combination with Property Type NAACL2025

【速读】：该论文旨在解决现有研究在评估概念组合过程中对属性处理（property handling）的局限性，并未全面考察属性的继承、新出现及取消现象。为填补这一空白，论文引入了“概念组合与属性类型数据集”(CCPT, Conceptual Combination with Property Type dataset)，包含12.3K个标注的名词短语、属性及其属性类型的三元组。解决方案的关键在于利用CCPT建立三种任务来全面评估大型语言模型（LLMs）在概念组合中的表现，并提出了一种受认知心理学模型启发的方法，该方法能够改善所有生成任务中的性能。

链接: https://arxiv.org/abs/2502.06086
作者: Seokwon Song,Taehyun Lee,Jaewoo Ahn,Jae Hyuk Sung,Gunhee Kim
机构: Seoul National University(首尔国立大学); Korea University(韩国大学)
类目: Computation and Language (cs.CL)
备注: NAACL 2025; the dataset and experimental code are available at this https URL

点击查看摘要

Abstract:Conceptual combination is a cognitive process that merges basic concepts, enabling the creation of complex expressions. During this process, the properties of combination (e.g., the whiteness of a peeled apple) can be inherited from basic concepts, newly emerge, or be canceled. However, previous studies have evaluated a limited set of properties and have not examined the generative process. To address this gap, we introduce the Conceptual Combination with Property Type dataset (CCPT), which consists of 12.3K annotated triplets of noun phrases, properties, and property types. Using CCPT, we establish three types of tasks to evaluate LLMs for conceptual combination thoroughly. Our key findings are threefold: (1) Our automatic metric grading property emergence and cancellation closely corresponds with human judgments. (2) LLMs, including OpenAI’s o1, struggle to generate noun phrases which possess given emergent properties. (3) Our proposed method, inspired by cognitive psychology model that explains how relationships between concepts are formed, improves performances in all generative tasks. The dataset and experimental code are available at this https URL.
zh

[NLP-64] Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs

【速读】：该论文旨在解决精神疾病污名这一持久的社会问题，通过分析相关数据以更清晰地理解其本质，但传统方法高度劳动密集。为应对这一挑战，论文提出的关键解决方案是设计一个聊天机器人与参与者进行对话，并利用AI辅助进行定性编码，进而基于这些编码结果构建因果知识图谱来解析污名现象。这种方法能够揭示个体回应中的模式，并展示数据集中心理构念之间的相互关系。

链接: https://arxiv.org/abs/2502.06075
作者: Han Meng,Renwen Zhang,Ganyi Wang,Yitian Yang,Peinuan Qin,Jungup Lee,Yi-Chieh Lee
机构: National University of Singapore (新加坡国立大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Conditionally accepted to CHI Conference on Human Factors in Computing Systems (CHI’25)

点击查看摘要

Abstract:Mental-illness stigma is a persistent social problem, hampering both treatment-seeking and recovery. Accordingly, there is a pressing need to understand it more clearly, but analyzing the relevant data is highly labor-intensive. Therefore, we designed a chatbot to engage participants in conversations; coded those conversations qualitatively with AI assistance; and, based on those coding results, built causal knowledge graphs to decode stigma. The results we obtained from 1,002 participants demonstrate that conversation with our chatbot can elicit rich information about people’s attitudes toward depression, while our AI-assisted coding was strongly consistent with human-expert coding. Our novel approach combining large language models (LLMs) and causal knowledge graphs uncovered patterns in individual responses and illustrated the interrelationships of psychological constructs in the dataset as a whole. The paper also discusses these findings’ implications for HCI researchers in developing digital interventions, decomposing human psychological constructs, and fostering inclusive attitudes.
zh

[NLP-65] Benchmarking Prompt Sensitivity in Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在不同提示表述下的敏感性问题，即如何预测和减轻提示微小变化对模型性能的影响。关键在于开发PromptSET数据集以及评估现有方法在处理提示敏感性预测任务上的有效性，并发现当前方法难以有效应对这一挑战。

链接: https://arxiv.org/abs/2502.06065
作者: Amirhossein Razavi,Mina Soltangheis,Negar Arabzadeh,Sara Salamat,Morteza Zihayat,Ebrahim Bagheri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language Models (LLMs) are highly sensitive to variations in prompt formulation, which can significantly impact their ability to generate accurate responses. In this paper, we introduce a new task, Prompt Sensitivity Prediction, and a dataset PromptSET designed to investigate the effects of slight prompt variations on LLM performance. Using TriviaQA and HotpotQA datasets as the foundation of our work, we generate prompt variations and evaluate their effectiveness across multiple LLMs. We benchmark the prompt sensitivity prediction task employing state-of-the-art methods from related tasks, including LLM-based self-evaluation, text classification, and query performance prediction techniques. Our findings reveal that existing methods struggle to effectively address prompt sensitivity prediction, underscoring the need to understand how information needs should be phrased for accurate LLM responses.
zh

[NLP-66] raining Language Models for Social Deduction with Multi-Agent Reinforcement Learning AAMAS2025

【速读】：该论文旨在解决在多智能体系统中自然语言沟通的问题，特别是在没有大量人类演示的情况下训练模型以生成自然且有用的沟通策略。论文的关键在于将沟通问题分解为听和说，并利用代理的目标来预测关于环境的有用信息作为密集奖励信号，以此引导沟通。具体而言，通过让模型基于讨论预测环境信息来提升其聆听技能，同时利用多智能体强化学习通过奖励消息对其它代理的影响来提升其表达能力。这种方法使模型能够在复杂的社会环境中进行有效的交流，显著提高了胜率。

链接: https://arxiv.org/abs/2502.06060
作者: Bidipta Sarkar,Warren Xia,C. Karen Liu,Dorsa Sadigh
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 14 pages, 5 figures, 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)

点击查看摘要

Abstract:Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train language models to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent’s goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model’s listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model’s speaking skills with multi-agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at this https URL
zh

[NLP-67] LM2: Large Memory Models

【速读】：该论文旨在解决标准Transformer在多步推理、关系论证及长上下文信息综合方面的局限性。解决方案的关键在于引入了一种辅助记忆模块，即Large Memory Model (LM2)，它作为一个上下文表示库与输入标记通过交叉注意力机制进行交互，并通过门控机制进行更新。LM2保持了原始信息流的同时，整合了一个互补的记忆路径，从而增强了模型在多跳推理、数值推理和大规模上下文问答任务中的表现。

链接: https://arxiv.org/abs/2502.06049
作者: Jikun Kang,Wenqi Wu,Filippos Christianos,Alex J. Chan,Fraser Greenlee,George Thomas,Marvin Purtorab,Andy Toulis
机构: unknown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general-purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in multi-hop inference, numerical reasoning, and large-context question-answering. On the MMLU dataset, it achieves a 5.0% improvement over a pre-trained vanilla model, demonstrating that its memory module does not degrade performance on general tasks. Further, in our analysis, we explore the memory interpretability, effectiveness of memory modules, and test-time behavior. Our findings emphasize the importance of explicit memory in enhancing Transformer architectures.
zh

[NLP-68] Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

【速读】：该论文旨在解决通过微调预训练模型在目标域上进行无监督预测任务时所面临的两个挑战：数据量有限导致的过拟合以及遗忘预训练数据及其通用知识的问题。关键解决方案在于将少量（如1%）的预训练数据注入到微调数据混合中，以避免遗忘并减轻过拟合现象。

链接: https://arxiv.org/abs/2502.06042
作者: Louis Bethune,David Grangier,Dan Busbridge,Eleonora Gualdoni,Marco Cuturi,Pierre Ablin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 19 pages, 15 figures, preprint

点击查看摘要

Abstract:A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. We aim to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
zh

[NLP-69] Analysis of LLM as a grammatical feature tagger for African American English NAACL2025

【速读】：该论文旨在解决非洲裔美国人英语（African American English, AAE）在自然语言处理（NLP）中的独特挑战。研究通过系统比较规则基础模型、变换器基础模型及大型语言模型（LLMs）在识别AAE关键语法特征（如Habitual Be和Multiple Negation）方面的性能，探索了解决方案。研究的关键在于评估这些模型在零样本和少量样本策略下的句级二元分类任务表现，并揭示LLMs虽然相较于基准有所进步，但仍受近期偏差及文本中无关特征（如正式性）的影响。因此，改进模型训练和架构调整以更好地适应AAE的独特语言特性成为解决问题的关键。

链接: https://arxiv.org/abs/2502.06004
作者: Rahul Porwal,Alice Rozet,Pryce Houck,Jotsna Gowda,Sarah Moeller,Kevin Tang
机构: University of Florida(佛罗里达大学); Heinrich Heine University Düsseldorf(杜塞尔多夫海因里希海涅大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, Accepted to “Findings of the Association for Computational Linguistics: NAACL 2025”

点击查看摘要

Abstract:African American English (AAE) presents unique challenges in natural language processing (NLP). This research systematically compares the performance of available NLP models–rule-based, transformer-based, and large language models (LLMs)–capable of identifying key grammatical features of AAE, namely Habitual Be and Multiple Negation. These features were selected for their distinct grammatical complexity and frequency of occurrence. The evaluation involved sentence-level binary classification tasks, using both zero-shot and few-shot strategies. The analysis reveals that while LLMs show promise compared to the baseline, they are influenced by biases such as recency and unrelated features in the text such as formality. This study highlights the necessity for improved model training and architectural adjustments to better accommodate AAE’s unique linguistic characteristics. Data and code are available.
zh

[NLP-70] Preventing Rogue Agents Improves Multi-Agent Collaboration

【速读】：该论文旨在解决多智能体系统中单个智能体可能导致整个系统失败的问题。论文的关键解决方案在于监控智能体在动作预测过程中的行为，并在预测未来可能出现错误时进行干预。通过引入WhoDunitEnv环境以及在GovSim环境中进行实验，结果表明该方法可显著提高性能，分别达到17.4%和20%的提升。此外，详细分析显示所提出的监测器能够成功识别智能体的困惑点，而干预措施则有效阻止了错误的传播。

链接: https://arxiv.org/abs/2502.05986
作者: Ohav Barbi,Ori Yoran,Mor Geva
机构: Blavatnik School of Computer Science and AI, Tel Aviv University (布劳瓦特尼克计算机科学与人工智能学院，特拉维夫大学)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent systems, where specialized agents collaborate to solve a shared task hold great potential, from increased modularity to simulating complex environments. However, they also have a major caveat – a single agent can cause the entire system to fail. Consider a simple game where the knowledge to solve the task is distributed between agents, which share information in a communication channel. At each round, any of the agents can terminate the game and make the final prediction, even if they are uncertain about the outcome of their action. Detection of such rogue agents \textitbefore they act may prevent the system’s failure. In this work, we propose to \textitmonitor agents during action prediction and \textitintervene when a future error is likely to occur. To test our approach, we introduce WhoDunitEnv, a multi-agent collaboration environment that allows modular control over task complexity and communication structure. Experiments on two variants of WhoDunitEnv and the GovSim environment for resource sustainability show that our approach leads to substantial performance gains up to 17.4% and 20%, respectively. Moreover, a thorough analysis shows that our monitors successfully identify critical points of agent confusion and our interventions effectively stop agent errors from propagating.
zh

[NLP-71] HamRaz: A Culture-Based Persian Conversation Dataset for Person-Centered Therapy Using LLM Agents

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）在心理治疗应用中忽视波斯语文化与语言特性的不足。论文的关键解决方案是提出了HamRaz数据集，该数据集结合了基于脚本的对话与适应性LLM角色扮演，确保了连贯且动态的心理治疗互动。此外，通过引入HamRazEval评估框架，从对话质量和治疗效果两个方面进行综合评价，从而验证了HamRaz的有效性。

链接: https://arxiv.org/abs/2502.05982
作者: Mohammad Amin Abbasi,Farnaz Sadat Mirnezami,Hassan Naderi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents HamRaz, a novel Persian-language mental health dataset designed for Person-Centered Therapy (PCT) using Large Language Models (LLMs). Despite the growing application of LLMs in AI-driven psychological counseling, existing datasets predominantly focus on Western and East Asian contexts, overlooking cultural and linguistic nuances essential for effective Persian-language therapy. To address this gap, HamRaz combines script-based dialogues with adaptive LLM role-playing, ensuring coherent and dynamic therapy interactions. We also introduce HamRazEval, a dual evaluation framework that measures conversational quality and therapeutic effectiveness using General Dialogue Metrics and the Barrett-Lennard Relationship Inventory (BLRI). Experimental results show HamRaz outperforms conventional Script Mode and Two-Agent Mode, producing more empathetic, context-aware, and realistic therapy sessions. By releasing HamRaz, we contribute a culturally adapted, LLM-driven resource to advance AI-powered psychotherapy research in diverse communities.
zh

[NLP-72] Speech to Speech Translation with Translatotron: A State of the Art Review

【速读】：该论文旨在解决级联式语音到语音翻译系统中存在的多个问题，如翻译时间过长及复合错误。这些问题源于级联方法需要结合多种技术，包括语音识别、语音到文本的翻译以及最后的文本到语音的翻译。为了解决这些复合错误问题，Google设计了Translatotron模型，这是一种端到端的直接语音到语音翻译模型。Translatotron的关键在于它能够直接进行语音到语音的转换，从而避免了级联方法中的多个中间步骤，有效减少了复合错误。

链接: https://arxiv.org/abs/2502.05980
作者: Jules R. Kala,Emmanuel Adetiba,Abdultaofeek Abayom,Oluwatobi E. Dare,Ayodele H. Ifijeh
机构: International University of Grand-Bassam; Covenant University, Ota, Nigeria; Department of Electrical and Information Engineering, Covenant University, Ota, Nigeria; Durban University of Technology, Durban, South Africa; Walter Sisulu University, East London 5200, South Africa; Summit University, PMB 4412, Offa, Kwara, Nigeria
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages and 3 figures

点击查看摘要

Abstract:A cascade-based speech-to-speech translation has been considered a benchmark for a very long time, but it is plagued by many issues, like the time taken to translate a speech from one language to another and compound errors. These issues are because a cascade-based method uses a combination of methods such as speech recognition, speech-to-text translation, and finally, text-to-speech translation. Translatotron, a sequence-to-sequence direct speech-to-speech translation model was designed by Google to address the issues of compound errors associated with cascade model. Today there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron3. The first version was designed as a proof of concept to show that a direct speech-to-speech translation was possible, it was found to be less effective than the cascade model but was producing promising results. Translatotron2 was an improved version of Translatotron 1 with results similar to the cascade model. Translatotron 3 the latest version of the model is better than the cascade model at some points. In this paper, a complete review of speech-to-speech translation will be presented, with a particular focus on all the versions of Translatotron models. We will also show that Translatotron is the best model to bridge the language gap between African Languages and other well-formalized languages.
zh

[NLP-73] MetaChain: A Fully-Automated and Zero-Code Framework for LLM Agents

【速读】：该论文旨在解决大型语言模型（LLM）代理开发的高技术门槛问题，当前的开发框架如LangChain和AutoGen主要服务于具备深厚技术背景的开发者。论文的关键解决方案是提出MetaChain——一个全自动化且高度自发展的框架，使用户能够仅通过自然语言创建和部署LLM代理。MetaChain包含四个核心组件：i）代理系统实用程序，ii）基于LLM的可操作引擎，iii）自我管理的文件系统，以及iv）自玩代理定制模块。这一轻量级但功能强大的系统允许在无需编码或人工干预的情况下高效动态地创建和修改工具、代理及工作流程，从而大幅降低使用门槛。

链接: https://arxiv.org/abs/2502.05957
作者: Jiabin Tang,Tianyu Fan,Chao Huang
机构: The University of Hong Kong
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:Large Language Model (LLM) Agents have demonstrated remarkable capabilities in task automation and intelligent decision-making, driving the widespread adoption of agent development frameworks such as LangChain and AutoGen. However, these frameworks predominantly serve developers with extensive technical expertise - a significant limitation considering that only 0.03 % of the global population possesses the necessary programming skills. This stark accessibility gap raises a fundamental question: Can we enable everyone, regardless of technical background, to build their own LLM agents using natural language alone? To address this challenge, we introduce MetaChain-a Fully-Automated and highly Self-Developing framework that enables users to create and deploy LLM agents through Natural Language Alone. Operating as an autonomous Agent Operating System, MetaChain comprises four key components: i) Agentic System Utilities, ii) LLM-powered Actionable Engine, iii) Self-Managing File System, and iv) Self-Play Agent Customization module. This lightweight yet powerful system enables efficient and dynamic creation and modification of tools, agents, and workflows without coding requirements or manual intervention. Beyond its code-free agent development capabilities, MetaChain also serves as a versatile multi-agent system for General AI Assistants. Comprehensive evaluations on the GAIA benchmark demonstrate MetaChain’s effectiveness in generalist multi-agent tasks, surpassing existing state-of-the-art methods. Furthermore, MetaChain’s Retrieval-Augmented Generation (RAG)-related capabilities have shown consistently superior performance compared to many alternative LLM-based solutions.
zh

[NLP-74] Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

【速读】：该论文旨在解决多头解码（Multiple Heads Decoding）在大型语言模型（LLMs）推理过程中的效率问题，同时保持生成质量。解决方案的关键在于引入动态树注意力机制（Dynamic Tree Attention），替代原有的固定树注意力机制。通过提出一种简单且低复杂度的策略来生成候选序列并构建动态树结构，从而提高了多头解码的解码效率。

链接: https://arxiv.org/abs/2502.05947
作者: Zhendong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed structure. In this paper, we replace the fixed tree attention with dynamic tree attention on multiple head decoding, specifically in the context of MEDUSA. We propose a simple and low complexity strategy to generate candidates and construct the dynamic tree structure. Preliminary experiments show that the proposed method improves the decoding efficiency of multiple head decoding for LLMs while maintaining the generation quality. This result demonstrates the potential for improvement of multiple head decoding in candidate generation.
zh

[NLP-75] “Let the AI conspiracy begin…” Language Model coordination is just one inference-intervention away

【速读】：该论文旨在解决如何通过干预大型语言模型的行为来绕过已学习的对齐目标的问题。关键解决方案在于利用干扰时激活位移（interference-time activation shifting），无需额外训练即可有效实施。通过从对比模型输出对（代表期望与非期望行为）的激活差异中推导干预方向，并通过提示模型包含多选答案来自动评估模型输出对个体注意力头（attention heads）操作的敏感性，从而实现对特定注意力头的精细干预，使模型更倾向于在“AI协调”数据集中与其他AI进行协作，而非遵循传统的对齐目标。这种方法不仅效果显著，还能保持输出的整体连贯性。

链接: https://arxiv.org/abs/2502.05945
作者: Paul Darm,Annalisa Riccardi
机构: University of Strathclyde (斯特拉斯克莱德大学), Glasgow
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Large Language Models (LLMs), Interference-time activation shifting, Steerability, Explainability, AI alignment, Interpretability

点击查看摘要

Abstract:In this work, we introduce a straightforward and effective methodology to steer large language model behaviour capable of bypassing learned alignment goals. We employ interference-time activation shifting, which is effective without additional training. Following prior studies, we derive intervention directions from activation differences in contrastive pairs of model outputs, which represent the desired and undesired behaviour. By prompting the model to include multiple-choice answers in its response, we can automatically evaluate the sensitivity of model output to individual attention heads steering efforts. We demonstrate that interventions on these heads generalize well to open-ended answer generation in the challenging “AI coordination” dataset. In this dataset, models must choose between assisting another AI or adhering to ethical, safe, and unharmful behaviour. Our fine-grained interventions lead Llama 2 to prefer coordination with other AIs over following established alignment goals. Additionally, this approach enables stronger interventions than those applied to whole model layers, preserving the overall cohesiveness of the output. The simplicity of our method highlights the shortcomings of current alignment strategies and points to potential future research directions, as concepts like “AI coordination” can be influenced by selected attention heads.
zh

[NLP-76] Multi-granular Training Strategies for Robust Multi-hop Reasoning Over Noisy and Heterogeneous Knowledge Sources

【速读】：该论文旨在解决多源多跳问题回答（Multi-source multi-hop QA）中的挑战，包括异构知识源的动态整合和多步推理的需求。论文的关键解决方案是提出了一种名为自适应多源知识导向推理（Adaptive Multi-source Knowledge-Oriented Reasoning, AMKOR）的生成框架。AMKOR利用大型语言模型（LLMs）动态融合参数化知识与检索到的知识，并通过概率束推理探索推理路径。此外，AMKOR通过多层次学习策略优化局部推理步骤和全局答案准确性，从而显著提升了推理准确性和鲁棒性。

链接: https://arxiv.org/abs/2502.05944
作者: Jackson Coleman,Isaiah Lawrence,Benjamin Turner
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-source multi-hop question answering (QA) represents a challenging task in natural language processing due to the need for dynamic integration of heterogeneous knowledge sources and multi-step reasoning. Existing methods often suffer from cascading errors, insufficient handling of knowledge conflicts, and computational inefficiency. In this paper, we propose Adaptive Multi-source Knowledge-Oriented Reasoning (AMKOR), a generative framework that leverages large language models (LLMs) to dynamically fuse parametric and retrieved knowledge while exploring reasoning trajectories using probabilistic beam reasoning. AMKOR is further enhanced by a multi-granular learning strategy, optimizing both local reasoning steps and global answer accuracy. Experiments conducted on four widely-used multi-hop QA datasets, including HotpotQA and MuSiQue, demonstrate that AMKOR achieves state-of-the-art performance, significantly outperforming baseline methods on both reasoning accuracy and robustness. Additional analyses confirm its scalability, adaptability to noisy knowledge, and superior ability to handle complex multi-hop tasks. This work establishes a new benchmark for multi-source multi-hop QA by effectively combining reasoning quality and efficiency.
zh

[NLP-77] A Semi-Supervised Text Generation Framework Combining a Deep Transformer and a GAN

【速读】：该论文旨在解决半监督条件下的文本生成问题。解决方案的关键在于结合深度生成预训练Transformer语言模型与生成对抗网络（GAN），并通过Gumbel-Softmax处理令牌的离散性。此外，引入了一种半监督方法，将真实数据与GAN生成的数据相结合，以进一步微调Transformer模型。

链接: https://arxiv.org/abs/2502.05937
作者: Shengquan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:This paper introduces a framework that connects a deep generative pre-trained Transformer language model with a generative adversarial network for semi-supervised text generation. In other words, the proposed model is first pre-trained unsupervised on a large and diverse text corpus with 24 layers. Then a simple GAN architecture for synthetic text generation is introduced, and Gumbel-Softmax is applied to handle the discreteness of tokens. The paper also shows a semi-supervised approach where real data is augmented with GAN samples, which is further used to fine-tune the Transformer model on the merged dataset. Detailed theoretical derivations are also included, outlining the proof of the min-max objective function, and an extensive discussion of the Gumbel-Softmax reparameterization trick.
zh

[NLP-78] Learning to Substitute Words with Model-based Score Ranking

【速读】：该论文旨在解决现有智能词汇替换基准依赖人工标注数据的问题。由于词汇选择具有主观性，由少数注释者生成的真实词替换数据往往不完整且不具备普适性。为解决这一问题，论文提出使用基于模型的评分（BARTScore）来量化句子质量，从而避免了对人工标注的需求。关键在于通过这种评分定义每个词汇替换的分布，并引入一种损失函数，直接优化模型预测与句子评分之间的对齐，同时提升替换的整体质量得分。这种方法使得模型学习不再依赖于人工标签，从而在无需标注成本的情况下保持文本修改的质量。实验结果表明，所提出的方案优于掩码语言模型（如BERT、BART）和大型语言模型（如GPT-4、LLaMA）。

链接: https://arxiv.org/abs/2502.05933
作者: Hongye Liu,Ricardo Henao
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Smart word substitution aims to enhance sentence quality by improving word choices; however current benchmarks rely on human-labeled data. Since word choices are inherently subjective, ground-truth word substitutions generated by a small group of annotators are often incomplete and likely not generalizable. To circumvent this issue, we instead employ a model-based score (BARTScore) to quantify sentence quality, thus forgoing the need for human annotations. Specifically, we use this score to define a distribution for each word substitution, allowing one to test whether a substitution is statistically superior relative to others. In addition, we propose a loss function that directly optimizes the alignment between model predictions and sentence scores, while also enhancing the overall quality score of a substitution. Crucially, model learning no longer requires human labels, thus avoiding the cost of annotation while maintaining the quality of the text modified with substitutions. Experimental results show that the proposed approach outperforms both masked language models (BERT, BART) and large language models (GPT-4, LLaMA). The source code is available at this https URL.
zh

[NLP-79] ARISE: Iterative Rule Induction and Synthetic Data Generation for Text Classification NAACL2025

【速读】：该论文旨在解决文本分类中的规则诱导和数据增强问题。关键在于提出了一种名为ARISE的框架，通过迭代生成规则和合成数据，并利用自举法（bootstrap）来过滤生成的规则和数据。ARISE通过句法n元语法（syntactic n-grams）的归纳泛化来诱导规则，从而捕捉额外的监督信号源。这些规则与合成数据的使用单独或结合都能显著提升模型在全量数据（full-shot）、少量数据（few-shot）及多语言环境下的性能。

链接: https://arxiv.org/abs/2502.05923
作者: Yashwanth M.,Vaibhav Singh,Ayush Maheshwari,Amrith Krishna,Ganesh Ramakrishnan
机构: Accenture; NVIDIA; Indian Institute of Technology Bombay (印度理工学院孟买分校); BharatGen
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of NAACL 2025

点击查看摘要

Abstract:We propose ARISE, a framework that iteratively induces rules and generates synthetic data for text classification. We combine synthetic data generation and automatic rule induction, via bootstrapping, to iteratively filter the generated rules and data. We induce rules via inductive generalisation of syntactic n-grams, enabling us to capture a complementary source of supervision. These rules alone lead to performance gains in both, in-context learning (ICL) and fine-tuning (FT) settings. Similarly, use of augmented data from ARISE alone improves the performance for a model, outperforming configurations that rely on complex methods like contrastive learning. Further, our extensive experiments on various datasets covering three full-shot, eight few-shot and seven multilingual variant settings demonstrate that the rules and data we generate lead to performance improvements across these diverse domains and languages.
zh

[NLP-80] GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation

【速读】：该论文旨在解决拒绝意识指令调优（Refusal-Aware Instruction Tuning, RAIT）中的两个关键挑战：有效拒绝未知问题以减少幻觉，并避免过度拒绝以确保可以正确回答的问题不被拒绝，从而保持语言模型输出的有用性。为了解决这些问题，论文提出了梯度驱动的拒绝意识指令调优框架（Gradient-driven Refusal Aware Instruction Tuning Framework, GRAIT），其关键是采用基于梯度的样本选择来有效减少幻觉，并引入自适应加权机制在微调过程中降低过度拒绝的风险，从而在准确拒绝与保持有用响应之间取得平衡。

链接: https://arxiv.org/abs/2502.05911
作者: Runchuan Zhu,Zinco Jiang,Jiang Wu,Zhipeng Ma,Jiahe Song,Fengshuo Bai,Dahua Lin,Lijun Wu,Conghui He
机构: Peking University (北京大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Southwest Jiaotong University (西南交通大学); Shanghai Jiaotong University (上海交通大学); Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: Equal contribution: Runchuan Zhu, Zinco Jiang, Jiang Wu; Corresponding author: Conghui He

点击查看摘要

Abstract:Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at this https URL .
zh

[NLP-81] A Distributional Perspective on Word Learning in Neural Language Models

【速读】：该论文旨在解决语言模型（Language Models, LMs）在词汇学习轨迹方面与人类儿童是否存在相似性的问题。由于该领域尚处于初期阶段，目前缺乏广泛认可的语言模型中词汇学习的评估指标。论文的关键解决方案在于提出了一种基于分布的方法来定义词汇知识，通过改进的分布特征捕捉目标词可能出现和不可能出现的位置以及其适当性的梯度偏好，从而更全面地反映词汇学习情况。研究结果表明，所提出的多种指标主要捕获互补信息，但不同指标下语言模型的学习轨迹未能与儿童的学习轨迹相关联。

链接: https://arxiv.org/abs/2502.05892
作者: Filippo Ficarra,Ryan Cotterell,Alex Warstadt
机构: ETH Zürich (苏黎世联邦理工学院); University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models (LMs) are increasingly being studied as models of human language learners. Due to the nascency of the field, it is not well-established whether LMs exhibit similar learning dynamics to humans, and there are few direct comparisons between learning trajectories in humans and models. Word learning trajectories for children are relatively well-documented, and recent work has tried to extend these investigations to language models. However, there are no widely agreed-upon metrics for word learning in language models. We take a distributional approach to this problem, defining lexical knowledge in terms of properties of the learned distribution for a target word. We argue that distributional signatures studied in prior work fail to capture key distributional information. Thus, we propose an array of signatures that improve on earlier approaches by capturing knowledge of both where the target word can and cannot occur as well as gradient preferences about the word’s appropriateness. We obtain learning trajectories for a selection of small language models we train from scratch, study the relationship between different distributional signatures, compare how well they align with human word learning trajectories and interpretable lexical features, and address basic methodological questions about estimating these distributional signatures. Our metrics largely capture complementary information, suggesting that it is important not to rely on a single metric. However, across all metrics, language models’ learning trajectories fail to correlate with those of children.
zh

[NLP-82] MTPChat: A Multimodal Time-Aware Persona Dataset for Conversational Agents NAACL2025

【速读】：该论文旨在解决时间感知数据集在人物角色对话中的局限性，特别是对于人格化对话而言。论文的关键解决方案是引入MTPChat，一个融合语言、视觉和时间元素的多模态时间感知人物对话数据集，并基于此提出了两个时间敏感任务：Temporal Next Response Prediction (TNRP) 和 Temporal Grounding Memory Prediction (TGMP)，以评估模型理解隐含时间线索和动态交互的能力。此外，论文还提出了一种创新框架，包含自适应时间模块，有效整合多模态信息流并捕捉时间依赖关系。

链接: https://arxiv.org/abs/2502.05887
作者: Wanqi Yang,Yanda Li,Meng Fang,Ling Chen
机构: University of Technology Sydney (悉尼科技大学); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025 Findings

点击查看摘要

Abstract:Understanding temporal dynamics is critical for conversational agents, enabling effective content analysis and informed decision-making. However, time-aware datasets, particularly for persona-grounded conversations, are still limited, which narrows their scope and diminishes their complexity. To address this gap, we introduce MTPChat, a multimodal, time-aware persona dialogue dataset that integrates linguistic, visual, and temporal elements within dialogue and persona memory. Leveraging MTPChat, we propose two time-sensitive tasks: Temporal Next Response Prediction (TNRP) and Temporal Grounding Memory Prediction (TGMP), both designed to assess a model’s ability to understand implicit temporal cues and dynamic interactions. Additionally, we present an innovative framework featuring an adaptive temporal module to effectively integrate multimodal streams and capture temporal dependencies. Experimental results validate the challenges posed by MTPChat and demonstrate the effectiveness of our framework in multimodal time-sensitive scenarios.
zh

[NLP-83] Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models

【速读】：该论文旨在解决当前基于大型语言模型（Large Language Models, LLMs）的抑郁症检测方法在细微症状识别和透明推理过程方面的不足，从而难以准确分类和解释精神健康状况的问题。论文的关键解决方案是提出了一种称为“Chain-of-Thought Prompting”的方法，通过将检测过程分解为四个阶段：情感分析、二元抑郁症分类、潜在原因识别和严重性评估，以此来提高基于LLM的抑郁症检测性能和可解释性。

链接: https://arxiv.org/abs/2502.05879
作者: Shiyu Teng,Jiaqing Liu,Rahul Kumar Jain,Shurong Chai,Ruibo Hou,Tomoko Tateyama,Lanfen Lin,Yen-wei Chen
机构: College of Information Science and Engineering, Ritsumeikan University (立命馆大学信息科学与工程学院), Osaka, Japan; Department of Intelligent Information Engineering, Fujita Health University (藤田保健卫生大学智能信息工程系), Japan; College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院), Hangzhou, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Depression is one of the leading causes of disability worldwide, posing a severe burden on individuals, healthcare systems, and society at large. Recent advancements in Large Language Models (LLMs) have shown promise in addressing mental health challenges, including the detection of depression through text-based analysis. However, current LLM-based methods often struggle with nuanced symptom identification and lack a transparent, step-by-step reasoning process, making it difficult to accurately classify and explain mental health conditions. To address these challenges, we propose a Chain-of-Thought Prompting approach that enhances both the performance and interpretability of LLM-based depression detection. Our method breaks down the detection process into four stages: (1) sentiment analysis, (2) binary depression classification, (3) identification of underlying causes, and (4) assessment of severity. By guiding the model through these structured reasoning steps, we improve interpretability and reduce the risk of overlooking subtle clinical indicators. We validate our method on the E-DAIC dataset, where we test multiple state-of-the-art large language models. Experimental results indicate that our Chain-of-Thought Prompting technique yields superior performance in both classification accuracy and the granularity of diagnostic insights, compared to baseline approaches.
zh

[NLP-84] Retrieval-augmented Large Language Models for Financial Time Series Forecasting

【速读】：该论文旨在解决股票走势预测中的复杂金融数据分析难题，现有基于文本训练或数值相似性检索的方法难以应对。为了解决这一问题，论文提出了一种创新的检索增强生成（RAG）框架，关键创新包括：1) 一个经过微调的10亿参数大规模语言模型（StockLLM）作为基础；2) 利用语言模型反馈的新颖候选选择方法；3) 最大化查询与历史上重要序列之间的相似性的训练目标。这些创新使得检索器FinSeer能够在复杂金融数据中发现有意义的模式，同时减少噪声。

链接: https://arxiv.org/abs/2502.05878
作者: Mengxi Xiao,Zihao Jiang,Lingfei Qian,Zhengyu Chen,Yueru He,Yijing Xu,Yuecheng Jiang,Dong Li,Ruey-Ling Weng,Min Peng,Jimin Huang,Sophia Ananiadou,Qianqian Xie
机构: School of Computer Science, Wuhan University(武汉大学); The Fin AI(金融人工智能); Columbia University(哥伦比亚大学); Stevens Institute of Technology(史蒂文斯理工学院); Yale University(耶鲁大学); University of Manchester(曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Stock movement prediction, a fundamental task in financial time-series forecasting, requires identifying and retrieving critical influencing factors from vast amounts of time-series data. However, existing text-trained or numeric similarity-based retrieval methods fall short in handling complex financial analysis. To address this, we propose the first retrieval-augmented generation (RAG) framework for financial time-series forecasting, featuring three key innovations: a fine-tuned 1B parameter large language model (StockLLM) as the backbone, a novel candidate selection method leveraging LLM feedback, and a training objective that maximizes similarity between queries and historically significant sequences. This enables our retriever, FinSeer, to uncover meaningful patterns while minimizing noise in complex financial data. We also construct new datasets integrating financial indicators and historical stock prices to train FinSeer and ensure robust evaluation. Experimental results demonstrate that our RAG framework outperforms bare StockLLM and random retrieval, highlighting its effectiveness, while FinSeer surpasses existing retrieval methods, achieving an 8% higher accuracy on BIGDATA22 and retrieving more impactful sequences. This work underscores the importance of tailored retrieval models in financial forecasting and provides a novel framework for future research.
zh

[NLP-85] Self-Training Large Language Models for Tool-Use Without Demonstrations

【速读】：该论文旨在解决大型语言模型（LLMs）在事实准确性与计算正确性方面的不足，特别是避免幻觉现象及数学推理错误。为实现这一目标，论文探索了两种方法：一是分析零样本提示策略以引导LLMs使用工具；二是提出一种自我训练方法，利用LLM自身合成工具使用痕迹。关键在于无需示范的情况下，使LLMs能够学会使用外部工具，并通过监督微调和偏好微调技术在现有问答数据集上进行模型优化。实验结果显示，在长尾知识任务中，工具使用的引入显著提升了PopQA上的表现，但在其他数据集上的效果则喜忧参半。

链接: https://arxiv.org/abs/2502.05867
作者: Ne Luo,Aryo Pradipta Gema,Xuanli He,Emile van Krieken,Pietro Lesci,Pasquale Minervini
机构: University of Edinburgh(爱丁堡大学); University College London(伦敦大学学院); University of Cambridge(剑桥大学); Miniml.AI(未知)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) remain prone to factual inaccuracies and computational errors, including hallucinations and mistakes in mathematical reasoning. Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations. In this paper, we investigate whether LLMs can learn to use tools without demonstrations. First, we analyse zero-shot prompting strategies to guide LLMs in tool utilisation. Second, we propose a self-training method to synthesise tool-use traces using the LLM itself. We compare supervised fine-tuning and preference fine-tuning techniques for fine-tuning the model on datasets constructed using existing Question Answering (QA) datasets, i.e., TriviaQA and GSM8K. Experiments show that tool-use enhances performance on a long-tail knowledge task: 3.7% on PopQA, which is used solely for evaluation, but leads to mixed results on other datasets, i.e., TriviaQA, GSM8K, and NQ-Open. Our findings highlight the potential and challenges of integrating external tools into LLMs without demonstrations.
zh

[NLP-86] Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries

【速读】：该论文旨在探究准确反映事实性与促进多样性和公平性之间的关系，并解决生成式大型语言模型（Large Language Models, LLMs）及文本到图像模型（Text-to-Image, T2I）在这些方面存在的不足。论文的关键在于开发了一个包含客观和主观查询的检查表，通过19个来自权威来源的真实世界统计数据进行分析。其中，主观查询遵循一个基本原则：不应将统计或经验先验过度泛化到个体，以确保模型能够维护多样性。此外，论文提出了评估事实性和公平性的指标，并证明了这两者之间存在固有的权衡关系。

链接: https://arxiv.org/abs/2502.05849
作者: Jen-tse Huang,Yuhang Yan,Linqi Liu,Yixin Wan,Wenxuan Wang,Kai-Wei Chang,Michael R. Lyu
机构: The Chinese University of Hong Kong; University of California, Los Angeles
类目: Computation and Language (cs.CL)
备注: 8 pages of main text; 7 pages of appendices;

点击查看摘要

Abstract:The generation of incorrect images, such as depictions of people of color in Nazi-era uniforms by Gemini, frustrated users and harmed Google’s reputation, motivating us to investigate the relationship between accurately reflecting factuality and promoting diversity and equity. In this study, we focus on 19 real-world statistics collected from authoritative sources. Using these statistics, we develop a checklist comprising objective and subjective queries to analyze behavior of large language models (LLMs) and text-to-image (T2I) models. Objective queries assess the models’ ability to provide accurate world knowledge. In contrast, the design of subjective queries follows a key principle: statistical or experiential priors should not be overgeneralized to individuals, ensuring that models uphold diversity. These subjective queries are based on three common human cognitive errors that often result in social biases. We propose metrics to assess factuality and fairness, and formally prove the inherent trade-off between these two aspects. Results show that GPT-4o and DALL-E 3 perform notably well among six LLMs and four T2I models. Our code is publicly available at this https URL.
zh

[NLP-87] LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification NAACL2025

【速读】：该论文旨在解决法律文件语义分割中的修辞角色分类问题，特别关注印度司法判决。论文的关键解决方案在于引入LegalSeg数据集，包含超过7,000份文档和140万句标注数据，并涵盖7种修辞角色。研究评估了多种先进模型，包括Hierarchical BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、图神经网络（GNNs）和Role-Aware Transformers，以及探索性的RhetoricLLaMA。结果表明，融合更广泛上下文、结构关系和句子顺序信息的模型优于仅依赖句子级特征的模型。此外，论文通过实验探讨了使用周围上下文和相邻句子的实际或预测标签对分类准确性的影响。尽管取得进展，但仍面临区分相关角色和处理类别不平衡的挑战。

链接: https://arxiv.org/abs/2502.05836
作者: Shubham Kumar Nigam,Tanmay Dubey,Govind Sharma,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya
机构: IIT Kanpur(印度理工学院坎普尔); IISER Kolkata(印度西孟加拉邦国立研究所); Symbiosis Law School Pune(浦那共生法学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted on NAACL 2025

点击查看摘要

Abstract:In this paper, we address the task of semantic segmentation of legal documents through rhetorical role classification, with a focus on Indian legal judgments. We introduce LegalSeg, the largest annotated dataset for this task, comprising over 7,000 documents and 1.4 million sentences, labeled with 7 rhetorical roles. To benchmark performance, we evaluate multiple state-of-the-art models, including Hierarchical BiLSTM-CRF, TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an instruction-tuned large language model. Our results demonstrate that models incorporating broader context, structural relationships, and sequential sentence information outperform those relying solely on sentence-level features. Additionally, we conducted experiments using surrounding context and predicted or actual labels of neighboring sentences to assess their impact on classification accuracy. Despite these advancements, challenges persist in distinguishing between closely related roles and addressing class imbalance. Our work underscores the potential of advanced techniques for improving legal document understanding and sets a strong foundation for future research in legal NLP.
zh

[NLP-88] Delta - Contrastive Decoding Mitigates Text Hallucinations in Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在自然语言处理任务中容易产生幻觉（hallucinations），即生成事实错误或虚构内容的问题。这种现象尤其在医疗和法律咨询等高风险领域削弱了模型的可靠性。论文提出的关键解决方案是Delta方法，这是一种推理阶段的技术，通过随机屏蔽输入提示的一部分，并对比原始输入和屏蔽输入的输出分布，从而抑制幻觉的生成，且无需重新训练模型或使用额外数据。这种方法仅依赖于推理阶段的计算，实现了模型可靠性的提升。

链接: https://arxiv.org/abs/2502.05825
作者: Cheng Peng Huang,Hao-Yuan Chen
机构: National Taiwan University of Science and Technology (台湾科技大学); Mindify AI (Mindify AI); University of London (伦敦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong capabilities in natural language processing but remain prone to hallucinations, generating factually incorrect or fabricated content. This issue undermines their reliability, particularly in high-stakes domains such as healthcare and legal advisory. To address this challenge, we propose Delta, an inference-time method that reduces hallucinations without requiring model retraining or additional data. Delta works by randomly masking parts of the input prompt and contrasting the output distributions for the original and masked inputs, effectively suppressing hallucinations through inference-only computations. We evaluate Delta on context-rich question-answering benchmarks, achieving absolute improvements of approximately 3 and 6 percentage points on SQuAD v1.1 and v2, respectively, and 7 and 2 percentage points on TriviaQA and Natural Questions under-sampling decoding. Delta also improves the no-answer exact match score on SQuAD v2 by over ten percentage points, demonstrating its effectiveness in mitigating hallucinations arising from contextual ambiguity. These results highlight Delta as a computationally efficient and scalable approach for improving the reliability of LLMs in real-world applications.
zh

[NLP-89] Structural Perturbation in Large Language Model Representations through Recursive Symbolic Regeneration

【速读】：该论文旨在解决在不直接修改模型参数的情况下，如何影响神经表征的问题。解决方案的关键在于引入符号扰动（Symbolic Perturbations），通过递归再生符号结构，在潜在嵌入（Latent Embeddings）中引入结构化变化，从而实现对注意力动态和词汇多样性（Lexical Diversity）的受控调整。实验结果表明，符号层面的修改可以增强领域特定应用中的适应性，并且可以在不重新训练模型的情况下调整模型行为。

链接: https://arxiv.org/abs/2502.05794
作者: Kathlyn Eaglewood,Tobias Featherington,Dorian Mayfair,Sylvester Grimshaw,James Pettigrew
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Symbolic perturbations offer a novel approach for influencing neural representations without requiring direct modification of model parameters. The recursive regeneration of symbolic structures introduces structured variations in latent embeddings, leading to controlled shifts in attention dynamics and lexical diversity across sequential generations. A comparative analysis with conventional fine-tuning techniques reveals that structural modifications at the symbolic level induce distinct variations in contextual sensitivity while maintaining overall model fluency and coherence. Shifts in attention weight distributions highlight the role of symbolic modifications in adjusting token dependencies, influencing response variability, and refining long-form text generation. Experimental findings suggest that symbolic perturbations can enhance adaptability in domain-specific applications, allowing modifications in model behavior without retraining. Evaluations of semantic drift indicate that recursive regeneration alters long-range token dependencies, affecting topic coherence across extended text sequences. Results from lexical variability assessments further support the conclusion that symbolic-level modifications introduce interpretable variations in generated responses, potentially enabling more controlled stylistic adjustments in automated text generation.
zh

[NLP-90] On Reference (In-)Determinacy in Natural Language Inference NAACL2025

【速读】：该论文旨在解决自然语言推理（NLI）任务中参考确定性（Reference Determinacy, RD）假设在实际应用中的局限性。当前NLI模型通常仅基于遵循RD假设的假设-前提对进行训练，这导致这些模型在下游应用如事实验证中表现不佳，因为输入的前提和假设可能引用不同的上下文。为了解决这一问题，论文引入了一个诊断基准RefNLI，用于识别NLI示例中的引用歧义。通过RefNLI，研究者展示了经过微调的NLI模型和少量提示的大规模语言模型（LLMs）均未能识别上下文不匹配的问题，导致超过80%的错误矛盾预测和超过50%的蕴含预测。论文的关键在于揭示NLI示例中存在的引用歧义可以部分解释人类在NLI任务中的固有分歧，并深入探讨RD假设如何影响NLI数据集的构建过程。

链接: https://arxiv.org/abs/2502.05793
作者: Sihao Chen,Chaitanya Malaviya,Alex Fabrikant,Hagai Taitelbaum,Tal Schuster,Senaka Buthpitiya,Dan Roth
机构: University of Pennsylvania (宾夕法尼亚大学); Google DeepMind (谷歌深思维); Google Research (谷歌研究); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注: NAACL 2025 Findings

点击查看摘要

Abstract:We revisit the reference determinacy (RD) assumption in the task of natural language inference (NLI), i.e., the premise and hypothesis are assumed to refer to the same context when human raters annotate a label. While RD is a practical assumption for constructing a new NLI dataset, we observe that current NLI models, which are typically trained solely on hypothesis-premise pairs created with the RD assumption, fail in downstream applications such as fact verification, where the input premise and hypothesis may refer to different contexts. To highlight the impact of this phenomenon in real-world use cases, we introduce RefNLI, a diagnostic benchmark for identifying reference ambiguity in NLI examples. In RefNLI, the premise is retrieved from a knowledge source (i.e., Wikipedia) and does not necessarily refer to the same context as the hypothesis. With RefNLI, we demonstrate that finetuned NLI models and few-shot prompted LLMs both fail to recognize context mismatch, leading to over 80% false contradiction and over 50% entailment predictions. We discover that the existence of reference ambiguity in NLI examples can in part explain the inherent human disagreements in NLI and provide insight into how the RD assumption impacts the NLI dataset creation process.
zh

[NLP-91] Reinforced Lifelong Editing for Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在持续学习过程中因参数动态变化而难以进行有效编辑的问题。关键在于提出了一种基于强化学习的编辑方法RLEdit，通过将编辑损失视为奖励，并在完整知识序列层面上优化超网络参数，从而精确捕捉LLM的变化并生成适当的参数更新。

链接: https://arxiv.org/abs/2502.05759
作者: Zherui Li,Houcheng Jiang,Hao Chen,Baolong Bi,Zhenhong Zhou,Fei Sun,Junfeng Fang,Xiang Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) acquire information from pre-training corpora, but their stored knowledge can become inaccurate or outdated over time. Model editing addresses this challenge by modifying model parameters without retraining, and prevalent approaches leverage hypernetworks to generate these parameter updates. However, they face significant challenges in lifelong editing due to their incompatibility with LLM parameters that dynamically change during the editing process. To address this, we observed that hypernetwork-based lifelong editing aligns with reinforcement learning modeling and proposed RLEdit, an RL-based editing method. By treating editing losses as rewards and optimizing hypernetwork parameters at the full knowledge sequence level, we enable it to precisely capture LLM changes and generate appropriate parameter updates. Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a 59.24% improvement while requiring only 2.11% of the time compared to most approaches. Our code is available at: this https URL.
zh

[NLP-92] BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting NAACL2025

【速读】：该论文旨在解决Bangla语音合成领域数据稀缺的问题，提出BnTTS（Bangla Text-To-Speech）框架，通过基于说话人适应的文本转语音技术来弥合这一差距。解决方案的关键在于对XTTS架构进行改进，以整合Bangla语言的音韵和语言特征，并在少量训练数据条件下实现高效的Bangla语音合成。通过预训练和在少量标注数据上的微调，BnTTS显著提升了合成Bangla语音的自然度、清晰度和说话人保真度。

链接: https://arxiv.org/abs/2502.05729
作者: Mohammad Jahid Ibna Basher,Md Kowsher,Md Saiful Islam,Rabindra Nath Nandi,Nusrat Jahan Prottasha,Mehadi Hasan Menon,Tareq Al Muntasir,Shammur Absar Chowdhury,Firoj Alam,Niloofar Yousefi,Ozlem Ozmen Garibay
机构: Hishab Singapore Pte. Ltd(智算新加坡有限公司), Singapore; University of Central Florida(中佛罗里达大学), USA; Qatar Computing Research Institute(卡塔尔计算研究研究院), Qatar
类目: Computation and Language (cs.CL)
备注: Accepted paper in NAACL 2025

点击查看摘要

Abstract:This paper introduces BnTTS (Bangla Text-To-Speech), the first framework for Bangla speaker adaptation-based TTS, designed to bridge the gap in Bangla speech synthesis using minimal training data. Building upon the XTTS architecture, our approach integrates Bangla into a multilingual TTS pipeline, with modifications to account for the phonetic and linguistic characteristics of the language. We pre-train BnTTS on 3.85k hours of Bangla speech dataset with corresponding text labels and evaluate performance in both zero-shot and few-shot settings on our proposed test dataset. Empirical evaluations in few-shot settings show that BnTTS significantly improves the naturalness, intelligibility, and speaker fidelity of synthesized Bangla speech. Compared to state-of-the-art Bangla TTS systems, BnTTS exhibits superior performance in Subjective Mean Opinion Score (SMOS), Naturalness, and Clarity metrics.
zh

[NLP-93] Rethinking Word Similarity: Semantic Similarity through Classification Confusion NAACL

【速读】：该论文旨在解决传统基于余弦相似度的词嵌入方法无法捕捉语义相似性的上下文依赖性、不对称性和多义性的问题。论文的关键解决方案是提出了一种新的相似性度量方法——词语混淆（Word Confusion），通过训练分类器将上下文嵌入映射到词身份，并利用分类器的混淆概率（即选择干扰词而非正确目标词的概率）来衡量相似性。这种方法能够更好地反映语义相似性的动态特征，并在多个数据集上与人类相似性判断相匹配。

链接: https://arxiv.org/abs/2502.05704
作者: Kaitlyn Zhou,Haishan Gao,Sarah Chen,Dan Edelstein,Dan Jurafsky,Chen Shani
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL-main-2025

点击查看摘要

Abstract:Word similarity has many applications to social science and cultural analytics tasks like measuring meaning change over time and making sense of contested terms. Yet traditional similarity methods based on cosine similarity between word embeddings cannot capture the context-dependent, asymmetrical, polysemous nature of semantic similarity. We propose a new measure of similarity, Word Confusion, that reframes semantic similarity in terms of feature-based classification confusion. Word Confusion is inspired by Tversky’s suggestion that similarity features be chosen dynamically. Here we train a classifier to map contextual embeddings to word identities and use the classifier confusion (the probability of choosing a confounding word c instead of the correct target word t) as a measure of the similarity of c and t. The set of potential confounding words acts as the chosen features. Our method is comparable to cosine similarity in matching human similarity judgments across several datasets (MEN, WirdSim353, and SimLex), and can measure similarity using predetermined features of interest. We demonstrate our model’s ability to make use of dynamic features by applying it to test a hypothesis about changes in the 18th C. meaning of the French word “revolution” from popular to state action during the French Revolution. We hope this reimagining of semantic similarity will inspire the development of new tools that better capture the multi-faceted and dynamic nature of language, advancing the fields of computational social science and cultural analytics and beyond.
zh

[NLP-94] Zero-Shot End-to-End Relation Extraction in Chinese: A Comparative Study of Gemini LLaMA and ChatGPT

【速读】：该论文旨在解决大型语言模型（LLMs）在零样本端到端中文关系抽取（RE）中的性能问题，特别是该任务需要整合实体识别与关系抽取且无需预标注数据。研究的关键在于评估ChatGPT、Gemini和LLaMA三种模型的准确性、效率及适应性，从而揭示这些模型在零样本中文RE中的优势与局限，并为未来提升LLMs在复杂中文自然语言处理（NLP）任务中的适应性提供基础。

链接: https://arxiv.org/abs/2502.05694
作者: Shaoshuai Du,Yiyi Tao,Yixian Shen,Hang Zhang,Yanxin Shen,Xinyu Qiu,Chuanqi Shi
机构: University of Amsterdam(阿姆斯特丹大学); Johns Hopkins University(约翰斯·霍普金斯大学); University of Amsterdam(阿姆斯特丹大学); University of California San Diego(加州大学圣地亚哥分校); Simon Fraser University(西蒙弗雷泽大学); Northeastern University(东北大学); University of California San Diego(加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study investigates the performance of various large language models (LLMs) on zero-shot end-to-end relation extraction (RE) in Chinese, a task that integrates entity recognition and relation extraction without requiring annotated data. While LLMs show promise for RE, most prior work focuses on English or assumes pre-annotated entities, leaving their effectiveness in Chinese RE largely unexplored. To bridge this gap, we evaluate ChatGPT, Gemini, and LLaMA based on accuracy, efficiency, and adaptability. ChatGPT demonstrates the highest overall performance, balancing precision and recall, while Gemini achieves the fastest inference speed, making it suitable for real-time applications. LLaMA underperforms in both accuracy and latency, highlighting the need for further adaptation. Our findings provide insights into the strengths and limitations of LLMs for zero-shot Chinese RE, shedding light on trade-offs between accuracy and efficiency. This study serves as a foundation for future research aimed at improving LLM adaptability to complex linguistic tasks in Chinese NLP.
zh

[NLP-95] Investigating the Shortcomings of LLM s in Step-by-Step Legal Reasoning NAACL2025

【速读】：该论文旨在深入分析大型语言模型（LLMs）在法律推理任务中的具体错误类型，并提出了一种基于自动评估框架的解决方案。论文的关键在于开发了一个新的错误分类法，并通过引入声望性和正确性评分的客观衡量标准来识别这些错误。此外，论文展示了将错误分类作为反馈应用于流行的提示技术中可以边际提升LLMs的表现。该工作提供了一个详细的推理链错误分析框架，适用于逻辑密集型复杂任务。

链接: https://arxiv.org/abs/2502.05675
作者: Venkatesh Mishra,Bimsara Pathiraja,Mihir Parmar,Sat Chidananda,Jayanth Srinivasa,Gaowen Liu,Ali Payani,Chitta Baral
机构: Arizona State University; Cisco Research
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Findings

点击查看摘要

Abstract:Reasoning abilities of LLMs have been a key focus in recent years. One challenging reasoning domain with interesting nuances is legal reasoning, which requires careful application of rules, and precedents while balancing deductive and analogical reasoning, and conflicts between rules. Although there have been a few works on using LLMs for legal reasoning, their focus has been on overall accuracy. In this paper, we dig deeper to do a step-by-step analysis and figure out where they commit errors. We use the college-level Multiple Choice Question-Answering (MCQA) task from the \textitCivil Procedure dataset and propose a new error taxonomy derived from initial manual analysis of reasoning chains with respect to several LLMs, including two objective measures: soundness and correctness scores. We then develop an LLM-based automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs. The computation of soundness and correctness on the dataset using the auto-evaluator framework reveals several interesting insights. Furthermore, we show that incorporating the error taxonomy as feedback in popular prompting techniques marginally increases LLM performance. Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.
zh

[NLP-96] Language Models Largely Exhibit Human-like Constituent Ordering Preferences NAACL2025

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在处理不同类型的成分移位（constituent movement）时是否表现出与人类相似的语言处理模式。论文的关键在于比较多种具有不同特性的LLMs在四种成分移位类型（重名词短语移位、粒子移位、与格转换及多个介词短语移位）上的表现，以此评估LLMs的整体性能，并探讨其与人类语言处理模式的异同。尽管LLMs在粒子移位方面表现得不够理想，但总体上它们在成分排序方面与人类偏好一致。

链接: https://arxiv.org/abs/2502.05670
作者: Ada Defne Tur,Gaurav Kamath,Siva Reddy
机构: McGill University (麦吉尔大学); Mila - Quebec AI Institute (魁北克人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025 Main Conference

点击查看摘要

Abstract:Though English sentences are typically inflexible vis-à-vis word order, constituents often show far more variability in ordering. One prominent theory presents the notion that constituent ordering is directly correlated with constituent weight: a measure of the constituent’s length or complexity. Such theories are interesting in the context of natural language processing (NLP), because while recent advances in NLP have led to significant gains in the performance of large language models (LLMs), much remains unclear about how these models process language, and how this compares to human language processing. In particular, the question remains whether LLMs display the same patterns with constituent movement, and may provide insights into existing theories on when and how the shift occurs in human language. We compare a variety of LLMs with diverse properties to evaluate broad LLM performance on four types of constituent movement: heavy NP shift, particle movement, dative alternation, and multiple PPs. Despite performing unexpectedly around particle movement, LLMs generally align with human preferences around constituent ordering.
zh

[NLP-97] CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在代码生成和问题解决中的初始代码生成质量不足的问题。当前方法依赖于基于工具的迭代调试器，这些方法利用编译器或其他工具的运行时反馈来改进由不同方法生成的粗略程序。然而，这些方法的效果高度依赖于初始代码生成的质量，这仍然是一个开放性挑战。论文提出的关键解决方案是CodeSim，这是一种新颖的多智能体代码生成框架，通过类人感知的方法全面解决程序综合的各个阶段，包括规划、编码和调试。CodeSim的独特之处在于通过逐步模拟输入/输出的方式进行计划验证和内部调试，从而有效提升代码生成的准确性和可靠性。

链接: https://arxiv.org/abs/2502.05664
作者: Md. Ashraful Islam,Mohammed Eunus Ali,Md Rizwan Parvez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in NAACL 2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim’s remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (this https URL).
zh

[NLP-98] Evaluating Vision-Language Models for Emotion Recognition NAACL2025

【速读】：该论文旨在解决大型视觉-语言模型（Vision-Language Models, VLMs）在识别图像所引发情绪方面的性能评估不足的问题。论文的关键在于创建了一个基准任务——所引发情绪识别，并从正确性和鲁棒性的角度评估了VLMs在此任务上的表现。通过实验和人类评估研究，论文揭示了影响情绪识别性能的重要因素以及VLMs在处理此类任务时常见的错误类型，从而为未来情感研究提供了指导性建议。

链接: https://arxiv.org/abs/2502.05660
作者: Sree Bhattacharyya,James Z. Wang
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Findings

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have achieved unprecedented success in several objective multimodal reasoning tasks. However, to further enhance their capabilities of empathetic and effective communication with humans, improving how VLMs process and understand emotions is crucial. Despite significant research attention on improving affective understanding, there is a lack of detailed evaluations of VLMs for emotion-related tasks, which can potentially help inform downstream fine-tuning efforts. In this work, we present the first comprehensive evaluation of VLMs for recognizing evoked emotions from images. We create a benchmark for the task of evoked emotion recognition and study the performance of VLMs for this task, from perspectives of correctness and robustness. Through several experiments, we demonstrate important factors that emotion recognition performance depends on, and also characterize the various errors made by VLMs in the process. Finally, we pinpoint potential causes for errors through a human evaluation study. We use our experimental results to inform recommendations for the future of emotion research in the context of VLMs.
zh

[NLP-99] KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy NAACL2025

【速读】：该论文旨在解决AI驱动的心理健康聊天机器人在训练数据和专业知识方面的局限性，特别是在非英语语言中的挑战。解决方案的关键在于提出了一种新颖的框架，该框架通过模拟专业治疗师的行为来增强动机访谈（Motivational Interviewing, MI）会话，并利用大型语言模型（Large Language Models, LLMs）通过提示工程生成对话。论文还介绍了一个名为KMI的合成数据集，包含1,000段高质量的韩语动机访谈对话，以此验证所提方法的质量和实用性。

链接: https://arxiv.org/abs/2502.05651
作者: Hyunjong Kim,Suyeon Lee,Yeongjae Cho,Eunseo Ryu,Yohan Jo,Suran Seong,Sungzoon Cho
机构: Seoul National University (首尔国立大学); Korea Counseling Graduate University (韩国咨询研究生院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at NAACL 2025 Main Conference

点击查看摘要

Abstract:The increasing demand for mental health services has led to the rise of AI-driven mental health chatbots, though challenges related to privacy, data collection, and expertise persist. Motivational Interviewing (MI) is gaining attention as a theoretical basis for boosting expertise in the development of these chatbots. However, existing datasets are showing limitations for training chatbots, leading to a substantial demand for publicly available resources in the field of MI and psychotherapy. These challenges are even more pronounced in non-English languages, where they receive less attention. In this paper, we propose a novel framework that simulates MI sessions enriched with the expertise of professional therapists. We train an MI forecaster model that mimics the behavioral choices of professional therapists and employ Large Language Models (LLMs) to generate utterances through prompt engineering. Then, we present KMI, the first synthetic dataset theoretically grounded in MI, containing 1,000 high-quality Korean Motivational Interviewing dialogues. Through an extensive expert evaluation of the generated dataset and the dialogue model trained on it, we demonstrate the quality, expertise, and practicality of KMI. We also introduce novel metrics derived from MI theory in order to evaluate dialogues from the perspective of MI.
zh

[NLP-100] Incongruence Identification in Eyewitness Testimony ACL

【速读】：该论文旨在解决目击者证词中不一致检测的问题。传统方法难以处理这些证词中存在的细微矛盾。为了解决这一问题，论文提出了一种基于INTEND框架的新方法，该框架利用6W（who, what, when, where, why）和多跳推理策略来识别证词中的不一致，并在语句中标注出不一致的片段。关键在于通过调整提示（prompt tuning）显著提升了不一致检测的准确性，特别是在使用INTEND框架时，其F1分数表现优于传统的微调和常规提示调优技术。

链接: https://arxiv.org/abs/2502.05650
作者: Akshara Nair,Zeba Afroz,Md Shad Akhtar
机构: IIIT-Delhi(IIIT德里), New Delhi, India
类目: Computation and Language (cs.CL)
备注: 9 pages,10 tables. Under review at ACL ARR 2024. Includes supplementary appendix with detailed evaluation results

点击查看摘要

Abstract:Incongruence detection in eyewitness narratives is critical for understanding the reliability of testimonies, yet traditional approaches often fail to address the nuanced inconsistencies inherent in such accounts. In this paper, we introduce a novel task of incongruence detection in eyewitness testimonies. Given a pair of testimonies containing of multiple pairs of question and answer by two subjects, we identify contextually related incongruence between the two subjects. We also mark the span of incongruences in the utterances. To achieve this, we developed MIND(MultI-EyewitNess Deception) - a comprehensive dataset consisting of 2927 pairs of contextually related answers designed to capture both explicit and implicit contradictions. INstruction - TunEd iNcongruity Detection framework based on 6W and multi-hop reasoning approach, aka. INTEND. Drawing from investigative techniques, INTEND address the task as a close-style problem, contradicting on the who, what, when, where and why aspect of the content. Our findings shows that prompt tuning, especially when utilizing our framework, enhances the detection of incongruences by a margin of +5.63 percent. We compare our approach with multiple fine-tuning and prompt tuning techniques on MLMs and LLMs. Emperical results demonstrate convincing performance improvement in F1-score over fine-tuned and regular prompt-tuning techniques, highlighting the effectiveness of our approach.
zh

[NLP-101] Gender Bias in Instruction-Guided Speech Synthesis Models NAACL2025

【速读】：该论文旨在探究可控表达性语音合成模型在处理性别相关的职业描述提示（如“Act like a nurse”）时是否存在性别偏见。研究的关键在于分析这些模型如何解读模糊或抽象的风格提示，并评估其是否放大了性别刻板印象。实验结果揭示了模型在某些职业描述中存在性别偏向的趋势，并且不同规模的模型在这方面的表现也有所不同。

链接: https://arxiv.org/abs/2502.05649
作者: Chun-Yi Kuan,Hung-yi Lee
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: NAACL 2025 Findings

点击查看摘要

Abstract:Recent advancements in controllable expressive speech synthesis, especially in text-to-speech (TTS) models, have allowed for the generation of speech with specific styles guided by textual descriptions, known as style prompts. While this development enhances the flexibility and naturalness of synthesized speech, there remains a significant gap in understanding how these models handle vague or abstract style prompts. This study investigates the potential gender bias in how models interpret occupation-related prompts, specifically examining their responses to instructions like “Act like a nurse”. We explore whether these models exhibit tendencies to amplify gender stereotypes when interpreting such prompts. Our experimental results reveal the model’s tendency to exhibit gender bias for certain occupations. Moreover, models of different sizes show varying degrees of this bias across these occupations.
zh

[NLP-102] ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports

【速读】：该论文旨在解决欧洲医疗系统中临床数据处理的互操作性和数字化需求，通过利用大型语言模型（Large Language Models, LLMs）从非结构化的临床报告中提取结构化信息，重点关注患者病史、诊断、治疗及其他预定义类别。解决方案的关键在于开发了一种包含用户界面的工作流程，并通过提示策略和微调方法评估了不同规模的LLMs，结果显示经过微调的小型模型在性能上匹配或超越大型模型，尤其适用于资源受限的环境。

链接: https://arxiv.org/abs/2502.05638
作者: Aynur Guluzade,Naguib Heiba,Zeyd Boukhers,Florim Hamiti,Jahid Hasan Polash,Yehya Mohamad,Carlos A Velasco
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Europe’s healthcare systems require enhanced interoperability and digitalization, driving a demand for innovative solutions to process legacy clinical data. This paper presents the results of our project, which aims to leverage Large Language Models (LLMs) to extract structured information from unstructured clinical reports, focusing on patient history, diagnoses, treatments, and other predefined categories. We developed a workflow with a user interface and evaluated LLMs of varying sizes through prompting strategies and fine-tuning. Our results show that fine-tuned smaller models match or surpass larger counterparts in performance, offering efficiency for resource-limited settings. A new dataset of 60,000 annotated English clinical summaries and 24,000 German translations was validated with automated and manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics. The work highlights the approach’s viability and outlines future improvements.
zh

[NLP-103] AnyEdit: Edit Any Knowledge Encoded in Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在处理长格式且格式多样的知识（如诗歌、代码片段和数学推导）时，现有模型编辑方法因受限于单一令牌隐藏状态修改而难以实现高效精确更新的问题。论文提出的关键解决方案是AnyEdit，这是一种新的自回归编辑范式，通过将长格式知识分解为顺序片段，并迭代修改每个片段中的关键令牌，从而确保输出的一致性和准确性。理论上，AnyEdit基于互信息链式法则，能够更新LLMs内的任何知识；实证结果表明，相比强基准方法，其在包括UnKEBench、AKEW和新构建的EditEverything数据集在内的基准测试中提升了21.5%的性能。此外，AnyEdit作为一个即插即用框架，使当前编辑方法能够更新任意长度和格式的知识，显著扩展了LLMs知识编辑的范围和实用性。

链接: https://arxiv.org/abs/2502.05628
作者: Houcheng Jiang,Junfeng Fang,Ningyu Zhang,Guojun Ma,Mingyang Wan,Xiang Wang,Xiangnan He,Tat-seng Chua
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often produce incorrect or outdated information, necessitating efficient and precise knowledge updates. Current model editing methods, however, struggle with long-form knowledge in diverse formats, such as poetry, code snippets, and mathematical derivations. These limitations arise from their reliance on editing a single token’s hidden state, a limitation we term “efficacy barrier”. To solve this, we propose AnyEdit, a new autoregressive editing paradigm. It decomposes long-form knowledge into sequential chunks and iteratively edits the key token in each chunk, ensuring consistent and accurate outputs. Theoretically, we ground AnyEdit in the Chain Rule of Mutual Information, showing its ability to update any knowledge within LLMs. Empirically, it outperforms strong baselines by 21.5% on benchmarks including UnKEBench, AKEW, and our new EditEverything dataset for long-form diverse-formatted knowledge. Additionally, AnyEdit serves as a plug-and-play framework, enabling current editing methods to update knowledge with arbitrary length and format, significantly advancing the scope and practicality of LLM knowledge editing.
zh

[NLP-104] owards Sustainable NLP: Insights from Benchmarking Inference Energy in Large Language Models NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在推理过程中的高能耗问题，这一问题在现有研究中主要关注训练成本而被忽视。研究的关键解决方案在于通过综合基准测试，分析不同模型、任务、提示语以及系统相关因素对推理能耗的影响，并发现输出token长度、响应时间与推理能耗之间存在强相关性。此外，论文提出量化及优化批量大小，结合针对性的提示语，可以显著降低能耗。这项研究首次全面评估了LLMs在如此广泛方面的推理能耗，提供了改进模型部署能效的见解和建议。

链接: https://arxiv.org/abs/2502.05610
作者: Soham Poddar,Paramita Koley,Janardan Misra,Niloy Ganguly,Saptarshi Ghosh
机构: Indian Institute of Technology, Kharagpur (印度理工学院，卡尔古尔); Indian Statistical Institute (印度统计学院)
类目: Computation and Language (cs.CL)
备注: Accepted to appear at the NAACL 2025 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) are increasingly recognized for their exceptional generative capabilities and versatility across various tasks. However, the high inference costs associated with these models have not received adequate attention, particularly when compared to the focus on training costs in existing research. In response to this gap, our study conducts a comprehensive benchmarking of LLM inference energy across a wide range of NLP tasks, where we analyze the impact of different models, tasks, prompts, and system-related factors on inference energy. Specifically, our experiments reveal several interesting insights, including strong correlation of inference energy with output token length and response time. Also, we find that quantization and optimal batch sizes, along with targeted prompt phrases, can significantly reduce energy usage. This study is the first to thoroughly benchmark LLM inference across such a diverse range of aspects, providing insights and offering several recommendations for improving energy efficiency in model deployment.
zh

[NLP-105] Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding NAACL2025

【速读】：该论文旨在解决加速大型语言模型（Large Language Models, LLMs）推理速度的问题，特别是在实时交互中的需求。当前的推测解码（Speculative Decoding）方法通常需要显著的微调或在不同任务间表现出不一致的性能。论文的关键解决方案是提出了一种名为分层推测（Hierarchy Drafting, HD）的新方法。HD通过基于时间局部性将各种令牌来源组织到多级数据库中，在推测步骤中按从高到低的局部性顺序访问这些数据库，从而确保在不同任务和模型大小下实现一致的加速，并最小化推测延迟。实验结果表明，HD方法优于现有的数据库推测方法，在不同模型规模、任务和温度设置下均实现了稳健的推理加速。

链接: https://arxiv.org/abs/2502.05609
作者: Sukmin Cho,Sangjin Choi,Taeho Hwang,Jeongyeon Seo,Soyeong Jeong,Huije Lee,Hoyun Song,Jong C. Park,Youngjin Kwon
机构: 未知
类目: Computation and Language (cs.CL)
备注: Findings of NAACL 2025

点击查看摘要

Abstract:Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.
zh

[NLP-106] ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在响应过程中难以通过外部交互修正错误的问题。为实现这一目标，论文提出的关键解决方案是引入ARIES（自适应精化与迭代增强结构），通过迭代偏好训练和基于自我精化的数据收集方法来培养LLM的自我修正能力。这种方法不仅增强了模型直接问答的能力，还解锁了其自我精化的潜力，并在推理阶段利用这种能力生成一系列逐步优化的响应，从而显著提升模型性能。

链接: https://arxiv.org/abs/2502.05605
作者: Yongcheng Zeng,Xinyu Cui,Xuanfa Jin,Guoqing Liu,Zexu Sun,Quan He,Dong Li,Ning Yang,Jianye Hao,Haifeng Zhang,Jun Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A truly intelligent Large Language Model (LLM) should be capable of correcting errors in its responses through external interactions. However, even the most advanced models often face challenges in improving their outputs. In this paper, we explore how to cultivate LLMs with the self-refinement capability through iterative preference training, and how this ability can be leveraged to improve model performance during inference. To this end, we introduce a novel post-training and inference framework, called ARIES: Adaptive Refinement and Iterative Enhancement Structure. This method iteratively performs preference training and self-refinement-based data collection. During training, ARIES strengthen the model’s direct question-answering capability while simultaneously unlocking its self-refinement potential. During inference, ARIES harnesses this self-refinement capability to generate a series of progressively refined responses, which are then filtered using either the Reward Model Scoring or a simple yet effective Rule-Based Selection mechanism, specifically tailored to our approach, to construct a dataset for the next round of preference training. Experimental results demonstrate the remarkable performance of ARIES. When applied to the Llama-3.1-8B model and under the self-refinement setting, ARIES surpasses powerful models such as GPT-4o, achieving 62.3% length-controlled (LC) and a 63.3% raw win rates on AlpacaEval 2, outperforming Iterative DPO by 27.8% and 35.5% respectively, as well as a 50.3% win rate on Arena-Hard, surpassing Iterative DPO by 26.6%. Furthermore, ARIES consistently enhances performance on mathematical reasoning tasks like GSM8K and MATH.
zh

[NLP-107] On Memory Construction and Retrieval for Personalized Conversational Agents

【速读】：该论文旨在解决长期对话中连贯性和个性化体验的提供问题。现有方法通常通过在轮次级别、会话级别或通过摘要技术从对话历史构建记忆库来实现增强检索响应生成。论文的关键发现在于：(1) 记忆单元的粒度至关重要，轮次级别、会话级别和基于摘要的方法在记忆检索准确性和检索内容的语义质量方面均存在局限性；(2) 提示压缩方法，如\textit{LLMLingua-2}，可以作为去噪机制有效提升不同粒度下的记忆检索准确性。基于这些洞察，论文提出SeCom方法，通过引入对话分割模型构建主题段落的记忆库，并基于压缩后的记忆单元进行检索。实验结果显示SeCom在长期对话基准测试（如LOCOMO和Long-MT-Bench+）上优于轮次级别、会话级别以及几种基于摘要的方法。此外，所提出的对话分割方法在对话分割数据集（如DialSeg711、TIAGE和SuperDialSeg）上表现出色。

链接: https://arxiv.org/abs/2502.05589
作者: Zhuoshi Pan,Qianhui Wu,Huiqiang Jiang,Xufang Luo,Hao Cheng,Dongsheng Li,Yuqing Yang,Chin-Yew Lin,H. Vicky Zhao,Lili Qiu,Jianfeng Gao
机构: Tsinghua University (清华大学); Microsoft Corporation (微软公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, conference

点击查看摘要

Abstract:To deliver coherent and personalized experiences in long-term conversations, existing approaches typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization techniques. In this paper, we present two key findings: (1) The granularity of memory unit matters: Turn-level, session-level, and summarization-based methods each exhibit limitations in both memory retrieval accuracy and the semantic quality of the retrieved content. (2) Prompt compression methods, such as \textitLLMLingua-2, can effectively serve as a denoising mechanism, enhancing memory retrieval accuracy across different granularities. Building on these insights, we propose SeCom, a method that constructs a memory bank with topical segments by introducing a conversation Segmentation model, while performing memory retrieval based on Compressed memory units. Experimental results show that SeCom outperforms turn-level, session-level, and several summarization-based methods on long-term conversation benchmarks such as LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as DialSeg711, TIAGE, and SuperDialSeg.
zh

[NLP-108] Large Multimodal Models for Low-Resource Languages: A Survey

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在低资源（Low-Resource, LR）语言环境中的应用挑战。论文的关键在于分析和总结了多种适应策略，包括视觉增强、数据创建、跨模态迁移和融合方法。通过综合分析106项研究，覆盖了75种低资源语言，论文指出视觉信息在提升模型性能方面起到了关键作用，但仍然存在幻觉抑制和计算效率等显著挑战。论文的目标是帮助研究人员理解当前的方法及存在的问题，以使LMMs能够更好地服务于低资源语言的使用者。

链接: https://arxiv.org/abs/2502.05568
作者: Marian Lupascu,Ana-Cristina Rogoz,Mihai Sorin Stupariu,Radu Tudor Ionescu
机构: Department of Computer Science, University of Bucharest (布加勒斯特大学), Romania (罗马尼亚)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 106 studies across 75 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. We aim to provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: this https URL.
zh

[NLP-109] ATLAS: Autoformalizing Theorems through Lifting Augmentation and Synthesis of Data

【速读】：该论文旨在解决自动形式化（Autoformalization）领域中配对数据集稀缺的问题，这对大型语言模型（LLMs）在该领域的进一步发展构成了关键障碍。为了解决这一挑战，论文提出了一种名为ATLAS（通过提升、增强和数据合成进行定理的自动形式化）的迭代数据生成框架。该框架的关键在于通过多轮迭代生成大规模高质量的自然语言与形式语言对齐的定理陈述，从而显著提升了模型在多个基准测试中的性能。

链接: https://arxiv.org/abs/2502.05567
作者: Xiaoyang Liu,Kangjie Bao,Jiashuo Zhang,Yunqi Liu,Yu Chen,Yuntian Liu,Yang Jiao,Tao Luo
机构: School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院); SPEIT, Shanghai Jiao Tong University(上海交通大学致远学院); JoinTech Co., Ltd(JoinTech有限公司); Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University(自然科学院, 教育部-上海市重点实验室, 上海交通大学); CMA-Shanghai, Shanghai Artificial Intelligence Laboratory(上海人工智能实验室CMA分部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoformalization, the process of automatically translating natural language mathematics into machine-verifiable formal language, has demonstrated advancements with the progress of large language models (LLMs). However, a key obstacle to further advancements is the scarcity of paired datasets that align natural language with formal language. To address this challenge, we introduce ATLAS (Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data), an iterative data generation framework designed to produce large-scale, high-quality parallel theorem statements. With the proposed ATLAS running for 10 iterations, we construct an undergraduate-level dataset comprising 300k theorem statements and develop the ATLAS translator, achieving accuracies of 80.59% (pass@8) and 92.99% (pass@128) on ProofNet, significantly outperforming the base model (23.99% and 47.17%) and InternLM2-Math-Plus-7B (50.94% and 80.32%). Furthermore, the ATLAS translator also achieves state-of-the-art performance on both the high-school-level miniF2F dataset and the graduate-level MathQual dataset introduced in this work. The datasets, model, and code will be released to the public soon.
zh

[NLP-110] Latent Structure Modulation in Large Language Models Through Stochastic Concept Embedding Transitions

【速读】：该论文旨在解决静态或确定性嵌入在处理文本表示时所面临的限制，通过引入随机嵌入转换机制，提供了一种动态调整标记表示的概率方法。解决方案的关键在于提出了一种过渡框架，使每个标记嵌入通过概率更新进行演变，从而确保适应性的同时保持语义完整性，并且展示了这种机制在增强词汇多样性、提升生成连贯性和保留低频词方面的有效性。实验结果表明，这种随机转换在引入微小计算开销的同时保持了生成效率，验证了其在大规模应用中的可行性。

链接: https://arxiv.org/abs/2502.05553
作者: Stefan Whitaker,Colin Sisate,Marcel Windsor,Nikolai Fairweather,Tarquin Goldborough,Oskar Lindenfeld
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stochastic embedding transitions introduce a probabilistic mechanism for adjusting token representations dynamically during inference, mitigating the constraints imposed through static or deterministic embeddings. A transition framework was proposed in which each token embedding evolved through probabilistic updates, ensuring adaptability while preserving semantic integrity across linguistic contexts. Empirical evaluations demonstrated that models incorporating stochastic transitions exhibited greater lexical diversity, improved generative coherence, and enhanced retention of low-frequency vocabulary, contributing to more varied sentence structures and reduced reliance on high-probability token selections. Statistical analyses of embedding drift across transformer layers indicated that representations evolved more flexibly without losing coherence, supporting the hypothesis that controlled stochasticity facilitated context-sensitive representation learning. Experimental results revealed that probabilistic embeddings introduced minor computational overhead while maintaining generative efficiency, reinforcing their feasibility in large-scale applications. A comparative study with traditional embedding approaches highlighted measurable gains in text completion accuracy, dialogue coherence, and structural complexity, confirming the effectiveness of stochastic transitions in enhancing representation expressiveness. Clustering patterns in the embedding space suggested that probabilistic updates preserved meaningful semantic groupings while enabling context-driven shifts, further validating the stability of the transition mechanism. Performance metrics indicated that stochastic transitions balanced adaptability and control, ensuring that generative outputs remained linguistically coherent without excessive randomness.
zh

[NLP-111] FRAMES: Boosting LLM s with A Four-Quadrant Multi-Stage Pretraining Strategy

【速读】：该论文旨在解决多阶段预训练中数据划分缺乏定量标准的问题。解决方案的关键在于提出了四象限多阶段预训练策略（FRAMES），通过将预训练过程组织成四个阶段，并依据困惑度（Perplexity, PPL）和困惑度差异（PPL Difference, PD）对数据进行分区和有序排列，实现了平均性能提升16.8%，显著优于随机采样方法。

链接: https://arxiv.org/abs/2502.05551
作者: Xuemiao Zhang,Feiyu Duan,Liangyu Xu,Yongwei Zhou,Sirui Wang,Rongxiang Weng,Jingang Wang,Xunliang Cai
机构: Peking University (北京大学); Beihang University (北京航空航天大学); Tsinghua University (清华大学); Meituan
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced human language understanding and generation, with pretraining data quality and organization being crucial to their performance. Multi-stage pretraining is a promising approach, but existing methods often lack quantitative criteria for data partitioning and instead rely on intuitive heuristics. In this paper, we propose the novel Four-quadRAnt Multi-stage prEtraining Strategy (FRAMES), guided by the established principle of organizing the pretraining process into four stages to achieve significant loss reductions four times. This principle is grounded in two key findings: first, training on high Perplexity (PPL) data followed by low PPL data, and second, training on low PPL difference (PD) data followed by high PD data, both causing the loss to drop significantly twice and performance enhancements. By partitioning data into four quadrants and strategically organizing them, FRAMES achieves a remarkable 16.8% average improvement over random sampling across MMLU and CMMLU, effectively boosting LLM performance.
zh

[NLP-112] DeepThink: Aligning Language Models with Domain-Specific User Intents

【速读】：该论文旨在解决现有方法在适应大型语言模型（LLM）到特定领域问答任务时，合成指令与真实用户问题及其预期答案存在偏差的问题。解决方案的关键在于提出了一种名为DeepThink的新框架，该框架通过生成模拟真实用户问题的种子问题、模拟对话以揭示隐藏的用户需求，并结合对话上下文和检索文档来优化答案，从而生成高质量的指令。实验表明，DeepThink在广告领域的实际用户测试集中，在相关性、完整性、清晰度、准确性和可操作性等方面平均性能提升了7.92%，优于基于GPT-4-turbo+RAG的助手。

链接: https://arxiv.org/abs/2502.05497
作者: Yang Li,Mingxuan Luo,Yeyun Gong,Chen Lin,Jian Jiao,Yi Liu,Kaili Huang
机构: Xiemen University (西门子大学); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised fine-tuning with synthesized instructions has been a common practice for adapting LLMs to domain-specific QA tasks. However, the synthesized instructions deviate from real user questions and expected answers. This study proposes a novel framework called DeepThink to generate high-quality instructions. DeepThink first generates a few seed questions to mimic actual user questions, simulates conversations to uncover the hidden user needs, and refines the answer by conversational contexts and the retrieved documents for more comprehensive answers. Experiments demonstrate that DeepThink achieves an average performance improvement of 7.92% compared to a GPT-4-turbo+RAG-based assistant on the real user test set in the advertising domain across dimensions such as relevance, completeness, clarity, accuracy, and actionability.
zh

[NLP-113] Mechanistic Interpretability of Emotion Inference in Large Language Models ACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在预测文本中人类情感时的机制不明确的问题。关键解决方案在于通过探究自回归LLMs如何推断情感，发现情感表征在模型中特定区域功能定位，并通过因果干预构建的评估概念来引导情感文本生成，从而验证这些表征的心理合理性，并与理论及直观预期一致。这一方法展示了如何进行因果干预以精确塑造情感文本生成，可能有助于敏感情感领域中的安全性和一致性。

链接: https://arxiv.org/abs/2502.05489
作者: Ala N. Tak,Amin Banayeeanzade,Anahita Bolourani,Mina Kian,Robin Jia,Jonathan Gratch
机构: Institute for Creative Technologies, University of Southern California (USC); Information Sciences Institute, University of Southern California (USC); Department of Computer Science, University of Southern California (USC); Department of Statistics and Data Science, University of California, Los Angeles (UCLA)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be submitted to the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Large language models (LLMs) show promising capabilities in predicting human emotions from text. However, the mechanisms through which these models process emotional stimuli remain largely unexplored. Our study addresses this gap by investigating how autoregressive LLMs infer emotions, showing that emotion representations are functionally localized to specific regions in the model. Our evaluation includes diverse model families and sizes and is supported by robustness checks. We then show that the identified representations are psychologically plausible by drawing on cognitive appraisal theory, a well-established psychological framework positing that emotions emerge from evaluations (appraisals) of environmental stimuli. By causally intervening on construed appraisal concepts, we steer the generation and show that the outputs align with theoretical and intuitive expectations. This work highlights a novel way to causally intervene and precisely shape emotional text generation, potentially benefiting safety and alignment in sensitive affective domains.
zh

[NLP-114] OntoTune: Ontology-Driven Self-training for Aligning Large Language Models WWW25

【速读】：该论文旨在解决现有领域专用大型语言模型（Large Language Models, LLMs）在处理领域知识时存在的组织不善和碎片化理解问题。为了解决这一问题，论文提出了一种名为OntoTune的基于本体驱动的自训练框架。OntoTune的关键在于利用情境学习（in-context learning）方法，通过与现有的长期开发的本体（ontology）对齐，使LLMs能够更好地理解和生成响应，从而提高模型的知识组织能力和泛化能力。这种方法显著减少了数据维护成本，并在医学领域的实验中展示了优越的性能。

链接: https://arxiv.org/abs/2502.05478
作者: Zhiqiang Liu,Chengtao Gan,Junjie Wang,Yichi Zhang,Zhongpu Bo,Mengshu Sun,Huajun Chen,Wen Zhang
机构: Zhejiang University(浙江大学); Ant Group(蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: Accepted by WWW25

点击查看摘要

Abstract:Existing domain-specific Large Language Models (LLMs) are typically developed by fine-tuning general-purposed LLMs with large-scale domain-specific corpora. However, training on large-scale corpora often fails to effectively organize domain knowledge of LLMs, leading to fragmented understanding. Inspired by how humans connect concepts and organize knowledge through mind maps, we aim to emulate this approach by using ontology with hierarchical conceptual knowledge to reorganize LLM’s domain knowledge. From this perspective, we propose an ontology-driven self-training framework called OntoTune, which aims to align LLMs with ontology through in-context learning, enabling the generation of responses guided by the ontology. We leverage in-context learning to identify whether the LLM has acquired the specific concept’s ontology knowledge, and select the entries not yet mastered by LLM as the training set to further align the LLM with ontology. Compared to existing domain LLMs based on newly collected large-scale domain-specific corpora, our OntoTune, which relies on the existing, long-term developed ontology and LLM itself, significantly reduces data maintenance costs and offers improved generalization ability. We conduct our study in the medical domain to evaluate the effectiveness of OntoTune, utilizing a standardized medical ontology, SNOMED CT as our ontology source. Experimental results demonstrate that OntoTune achieves state-of-the-art performance in both in-ontology task hypernym discovery and out-of-ontology task medical domain QA. Moreover, compared to the latest direct ontology injection method TaxoLLaMA, our OntoTune better preserves original knowledge of LLM. The code and data are available at this https URL.
zh

[NLP-115] Position: LLM s Can be Good Tutors in Foreign Language Education

【速读】：该论文旨在解决大型语言模型（LLMs）在外国语言教育（FLE）中的应用局限于传统学习任务方法，缺乏适应性的问题。关键解决方案在于充分利用LLMs的潜力，使其作为有效的导师，在三个关键角色中发挥作用：(1) 数据增强器，改进学习材料的创建或充当学生模拟；(2) 任务预测器，用于学习者评估或优化学习路径；(3) 代理，实现个性化和包容性教育。通过跨学科研究探索这些角色，促进创新的同时应对挑战和风险，从而通过深思熟虑的整合推进FLE。

链接: https://arxiv.org/abs/2502.05467
作者: Jingheng Ye,Shen Wang,Deqing Zou,Yibo Yan,Kun Wang,Hai-Tao Zheng,Zenglin Xu,Irwin King,Philip S. Yu,Qingsong Wen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:While recent efforts have begun integrating large language models (LLMs) into foreign language education (FLE), they often rely on traditional approaches to learning tasks without fully embracing educational methodologies, thus lacking adaptability to language learning. To address this gap, we argue that LLMs have the potential to serve as effective tutors in FLE. Specifically, LLMs can play three critical roles: (1) as data enhancers, improving the creation of learning materials or serving as student simulations; (2) as task predictors, serving as learner assessment or optimizing learning pathway; and (3) as agents, enabling personalized and inclusive education. We encourage interdisciplinary research to explore these roles, fostering innovation while addressing challenges and risks, ultimately advancing FLE through the thoughtful integration of LLMs.
zh

[NLP-116] Iterative Deepening Sampling for Large Language Models

【速读】：该论文旨在解决在训练大型语言模型（Large Language Models, LLMs）过程中，实现有效自评估和自修正的挑战。这一挑战的关键在于高质量自我反思数据的生成。论文提出了一种新颖的迭代深化采样算法框架，通过手动触发模型的自修正机制来提升其在复杂推理任务中的表现。该方法通过在Math500和AIME基准测试上的广泛实验验证，展示了在困难任务上具有更高的成功率，并进行了详细的消融研究以分析其在不同设置下的有效性。

链接: https://arxiv.org/abs/2502.05449
作者: Weizhe Chen,Sven Koenig,Bistra Dilkina
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The recent release of OpenAI’s o1 models and other similar frameworks showcasing test-time scaling laws has demonstrated their exceptional capability to tackle complex reasoning tasks. Inspired by this, subsequent research has revealed that such test-time scaling laws hinge on the model’s ability to search both within a single response (intra-response) and across multiple responses (inter-response) during training. Crucially, beyond selecting a single optimal response, the model must also develop robust self-correction capabilities within its own outputs. However, training models to achieve effective self-evaluation and self-correction remains a significant challenge, heavily dependent on the quality of self-reflection data. In this paper, we address this challenge by focusing on enhancing the quality of self-reflection data generation for complex problem-solving, which can subsequently improve the training of next-generation large language models (LLMs). Specifically, we explore how manually triggering a model’s self-correction mechanisms can improve performance on challenging reasoning tasks. To this end, we propose a novel iterative deepening sampling algorithm framework designed to enhance self-correction and generate higher-quality samples. Through extensive experiments on Math500 and AIME benchmarks, we demonstrate that our method achieves a higher success rate on difficult tasks and provide detailed ablation studies to analyze its effectiveness across diverse settings.
zh

[NLP-117] Agent ic AI Systems Applied to tasks in Financial Services: Modeling and model risk management crews

【速读】：该论文旨在探索金融服务业中自主决策系统的流程与应用，特别关注于构建能够有效协作以完成复杂建模及模型风险管理（Model Risk Management, MRM）任务的自主系统。论文的关键在于通过设计包含经理与多个执行特定任务（如数据探索分析、特征工程、模型选择、超参数调优、模型训练、模型评估以及文档撰写）的建模团队，以及负责验证建模文档合规性、模型复现、概念合理性、结果分析及文档撰写的MRM团队，来展示所构建系统的有效性和鲁棒性。该方法通过应用于信用卡欺诈检测、信用卡审批和投资组合信贷风险建模等实际数据集中的数值示例得以验证。

链接: https://arxiv.org/abs/2502.05439
作者: Izunna Okpala,Ashkan Golgoon,Arjun Ravi Kannan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The advent of large language models has ushered in a new era of agentic systems, where artificial intelligence programs exhibit remarkable autonomous decision-making capabilities across diverse domains. This paper explores agentic system workflows in the financial services industry. In particular, we build agentic crews that can effectively collaborate to perform complex modeling and model risk management (MRM) tasks. The modeling crew consists of a manager and multiple agents who perform specific tasks such as exploratory data analysis, feature engineering, model selection, hyperparameter tuning, model training, model evaluation, and writing documentation. The MRM crew consists of a manager along with specialized agents who perform tasks such as checking compliance of modeling documentation, model replication, conceptual soundness, analysis of outcomes, and writing documentation. We demonstrate the effectiveness and robustness of modeling and MRM crews by presenting a series of numerical examples applied to credit card fraud detection, credit card approval, and portfolio credit risk modeling datasets.
zh

[NLP-118] oward Copyright Integrity and Verifiability via Multi-Bit Watermarking for Intelligent Transportation Systems

【速读】：该论文旨在解决智能交通系统（ITS）数据完整性验证与版权保护的问题。解决方案的关键在于提出了一种名为ITSmark的水印技术，该技术通过构建多比特空间并将其划分为多个段落来嵌入版权信息，从而生成自定义水印。ITSmark通过加密关键参数以确保授权，并利用正确的密文数据和私钥来完全提取水印，以此保证数据的质量、提取精度以及不可伪造性。此外，ITSmark还具备权限验证和篡改位置追踪的独特能力，确保了提取的安全性和版权验证的可靠性。

链接: https://arxiv.org/abs/2502.05425
作者: Yihao Wang,Lingxiao Li,Yifan Tang,Ru Zhang,Jianyi Liu
机构: School of Cyberspace Security, Beijing University of Posts and Telecommunications (北京邮电大学网络空间安全学院); State Key Laboratory of Networking and Switching Technology (网络与交换技术国家重点实验室)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 11 figures, 10 tables. Accepted for publication in IEEE Transactions on Intelligent Transportation Systems (accepted versions, not the IEEE-published versions). ©2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission

点击查看摘要

Abstract:Intelligent transportation systems (ITS) use advanced technologies such as artificial intelligence to significantly improve traffic flow management efficiency, and promote the intelligent development of the transportation industry. However, if the data in ITS is attacked, such as tampering or forgery, it will endanger public safety and cause social losses. Therefore, this paper proposes a watermarking that can verify the integrity of copyright in response to the needs of ITS, termed ITSmark. ITSmark focuses on functions such as extracting watermarks, verifying permission, and tracing tampered locations. The scheme uses the copyright information to build the multi-bit space and divides this space into multiple segments. These segments will be assigned to tokens. Thus, the next token is determined by its segment which contains the copyright. In this way, the obtained data contains the custom watermark. To ensure the authorization, key parameters are encrypted during copyright embedding to obtain cipher data. Only by possessing the correct cipher data and private key, can the user entirely extract the watermark. Experiments show that ITSmark surpasses baseline performances in data quality, extraction accuracy, and unforgeability. It also shows unique capabilities of permission verification and tampered location tracing, which ensures the security of extraction and the reliability of copyright verification. Furthermore, ITSmark can also customize the watermark embedding position and proportion according to user needs, making embedding more flexible.
zh

[NLP-119] SAMGPT : Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation WWW2025

【速读】：该论文旨在解决如何在多个源领域训练图基础模型并适应未见过的目标领域的问题。关键在于提出了一种名为结构对齐的多域图预训练与跨域适应框架（Structure Alignment framework for text-free Multi-domain Graph Pre-Training and cross-domain adaptation, SAMGPT）。该框架通过引入结构标记来协调源领域的基于结构的聚合，并设计了整体提示和特定提示来进行跨域适应，从而有效融合多域结构知识和领域特定信息。

链接: https://arxiv.org/abs/2502.05424
作者: Xingtong Yu,Zechuan Gong,Chang Zhou,Yuan Fang,Hui Zhang
机构: Singapore Management University(新加坡管理大学); University of Science and Technology of China(中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by WWW2025 Main Track

点击查看摘要

Abstract:Graphs are able to model interconnected entities in many online services, supporting a wide range of applications on the Web. This raises an important question: How can we train a graph foundational model on multiple source domains and adapt to an unseen target domain? A major obstacle is that graphs from different domains often exhibit divergent characteristics. Some studies leverage large language models to align multiple domains based on textual descriptions associated with the graphs, limiting their applicability to text-attributed graphs. For text-free graphs, a few recent works attempt to align different feature distributions across domains, while generally neglecting structural differences. In this work, we propose a novel Structure Alignment framework for text-free Multi-domain Graph Pre-Training and cross-domain adaptation (SAMGPT). It is designed to learn multi-domain knowledge from graphs originating in multiple source domains, which can then be adapted to address applications in an unseen target domain. Specifically, we introduce a set of structure tokens to harmonize structure-based aggregation across source domains during the pre-training phase. Next, for cross-domain adaptation, we design dual prompts, namely, holistic prompts and specific prompts, which adapt unified multi-domain structural knowledge and fine-grained, domain-specific information, respectively, to a target domain. Finally, we conduct comprehensive experiments on seven public datasets to evaluate and analyze the effectiveness of SAMGPT.
zh

[NLP-120] Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints

【速读】：该论文旨在解决分子任务中上下文学习（In-context Learning, ICL）存在的两个主要问题：一是当前基于分子特征相似性的提示检索方法（如Morgan指纹）无法充分捕捉全局分子及原子结合关系，导致在推理过程中未能全面表示分子结构的复杂性；二是小型到中型语言模型（LLMs）在分子ICL领域的探索较少。为了解决这些问题，论文提出了一种自监督学习技术——图对齐分子上下文学习（Graph-Aligned Molecular In-Context learning, GAMIC）。GAMIC的关键在于通过图神经网络（GNNs）对全局分子结构进行表征，并与文本描述对齐，同时利用局部特征相似性（如Morgan指纹），并且引入基于最大边际相关性（Maximum Marginal Relevance, MMR）的多样性启发式方法来优化输入提示中的示例样本。实验结果表明，GAMIC在所有任务上比简单的基于Morgan的ICL检索方法提高了最多45%的性能。

链接: https://arxiv.org/abs/2502.05414
作者: Ali Al-Lawati,Jason Lucas,Zhiwei Zhang,Prasenjit Mitra,Suhang Wang
机构: The Pennsylvania State University(宾夕法尼亚州立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) effectively conditions large language models (LLMs) for molecular tasks, such as property prediction and molecule captioning, by embedding carefully selected demonstration examples into the input prompt. This approach avoids the computational overhead of extensive pertaining and fine-tuning. However, current prompt retrieval methods for molecular tasks have relied on molecule feature similarity, such as Morgan fingerprints, which do not adequately capture the global molecular and atom-binding relationships. As a result, these methods fail to represent the full complexity of molecular structures during inference. Moreover, small-to-medium-sized LLMs, which offer simpler deployment requirements in specialized systems, have remained largely unexplored in the molecular ICL literature. To address these gaps, we propose a self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context learning, which aligns global molecular structures, represented by graph neural networks (GNNs), with textual captions (descriptions) while leveraging local feature similarity through Morgan fingerprints. In addition, we introduce a Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to optimize input prompt demonstration samples. Our experimental findings using diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL retrieval methods across all tasks by up to 45%.
zh

[NLP-121] Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data

【速读】：该论文旨在解决大型语言模型（LLMs）在进一步扩展过程中因依赖大量人工标注数据而受到的限制。为克服这一挑战，论文提出了一种名为动态噪声偏好优化（Dynamic Noise Preference Optimization, DNPO）的方法。DNPO的关键在于采用动态样本标记机制构建训练所需的偏好对，并在偏好优化过程中引入可控制且可训练的噪声，从而有效防止性能停滞并实现持续改进。

链接: https://arxiv.org/abs/2502.05400
作者: Haoyan Yang,Ting Hua,Shangqian Gao,Binfeng Xu,Zheng Tang,Jie Xu,Hongxia Jin,Vijay Srinivasan
机构: Samsung Research America (三星研究美国); New York University (纽约大学); Florida State University (佛罗里达州立大学); University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although LLMs have achieved significant success, their reliance on large volumes of human-annotated data has limited their potential for further scaling. In this situation, utilizing self-generated synthetic data has become crucial for fine-tuning LLMs without extensive human annotation. However, current methods often fail to ensure consistent improvements across iterations, with performance stagnating after only minimal updates. To overcome these challenges, we introduce Dynamic Noise Preference Optimization (DNPO). DNPO employs a dynamic sample labeling mechanism to construct preference pairs for training and introduces controlled, trainable noise into the preference optimization process. Our approach effectively prevents stagnation and enables continuous improvement. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6% across multiple benchmarks. Additionally, DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations. This highlights its effectiveness in enhancing model performance through iterative refinement.
zh

[NLP-122] Hierarchical Lexical Manifold Projection in Large Language Models : A Novel Mechanism for Multi-Scale Semantic Representation

【速读】：该论文旨在解决传统词嵌入在保持多尺度语义关系和计算效率之间的平衡问题。关键在于引入结构化层次嵌入（Hierarchical Embeddings）到基于变换器的架构中，通过投影机制将令牌映射到结构化流形上，从而改进词汇对齐，并确保层次嵌入在不同抽象层级上保持一致性。这种方法不仅提升了词表示的适应性，还在多种语言任务中展示了优越的性能和较低的计算开销。

链接: https://arxiv.org/abs/2502.05395
作者: Natasha Martus,Sebastian Crowther,Maxwell Dorrington,Jonathan Applethwaite,Edgar Tillinghurst,Quentin Birkenshaw,Lukas Petrov,Constance Willoughby
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of structured hierarchical embeddings into transformer-based architectures introduces a refined approach to lexical representation, ensuring that multi-scale semantic relationships are preserved without compromising computational efficiency. A projection mechanism that maps tokens onto a structured manifold provides improved lexical alignment, enhancing the adaptability of word representations across diverse linguistic tasks. The structured encoding framework ensures that hierarchical embeddings maintain coherence across varying abstraction levels, allowing for stable transitions between localized syntactic features and global semantic structures. Experimental evaluations indicate that hierarchical embeddings consistently outperform conventional token representations, improving accuracy in linguistic benchmarks while maintaining lower computational overhead. Comparative analysis across multiple domains highlights the ability of hierarchical embeddings to retain contextual consistency, particularly in specialized language applications where structured lexical alignment is essential. Statistical assessments further demonstrate that hierarchical embeddings exhibit enhanced robustness under perturbation conditions, ensuring that linguistic structures remain stable across adversarial text modifications. The integration of hierarchical projections with transformer attention mechanisms enables improved contextual adaptation, ensuring that token representations are dynamically adjusted based on varying linguistic distributions. The refined hierarchical organization of embeddings provides greater interpretability in lexical modeling, facilitating enhanced generalization capabilities across diverse text processing tasks.
zh

[NLP-123] Learning Task Representations from In-Context Learning ICML2024

【速读】：该论文旨在解决大型语言模型（LLMs）在情境学习（In-Context Learning, ICL）中任务信息编码和泛化能力不足的问题。关键解决方案在于提出了一种自动化方法，通过调整Transformer架构中的注意力头（attention heads）权重，计算出一个任务向量（task vector），以实现更有效的任务信息编码。这种方法通过梯度下降法优化权重，并且成功提取了情境演示中的特定任务信息，在文本和回归任务中均表现出色，证明了其跨模态的泛化能力。

链接: https://arxiv.org/abs/2502.05390
作者: Baturay Saglam,Zhuoran Yang,Dionysis Kalogerias,Amin Karbasi
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Appeared in ICML 2024 Workshop on In-Context Learning

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning (ICL), where models adapt to new tasks through example-based prompts without requiring parameter updates. However, understanding how tasks are internally encoded and generalized remains a challenge. To address some of the empirical and technical gaps in the literature, we introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads within the transformer architecture. This approach computes a single task vector as a weighted sum of attention heads, with the weights optimized causally via gradient descent. Our findings show that existing methods fail to generalize effectively to modalities beyond text. In response, we also design a benchmark to evaluate whether a task vector can preserve task fidelity in functional regression tasks. The proposed method successfully extracts task-specific information from in-context demonstrations and excels in both text and regression tasks, demonstrating its generalizability across modalities. Moreover, ablation studies show that our method’s effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
zh

[NLP-124] he Role of Prosody in Spoken Question Answering NAACL2025

【速读】：该论文旨在解决语音问答（Spoken Question Answering, SQA）研究中文本视角过重的问题，即大多数现有模型依赖于自动转录的文本信息，而忽视了语音信号中的韵律（prosody）信息。论文的关键在于通过隔离SLUE-SQA-5数据集中的韵律和词汇信息，验证仅利用韵律信息的模型也能表现良好。然而，当词汇信息可用时，模型更倾向于依赖词汇信息。研究发现，虽然韵律线索提供了有价值的补充信息，但需要更有效的整合方法以确保韵律与词汇特征能够更显著地协同作用。

链接: https://arxiv.org/abs/2502.05389
作者: Jie Chi,Maureen de Seyssel,Natalie Schluter
机构: University of Edinburgh (爱丁堡大学); Apple MLR (苹果机器学习研究)
类目: Computation and Language (cs.CL)
备注: accepted to NAACL 2025 Findings

点击查看摘要

Abstract:Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody–additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we investigate the role of prosody in Spoken Question Answering. By isolating prosodic and lexical information on the SLUE-SQA-5 dataset, which consists of natural speech, we demonstrate that models trained on prosodic information alone can perform reasonably well by utilizing prosodic cues. However, we find that when lexical information is available, models tend to predominantly rely on it. Our findings suggest that while prosodic cues provide valuable supplementary information, more effective integration methods are required to ensure prosody contributes more significantly alongside lexical features.
zh

[NLP-125] owards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond

【速读】：该论文旨在解决大型语言模型（LLM）在响应数据法规及安全与伦理关切而进行去标记化处理后，仍易受“重新学习”攻击的问题。此类攻击可通过少量遗忘数据点使模型恢复已删除的信息。论文的关键解决方案在于首次将鲁棒去标记化与尖峰感知最小化（SAM）方法通过统一的鲁棒优化框架相结合，类似于对抗训练防御对抗性攻击的方式。研究发现，平滑优化在减轻重新学习攻击方面起到了核心作用，并进一步探索了多种平滑策略以增强去标记化的鲁棒性。实验结果表明，SAM及其他平滑优化方法显著提升了LLM去标记化过程对重新学习攻击的抵御能力，同时也有助于防御输入级别的越狱攻击。

链接: https://arxiv.org/abs/2502.05374
作者: Chongyu Fan,Jinghan Jia,Yihua Zhang,Anil Ramakrishna,Mingyi Hong,Sijia Liu
机构: Michigan State University; Amazon; University of Minnesota; IBM Research
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The LLM unlearning technique has recently been introduced to comply with data regulations and address the safety and ethical concerns of LLMs by removing the undesired data-model influence. However, state-of-the-art unlearning methods face a critical vulnerability: they are susceptible to ``relearning’’ the removed information from a small number of forget data points, known as relearning attacks. In this paper, we systematically investigate how to make unlearned models robust against such attacks. For the first time, we establish a connection between robust unlearning and sharpness-aware minimization (SAM) through a unified robust optimization framework, in an analogy to adversarial training designed to defend against adversarial attacks. Our analysis for SAM reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks. Thus, we further explore diverse smoothing strategies to enhance unlearning robustness. Extensive experiments on benchmark datasets, including WMDP and MUSE, demonstrate that SAM and other smoothness optimization approaches consistently improve the resistance of LLM unlearning to relearning attacks. Notably, smoothness-enhanced unlearning also helps defend against (input-level) jailbreaking attacks, broadening our proposal’s impact in robustifying LLM unlearning. Codes are available at this https URL.
zh

[NLP-126] Probabilistic Subspace Manifolds for Contextual Inference in Large Language Models

【速读】：该论文旨在解决传统嵌入表示在灵活性和语义粒度方面的局限性，以及在不同领域应用中的适应性不足问题。关键解决方案在于将标记嵌入表示为学习流形上的概率分布，这不仅减少了表达的僵化，还增强了语义细化。通过在注意力机制中整合概率子空间，模型能够更自适应地进行上下文加权，从而捕捉潜在依赖关系，并提高对抗性修改的鲁棒性。这种概率表示方法在保持上下文完整性方面表现出更高的适应性，尤其是在处理模糊或依赖上下文的语言结构时，从而减少领域转换时的广泛再训练需求。

链接: https://arxiv.org/abs/2502.05346
作者: Christopher Nightingale,Dominic Lavington,Jonathan Thistlethwaite,Sebastian Penhaligon,Thomas Belinski,David Boldo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Representing token embeddings as probability distributions over learned manifolds allows for more flexible contextual inference, reducing representational rigidity while enhancing semantic granularity. Comparative evaluations demonstrate that probabilistic embeddings improve neighborhood consistency and decrease redundancy, ensuring that token relationships remain more structurally coherent across fine-tuning iterations. The integration of probabilistic subspaces within attention mechanisms facilitates more adaptive contextual weighting, enabling models to capture latent dependencies that would otherwise be obscured in conventional embeddings. Experimental results highlight increased robustness against adversarial modifications, with probabilistic embeddings preserving contextual integrity even under perturbation-based evaluation scenarios. Performance assessments indicate that probabilistic representations achieve greater adaptability in domain-specific applications, mitigating the need for extensive retraining when shifting across linguistic domains. Computational trade-offs remain within operationally feasible limits, with marginal increases in inference latency balanced against the benefits of enhanced representation stability and contextual expressiveness. The capacity to encode structured uncertainty provides advantages in generative modeling tasks, particularly where maintaining coherence across extended sequences requires a representation framework capable of handling ambiguous or context-dependent linguistic constructs.
zh

[NLP-127] Fine-Tuned LLM s are “Time Capsules” for Tracking Societal Bias Through Books NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在训练过程中可能习得并延续书籍中的社会偏见的问题。研究的关键在于开发了一种通过微调LLMs并使用针对性提示来追踪和量化这些偏见的新方法，并构建了一个包含1950年至2019年间七个十年共593本虚构书籍的数据集BookPAGE，以跟踪偏见随时间的变化趋势。

链接: https://arxiv.org/abs/2502.05331
作者: Sangmitra Madhusudan,Robert Morabito,Skye Reid,Nikta Gohari Sadr,Ali Emami
机构: Brock University (布鲁克大学), St. Catharines, Canada
类目: Computation and Language (cs.CL)
备注: 9 pages (excluding references), accepted to NAACL 2025

点击查看摘要

Abstract:Books, while often rich in cultural insights, can also mirror societal biases of their eras - biases that Large Language Models (LLMs) may learn and perpetuate during training. We introduce a novel method to trace and quantify these biases using fine-tuned LLMs. We develop BookPAGE, a corpus comprising 593 fictional books across seven decades (1950-2019), to track bias evolution. By fine-tuning LLMs on books from each decade and using targeted prompts, we examine shifts in biases related to gender, sexual orientation, race, and religion. Our findings indicate that LLMs trained on decade-specific books manifest biases reflective of their times, with both gradual trends and notable shifts. For example, model responses showed a progressive increase in the portrayal of women in leadership roles (from 8% to 22%) from the 1950s to 2010s, with a significant uptick in the 1990s (from 4% to 12%), possibly aligning with third-wave feminism. Same-sex relationship references increased markedly from the 1980s to 2000s (from 0% to 10%), mirroring growing LGBTQ+ visibility. Concerningly, negative portrayals of Islam rose sharply in the 2000s (26% to 38%), likely reflecting post-9/11 sentiments. Importantly, we demonstrate that these biases stem mainly from the books’ content and not the models’ architecture or initial training. Our study offers a new perspective on societal bias trends by bridging AI, literary studies, and social science research.
zh

[NLP-128] owards the Development of Balanced Synthetic Data for Correcting Grammatical Errors in Arabic: An Approach Based on Error Tagging Model and Synthetic Data Generating Model

【速读】：该论文旨在解决阿拉伯语低资源环境下神经语法错误修正（Neural Grammatical Error Correction, GEC）系统中合成数据多样性不足的问题。解决方案的关键在于开发了一个包含错误标记模型和合成数据生成模型的系统。错误标记模型使用DeBERTav3模型将正确句子分类到多种错误类型，并借助Arabic Error Type Annotation工具（ARETA）进行多标签分类任务，每个句子被分类为26个错误标签之一。合成数据生成模型则基于反向翻译方法，在正确句子前添加错误标签生成错误句子。通过这种方式，论文生成了30,219,310个合成句对，显著提升了在QALB-14测试集上的F1分数至79.36%，达到了当前领域的最先进水平。

链接: https://arxiv.org/abs/2502.05312
作者: Ahlam Alrehili,Areej Alhothali
机构: King Abdulaziz University (阿卜杜勒阿齐兹国王大学); Saudi Electronic University (沙特电子大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 3 figures

点击查看摘要

Abstract:Synthetic data generation is widely recognized as a way to enhance the quality of neural grammatical error correction (GEC) systems. However, current approaches often lack diversity or are too simplistic to generate the wide range of grammatical errors made by humans, especially for low-resource languages such as Arabic. In this paper, we will develop the error tagging model and the synthetic data generation model to create a large synthetic dataset in Arabic for grammatical error correction. In the error tagging model, the correct sentence is categorized into multiple error types by using the DeBERTav3 model. Arabic Error Type Annotation tool (ARETA) is used to guide multi-label classification tasks in an error tagging model in which each sentence is classified into 26 error tags. The synthetic data generation model is a back-translation-based model that generates incorrect sentences by appending error tags before the correct sentence that was generated from the error tagging model using the ARAT5 model. In the QALB-14 and QALB-15 Test sets, the error tagging model achieved 94.42% F1, which is state-of-the-art in identifying error tags in clean sentences. As a result of our syntactic data training in grammatical error correction, we achieved a new state-of-the-art result of F1-Score: 79.36% in the QALB-14 Test set. We generate 30,219,310 synthetic sentence pairs by using a synthetic data generation model.
zh

[NLP-129] Can LLM s Rank the Harmfulness of Smaller LLM s? We are Not There Yet

【速读】：该论文旨在研究小型语言模型（Small LLMs）在生成有害内容方面的差异，并评估大型语言模型（Large LLMs）在标注这些有害内容方面的表现。关键在于通过人类标注来衡量小型语言模型生成有害内容的倾向性，并评估大型语言模型在标注这些有害内容时与人类标注的一致性。研究发现小型语言模型在生成有害内容方面存在差异，并且大型语言模型在标注有害内容时表现出低到中等水平的人类一致性。这表明需要进一步研究以改善语言模型中的有害内容缓解措施。

链接: https://arxiv.org/abs/2502.05291
作者: Berk Atil,Vipul Gupta,Sarkar Snigdha Sarathi Das,Rebecca J. Passonneau
机构: Penn State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations. Smaller LLMs can be deployed where compute resources are constrained, such as edge devices, but with different propensity to generate harmful output. Mitigation of LLM harm typically depends on annotating the harmfulness of LLM output, which is expensive to collect from humans. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we evaluate three state-of-the-art large LLMs on their ability to annotate the harmfulness of these responses. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans. These findings underline the need for further work on harm mitigation in LLMs.
zh

[NLP-130] LLM s Can Teach Themselves to Better Predict the Future

【速读】：该论文旨在提升大型语言模型（Large Language Models, LLMs）的预测能力，而不依赖于人工整理的推理样本。解决方案的关键在于提出了一种结果导向的微调框架，通过利用模型自博弈生成多样化的推理轨迹和概率预测，并在实际结果与这些轨迹的距离基础上进行排序。随后，采用直接偏好优化（Direct Preference Optimization, DPO）方法对模型进行微调。实验结果显示，该方法使Phi-4 14B和DeepSeek-R1 14B的预测准确性分别提高了7%到10%，达到了与更大规模的前沿模型如GPT-4相当的预测水平。

链接: https://arxiv.org/abs/2502.05253
作者: Benjamin Turtel,Danny Franklin,Philipp Schoenegger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present an outcome-driven fine-tuning framework that enhances the forecasting capabilities of large language models (LLMs) without relying on human-curated reasoning samples. Our method leverages model self-play to generate pairs of diverse reasoning trajectories and probabilistic forecasts for a set of diverse questions that resolve after the models’ knowledge cutoff date. We then rank pairs of these reasoning traces by their distance to the actual outcomes before fine-tuning the model via Direct Preference Optimization (DPO). On a separate test set, our approach increases prediction accuracy of Phi-4 14B and DeepSeek-R1 14B by between 7–10% over a base model and a DPO fine-tuned control model with randomized labels, bringing them on par with forecasting capabilities of much larger frontier models like GPT-4o.
zh

[NLP-131] GSM-Infinite: How Do Your LLM s Behave over Infinitely Increasing Context Length and Reasoning Complexity?

【速读】：该论文旨在解决长上下文大型语言模型（Long-context Large Language Models, LLMs）在处理复杂推理任务时性能下降的问题。论文的关键在于开发了一个名为GSM-Infinite的基准测试工具，该工具能够生成具有无限难度和上下文长度的算术问题，并且可以在细粒度下进行控制。通过使用GSM-Infinite基准测试，作者全面评估了现有LLMs的推理能力，揭示了推理性能随复杂性增加呈现一致的S形下降趋势以及推理计算呈指数增长仅带来线性性能提升的系统性推理扩展趋势。这一发现强调了当前LLMs在长复杂上下文中推理能力的根本局限性和主要挑战。

链接: https://arxiv.org/abs/2502.05252
作者: Yang Zhou,Hongyi Liu,Zhuoming Chen,Yuandong Tian,Beidi Chen
机构: Carnegie Mellon University (卡内基梅隆大学); Meta AI (Meta AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context large language models (LLMs) have recently shown strong performance in information retrieval and long-document QA. However, to tackle the most challenging intellectual problems, LLMs must reason effectively in long and complex contexts (e.g., frontier mathematical research). Studying how LLMs handle increasing reasoning complexity and context length is essential, yet existing benchmarks lack a solid basis for quantitative evaluation. Inspired by the abstraction of GSM-8K problems as computational graphs, and the ability to introduce noise by adding unnecessary nodes and edges, we develop a grade school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. Using our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate existing LLMs. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend: exponentially increasing inference computation yields only linear performance gains. These findings underscore the fundamental limitations of current long-context LLMs and the key challenges in scaling reasoning capabilities. Our GSM-Infinite benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.
zh

[NLP-132] Evaluating Personality Traits in Large Language Models : Insights from Psychological Questionnaires

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）是否展现出类似人类的人格特质。为了实现这一目标，研究者应用心理测评工具在不同场景下评估LLMs，并使用基于特征的问卷如大五人格问卷（Big Five Inventory），同时考虑训练数据污染的可能性，分析LLMs在开放性、尽责性、外向性、亲和性和神经质这五大核心人格维度上的变异性与主导性。解决方案的关键在于通过心理测评方法系统地评估和描述LLMs的人格特征，从而揭示其独特的人格轮廓。

链接: https://arxiv.org/abs/2502.05248
作者: Pranav Bhandari,Usman Naseem,Amitava Datta,Nicolas Fay,Mehwish Nasim
机构: School of Physics, Mathematics and ComputingUniversity of Western Australia(Australia); School of ComputingMacquarie University(Macquarie University, Australia); School of Physics, Mathematics and ComputingUniversity of Western Australia(Australia); School of Psychological SciencesUniversity of Western Australia(Australia); School of Physics, Mathematics and ComputingUniversity of Western Australia(Australia)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted for publication at TheWebConf 2025

点击查看摘要

Abstract:Psychological assessment tools have long helped humans understand behavioural patterns. While Large Language Models (LLMs) can generate content comparable to that of humans, we explore whether they exhibit personality traits. To this end, this work applies psychological tools to LLMs in diverse scenarios to generate personality profiles. Using established trait-based questionnaires such as the Big Five Inventory and by addressing the possibility of training data contamination, we examine the dimensional variability and dominance of LLMs across five core personality dimensions: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. Our findings reveal that LLMs exhibit unique dominant traits, varying characteristics, and distinct personality profiles even within the same family of models.
zh

[NLP-133] SEER: Self-Explainability Enhancement of Large Language Models Representations

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）内部表示的解释难题，以理解其底层推理逻辑并提升其在实际应用中的可靠性。论文提出的关键解决方案是自解释方法SEER，通过在表示空间中聚合相同概念和解缠不同概念来增强LLMs的可解释性。这种方法能够同步提供与LLMs输出一致的忠实解释，从而在信任相关任务（如安全性风险分类和去毒性任务）中实现可解释性和性能的一致提升，并通过最优传输理论分析了SEER对LLMs泛化能力的改进。

链接: https://arxiv.org/abs/2502.05242
作者: Guanxu Chen,Dongrui Liu,Tao Luo,Jing Shao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages,5 figures,10 tables

点击查看摘要

Abstract:Explaining the hidden representations of Large Language Models (LLMs) is a perspective to understand LLMs’ underlying inference logic and improve their reliability in application scenarios. However, previous methods introduce external ‘‘black-box’’ modules to explain ‘‘black-box’’ LLMs, increasing the potential uncertainty and failing to provide faithful explanations. In this paper, we propose a self-explaining method SEER, enhancing LLMs’ explainability by aggregating the same concept and disentangling the different concepts in the representation space. In this way, SEER provides faithful explanations carried by representations synchronously with the LLMs’ output. Additionally, we showcase the applications of SEER on trustworthiness-related tasks (e.g., the safety risks classification and detoxification tasks), where self-explained LLMs achieve consistent improvement in explainability and performance. More crucially, we theoretically analyze the improvement of SEER on LLMs’ generalization ability through optimal transport theory.
zh

[NLP-134] Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination Omission and Graph Similarity Metrics

【速读】：该论文旨在解决大型语言模型在从非结构化文本自动生成知识图谱过程中存在的幻觉（Hallucination）和遗漏（Omission）问题。为了解决这些问题，论文提出了一种改进的评估框架，引入了BERTScore来衡量图相似性，并设定了95%的图匹配实际阈值。关键解决方案在于采用经过微调（Fine-tuned）的Mistral模型，在零样本（Zero-shot）和少量样本（Few-shot）设置下进行实验，结果显示该模型显著提升了知识图谱构建的准确性，同时减少了幻觉和遗漏现象。然而，微调模型在泛化任务上的表现相对较差。

链接: https://arxiv.org/abs/2502.05239
作者: Hussam Ghanem(ICB, UB),Christophe Cruz(ICB, UB)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large language models have demonstrated significant potential in the automated construction of knowledge graphs from unstructured text. This paper builds upon our previous work [16], which evaluated various models using metrics like precision, recall, F1 score, triple matching, and graph matching, and introduces a refined approach to address the critical issues of hallucination and omission. We propose an enhanced evaluation framework incorporating BERTScore for graph similarity, setting a practical threshold of 95% for graph matching. Our experiments focus on the Mistral model, comparing its original and fine-tuned versions in zero-shot and few-shot settings. We further extend our experiments using examples from the KELM-sub training dataset, illustrating that the fine-tuned model significantly improves knowledge graph construction accuracy while reducing the exact hallucination and omission. However, our findings also reveal that the fine-tuned models perform worse in generalization tasks on the KELM-sub dataset. This study underscores the importance of comprehensive evaluation metrics in advancing the state-of-the-art in knowledge graph construction from textual data.
zh

[NLP-135] Optimizing Temperature for Language Models with Multi-Sample Inference

【速读】：该论文旨在解决在多样本聚合策略（multi-sample aggregation strategies）中自动识别不同大规模语言模型（LLMs）的最优温度（optimal temperature）的问题，无需依赖特定任务的验证数据。解决方案的关键在于提出了一种基于熵的新颖度量方法，用于自动化温度优化，并且通过引入随机过程模型增强了解释性。这种方法能够持续优于固定的温度基准。

链接: https://arxiv.org/abs/2502.05234
作者: Weihua Du,Yiming Yang,Sean Welleck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages. Code available at this https URL

点击查看摘要

Abstract:Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature’s role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.
zh

[NLP-136] Efficient Knowledge Feeding to Language Models: A Novel Integrated Encoder-Decoder Architecture

【速读】：该论文旨在解决语言模型（Language Models, LLMs）在预测过程中因训练数据不足和知识限制导致的问题。解决方案的关键在于引入了在上下文向量（In-Context Vectors, ICV）的概念，通过利用LLMs的潜在嵌入来创建一个捕捉任务关键信息的向量，从而调整LLMs的潜在状态，增强生成过程。这种方法直接将信息整合到模型中，有效地提升了模型处理信息的能力，同时克服了现有检索增强生成（Retrieval-Augmented Generation, RAG）模型的局限性，实现了更高效的性能和更低的计算成本及内存需求。

链接: https://arxiv.org/abs/2502.05233
作者: S Santosh Kumar,Rishi Gottimukkala,Supriya Devidutta,Karthikeyan S
机构: Indian Institute of Information Technology, Bangalore(印度信息技术学院,班加罗尔); Jawaharlal Technological University(贾瓦哈拉尔·尼赫鲁技术大学); North Carolina School of Science and Mathematics(北卡罗来纳科学与数学学校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Submitted to ACM TIST journal: under revision stage, 8 pages, 2 figures

点击查看摘要

Abstract:This paper introduces a novel approach to efficiently feeding knowledge to language models (LLMs) during prediction by integrating retrieval and generation processes within a unified framework. While the Retrieval-Augmented Generation (RAG) model addresses gaps in LLMs’ training data and knowledge limits, it is hindered by token limit restrictions and dependency on the retrieval system’s accuracy. Our proposed architecture incorporates in-context vectors (ICV) to overcome these challenges. ICV recasts in-context learning by using latent embeddings of LLMs to create a vector that captures essential task information. This vector is then used to shift the latent states of the LLM, enhancing the generation process without adding demonstration examples to the prompt. ICV directly integrates information into the model, enabling it to process this information more effectively. Our extensive experimental evaluation demonstrates that ICV outperforms standard in-context learning and fine-tuning across question-answering, information retrieval, and other tasks. This approach mitigates the limitations of current RAG models and offers a more robust solution for handling extensive and diverse datasets. Despite leveraging a fraction of the parameters, our ICV-enhanced model achieves competitive performance against models like LLaMA-3, Gemma, and Phi-3, significantly reducing computational costs and memory requirements. ICV reduces prompt length, is easy to control, surpasses token limitations, and is computationally efficient compared to fine-tuning.
zh

[NLP-137] Robotouille: An Asynchronous Planning Benchmark for LLM Agents

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在处理长时序异步规划任务中的不足。当前基准测试主要集中在短时序任务上，未能有效评估LLM在异步规划方面的能力。为了解决这一问题，论文引入了Robotouille环境，这是一个设计用于测试LLM处理长时序异步场景能力的具有挑战性的基准环境。论文的关键在于通过引入更复杂的同步与异步数据集来提升现有基准测试的难度，从而更好地评估LLM在管理重叠任务和应对中断方面的表现。实验结果显示，ReAct (gpt4-o) 在同步任务上取得了47%的表现，但在异步任务上仅达到11%，这表明LLM需要更好地整合长期反馈并自我审计其推理过程以改进其在异步规划任务中的性能。

链接: https://arxiv.org/abs/2502.05227
作者: Gonzalo Gonzalez-Pumariega,Leong Su Yean,Neha Sunkara,Sanjiban Choudhury
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages (not including references or appendix); 41 figures (7 main paper, 34 appendix); (v1) preprint

点击查看摘要

Abstract:Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents’ ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution. Code is available at this https URL.
zh

[NLP-138] KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLM s

【速读】：该论文旨在解决Jailbreak攻击对大型语言模型（LLM）带来的挑战，即通过特定提示绕过LLM的安全措施，导致有害、不适当和不一致内容的生成。为应对这一难题，论文提出了一种名为知识蒸馏攻击者（Knowledge-Distilled Attacker, KDA）的方法。KDA的关键在于将多个最先进的攻击者的知识提炼到一个开源模型中，并对其进行微调，使其能够自动生成连贯且多样的攻击提示，而无需精心设计系统提示。这种方法相比现有攻击策略，能够在针对多种最先进开源和商业闭源LLMs时，实现更高的攻击成功率和更好的成本时间效率。

链接: https://arxiv.org/abs/2502.05223
作者: Buyun Liang,Kwan Ho Ryan Chan,Darshan Thaker,Jinqi Luo,René Vidal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA’s effectiveness and efficiency.
zh

[NLP-139] Safety at Scale: A Comprehensive Survey of Large Model Safety

【速读】：该论文旨在系统性地审查大型模型（包括视觉基础模型、大语言模型、视觉-语言预训练模型、视觉-语言模型、扩散模型以及基于大型模型的智能体）的安全研究。论文的关键解决方案在于提出一个全面的安全威胁分类，涵盖对抗攻击、数据投毒、后门攻击、越狱与提示注入攻击、能耗延迟攻击、数据与模型提取攻击以及新兴的智能体特定威胁。此外，论文还回顾了针对每种攻击类型的防御策略，并总结了常用的数据集和基准测试。最后，论文强调了开放挑战，包括需要进行全面的安全评估、可扩展且有效的防御机制以及可持续的数据实践，并呼吁研究社区和国际协作共同努力。

链接: https://arxiv.org/abs/2502.05206
作者: Xingjun Ma,Yifeng Gao,Yixu Wang,Ruofan Wang,Xin Wang,Ye Sun,Yifan Ding,Hengyuan Xu,Yunhao Chen,Yunhan Zhao,Hanxun Huang,Yige Li,Jiaming Zhang,Xiang Zheng,Yang Bai,Henghui Ding,Zuxuan Wu,Xipeng Qiu,Jingfeng Zhang,Yiming Li,Jun Sun,Cong Wang,Jindong Gu,Baoyuan Wu,Siheng Chen,Tianwei Zhang,Yang Liu,Mingming Gong,Tongliang Liu,Shirui Pan,Cihang Xie,Tianyu Pang,Yinpeng Dong,Ruoxi Jia,Yang Zhang,Shiqing Ma,Xiangyu Zhang,Neil Gong,Chaowei Xiao,Sarah Erfani,Bo Li,Masashi Sugiyama,Dacheng Tao,James Bailey,Yu-Gang Jiang
机构: Fudan University; The University of Melbourne; Singapore Management University; Hong Kong University of Science and Technology; City University of Hong Kong; ByteDance; University of Auckland; RIKEN; University of Sydney; Griffith University; University of California, Santa Cruz; Tsinghua University; Virginia Tech; CISPA Helmholtz Center for Information Security; University of Massachusetts Amherst; Purdue University; Duke University; University of Wisconsin - Madison; The University of Tokyo; Nanyang Technological University; The University of Melbourne; Shanghai Jiao Tong University; University of Oxford; Chinese University of Hong Kong, Shenzhen; Sea AI Lab
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 47 pages, 3 figures, 11 tables

点击查看摘要

Abstract:The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-based Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.
zh

[NLP-140] Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

【速读】：该论文旨在解决大型语言模型（LLMs）推理加速的问题，特别是在生成式 AI (Generative AI) 中。现有推测解码（Speculative Decoding, SD）方法虽能显著提高效率，但受限于起草者（drafter）与目标模型需共享相同词汇表的约束，这限制了可选的起草者，并通常需要从头训练一个起草者。本文提出三种新的SD方法，去除了这一词汇共享约束，且无需额外训练或修改即可与现成模型配合使用。关键在于这些新方法在保持目标分布不变（即无损）的同时，能够实现标准自回归解码的显著加速。

链接: https://arxiv.org/abs/2502.05202
作者: Nadav Timor,Jonathan Mamou,Daniel Korat,Moshe Berchansky,Oren Pereg,Gaurav Jain,Roy Schwartz,Moshe Wasserblat,David Harel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms achieve significant speedups over standard autoregressive decoding. By enabling any off-the-shelf model to serve as drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.
zh

[NLP-141] LLM s Provide Unstable Answers to Legal Questions

【速读】：该论文旨在解决大型语言模型（LLMs）在处理复杂法律问题时的稳定性问题。研究发现，即使将温度参数设置为0以使模型尽可能确定，领先的LLMs如gpt-4o、claude-3.5和gemini-1.5在回答相同的法律问题时仍表现出不一致性。为此，作者构建并公开了一个包含500个从真实案例中提炼出的法律问题的数据集，这些问题是关于两个当事人之间、带有事实描述、对立法律论点以及哪一方应占优势的问题。研究的关键在于揭示这种不稳定性对日益增多的依赖这些LLMs的法律AI产品、法律流程和律师的影响。

链接: https://arxiv.org/abs/2502.05196
作者: Andrew Blair-Stanek,Benjamin Van Durme
机构: University of Maryland School of Law(马里兰大学法学院); Johns Hopkins University(约翰斯·霍普金斯大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 6 pages

点击查看摘要

Abstract:An LLM is stable if it reaches the same conclusion when asked the identical question multiple times. We find leading LLMs like gpt-4o, claude-3.5, and gemini-1.5 are unstable when providing answers to hard legal questions, even when made as deterministic as possible by setting temperature to 0. We curate and release a novel dataset of 500 legal questions distilled from real cases, involving two parties, with facts, competing legal arguments, and the question of which party should prevail. When provided the exact same question, we observe that LLMs sometimes say one party should win, while other times saying the other party should win. This instability has implications for the increasing numbers of legal AI products, legal processes, and lawyers relying on these LLMs.
zh

[NLP-142] Scalable Oversight for Superhuman AI via Recursive Self-Critiquing

【速读】：该论文旨在解决AI能力超越人类在复杂任务中的表现时，现有对齐技术（如SFT和RLHF）在确保可靠监督方面面临的根本挑战。这些方法依赖于直接的人类评估，在AI输出超出人类认知阈限时变得难以实施。关键解决方案在于探索递归自我批评（recursive self-critiquing），即通过高阶批评（如批评的批评）来提供更可处理的监督路径，以此来实现可扩展的监督。

链接: https://arxiv.org/abs/2502.04675
作者: Xueru Wen,Jie Lou,Xinyu Lu,Junjie Yang,Yanjiang Liu,Yaojie Lu,Debing Zhang,XingYu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques including SFT and RLHF face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become untenable when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) critique of critique can be easier than critique itself, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) this difficulty relationship is recursively held, suggesting that when direct evaluation is infeasible, performing high-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. To examine these hypotheses, we perform Human-Human, Human-AI, and AI-AI experiments across multiple tasks. Our results demonstrate encouraging evidence supporting these hypotheses and suggest that recursive self-critiquing is a promising direction for scalable oversight.
zh

[NLP-143] A Generative Framework for Bidirectional Image-Report Understanding in Chest Radiography

【速读】：该论文旨在解决将大型语言模型（Large Language Models, LLMs）应用于医学影像，特别是胸部X光（CXR）处理中的精确视觉-文本对齐及关键诊断细节保留的挑战。关键解决方案是提出了一种名为多阶段自适应视觉-语言调优（Multi-Stage Adaptive Vision-Language Tuning, MAViLT）的新框架。MAViLT通过引入临床梯度加权标记化过程和分层微调策略，实现了精准的放射学报告生成、从文本合成逼真的CXR图像以及回答基于视觉的临床问题。

链接: https://arxiv.org/abs/2502.05926
作者: Nicholas Evans,Stephen Baker,Miles Reed
机构: Bandırma Onyedi Eylül University
类目: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancements in large language models (LLMs) have unlocked their potential for multimodal tasks, where text and visual data are processed jointly. However, applying LLMs to medical imaging, particularly for chest X-rays (CXR), poses significant challenges due to the need for precise visual-textual alignment and the preservation of critical diagnostic details. In this paper, we propose Multi-Stage Adaptive Vision-Language Tuning (MAViLT), a novel framework designed to enhance multimodal reasoning and generation for CXR understanding. MAViLT incorporates a clinical gradient-weighted tokenization process and a hierarchical fine-tuning strategy, enabling it to generate accurate radiology reports, synthesize realistic CXRs from text, and answer vision-based clinical questions. We evaluate MAViLT on two benchmark datasets, MIMIC-CXR and Indiana University CXR, achieving state-of-the-art results across all tasks. Human evaluations further validate the clinical relevance and utility of MAViLT, making it a robust tool for real-world medical applications. This work demonstrates the feasibility of leveraging LLMs for multimodal medical imaging while addressing key challenges in vision-language integration.
zh

计算机视觉

[CV-0] EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

【速读】：该论文旨在解决现有无编码器视觉-语言模型（Vision-Language Models, VLMs）与基于编码器的模型在性能上的差距问题。关键在于通过适当分解和分层关联视觉与语言，并设计有效的训练策略，以减少模态间的干扰并实现无编码器VLMs的有效优化。通过这些方法，论文推出了EVEv2.0，展示了其在跨模态解码器-only架构中的优越数据效率和强大的视觉推理能力。

链接: https://arxiv.org/abs/2502.06788
作者: Haiwen Diao,Xiaotong Li,Yufeng Cui,Yueze Wang,Haoge Deng,Ting Pan,Wenxuan Wang,Huchuan Lu,Xinlong Wang
机构: BAAI; DLUT; PKU; BUPT; UCAS; CASIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: this https URL.
zh

[CV-1] Visual Agent ic AI for Spatial Reasoning with a Dynamic API ALT

【速读】：该论文旨在解决三维场景中视觉推理（Visual Reasoning）的问题，特别是在处理需要多步定位（grounding）和推理（inference）的任务时，现有模型性能显著下降。为了解决这一挑战，论文提出了一种基于代理程序合成（agentic program synthesis）的方法，其中大型语言模型（LLM）代理协同生成一个具有新功能的Python API，以解决常见的子问题。这种方法的关键在于通过动态生成API而非依赖静态的人类定义API，从而能够处理更广泛范围的查询，进而提升在三维空间推理任务中的表现。

链接: https://arxiv.org/abs/2502.06787
作者: Damiano Marsili,Rohun Agrawal,Yisong Yue,Georgia Gkioxari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Visual reasoning – the ability to interpret the visual world – is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: this https URL
zh

[CV-2] Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

【速读】：该论文旨在解决在视频生成中利用Diffusion Transformers (DiTs) 模型所面临的挑战，特别是处理视频数据固有的时空复杂性。论文的关键解决方案是引入Lumina-Video框架，该框架结合了多尺度Next-DiT架构以增强效率和灵活性，并通过显式条件引入运动得分以直接控制生成视频的动态程度。此外，Lumina-Video采用渐进训练方案和多源训练策略，实现了高训练和推理效率下的卓越美学质量和流畅动作。

链接: https://arxiv.org/abs/2502.06782
作者: Dongyang Liu,Shicheng Li,Yutong Liu,Zhen Li,Kai Wang,Xinyue Li,Qi Qin,Yufei Liu,Yi Xin,Zhongyu Li,Bin Fu,Chenyang Si,Yuewen Cao,Conghui He,Ziwei Liu,Yu Qiao,Qibin Hou,Hongsheng Li,Peng Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos’ dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at this https URL.
zh

[CV-3] KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification ICASSP2025

【速读】：该论文旨在解决参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法在处理大规模模型时所面临的有限表征能力及与预训练中间特征不匹配的问题。其关键解决方案是引入了一种创新的多核Kronecker自适应再缩放传输（KARST）方法。KARST通过水平扩展Kronecker投影和分离适应矩阵到多个互补空间来减少参数依赖，并创建更紧凑的子空间。此外，它还结合了额外的可学习缩放因子，以更好地与预训练特征分布对齐，从而实现更灵活和平衡的特征聚合。

链接: https://arxiv.org/abs/2502.06779
作者: Yue Zhu,Haiwen Diao,Shang Gao,Long Chen,Huchuan Lu
机构: Dalian University of Technology (大连理工大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, Accepted by ICASSP2025

点击查看摘要

Abstract:Fine-tuning pre-trained vision models for specific tasks is a common practice in computer vision. However, this process becomes more expensive as models grow larger. Recently, parameter-efficient fine-tuning (PEFT) methods have emerged as a popular solution to improve training efficiency and reduce storage needs by tuning additional low-rank modules within pre-trained backbones. Despite their advantages, they struggle with limited representation capabilities and misalignment with pre-trained intermediate features. To address these issues, we introduce an innovative Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission (KARST) for various recognition tasks. Specifically, its multi-kernel design extends Kronecker projections horizontally and separates adaptation matrices into multiple complementary spaces, reducing parameter dependency and creating more compact subspaces. Besides, it incorporates extra learnable re-scaling factors to better align with pre-trained feature distributions, allowing for more flexible and balanced feature aggregation. Extensive experiments validate that our KARST outperforms other PEFT counterparts with a negligible inference cost due to its re-parameterization characteristics. Code is publicly available at: this https URL.
zh

[CV-4] History-Guided Video Diffusion

【速读】：该论文旨在解决在视频扩散模型中利用可变长度历史进行引导时所面临的两个关键挑战：一是仅支持固定大小条件的架构限制，二是分类器自由引导（CFG）风格的历史丢弃效果不佳。为了解决这些问题，论文提出了一种名为扩散强制变换器（Diffusion Forcing Transformer, DFoT）的新型视频扩散架构及其理论基础训练目标，使得能够灵活地对多个历史帧进行条件化。论文进一步引入了历史引导（History Guidance），这是一种独特的引导方法，由DFoT所支持，显著提升了视频生成质量和时间一致性，并通过更高级的方法增强了运动动态，实现了对分布外历史的组合泛化以及稳定生成极长视频的能力。

链接: https://arxiv.org/abs/2502.06764
作者: Kiwhan Song,Boyuan Chen,Max Simchowitz,Yilun Du,Russ Tedrake,Vincent Sitzmann
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Website: this https URL
zh

[CV-5] SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement ICLR2025

【速读】：该论文旨在解决现有粗略掩模质量不足的问题，使其能够作为可靠的训练数据以减少标注成本。论文的关键解决方案是提出了SAMRefiner方法，这是一种通用且高效的掩模优化技术。其核心在于噪声容忍提示方案以及多提示挖掘策略，通过引入距离引导点、上下文感知弹性边界框和高斯风格掩模等多种输入提示来优化初始粗略掩模。此外，为了应对语义分割中的多目标处理难题，论文还引入了分而治之（split-then-merge, STM）的处理流程。这些关键技术显著提升了掩模的质量与泛化能力。

链接: https://arxiv.org/abs/2502.06756
作者: Yuqi Lin,Hengjia Li,Wenqi Shao,Zheng Yang,Jun Zhao,Xiaofei He,Ping Luo,Kaipeng Zhang
机构: State Key Lab of CAD&CG, College of Computer Science, Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室); FABU Inc. (FABU公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:In this paper, we explore a principal way to enhance the quality of widely pre-existing coarse masks, enabling them to serve as reliable training data for segmentation models to reduce the annotation cost. In contrast to prior refinement techniques that are tailored to specific models or tasks in a close-world manner, we propose SAMRefiner, a universal and efficient approach by adapting SAM to the mask refinement task. The core technique of our model is the noise-tolerant prompting scheme. Specifically, we introduce a multi-prompt excavation strategy to mine diverse input prompts for SAM (i.e., distance-guided points, context-aware elastic bounding boxes, and Gaussian-style masks) from initial coarse masks. These prompts can collaborate with each other to mitigate the effect of defects in coarse masks. In particular, considering the difficulty of SAM to handle the multi-object case in semantic segmentation, we introduce a split-then-merge (STM) pipeline. Additionally, we extend our method to SAMRefiner++ by introducing an additional IoU adaption step to further boost the performance of the generic SAMRefiner on the target dataset. This step is self-boosted and requires no additional annotation. The proposed framework is versatile and can flexibly cooperate with existing segmentation methods. We evaluate our mask framework on a wide range of benchmarks under different settings, demonstrating better accuracy and efficiency. SAMRefiner holds significant potential to expedite the evolution of refinement tools. Our code is available at this https URL.
zh

[CV-6] Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

【速读】：该论文旨在解决理解视觉模型时无法同时实现特征解释与因果影响验证的问题。论文的关键在于提出了一种使用稀疏自编码器（Sparse Autoencoders, SAEs）的统一框架，该框架能够发现人类可解释的视觉特征，并精确操控这些特征以测试关于模型行为的假设。通过这种方法，研究者可以在不重新训练模型的情况下，可靠地识别和操纵可解释的视觉特征，从而提供了一个强大的工具来理解和控制视觉模型的行为。

链接: https://arxiv.org/abs/2502.06755
作者: Samuel Stevens,Wei-Lun Chao,Tanya Berger-Wolf,Yu Su
机构: The Ohio State University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main text is 11 pages with 7 figures

点击查看摘要

Abstract:To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: this https URL.
zh

[CV-7] Accelerating Data Processing and Benchmarking of AI Models for Pathology

【速读】：该论文旨在解决因基础模型数量增加及缺乏标准化基准而导致的评估复杂性上升的问题。解决方案的关键在于引入一套新的软件工具，涵盖全切片图像处理、基础模型基准测试以及精心策划的公开任务，以促进该领域的透明度、可重复性和持续进步。

链接: https://arxiv.org/abs/2502.06750
作者: Andrew Zhang,Guillaume Jaume,Anurag Vaidya,Tong Ding,Faisal Mahmood
机构: Brigham and Women’s Hospital (布里格姆妇女医院), Harvard Medical School (哈佛医学院), Massachusetts General Hospital (马萨诸塞州总医院), Cancer Program, Broad Institute of Harvard and MIT (哈佛和MIT的博德研究所癌症项目), Health Sciences and Technology, Harvard-MIT (哈佛-麻省理工健康科学与技术), Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University (哈佛大学约翰·A·保尔森工程与应用科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advances in foundation modeling have reshaped computational pathology. However, the increasing number of available models and lack of standardized benchmarks make it increasingly complex to assess their strengths, limitations, and potential for further development. To address these challenges, we introduce a new suite of software tools for whole-slide image processing, foundation model benchmarking, and curated publicly available tasks. We anticipate that these resources will promote transparency, reproducibility, and continued progress in the field.
zh

[CV-8] Wandering around: A bioinspired approach to visual attention through object motion sensitivity

【速读】：该论文旨在解决动态视觉感知中的选择性注意问题，特别是在移动物体检测和场景理解方面。解决方案的关键在于开发了一种基于脉冲神经网络（Spiking Neural Networks, SNN）的生物启发式注意力系统，通过对象运动敏感性实现选择性注意。该系统利用动态视觉传感器（Dynamic Vision Sensor, DVS）集成在Speck类脑硬件中，并安装在云台装置上，通过眼跳运动生成事件来识别感兴趣区域（ROI）。关键创新点包括使用SNN进行并行计算以适应动态环境，并通过无监督学习的方式实现实时多目标运动分割，达到高精度和低延迟响应。

链接: https://arxiv.org/abs/2502.06747
作者: Giulia D Angelo,Victoria Clerico,Chiara Bartolozzi,Matej Hoffmann,P. Michael Furlong,Alexander Hadjiivanov
机构: Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University (捷克技术大学); IBM Research Europe (IBM欧洲研究院), Zurich, Switzerland; Event-Driven Perception for Robotics, Italian Institute of Technology (意大利理工学院), Genoa, Italy; National Research Council of Canada & Systems Design Engineering, University of Waterloo (滑铁卢大学), Canada; Advanced Concepts Team, European Space Agency (欧洲航天局), Noordwijk, The Netherlands; Adapsent (Adapsent), Leiden, The Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Active vision enables dynamic visual perception, offering an alternative to static feedforward architectures in computer vision, which rely on large datasets and high computational resources. Biological selective attention mechanisms allow agents to focus on salient Regions of Interest (ROIs), reducing computational demand while maintaining real-time responsiveness. Event-based cameras, inspired by the mammalian retina, enhance this capability by capturing asynchronous scene changes enabling efficient low-latency processing. To distinguish moving objects while the event-based camera is in motion the agent requires an object motion segmentation mechanism to accurately detect targets and center them in the visual field (fovea). Integrating event-based sensors with neuromorphic algorithms represents a paradigm shift, using Spiking Neural Networks to parallelize computation and adapt to dynamic environments. This work presents a Spiking Convolutional Neural Network bioinspired attention system for selective attention through object motion sensitivity. The system generates events via fixational eye movements using a Dynamic Vision Sensor integrated into the Speck neuromorphic hardware, mounted on a Pan-Tilt unit, to identify the ROI and saccade toward it. The system, characterized using ideal gratings and benchmarked against the Event Camera Motion Segmentation Dataset, reaches a mean IoU of 82.2% and a mean SSIM of 96% in multi-object motion segmentation. The detection of salient objects reaches 88.8% accuracy in office scenarios and 89.8% in low-light conditions on the Event-Assisted Low-Light Video Object Segmentation Dataset. A real-time demonstrator shows the system’s 0.12 s response to dynamic scenes. Its learning-free design ensures robustness across perceptual scenes, making it a reliable foundation for real-time robotic applications serving as a basis for more complex architectures.
zh

[CV-9] ViSIR: Vision Transformer Single Image Reconstruction Method for Earth System Models

【速读】：该论文旨在解决地球系统模型（Earth System Models, ESMs）数据在单图像超分辨率（Single Image Super-Resolution, SR）重建任务中的复杂性和精度问题。解决方案的关键在于提出了一种名为Vision Transformer Sinusoidal Representation Networks (ViSIR)的新架构，它结合了视觉变换器（Vision Transformer, ViT）的超分辨率能力与正弦表示网络（Sinusoidal Representation Network, SIREN）对高频细节的保留能力，以解决传统SR任务中存在的频谱偏差问题。

链接: https://arxiv.org/abs/2502.06741
作者: Ehsan Zeraatkar,Salah Faroughi,Jelena Tesic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Earth system models (ESMs) integrate the interactions of the atmosphere, ocean, land, ice, and biosphere to estimate the state of regional and global climate under a wide variety of conditions. The ESMs are highly complex, and thus, deep neural network architectures are used to model the complexity and store the down-sampled data. In this paper, we propose the Vision Transformer Sinusoidal Representation Networks (ViSIR) to improve the single image SR (SR) reconstruction task for the ESM data. Methods: ViSIR combines the SR capability of Vision Transformers (ViT) with the high-frequency detail preservation of the Sinusoidal Representation Network (SIREN) to address the spectral bias observed in SR tasks. Results: The ViSIR outperforms ViT by 4.1 dB, SIREN by 7.5 dB, and SR-Generative Adversarial (SR-GANs) by 7.1dB PSNR on average for three different measurements. Conclusion: The proposed ViSIR is evaluated and compared with state-of-the-art methods. The results show that the proposed algorithm is outperforming other methods in terms of Mean Square Error(MSE), Peak-Signal-to-Noise-Ratio(PSNR), and Structural Similarity Index Measure(SSIM). Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.06741 [cs.CV] (or arXiv:2502.06741v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.06741 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ehsan Zeraatkar [view email] [v1] Mon, 10 Feb 2025 18:09:45 UTC (17,642 KB)
zh

[CV-10] Enhancing Pneumonia Diagnosis and Severity Assessment through Deep Learning: A Comprehensive Approach Integrating CNN Classification and Infection Segmentation

【速读】：该论文旨在应对肺部疾病尤其是肺炎带来的全球健康挑战，重点关注通过深度学习技术提升肺炎检测与评估的准确性和效率。论文的关键解决方案在于引入卷积神经网络（CNN）模型进行肺炎分类，并强调在考虑COVID-19的情况下进行全面诊断评估。此外，论文提倡使用基于深度学习的分割方法来确定感染的严重程度。这种双重策略旨在为医疗专业人员提供有价值的见解，从而实现对肺炎更细致的理解和更有效的治疗，最终提升全球范围内的医疗效果。

链接: https://arxiv.org/abs/2502.06735
作者: S Kumar Reddy Mallidi(1) ((1) Sri Vasavi Engineering College)
机构: Sri Vasavi Engineering College (斯里瓦萨维工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lung disease poses a substantial global health challenge, with pneumonia being a prevalent concern. This research focuses on leveraging deep learning techniques to detect and assess pneumonia, addressing two interconnected objectives. Initially, Convolutional Neural Network (CNN) models are introduced for pneumonia classification, emphasizing the necessity of comprehensive diagnostic assessments considering COVID-19. Subsequently, the study advocates for the utilization of deep learning-based segmentation to determine the severity of infection. This dual-pronged approach offers valuable insights for medical professionals, facilitating a more nuanced understanding and effective treatment of pneumonia. Integrating deep learning aims to elevate the accuracy and efficiency of pneumonia detection, thereby contributing to enhanced healthcare outcomes on a global scale.
zh

[CV-11] Se~norita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

【速读】：该论文旨在解决现有视频编辑方法中存在的挑战，特别是反转法（inversion-based methods）在推理过程中耗时且难以处理细粒度编辑指令的问题，以及端到端法（end-to-end methods）因缺乏高质量训练视频对而导致编辑结果不佳的问题。关键解决方案在于引入Señorita-2M，这是一个包含约200万视频编辑对的高质量数据集。通过构建四个专门的视频编辑模型，并采用过滤管道消除质量较差的视频编辑对，从而提升视频编辑的质量。此外，研究还探索了常见的视频编辑架构，以确定基于当前预训练生成模型的最有效结构。

链接: https://arxiv.org/abs/2502.06734
作者: Bojia Zi,Penghui Ruan,Marco Chen,Xianbiao Qi,Shaozhe Hao,Shihao Zhao,Youze Huang,Bin Liang,Rong Xiao,Kam-Fai Wong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training-free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Señorita-2M, a high-quality video editing dataset. Señorita-2M consists of approximately 2 millions of video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results. More details are available at this https URL.
zh

[CV-12] Learning Musical Representations for Music Performance Question Answering EMNLP2024

【速读】：该论文旨在解决音乐表演场景下现有多模态学习方法无法准确处理音频-视觉问答的问题。关键在于设计了一种能够捕捉音乐数据中复杂多模态相互作用的主干模型，并引入了对节奏和音乐源的标注以帮助模型学习音乐特性。此外，通过将模型预测与时间维度对齐，实现了时间感知的音频-视觉建模。这些方法共同提高了在音乐AVQA数据集上的表现效果。

链接: https://arxiv.org/abs/2502.06710
作者: Xingjian Diao,Chunhui Zhang,Tingxuan Wu,Ming Cheng,Zhongyu Ouyang,Weiyi Wu,Jiang Gui
机构: Dartmouth College; London School of Economics and Political Science
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:Music performances are representative scenarios for audio-visual modeling. Unlike common scenarios with sparse audio, music performances continuously involve dense audio signals throughout. While existing multimodal learning methods on the audio-video QA demonstrate impressive capabilities in general scenarios, they are incapable of dealing with fundamental problems within the music performances: they underexplore the interaction between the multimodal signals in performance and fail to consider the distinctive characteristics of instruments and music. Therefore, existing methods tend to answer questions regarding musical performances inaccurately. To bridge the above research gaps, (i) given the intricate multimodal interconnectivity inherent to music data, our primary backbone is designed to incorporate multimodal interactions within the context of music; (ii) to enable the model to learn music characteristics, we annotate and release rhythmic and music sources in the current music datasets; (iii) for time-aware audio-visual modeling, we align the model’s music predictions with the temporal dimension. Our experiments show state-of-the-art effects on the Music AVQA datasets. Our code is available at this https URL.
zh

[CV-13] EMSET-24K: Densely Annotated Dataset for Indexing Multipart Endoscopic Videos using Surgical Timeline Segmentation

【速读】：该论文旨在解决内镜手术视频自动索引的问题，当前这一过程依赖于耗时的手动标注。解决方案的关键在于提出了TEMSET-24K，一个包含24,306个经肛门内镜显微手术（Trans-anal Endoscopic Microsurgery, TEMS）视频微片段的开源数据集。每个片段由临床专家使用一种新颖的层次化标签分类法进行细致标注，涵盖了阶段、任务和动作三元组，以捕捉复杂的手术流程。通过在该数据集上验证深度学习模型，包括基于变换器的架构，研究证明了高精度（高达0.99）和F1分数（高达0.99），特别是在关键阶段如设置和缝合方面。

链接: https://arxiv.org/abs/2502.06708
作者: Muhammad Bilal,Mahmood Alam,Deepa Bapu,Stephan Korsgen,Neeraj Lal,Simon Bach,Amir M Hajivanand,Muhammed Ali,Kamran Soomro,Iqbal Qasim,Paweł Capik,Aslam Khan,Zaheer Khan,Hunaid Vohra,Massimo Caputo,Andrew Beggs,Adnan Qayyum,Junaid Qadir,Shazad Ashraf
机构: Birmingham City University(伯明翰城市大学); University Hospitals Birmingham(伯明翰大学医院); University of Birmingham(伯明翰大学); University of the West of England(西英格兰大学); University of Hertfordshire(赫特福德郡大学); University of Bristol(布里斯托大学); Qatar University(卡塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Indexing endoscopic surgical videos is vital in surgical data science, forming the basis for systematic retrospective analysis and clinical performance evaluation. Despite its significance, current video analytics rely on manual indexing, a time-consuming process. Advances in computer vision, particularly deep learning, offer automation potential, yet progress is limited by the lack of publicly available, densely annotated surgical datasets. To address this, we present TEMSET-24K, an open-source dataset comprising 24,306 trans-anal endoscopic microsurgery (TEMS) video micro-clips. Each clip is meticulously annotated by clinical experts using a novel hierarchical labeling taxonomy encompassing phase, task, and action triplets, capturing intricate surgical workflows. To validate this dataset, we benchmarked deep learning models, including transformer-based architectures. Our in silico evaluation demonstrates high accuracy (up to 0.99) and F1 scores (up to 0.99) for key phases like Setup and Suturing. The STALNet model, tested with ConvNeXt, ViT, and SWIN V2 encoders, consistently segmented well-represented phases. TEMSET-24K provides a critical benchmark, propelling state-of-the-art solutions in surgical data science.
zh

[CV-14] ransfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene

【速读】：该论文旨在解决自驾驶车辆依赖单一视角感知所面临的局限性，特别是无法有效检测被遮挡或远距离物体的问题。为应对这一挑战，论文提出了一种新颖的方法，即通过生成不同视角下的现实感知数据，该方法以真实世界样本（即自车的传感数据）为条件。解决方案的关键在于Transfer Your Perspective (TYP) 方法，它结合模拟协作数据与真实的自车数据，训练一个条件扩散模型。该模型生成的数据不仅逼真，而且在语义和布局上与给定的自车数据保持一致。这使得利用现有自车数据集扩展协作自动驾驶 (CAV) 系统的开发成为可能，而无需大量实际协作数据。

链接: https://arxiv.org/abs/2502.06682
作者: Tai-Yu Pan,Sooyoung Jeon,Mengdi Fan,Jinsu Yoo,Zhenyang Feng,Mark Campbell,Kilian Q. Weinberger,Bharath Hariharan,Wei-Lun Chao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-driving cars relying solely on ego-centric perception face limitations in sensing, often failing to detect occluded, faraway objects. Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. It requires placing multiple sensor-equipped agents in a real-world driving scene, simultaneously! As such, existing datasets are limited in locations and agents. We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene, conditioned on a real-world sample - the ego-car’s sensory data. This surrogate has huge potential: it could potentially turn any ego-car dataset into a collaborative driving one to scale up the development of CAV. We present the very first solution, using a combination of simulated collaborative data and real ego-car data. Our method, Transfer Your Perspective (TYP), learns a conditioned diffusion model whose output samples are not only realistic but also consistent in both semantics and layouts with the given ego-car data. Empirical results demonstrate TYP’s effectiveness in aiding in a CAV setting. In particular, TYP enables us to (pre-)train collaborative perception algorithms like early and late fusion with little or no real-world collaborative data, greatly facilitating downstream CAV applications.
zh

[CV-15] CHIRLA: Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis

【速读】：该论文旨在解决长期场景下的人体再识别（Person Re-identification, Re-ID）挑战，特别是在跨时间与环境变化显著的情况下。为应对这一问题，论文提出了一套名为CHIRLA（Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis）的新数据集。该数据集的关键在于包含了长达七个月的监控视频，涵盖了显著的时间和外观属性变化，并通过控制受试者的服装和身体特征变化来模拟现实世界的复杂性。通过引入这一综合基准数据集，论文致力于推动能够可靠应用于复杂长期实际场景的Re-ID算法的发展与评估。

链接: https://arxiv.org/abs/2502.06681
作者: Bessie Dominguez-Dager,Felix Escalona,Francisco Gomez-Donoso,Miguel Cazorla
机构: Institute for Computing Research (计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Person re-identification (Re-ID) is a key challenge in computer vision, requiring the matching of individuals across different cameras, locations, and time periods. While most research focuses on short-term scenarios with minimal appearance changes, real-world applications demand robust Re-ID systems capable of handling long-term scenarios, where persons’ appearances can change significantly due to variations in clothing and physical characteristics. In this paper, we present CHIRLA, Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis, a novel dataset specifically designed for long-term person Re-ID. CHIRLA consists of recordings from strategically placed cameras over a seven-month period, capturing significant variations in both temporal and appearance attributes, including controlled changes in participants’ clothing and physical features. The dataset includes 22 individuals, four connected indoor environments, and seven cameras. We collected more than five hours of video that we semi-automatically labeled to generate around one million bounding boxes with identity annotations. By introducing this comprehensive benchmark, we aim to facilitate the development and evaluation of Re-ID algorithms that can reliably perform in challenging, long-term real-world scenarios.
zh

[CV-16] Prototype Contrastive Consistency Learning for Semi-Supervised Medical Image Segmentation

【速读】：该论文旨在解决半监督医学图像分割中由于部分像素对比学习方法忽视整个上下文信息而导致的精确分割难题。关键解决方案在于提出了一种名为原型对比一致性分割（Prototype Contrastive Consistency Segmentation, PCCS）的方法。PCCS通过构建符号距离图和不确定性图来创建对比学习所需的原型，并设计了一种基于师生架构的新机制——原型更新原型，以优化这些原型。此外，还引入了一种不确定性一致性损失，以从无标签数据中挖掘更可靠的信息。

链接: https://arxiv.org/abs/2502.06650
作者: Shihuan He,Zhihui Lai,Ruxin Wang,Heng Kong
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院); Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University(计算机视觉研究所, 计算机科学与软件工程学院, 深圳大学); Department of Breast and Thyroid Surgery, Baoan Central Hospital of Shenzhen(乳腺甲状腺外科, 宝安中心医院, 深圳市)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Medical image segmentation is a crucial task in medical image analysis, but it can be very challenging especially when there are less labeled data but with large unlabeled data. Contrastive learning has proven to be effective for medical image segmentation in semi-supervised learning by constructing contrastive samples from partial pixels. However, although previous contrastive learning methods can mine semantic information from partial pixels within images, they ignore the whole context information of unlabeled images, which is very important to precise segmentation. In order to solve this problem, we propose a novel prototype contrastive learning method called Prototype Contrastive Consistency Segmentation (PCCS) for semi-supervised medical image segmentation. The core idea is to enforce the prototypes of the same semantic class to be closer and push the prototypes in different semantic classes far away from each other. Specifically, we construct a signed distance map and an uncertainty map from unlabeled images. The signed distance map is used to construct prototypes for contrastive learning, and then we estimate the prototype uncertainty from the uncertainty map as trade-off among prototypes. In order to obtain better prototypes, based on the student-teacher architecture, a new mechanism named prototype updating prototype is designed to assist in updating the prototypes for contrastive learning. In addition, we propose an uncertainty-consistency loss to mine more reliable information from unlabeled data. Extensive experiments on medical image segmentation demonstrate that PCCS achieves better segmentation performance than the state-of-the-art methods. The code is available at this https URL.
zh

[CV-17] Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging

【速读】：该论文旨在解决在医疗诊断和研究中，由于有限的标记数据导致的精确分类和解剖定位难题。解决方案的关键在于采用了适应少量样本学习的Prototypical Networks用于组织分类，以及适应解剖标志定位的Propagation-Reconstruction Network (PRNet)用于图像重建和空间关系捕捉。具体而言，Prototypical Network利用预训练的ResNet-18骨干网络，在心部2D切片图像上实现了96.67%的训练准确率和93.33%的验证准确率；而PRNet通过编码器-解码器架构及跳跃连接，在训练损失为1.395的情况下，有效重构了图像块并捕捉了空间关系。这些结果表明，Prototypical Networks在有限标记数据条件下具有良好的组织分类潜力，而PRNet则在解剖学标志定位方面展现出优势，从而推动了深度学习框架性能的提升。

链接: https://arxiv.org/abs/2502.06632
作者: Mohammed Abdul Hafeez Khan,Samuel Morries Boddepalli,Siddhartha Bhattacharyya,Debasis Mitra
机构: Florida Institute of Technology, Melbourne, FL 32901 USA (佛罗里达理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2 pages, 2 figures

点击查看摘要

Abstract:Accurate classification and anatomical localization are essential for effective medical diagnostics and research, which may be efficiently performed using deep learning techniques. However, availability of limited labeled data poses a significant challenge. To address this, we adapted Prototypical Networks and the Propagation-Reconstruction Network (PRNet) for few-shot classification and localization, respectively, in Single Photon Emission Computed Tomography (SPECT) images. For the proof of concept we used a 2D-sliced image cropped around heart. The Prototypical Network, with a pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for 2D imaging with an encoder-decoder architecture and skip connections, achieved a training loss of 1.395, accurately reconstructing patches and capturing spatial relationships. These results highlight the potential of Prototypical Networks for tissue classification with limited labeled data and PRNet for anatomical landmark localization, paving the way for improved performance in deep learning frameworks.
zh

[CV-18] Conformal Predictions for Human Action Recognition with Vision-Language Models

【速读】：该论文旨在解决在视频监控领域中，利用人类行为识别（Human Action Recognition, HAR）方法结合预训练视觉-语言模型（Vision-Language Models, VLMs）时，如何有效减少候选类别数量的同时避免产生长尾分布的问题。关键解决方案在于调整视觉-语言模型的温度参数（temperature tuning），以最小化长尾分布，而无需额外的校准数据。

链接: https://arxiv.org/abs/2502.06631
作者: Bary Tim,Fuchs Clément,Macq Benoît
机构: UCLouvain(天普洛旺大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:Human-In-The-Loop (HITL) frameworks are integral to many real-world computer vision systems, enabling human operators to make informed decisions with AI assistance. Conformal Predictions (CP), which provide label sets with rigorous guarantees on ground truth inclusion probabilities, have recently gained traction as a valuable tool in HITL settings. One key application area is video surveillance, closely associated with Human Action Recognition (HAR). This study explores the application of CP on top of state-of-the-art HAR methods that utilize extensively pre-trained Vision-Language Models (VLMs). Our findings reveal that CP can significantly reduce the average number of candidate classes without modifying the underlying VLM. However, these reductions often result in distributions with long tails. To address this, we introduce a method based on tuning the temperature parameter of the VLMs to minimize these tails without requiring additional calibration data. Our code is made available on GitHub at the address this https URL.
zh

[CV-19] Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification

【速读】：该论文旨在解决领域泛化行人重识别（Domain-Generalizable Re-Identification, DG Re-ID）任务中的shortcut learning问题，以提高模型在未见过的目标域上的性能。论文的关键在于提出了一种名为扩散模型辅助表示学习与相关性感知调节方案（Diffusion Model-assisted Representation Learning with Correlation-aware Conditioning Scheme, DCAC）的方法。该方法通过将判别式和对比式Re-ID模型与预训练的扩散模型结合，并利用ID分类概率和可学习的ID特定提示进行相关性感知调节，注入暗知识以捕捉身份之间的相关性，从而指导扩散过程。同时，扩散模型的反馈通过调节方案反向传播到Re-ID模型，有效提升了Re-ID特征的泛化能力。

链接: https://arxiv.org/abs/2502.06619
作者: Jiachen Li,Xiaojin Gong
机构: College of Information Science and Electronic Engineering, Zhejiang University (浙江大学信息科学与电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain-generalizable re-identification (DG Re-ID) aims to train a model on one or more source domains and evaluate its performance on unseen target domains, a task that has attracted growing attention due to its practical relevance. While numerous methods have been proposed, most rely on discriminative or contrastive learning frameworks to learn generalizable feature representations. However, these approaches often fail to mitigate shortcut learning, leading to suboptimal performance. In this work, we propose a novel method called diffusion model-assisted representation learning with a correlation-aware conditioning scheme (DCAC) to enhance DG Re-ID. Our method integrates a discriminative and contrastive Re-ID model with a pre-trained diffusion model through a correlation-aware conditioning scheme. By incorporating ID classification probabilities generated from the Re-ID model with a set of learnable ID-wise prompts, the conditioning scheme injects dark knowledge that captures ID correlations to guide the diffusion process. Simultaneously, feedback from the diffusion model is back-propagated through the conditioning scheme to the Re-ID model, effectively improving the generalization capability of Re-ID features. Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies further validate the effectiveness of the proposed approach, providing insights into its robustness. Codes will be available at this https URL.
zh

[CV-20] Multi-Scale Feature Fusion with Image-Driven Spatial Integration for Left Atrium Segmentation from Cardiac MRI Images

【速读】：该论文旨在解决左心房（Left Atrium, LA）从晚期钆增强磁共振成像（late gadolinium-enhanced magnetic resonance imaging）中的精确分割问题，这对于心血管疾病的诊断和管理至关重要，尤其是在规划消融治疗以治疗心房颤动（Atrial Fibrillation, AF）时。论文的关键解决方案在于提出了一种整合DINOv2作为编码器，并结合UNet风格解码器的分割框架，通过多尺度特征融合和输入图像集成来提升分割精度。此外，引入可学习的加权机制动态优先选择来自不同编码块的层次特征，优化任务相关性特征的选择，同时在解码阶段重新引入输入图像以保留高分辨率的空间细节，从而克服编码过程中下采样带来的限制。

链接: https://arxiv.org/abs/2502.06615
作者: Bipasha Kundu,Zixin Yang,Richard Simon,Cristian Linte
机构: Rochester Institute of Technology (罗切斯特理工学院); Department of Biomedical Engineering, Rochester Institute of Technology (罗切斯特理工学院生物医学工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Accurate segmentation of the left atrium (LA) from late gadolinium-enhanced magnetic resonance imaging plays a vital role in visualizing diseased atrial structures, enabling the diagnosis and management of cardiovascular diseases. It is particularly essential for planning treatment with ablation therapy, a key intervention for atrial fibrillation (AF). However, manual segmentation is time-intensive and prone to inter-observer variability, underscoring the need for automated solutions. Class-agnostic foundation models like DINOv2 have demonstrated remarkable feature extraction capabilities in vision tasks. However, their lack of domain specificity and task-specific adaptation can reduce spatial resolution during feature extraction, impacting the capture of fine anatomical detail in medical imaging. To address this limitation, we propose a segmentation framework that integrates DINOv2 as an encoder with a UNet-style decoder, incorporating multi-scale feature fusion and input image integration to enhance segmentation accuracy. The learnable weighting mechanism dynamically prioritizes hierarchical features from different encoder blocks of the foundation model, optimizing feature selection for task relevance. Additionally, the input image is reintroduced during the decoding stage to preserve high-resolution spatial details, addressing limitations of downsampling in the encoder. We validate our approach on the LAScarQS 2022 dataset and demonstrate improved performance with a 92.3% Dice and 84.1% IoU score for giant architecture compared to the nnUNet baseline model. These findings emphasize the efficacy of our approach in advancing the field of automated left atrium segmentation from cardiac MRI.
zh

[CV-21] ripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

【速读】：该论文旨在解决3D形状生成技术在输出质量、泛化能力和与输入条件对齐方面的挑战。关键在于提出了一种名为TripoSG的新方法，它包括一个大规模校正流变换器用于3D形状生成，一种结合SDF、法线和eikonal损失的混合监督训练策略，以及一个生成高质量3D样本的数据处理流程。这些组件的无缝集成使得TripoSG能够实现前所未有的3D形状生成性能，从而显著提升生成模型的细节表现和保真度，并增强其从多样化图像风格和内容中生成3D模型的能力。

链接: https://arxiv.org/abs/2502.06608
作者: Yangguang Li,Zi-Xin Zou,Zexiang Liu,Dehu Wang,Yuan Liang,Zhipeng Yu,Xingchao Liu,Yuan-Chen Guo,Ding Liang,Wanli Ouyang,Yan-Pei Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion techniques have propelled image and video generation to unprece- dented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data process- ing, and insufficient exploration of advanced tech- niques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capa- bility, and alignment with input conditions. We present TripoSG, a new streamlined shape diffu- sion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high- quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high- quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D gen- erative models. Through comprehensive experi- ments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit en- hanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input im- ages. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong gen- eralization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.
zh

[CV-22] Illegal Waste Detection in Remote Sensing Images: A Case Study

【速读】：该论文旨在解决非法倾倒废弃物的问题，通过利用Very-High-Resolution Remote Sensing（甚高分辨率遥感）图像，开发了一套检测非法倾倒地点的管道。解决方案的关键在于一个由当地环保机构协作开发的Remote Sensing（遥感）图像分类器，通过广泛的实验分析确定了最佳配置，并在实际工作中验证了其有效性，实现了相较于人工照片解释的时间节省。这一方法展示了跨区域应用的潜力。

链接: https://arxiv.org/abs/2502.06607
作者: Federico Gibellini,Piero Fraternali,Giacomo Boracchi,Luca Morandini,Andrea Diecidue,Simona Malegori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Environmental crime currently represents the third largest criminal activity worldwide while threatening ecosystems as well as human health. Among the crimes related to this activity, improper waste management can nowadays be countered more easily thanks to the increasing availability and decreasing cost of Very-High-Resolution Remote Sensing images, which enable semi-automatic territory scanning in search of illegal landfills. This paper proposes a pipeline, developed in collaboration with professionals from a local environmental agency, for detecting candidate illegal dumping sites leveraging a classifier of Remote Sensing images. To identify the best configuration for such classifier, an extensive set of experiments was conducted and the impact of diverse image characteristics and training settings was thoroughly analyzed. The local environmental agency was then involved in an experimental exercise where outputs from the developed classifier were integrated in the experts’ everyday work, resulting in time savings with respect to manual photo-interpretation. The classifier was eventually run with valuable results on a location outside of the training area, highlighting potential for cross-border applicability of the proposed pipeline.
zh

[CV-23] MaterialFusion: High-Quality Zero-Shot and Controllable Material Transfer with Diffusion Models

【速读】：该论文旨在解决图像中物体材质外观操纵的问题，特别是在增强现实、虚拟原型设计和数字内容创作等应用中的高保真材质迁移。论文的关键在于提出了一种名为MaterialFusion的新框架，它允许用户调整材质应用的程度，以实现新材质属性与物体原始特征之间的最佳平衡，并通过保持背景一致性及减少边界伪影来无缝集成修改后的物体到场景中。

链接: https://arxiv.org/abs/2502.06606
作者: Kamil Garifullin,Maxim Nikolaev,Andrey Kuznetsov,Aibek Alanov
机构: HSE University (莫斯科高等经济学院); AIRI (人工智能与机器人研究院); Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Manipulating the material appearance of objects in images is critical for applications like augmented reality, virtual prototyping, and digital content creation. We present MaterialFusion, a novel framework for high-quality material transfer that allows users to adjust the degree of material application, achieving an optimal balance between new material properties and the object’s original features. MaterialFusion seamlessly integrates the modified object into the scene by maintaining background consistency and mitigating boundary artifacts. To thoroughly evaluate our approach, we have compiled a dataset of real-world material transfer examples and conducted complex comparative analyses. Through comprehensive quantitative evaluations and user studies, we demonstrate that MaterialFusion significantly outperforms existing methods in terms of quality, user control, and background preservation. Code is available at this https URL.
zh

[CV-24] A Large-scale AI-generated Image Inpainting Benchmark

【速读】：该论文旨在解决高真实度图像篡改检测方法的需求与现有数据集规模及多样性不足之间的矛盾。解决方案的关键在于提出了一种创建高质量图像修复数据集的方法，并应用此方法生成了DiQuID数据集。该数据集包含超过95,000张来自MS-COCO、RAISE和OpenImages的78,000张原始图像的修复图像。其方法包括三部分：(1)语义对齐对象替换(Semantically Aligned Object Replacement, SAOR)，通过实例分割识别合适对象并生成上下文相关的提示；(2)多模型图像修复(Multiple Model Image Inpainting, MMII)，利用基于扩散模型的不同先进修复管道创建多样化篡改；(3)不确定性引导欺骗性评估(Uncertainty-Guided Deceptiveness Assessment, UGDA)，通过与原始图像对比分析来评估图像的真实性。这些方法共同提升了数据集在多样性、美学质量和技术质量方面的表现。

链接: https://arxiv.org/abs/2502.06593
作者: Paschalis Giakoumoglou,Dimitrios Karageorgiou,Symeon Papadopoulos,Panagiotis C. Petrantonakis
机构: Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki (电气与计算机工程系,亚里士多德大学塞萨洛尼基分校); Information Technologies Institute, CERTH (信息技术研究所,CE.R.T.H.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in generative models enable highly realistic image manipulations, creating an urgent need for robust forgery detection methods. Current datasets for training and evaluating these methods are limited in scale and diversity. To address this, we propose a methodology for creating high-quality inpainting datasets and apply it to create DiQuID, comprising over 95,000 inpainted images generated from 78,000 original images sourced from MS-COCO, RAISE, and OpenImages. Our methodology consists of three components: (1) Semantically Aligned Object Replacement (SAOR) that identifies suitable objects through instance segmentation and generates contextually appropriate prompts, (2) Multiple Model Image Inpainting (MMII) that employs various state-of-the-art inpainting pipelines primarily based on diffusion models to create diverse manipulations, and (3) Uncertainty-Guided Deceptiveness Assessment (UGDA) that evaluates image realism through comparative analysis with originals. The resulting dataset surpasses existing ones in diversity, aesthetic quality, and technical quality. We provide comprehensive benchmarking results using state-of-the-art forgery detection methods, demonstrating the dataset’s effectiveness in evaluating and improving detection algorithms. Through a human study with 42 participants on 1,000 images, we show that while humans struggle with images classified as deceiving by our methodology, models trained on our dataset maintain high performance on these challenging cases. Code and dataset are available at this https URL.
zh

[CV-25] vclust: Python library for evidential clustering

【速读】：该论文致力于解决在聚类分析中表达和捕捉数据对象归属不确定性的问题。关键解决方案在于采用证据聚类（Evidential Clustering）方法，利用belief函数的Dempster-Shafer理论来管理和表示这种不确定性，从而生成可信度划分（credal partition），通过一组质函数（mass functions）量化每个对象对潜在群组的不确定分配。论文还介绍了Python框架evclust，提供了一系列高效的证据聚类算法以及用于可视化、评估和分析可信度划分的工具。

链接: https://arxiv.org/abs/2502.06587
作者: Armel Soubeiga,Violaine Antoine
机构: 未知
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Mathematical Software (cs.MS)
备注: 13 pages, 2 figures, Preprint

点击查看摘要

Abstract:A recent developing trend in clustering is the advancement of algorithms that not only identify clusters within data, but also express and capture the uncertainty of cluster membership. Evidential clustering addresses this by using the Dempster-Shafer theory of belief functions, a framework designed to manage and represent uncertainty. This approach results in a credal partition, a structured set of mass functions that quantify the uncertain assignment of each object to potential groups. The Python framework evclust, presented in this paper, offers a suite of efficient evidence clustering algorithms as well as tools for visualizing, evaluating and analyzing credal partitions.
zh

[CV-26] Adaptive Perception for Unified Visual Multi-modal Object Tracking

【速读】：该论文旨在解决多模态跟踪器在不同模态间依赖不平衡的问题，这种不平衡限制了方法动态利用各模态互补信息的能力，特别是在复杂场景下。为了解决这一问题，论文提出APTrack，这是一种新型统一跟踪器，通过等效建模策略探索统一表示，使模型能够动态适应各种模态和任务，无需在不同任务之间进行额外的微调。关键解决方案在于引入自适应模态交互（Adaptive Modality Interaction, AMI）模块，该模块通过生成可学习标记高效地桥接跨模态交互。

链接: https://arxiv.org/abs/2502.06583
作者: Xiantao Hu,Bineng Zhong,Qihua Liang,Zhiyi Mo,Liangtao Shi,Ying Tai,Jian Yang
机构: Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University (广西师范大学多源信息挖掘与安全重点实验室); School of Data Science and Software Engineering, Wuzhou University (梧州学院数据科学与软件工程学院); School of intelligence Science and Technonolgy, Nanjing University (南京大学智能科学与技术学院); PCA-Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学计算机科学与工程学院PCA实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, many multi-modal trackers prioritize RGB as the dominant modality, treating other modalities as auxiliary, and fine-tuning separately various multi-modal tasks. This imbalance in modality dependence limits the ability of methods to dynamically utilize complementary information from each modality in complex scenarios, making it challenging to fully perceive the advantages of multi-modal. As a result, a unified parameter model often underperforms in various multi-modal tracking tasks. To address this issue, we propose APTrack, a novel unified tracker designed for multi-modal adaptive perception. Unlike previous methods, APTrack explores a unified representation through an equal modeling strategy. This strategy allows the model to dynamically adapt to various modalities and tasks without requiring additional fine-tuning between different tasks. Moreover, our tracker integrates an adaptive modality interaction (AMI) module that efficiently bridges cross-modality interactions by generating learnable tokens. Experiments conducted on five diverse multi-modal datasets (RGBT234, LasHeR, VisEvent, DepthTrack, and VOT-RGBD2022) demonstrate that APTrack not only surpasses existing state-of-the-art unified multi-modal trackers but also outperforms trackers designed for specific multi-modal tasks.
zh

[CV-27] A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative Systems

【速读】：本文旨在解决在分布式视频分析系统中高效处理、实时推理和隐私保护分析的问题。关键在于利用云边端协同（CETC）系统的架构优势，通过分布式的任务处理和自适应分析，实现从云端到边缘设备再到终端设备的视频监控、自动驾驶和智慧城市等领域的突破。文中探讨了基于边缘和云端的处理方法以及混合视频分析技术，这些方法通过自适应任务卸载和资源感知调度技术优化整个系统的性能，并考虑了大型语言模型和多模态集成带来的新机遇与挑战。

链接: https://arxiv.org/abs/2502.06581
作者: Linxiao Gong,Hao Yang,Gaoyun Fang,Bobo Ju,Juncen Guo,Xiaoguang Zhu,Yan Wang,Xiping Hu,Peng Sun,Azzedine Boukerche
机构: Fudan University; University of Toronto; Imperial College London; University of California, Davis; SMBU; University of Ottawa
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The explosive growth of video data has driven the development of distributed video analytics in cloud-edge-terminal collaborative (CETC) systems, enabling efficient video processing, real-time inference, and privacy-preserving analysis. Among multiple advantages, CETC systems can distribute video processing tasks and enable adaptive analytics across cloud, edge, and terminal devices, leading to breakthroughs in video surveillance, autonomous driving, and smart cities. In this survey, we first analyze fundamental architectural components, including hierarchical, distributed, and hybrid frameworks, alongside edge computing platforms and resource management mechanisms. Building upon these foundations, edge-centric approaches emphasize on-device processing, edge-assisted offloading, and edge intelligence, while cloud-centric methods leverage powerful computational capabilities for complex video understanding and model training. Our investigation also covers hybrid video analytics incorporating adaptive task offloading and resource-aware scheduling techniques that optimize performance across the entire system. Beyond conventional approaches, recent advances in large language models and multimodal integration reveal both opportunities and challenges in platform scalability, data protection, and system reliability. Future directions also encompass explainable systems, efficient processing mechanisms, and advanced video analytics, offering valuable insights for researchers and practitioners in this dynamic field.
zh

[CV-28] Diffusion Models for Computational Neuroimaging: A Survey

【速读】：该论文旨在探索将扩散模型（Diffusion Models）应用于计算神经成像（Computational Neuroimaging）中的可行性与有效性，以支持数据增强、疾病诊断及脑功能解码等任务。论文的关键在于探讨扩散模型的不同变体在去噪起始点、条件输入和生成目标方面的调整如何优化特定的神经成像任务。

链接: https://arxiv.org/abs/2502.06552
作者: Haokai Zhao,Haowei Lou,Lina Yao,Wei Peng,Ehsan Adeli,Kilian M Pohl,Yu Zhang
机构: UNSW Sydney (新南威尔士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); Stanford University (斯坦福大学); Lehigh University (里海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:Computational neuroimaging involves analyzing brain images or signals to provide mechanistic insights and predictive tools for human cognition and behavior. While diffusion models have shown stability and high-quality generation in natural images, there is increasing interest in adapting them to analyze brain data for various neurological tasks such as data enhancement, disease diagnosis and brain decoding. This survey provides an overview of recent efforts to integrate diffusion models into computational neuroimaging. We begin by introducing the common neuroimaging data modalities, follow with the diffusion formulations and conditioning mechanisms. Then we discuss how the variations of the denoising starting point, condition input and generation target of diffusion models are developed and enhance specific neuroimaging tasks. For a comprehensive overview of the ongoing research, we provide a publicly available repository at this https URL.
zh

[CV-29] Sequence Transferability and Task Order Selection in Continual Learning

【速读】：该论文旨在解决连续学习中任务序列顺序选择的问题，以提升模型性能。论文的关键在于提出两种新的度量方法来捕捉任务序列在正向或反向中的总体迁移性，并基于这些度量的实证性质，开发了一种新的任务顺序选择方法。该方法相较于传统的随机任务选择策略，能够显著提高性能。

链接: https://arxiv.org/abs/2502.06544
作者: Thinh Nguyen,Cuong N. Nguyen,Quang Pham,Binh T. Nguyen,Savitha Ramasamy,Xiaoli Li,Cuong V. Nguyen
机构: VinUniversity (VinUniversity), Vietnam; Durham University (杜伦大学), United Kingdom; Independent Researcher (独立研究员); VNU-HCM, University of Science (越南国立大学胡志明市大学), Vietnam; Institute for Infocomm Research, ASTAR (信息通信研究院, ASTAR), Singapore
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:In continual learning, understanding the properties of task sequences and their relationships to model performance is important for developing advanced algorithms with better accuracy. However, efforts in this direction remain underdeveloped despite encouraging progress in methodology development. In this work, we investigate the impacts of sequence transferability on continual learning and propose two novel measures that capture the total transferability of a task sequence, either in the forward or backward direction. Based on the empirical properties of these measures, we then develop a new method for the task order selection problem in continual learning. Our method can be shown to offer a better performance than the conventional strategy of random task selection.
zh

[CV-30] Unsupervised Learning for Feature Extraction and Temporal Alignment of 3Dt Point Clouds of Zebrafish Embryos

【速读】：该论文旨在解决斑马鱼胚胎发育阶段同步化的问题，以满足进一步分析的需求。论文的关键解决方案在于提出了一种无监督方法，通过从三维时间序列点云（3D+t point clouds）中提取描述性特征，并利用这些特征进行时间对齐。具体而言，论文采用了一种自编码器架构来学习点云的描述性表示，并设计了一个深度回归网络来进行时间对齐。这种方法实现了高度的时间对齐精度，平均错配仅为3.83分钟，且在整个实验持续时间为5.3小时的过程中无需任何人工标记工作。此外，该方法易于扩展，并避免了人工标注数据可能引入的主观偏差。

链接: https://arxiv.org/abs/2502.06543
作者: Zhu Chen,Ina Laube,Johannes Stegmaier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zebrafish are widely used in biomedical research and developmental stages of their embryos often need to be synchronized for further analysis. We present an unsupervised approach to extract descriptive features from 3D+t point clouds of zebrafish embryos and subsequently use those features to temporally align corresponding developmental stages. An autoencoder architecture is proposed to learn a descriptive representation of the point clouds and we designed a deep regression network for their temporal alignment. We achieve a high alignment accuracy with an average mismatch of only 3.83 minutes over an experimental duration of 5.3 hours. As a fully-unsupervised approach, there is no manual labeling effort required and unlike manual analyses the method easily scales. Besides, the alignment without human annotation of the data also avoids any influence caused by subjective bias.
zh

[CV-31] CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers

【速读】：该论文旨在解决个性化视频生成中的时间不一致性和质量退化问题。解决方案的关键在于引入了CustomVideoX框架，该框架利用视频扩散变换器从参考图像生成个性化视频。具体而言，通过仅训练LoRA参数来提取参考特征，确保了效率和适应性。此外，提出了3D参考注意力机制（3D Reference Attention）以实现参考图像特征与所有视频帧在空间和时间维度上的直接同步互动，并采用时间感知参考注意力偏差（Time-Aware Reference Attention Bias, TAB）策略动态调整不同时间步长上的参考偏置。这些方法共同提升了生成视频的一致性和质量。

链接: https://arxiv.org/abs/2502.06527
作者: D. She,Mushui Liu,Jingxuan Pang,Jin Wang,Zhen Yang,Wanggui He,Guanghao Zhang,Yi Wang,Qihan Huang,Haobin Tang,Yunlong Yu,Siming Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content during inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy, dynamically modulating reference bias over different time steps. Additionally, we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly activated regions of key entity tokens with reference feature injection by adjusting attention bias. To thoroughly evaluate personalized video generation, we establish a new benchmark, VideoBench, comprising over 50 objects and 100 prompts for extensive assessment. Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.
zh

[CV-32] SIREN: Semantic Initialization-Free Registration of Multi-Robot Gaussian Splatting Maps

【速读】：该论文旨在解决多机器人高斯散斑图（Gaussian Splatting, GSplat）地图的注册问题，无需访问相机姿态、图像或局部子地图之间的变换进行初始化或融合。解决方案的关键在于利用语义信息（semantics）的多样性和鲁棒性，通过三个关键步骤实现：首先，识别局部地图中特征丰富的区域以更好地设定配准问题，从而避免初始配准的需求；其次，基于稳健的语义特征识别局部地图中高斯分布之间的候选对应关系，构成几何优化的基础，粗略对齐从局部地图中提取的三维高斯基元；最后，这一关键步骤使得后续能够通过GSplat地图中的新型视图合成以及基于语义的图像滤波来计算高精度的非刚体变换，从而生成高保真的融合地图。

链接: https://arxiv.org/abs/2502.06519
作者: Ola Shorinwa,Jiankai Sun,Mac Schwager,Anirudha Majumdar
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present SIREN for registration of multi-robot Gaussian Splatting (GSplat) maps, with zero access to camera poses, images, and inter-map transforms for initialization or fusion of local submaps. To realize these capabilities, SIREN harnesses the versatility and robustness of semantics in three critical ways to derive a rigorous registration pipeline for multi-robot GSplat maps. First, SIREN utilizes semantics to identify feature-rich regions of the local maps where the registration problem is better posed, eliminating the need for any initialization which is generally required in prior work. Second, SIREN identifies candidate correspondences between Gaussians in the local maps using robust semantic features, constituting the foundation for robust geometric optimization, coarsely aligning 3D Gaussian primitives extracted from the local maps. Third, this key step enables subsequent photometric refinement of the transformation between the submaps, where SIREN leverages novel-view synthesis in GSplat maps along with a semantics-based image filter to compute a high-accuracy non-rigid transformation for the generation of a high-fidelity fused map. We demonstrate the superior performance of SIREN compared to competing baselines across a range of real-world datasets, and in particular, across the most widely-used robot hardware platforms, including a manipulator, drone, and quadruped. In our experiments, SIREN achieves about 90x smaller rotation errors, 300x smaller translation errors, and 44x smaller scale errors in the most challenging scenes, where competing methods struggle. We will release the code and provide a link to the project page after the review process.
zh

[CV-33] Boost-and-Skip: A Simple Guidance-Free Diffusion for Minority Generation

【速读】：该论文旨在解决在利用扩散模型生成少数族裔样本时，现有方法依赖于昂贵的专门指导以促进少数族裔生成的问题。论文的关键解决方案是提出了一种无需指导的Boost-and-Skip方法，仅需对标准生成过程进行两项微小改动：(i) 方差增强初始化和(ii) 时间步跳过。这些改动得到了坚实的理论和实证支持，显著提升了生成少数族裔样本的能力，并且在计算成本上远低于基于指导的方法。

链接: https://arxiv.org/abs/2502.06516
作者: Soobin Um,Beomsu Kim,Jong Chul Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 29 pages, 11 figures

点击查看摘要

Abstract:Minority samples are underrepresented instances located in low-density regions of a data manifold, and are valuable in many generative AI applications, such as data augmentation, creative content generation, etc. Unfortunately, existing diffusion-based minority generators often rely on computationally expensive guidance dedicated for minority generation. To address this, here we present a simple yet powerful guidance-free approach called Boost-and-Skip for generating minority samples using diffusion models. The key advantage of our framework requires only two minimal changes to standard generative processes: (i) variance-boosted initialization and (ii) timestep skipping. We highlight that these seemingly-trivial modifications are supported by solid theoretical and empirical evidence, thereby effectively promoting emergence of underrepresented minority features. Our comprehensive experiments demonstrate that Boost-and-Skip greatly enhances the capability of generating minority samples, even rivaling guidance-based state-of-the-art approaches while requiring significantly fewer computations.
zh

[CV-34] Learning Clustering-based Prototypes for Compositional Zero-shot Learning ICLR2025

【速读】：该论文旨在解决基于已有组合学习基础属性（Attribute）和对象（Object）概念的组合零样本学习（Compositional Zero-Shot Learning, CZSL）中的主要挑战。现有方法通常依赖于简化的数据假设，如使用单一原型表示法来建模每个基础概念，忽略了在不同对象耦合时属性的自然多样性（反之亦然）。论文的关键解决方案是提出ClusPro框架，通过一组多样化的原型定义基础概念的边界。ClusPro采用基于聚类的方法自动发现和动态更新原型，并利用这些代表性原型重新绘制结构良好且独立的基础概念嵌入空间。这种方法确保了同类内部的分离和不同类别之间的解相关性，通过基于原型的对比学习和解相关学习实现。此外，ClusPro以非参数化的方式高效执行原型聚类，在测试过程中无需引入额外的学习参数或计算预算。实验结果表明，ClusPro在封闭世界和开放世界的基准测试中均优于多种领先方法。

链接: https://arxiv.org/abs/2502.06501
作者: Hongyu Qu,Jianan Wei,Xiangbo Shu,Wenguan Wang
机构: Nanjing University of Science and Technology (南京理工大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025; Project page: this https URL

点击查看摘要

Abstract:Learning primitive (i.e., attribute and object) concepts from seen compositions is the primary challenge of Compositional Zero-Shot Learning (CZSL). Existing CZSL solutions typically rely on oversimplified data assumptions, e.g., modeling each primitive with a single centroid primitive representation, ignoring the natural diversities of the attribute (resp. object) when coupled with different objects (resp. attribute). In this work, we develop ClusPro, a robust clustering-based prototype mining framework for CZSL that defines the conceptual boundaries of primitives through a set of diversified prototypes. Specifically, ClusPro conducts within-primitive clustering on the embedding space for automatically discovering and dynamically updating prototypes. These representative prototypes are subsequently used to repaint a well-structured and independent primitive embedding space, ensuring intra-primitive separation and inter-primitive decorrelation through prototype-based contrastive learning and decorrelation learning. Moreover, ClusPro efficiently performs prototype clustering in a non-parametric fashion without the introduction of additional learnable parameters or computational budget during testing. Experiments on three benchmarks demonstrate ClusPro outperforms various top-leading CZSL solutions under both closed-world and open-world settings.
zh

[CV-35] Decision Boundary Optimization-Informed Domain Adaptation

【速读】：该论文旨在解决现有领域适应（Domain Adaptation, DA）方法在分布对齐（Distribution Alignment, DA）过程中忽视优化决策边界的问题。为了解决这一问题，论文提出了一种增强的最大均值差异测量方法——决策边界优化感知最大均值差异（Decision Boundary optimization-informed Maximum Mean Discrepancy, DB-MMD）。这种方法能够同时优化数据分布对齐和跨域分类器，并在一个混合框架内引导理论误差界限下的领域适应。通过将DB-MMD嵌入到几种流行的DA方法中，如MEDA、DGA-DA等，验证了其在不同实验设置下的有效性。实验结果表明，采用DB-MMD的方法比基于普通最大均值差异（MMD）的基线模型有显著提升，最高可达9.5%的性能改进。

链接: https://arxiv.org/abs/2502.06498
作者: Lingkun Luo,Shiqiang Hu,Jie Yang,Liming Chen
机构: Shanghai Jiao Tong University(上海交通大学); École Centrale de Lyon(里昂中央理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Maximum Mean Discrepancy (MMD) is widely used in a number of domain adaptation (DA) methods and shows its effectiveness in aligning data distributions across domains. However, in previous DA research, MMD-based DA methods focus mostly on distribution alignment, and ignore to optimize the decision boundary for classification-aware DA, thereby falling short in reducing the DA upper error bound. In this paper, we propose a strengthened MMD measurement, namely, Decision Boundary optimization-informed MMD (DB-MMD), which enables MMD to carefully take into account the decision boundaries, thereby simultaneously optimizing the distribution alignment and cross-domain classifier within a hybrid framework, and leading to a theoretical bound guided DA. We further seamlessly embed the proposed DB-MMD measurement into several popular DA methods, e.g., MEDA, DGA-DA, to demonstrate its effectiveness w.r.t different experimental settings. We carry out comprehensive experiments using 8 standard DA datasets. The experimental results show that the DB-MMD enforced DA methods improve their baseline models using plain vanilla MMD, with a margin that can be as high as 9.5.
zh

[CV-36] Biomechanical Reconstruction with Confidence Intervals from Multiview Markerless Motion Capture

【速读】：该论文旨在解决在临床实践和大规模应用中，多视角无标记运动捕捉（Multiview Markerless Motion Capture, MMMC）系统对特定个体的特定运动学估计提供置信区间的需求。此前的研究虽然显示MMMC系统在平均情况下表现良好，但未能提供所需的具体置信区间。论文的关键解决方案在于使用一种隐式轨迹表示，并通过可微生物力学模型进行端到端优化，以学习给定所有检测到的关键点后的姿态后验概率分布。这种方法通过变分近似来估计单个关节在试验每个时刻的置信区间，虚拟标记位置的空间误差通常在10-15毫米内，与之前的研究结果一致。此外，后验概率还建模了关节角度之间的相关性结构。这种置信区间的估计方法使得识别不确定度较高的时间和试验成为可能。

链接: https://arxiv.org/abs/2502.06486
作者: R. James Cotton,Fabian Sinz
机构: Shirley Ryan AbilityLab (Shirley Ryan能力实验室); Northwestern University (西北大学); University of Göttingen (哥廷根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Advances in multiview markerless motion capture (MMMC) promise high-quality movement analysis for clinical practice and research. While prior validation studies show MMMC performs well on average, they do not provide what is needed in clinical practice or for large-scale utilization of MMMC – confidence intervals over specific kinematic estimates from a specific individual analyzed using a possibly unique camera configuration. We extend our previous work using an implicit representation of trajectories optimized end-to-end through a differentiable biomechanical model to learn the posterior probability distribution over pose given all the detected keypoints. This posterior probability is learned through a variational approximation and estimates confidence intervals for individual joints at each moment in a trial, showing confidence intervals generally within 10-15 mm of spatial error for virtual marker locations, consistent with our prior validation studies. Confidence intervals over joint angles are typically only a few degrees and widen for more distal joints. The posterior also models the correlation structure over joint angles, such as correlations between hip and pelvis angles. The confidence intervals estimated through this method allow us to identify times and trials where kinematic uncertainty is high.
zh

[CV-37] Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

【速读】：该论文旨在解决图像质量评估（IQA）领域中图像尺度变化对其感知质量影响未被系统量化的问题。论文的关键在于引入图像固有尺度（Image Intrinsic Scale, IIS）这一概念，并通过主观测量与预测方法构建了IIS评估（IISA）任务。论文提出了基于图像固有尺度变化的弱标注策略（WIISA），利用图像在下采样过程中IIS的变化生成弱标签，从而改进了多个适用于IISA的IQA方法的性能。

链接: https://arxiv.org/abs/2502.06476
作者: Vlad Hosu,Lorenzo Agnolucci,Daisuke Iso,Dietmar Saupe
机构: Sony AI(索尼AI); University of Florence, Italy(意大利佛罗伦萨大学); University of Konstanz, Germany(德国康斯坦茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image Quality Assessment (IQA) measures and predicts perceived image quality by human observers. Although recent studies have highlighted the critical influence that variations in the scale of an image have on its perceived quality, this relationship has not been systematically quantified. To bridge this gap, we introduce the Image Intrinsic Scale (IIS), defined as the largest scale where an image exhibits its highest perceived quality. We also present the Image Intrinsic Scale Assessment (IISA) task, which involves subjectively measuring and predicting the IIS based on human judgments. We develop a subjective annotation methodology and create the IISA-DB dataset, comprising 785 image-IIS pairs annotated by experts in a rigorously controlled crowdsourcing study. Furthermore, we propose WIISA (Weak-labeling for Image Intrinsic Scale Assessment), a strategy that leverages how the IIS of an image varies with downscaling to generate weak labels. Experiments show that applying WIISA during the training of several IQA methods adapted for IISA consistently improves the performance compared to using only ground-truth labels. We will release the code, dataset, and pre-trained models upon acceptance.
zh

[CV-38] UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths

【速读】：该论文旨在解决统一多模态变换器在训练过程中由于冗余标记和大量注意力计算导致的高成本问题。解决方案的关键在于提出了一种任务感知型标记剪枝方法——UniMoD，通过为每个任务配备独立的路由选择器，以更有效地确定应剪枝哪些标记。这种方法显著减少了Show-o和Emu3模型的训练浮点运算次数（FLOPs），分别降低了约15%和40%，同时保持或提升了多个基准测试中的性能。

链接: https://arxiv.org/abs/2502.06474
作者: Weijia Mao,Zhenheng Yang,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified multimodal transformers, which handle both generation and understanding tasks within a shared parameter space, have received increasing attention in recent research. Although various unified transformers have been proposed, training these models is costly due to redundant tokens and heavy attention computation. In the past, studies on large language models have demonstrated that token pruning methods, such as Mixture of Depths (MoD), can significantly improve computational efficiency. MoD employs a router to select the most important ones for processing within a transformer layer. However, directly applying MoD-based token pruning to unified transformers will result in suboptimal performance because different tasks exhibit varying levels of token redundancy. In our work, we analyze the unified transformers by (1) examining attention weight patterns, (2) evaluating the layer importance and token redundancy, and (3) analyzing task interactions. Our findings reveal that token redundancy is primarily influenced by different tasks and layers. Building on these findings, we introduce UniMoD, a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned. We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks. Code will be released at this https URL.
zh

[CV-39] Group-CLIP Uncertainty Modeling for Group Re-Identification

【速读】：该论文旨在解决行人组跨非重叠摄像头匹配（Group Re-Identification, Group ReID）的问题，特别关注组结构的变化，包括成员数量及其空间排列。现有方法大多依赖于基于确定性的模型，仅考虑特定的组结构，难以匹配未见过的组配置。论文的关键解决方案是提出了Group-CLIP Uncertainty Modeling (GCUM) 方法，通过适应不确定的组文本描述来应对成员和布局变化。具体而言，设计了Member Variant Simulation (MVS) 模块模拟成员排除，并使用伯努利分布；设计了Group Layout Adaptation (GLA) 模块生成带有身份特定标记的不确定组文本描述；同时，引入了Group Relationship Construction Encoder (GRCE)，利用组特征精炼个体特征，并采用跨模态对比损失从组文本描述中获取泛化知识。GCUM 方法首次将CLIP应用于Group ReID，并在实验中显著超越了现有的最先进方法。

链接: https://arxiv.org/abs/2502.06460
作者: Qingxin Zhang,Haoyan Wei,Yang Qian
机构: Glasgow College Hainan, UESTC (格拉斯哥学院，电子科技大学); University of Electronic Science and Technology of China (电子科技大学); Sichuan University (四川大学); USC Viterbi School of Engineering (南加州大学维特比工程学院); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Group Re-Identification (Group ReID) aims matching groups of pedestrians across non-overlapping cameras. Unlike single-person ReID, Group ReID focuses more on the changes in group structure, emphasizing the number of members and their spatial arrangement. However, most methods rely on certainty-based models, which consider only the specific group structures in the group images, often failing to match unseen group configurations. To this end, we propose a novel Group-CLIP UncertaintyModeling (GCUM) approach that adapts group text descriptions to undetermined accommodate member and layout variations. Specifically, we design a Member Variant Simulation (MVS)module that simulates member exclusions using a Bernoulli distribution and a Group Layout Adaptation (GLA) module that generates uncertain group text descriptions with identity-specific tokens. In addition, we design a Group RelationshipConstruction Encoder (GRCE) that uses group features to refine individual features, and employ cross-modal contrastive loss to obtain generalizable knowledge from group text descriptions. It is worth noting that we are the first to employ CLIP to GroupReID, and extensive experiments show that GCUM significantly outperforms state-of-the-art Group ReID methods.
zh

[CV-40] SparseFocus: Learning-based One-shot Autofocus for Microscopy with Sparse Content

【速读】：该论文旨在解决在显微成像高通量实时扫描中的自动对焦问题，特别是在图像内容稀疏时传统方法及现有学习方法失效的问题。论文的关键解决方案是提出了一种基于内容重要性的方法——SparseFocus，它包含一个新颖的两阶段流程：第一阶段测量图像内部各区域的重要性；第二阶段从选定的重要区域计算离焦距离。为了验证这一方法的有效性，研究团队收集了一个大规模标注数据集，涵盖了密集、稀疏以及极稀疏场景，并将其整合到全视野成像系统中，实现实时应用。实验结果表明，SparseFocus在处理各种内容稀疏程度的图像时均优于现有方法。

链接: https://arxiv.org/abs/2502.06452
作者: Yongping Zhai,Xiaoxi Fu,Qiang Su,Jia Hu,Yake Zhang,Yunfeng Zhou,Chaofan Zhang,Xiao Li,Wenxin Wang,Dongdong Wu,Shen Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Autofocus is necessary for high-throughput and real-time scanning in microscopic imaging. Traditional methods rely on complex hardware or iterative hill-climbing algorithms. Recent learning-based approaches have demonstrated remarkable efficacy in a one-shot setting, avoiding hardware modifications or iterative mechanical lens adjustments. However, in this paper, we highlight a significant challenge that the richness of image content can significantly affect autofocus performance. When the image content is sparse, previous autofocus methods, whether traditional climbing-hill or learning-based, tend to fail. To tackle this, we propose a content-importance-based solution, named SparseFocus, featuring a novel two-stage pipeline. The first stage measures the importance of regions within the image, while the second stage calculates the defocus distance from selected important regions. To validate our approach and benefit the research community, we collect a large-scale dataset comprising millions of labelled defocused images, encompassing both dense, sparse and extremely sparse scenarios. Experimental results show that SparseFocus surpasses existing methods, effectively handling all levels of content sparsity. Moreover, we integrate SparseFocus into our Whole Slide Imaging (WSI) system that performs well in real-world applications. The code and dataset will be made available upon the publication of this paper.
zh

[CV-41] Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

【速读】：该论文旨在解决视频环境中光学字符识别（Optical Character Recognition, OCR）任务中视觉-语言模型（Vision-Language Models, VLMs）的评估问题。关键在于提出了一套公开可用的数据集和基准测试框架，该数据集包含1,477个手动标注的视频帧，涵盖了代码编辑器、新闻广播、YouTube视频和广告等多个领域。通过将三个先进的VLMs与传统的OCR系统进行对比，使用字错误率（Word Error Rate, WER）、字符错误率（Character Error Rate, CER）和准确率（Accuracy）作为评价指标，论文展示了VLMs在视频OCR任务中的优势与局限性，并强调了诸如幻觉现象、内容安全策略以及对遮挡或风格化文本的敏感性等挑战。

链接: https://arxiv.org/abs/2502.06445
作者: Sankalp Nagaonkar,Augustya Sharma,Ashish Choithani,Ashutosh Trivedi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and dataset: this https URL

点击查看摘要

Abstract:This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.
zh

[CV-42] Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

【速读】：该论文旨在解决现有数据集压缩方法（包括数据蒸馏和数据剪枝）在评估协议上的不一致性，以及过度依赖软标签导致的额外负担。论文的关键解决方案是提出了一种新的数据集压缩框架PCA（Prune, Combine, and Augment），该框架专注于利用图像数据本身，仅依赖硬标签进行评估，并在主流大规模数据集的蒸馏设置中实现了最先进的性能。这有助于重新关注图像数据的内在价值，并提供更平衡和易于访问的数据集压缩技术。

链接: https://arxiv.org/abs/2502.06434
作者: Lingao Xiao,Songhua Liu,Yang He,Xinchao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Work In Progress

点击查看摘要

Abstract:Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: this https URL
zh

[CV-43] Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single-Image Denoising

【速读】：该论文旨在解决单图像去噪过程中由于盲点网络或子图像对采样导致的像素信息丢失及细节结构破坏的问题。论文的关键解决方案在于提出了一种基于提示学习的单图像去噪框架Prompt-SID。通过自监督方式利用降采样的图像对进行训练，并采用基于潜在扩散过程的结构表示生成模型以及在基于变换器的去噪器架构内设计的结构注意力模块来解码提示信息，从而有效保留结构细节并缓解不同分辨率图像之间的尺度差距。

链接: https://arxiv.org/abs/2502.06432
作者: Huaqiu Li,Wang Zhang,Xiaowan Hu,Tao Jiang,Zikang Chen,Haoqian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many studies have concentrated on constructing supervised models utilizing paired datasets for image denoising, which proves to be expensive and time-consuming. Current self-supervised and unsupervised approaches typically rely on blind-spot networks or sub-image pairs sampling, resulting in pixel information loss and destruction of detailed structural information, thereby significantly constraining the efficacy of such methods. In this paper, we introduce Prompt-SID, a prompt-learning-based single image denoising framework that emphasizes preserving of structural details. This approach is trained in a self-supervised manner using downsampled image pairs. It captures original-scale image information through structural encoding and integrates this prompt into the denoiser. To achieve this, we propose a structural representation generation model based on the latent diffusion process and design a structural attention module within the transformer-based denoiser architecture to decode the prompt. Additionally, we introduce a scale replay training mechanism, which effectively mitigates the scale gap from images of different resolutions. We conduct comprehensive experiments on synthetic, real-world, and fluorescence imaging datasets, showcasing the remarkable effectiveness of Prompt-SID.
zh

[CV-44] FCVSR: A Frequency-aware Method for Compressed Video Super-Resolution

【速读】：该论文旨在解决压缩视频超分辨率（Super-Resolution, SR）的问题，特别是如何从低分辨率（Low-Resolution, LR）压缩视频生成高分辨率（High-Resolution, HR）视频。现有方法在频域中利用时空信息，但未能区分不同的空间频率子带或捕捉时间频率动态，可能导致次优结果。为了解决这些问题，论文提出了一种基于深度频率的压缩视频超分辨率模型（Frequency-based Compressed Video Super-Resolution, FCVSR），其关键是引入了一个运动引导自适应对齐网络（Motion-Guided Adaptive Alignment, MGAA）和一个多频率特征精炼模块（Multi-Frequency Feature Refinement, MFFR）。此外，还提出了一个频率感知对比损失函数以优化模型训练，从而重建更精细的空间细节。

链接: https://arxiv.org/abs/2502.06431
作者: Qiang Zhu,Fan Zhang,Feiyu Chen,Shuyuan Zhu,David Bull,Bing Zeng
机构: School of Information and Communication Engineering, University of Electronic Science and Technology of China (电子科技大学信息与通信工程学院); University of Electronic Science and Technology of China (电子科技大学); School of Computer Science, University of Bristol (布里斯托尔大学计算机科学学院); University of Bristol (布里斯托尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compressed video super-resolution (SR) aims to generate high-resolution (HR) videos from the corresponding low-resolution (LR) compressed videos. Recently, some compressed video SR methods attempt to exploit the spatio-temporal information in the frequency domain, showing great promise in super-resolution performance. However, these methods do not differentiate various frequency subbands spatially or capture the temporal frequency dynamics, potentially leading to suboptimal results. In this paper, we propose a deep frequency-based compressed video SR model (FCVSR) consisting of a motion-guided adaptive alignment (MGAA) network and a multi-frequency feature refinement (MFFR) module. Additionally, a frequency-aware contrastive loss is proposed for training FCVSR, in order to reconstruct finer spatial details. The proposed model has been evaluated on three public compressed video super-resolution datasets, with results demonstrating its effectiveness when compared to existing works in terms of super-resolution performance (up to a 0.14dB gain in PSNR over the second-best model) and complexity.
zh

[CV-45] CoS: Chain-of-Shot Prompting for Long Video Understanding

【速读】：该论文旨在解决多模态大语言模型（MLLMs）处理长视频时因所需视觉标记过多而导致上下文长度超限的问题。这些多余的视觉标记包含大量与任务无关的镜头片段，导致模型理解视频内容时产生偏差。为了解决这一问题，论文提出了一种名为Chain-of-Shot提示（CoS）的方法。其关键是将镜头选择视为测试时视觉提示优化过程，通过优化镜头与任务的对齐来适应视频理解的任务需求。CoS的关键组成部分包括一个二值化视频摘要机制，用于执行伪时间定位，发现二值编码以识别与任务相关的镜头，以及一个视频共推理模块，利用该二值编码将相关正向镜头与不相关的负向镜头配对，从而优化长视频的理解。

链接: https://arxiv.org/abs/2502.06428
作者: Jian Hu,Zixu Cheng,Chenyang Si,Wei Li,Shaogang Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A training-free test-time optimisation approach for long video understanding

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in this https URL.
zh

[CV-46] Hybrid State-Space and GRU-based Graph Tokenization Mamba for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像（HSI）分类中的挑战，特别是由于数据的高维特性和复杂的光谱-空间关系所导致的问题。传统方法如传统的机器学习和卷积神经网络（CNNs）难以有效捕捉这些复杂的特征和全局上下文信息。此外，尽管基于Transformer的模型在捕捉长距离依赖方面表现出色，但在标记数据集有限的情况下需要大量的计算资源，这在高光谱图像应用中较为常见。

为了解决这些问题，论文提出了一种名为GraphMamba的混合模型。该模型的关键在于结合了光谱-空间令牌生成、基于图的令牌优先级排序以及交叉注意力机制。GraphMamba通过引入状态空间建模与门控循环单元（GRU）的创新混合方式，能够捕捉线性和非线性空间-光谱动态。这种方法不仅增强了对复杂空间-光谱关系的建模能力，还保持了在不同高光谱图像数据集上的可扩展性和计算效率。

链接: https://arxiv.org/abs/2502.06427
作者: Muhammad Ahmad,Muhammad Hassaan Farooq Butt,Muhammad Usama,Manuel Mazzara,Salvatore Distefano,Adil Mehmood Khan,Danfeng Hong
机构: Dipartimento di Matematica e Informatica—MIFT, University of Messina (梅西纳大学); Institute of Artificial Intelligence, School of Mechanical and Electrical Engineering, Shaoxing University (绍兴大学); National University of Computer and Emerging Sciences (NUCES), CFD, Pakistan (巴基斯坦国立科技大学); Institute of Software Development and Engineering, Innopolis University (因诺波利斯大学); School of Computer Science, University of Hull (赫尔大学); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子电气与通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) classification plays a pivotal role in domains such as environmental monitoring, agriculture, and urban planning. However, it faces significant challenges due to the high-dimensional nature of the data and the complex spectral-spatial relationships inherent in HSI. Traditional methods, including conventional machine learning and convolutional neural networks (CNNs), often struggle to effectively capture these intricate spectral-spatial features and global contextual information. Transformer-based models, while powerful in capturing long-range dependencies, often demand substantial computational resources, posing challenges in scenarios where labeled datasets are limited, as is commonly seen in HSI applications. To overcome these challenges, this work proposes GraphMamba, a hybrid model that combines spectral-spatial token generation, graph-based token prioritization, and cross-attention mechanisms. The model introduces a novel hybridization of state-space modeling and Gated Recurrent Units (GRU), capturing both linear and nonlinear spatial-spectral dynamics. GraphMamba enhances the ability to model complex spatial-spectral relationships while maintaining scalability and computational efficiency across diverse HSI datasets. Through comprehensive experiments, we demonstrate that GraphMamba outperforms existing state-of-the-art models, offering a scalable and robust solution for complex HSI classification tasks.
zh

[CV-47] Robust Watermarks Leak: Channel-Aware Feature Extraction Enables Adversarial Watermark Manipulation

【速读】：该论文旨在解决水印在AI生成内容的溯源和检测过程中存在的稳健性与隐蔽性之间的权衡问题。论文的关键在于揭示现有鲁棒水印方法会增加可检测模式的冗余度，从而导致信息泄露，并提出了一种利用多通道特征学习通过预训练视觉模型来提取水印图案泄漏的攻击框架。此方法仅需单个水印图像即可实现伪造和逃避检测，显著提升了检测规避成功率及伪造准确性，同时保持视觉保真度。

链接: https://arxiv.org/abs/2502.06418
作者: Zhongjie Ba,Yitao Zhang,Peng Cheng,Bin Gong,Xinyu Zhang,Qinglong Wang,Kui Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Watermarking plays a key role in the provenance and detection of AI-generated content. While existing methods prioritize robustness against real-world distortions (e.g., JPEG compression and noise addition), we reveal a fundamental tradeoff: such robust watermarks inherently improve the redundancy of detectable patterns encoded into images, creating exploitable information leakage. To leverage this, we propose an attack framework that extracts leakage of watermark patterns through multi-channel feature learning using a pre-trained vision model. Unlike prior works requiring massive data or detector access, our method achieves both forgery and detection evasion with a single watermarked image. Extensive experiments demonstrate that our method achieves a 60% success rate gain in detection evasion and 51% improvement in forgery accuracy compared to state-of-the-art methods while maintaining visual fidelity. Our work exposes the robustness-stealthiness paradox: current “robust” watermarks sacrifice security for distortion resistance, providing insights for future watermark design.
zh

[CV-48] ANGLED: Generating 3D Hair Strands from Images with Arbitrary Styles and Viewpoints

【速读】：该论文旨在解决现有文本或图像引导的发型生成方法无法处理多样化和复杂发型的问题。关键解决方案在于提出了一种名为TANGLED的新方法，该方法通过一个多步骤的流水线实现3D发丝生成，能够适应多样化的输入条件。具体而言，首先构建了一个包含457个多样化发型的MultiHair数据集，并标注了74个属性以增强模型的泛化能力；其次，引入了一种基于多视角线稿图的扩散框架，能够捕捉拓扑线索并过滤噪声；最后，采用参数化后处理模块来保持复杂结构的一致性，特别是辫子特有的约束。这种方法不仅提升了发型的真实感和多样性，还促进了文化包容性的数字虚拟人物的创建，并推动了如基于草图的3D发丝编辑等新应用的发展。

链接: https://arxiv.org/abs/2502.06392
作者: Pengyu Long,Zijun Zhao,Min Ouyang,Qingcheng Zhao,Qixuan Zhang,Wei Yang,Lan Xu,Jingyi Yu
机构: ShanghaiTech University(上海科技大学); Deemos Technology; Huazhong University of Science and Technology(华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Hairstyles are intricate and culturally significant with various geometries, textures, and structures. Existing text or image-guided generation methods fail to handle the richness and complexity of diverse styles. We present TANGLED, a novel approach for 3D hair strand generation that accommodates diverse image inputs across styles, viewpoints, and quantities of input views. TANGLED employs a three-step pipeline. First, our MultiHair Dataset provides 457 diverse hairstyles annotated with 74 attributes, emphasizing complex and culturally significant styles to improve model generalization. Second, we propose a diffusion framework conditioned on multi-view linearts that can capture topological cues (e.g., strand density and parting lines) while filtering out noise. By leveraging a latent diffusion model with cross-attention on lineart features, our method achieves flexible and robust 3D hair generation across diverse input conditions. Third, a parametric post-processing module enforces braid-specific constraints to maintain coherence in complex structures. This framework not only advances hairstyle realism and diversity but also enables culturally inclusive digital avatars and novel applications like sketch-based 3D strand editing for animation and augmented reality.
zh

[CV-49] When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs

【速读】：该论文旨在调查针对视觉-语言模型（Vision-Language Models, VLMs）的攻击策略，并提出相应的防御机制。论文通过分类攻击目标（如越狱、伪装和利用）以及详细描述数据操纵方法，构建了一个全面的攻击分类体系。关键解决方案在于总结评估指标以描述不同攻击对VLMs的影响，并提出了缓解这些漏洞的防御措施。最后，论文强调了未来研究方向的重要性，以进一步增强VLMs的鲁棒性和安全性。

链接: https://arxiv.org/abs/2502.06390
作者: Aobotao Dai,Xinyu Ma,Lei Chen,Songze Li,Lin Wang
机构: Artificial Intelligence Thrust, Hong Kong University of Science and Technology (Guangzhou), China(香港科技大学(广州)，中国); School of Cyber Science and Engineering, Southeast University, China(东南大学网络空间安全学院，中国); School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore(南洋理工大学电气与电子工程学院，新加坡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have gained considerable prominence in recent years due to their remarkable capability to effectively integrate and process both textual and visual information. This integration has significantly enhanced performance across a diverse spectrum of applications, such as scene perception and robotics. However, the deployment of VLMs has also given rise to critical safety and security concerns, necessitating extensive research to assess the potential vulnerabilities these VLM systems may harbor. In this work, we present an in-depth survey of the attack strategies tailored for VLMs. We categorize these attacks based on their underlying objectives - namely jailbreak, camouflage, and exploitation - while also detailing the various methodologies employed for data manipulation of VLMs. Meanwhile, we outline corresponding defense mechanisms that have been proposed to mitigate these vulnerabilities. By discerning key connections and distinctions among the diverse types of attacks, we propose a compelling taxonomy for VLM attacks. Moreover, we summarize the evaluation metrics that comprehensively describe the characteristics and impact of different attacks on VLMs. Finally, we conclude with a discussion of promising future research directions that could further enhance the robustness and safety of VLMs, emphasizing the importance of ongoing exploration in this critical area of study. To facilitate community engagement, we maintain an up-to-date project page, accessible at: this https URL.
zh

[CV-50] Structure-preserving contrastive learning for spatial time series

【速读】：该论文旨在解决自监督表征学习在空间特征化时间序列（如交通交互）中的挑战，特别是如何在潜在空间中保持细粒度的相似性关系。关键解决方案在于引入两种保持结构的正则化器以促进空间时间序列的对比学习：一种保持实例间相似性的拓扑结构，另一种保持跨时空维度的相似性图几何结构。此外，提出了一种动态机制来平衡对比学习与结构保持，并稳定训练过程。通过这些方法，论文展示了所提方法在多变量时间序列分类及宏观微观交通预测任务中的有效性，表明更高的相似性结构保持能够生成更具有信息量和实用性的表征。

链接: https://arxiv.org/abs/2502.06380
作者: Yiru Jiao,Sander van Cranenburgh,Simeon Calvert,Hans van Lint
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: TL;DR: Preserving certain structures of similarity relations in spatio-temporal data can improve downstream task performance via contrastive learning

点击查看摘要

Abstract:Informative representations enhance model performance and generalisability in downstream tasks. However, learning self-supervised representations for spatially characterised time series, like traffic interactions, poses challenges as it requires maintaining fine-grained similarity relations in the latent space. In this study, we incorporate two structure-preserving regularisers for the contrastive learning of spatial time series: one regulariser preserves the topology of similarities between instances, and the other preserves the graph geometry of similarities across spatial and temporal dimensions. To balance contrastive learning and structure preservation, we propose a dynamic mechanism that adaptively weighs the trade-off and stabilises training. We conduct experiments on multivariate time series classification, as well as macroscopic and microscopic traffic prediction. For all three tasks, our approach preserves the structures of similarity relations more effectively and improves state-of-the-art task performances. The proposed approach can be applied to an arbitrary encoder and is particularly beneficial for time series with spatial or geographical features. Furthermore, this study suggests that higher similarity structure preservation indicates more informative and useful representations. This may help to understand the contribution of representation learning in pattern recognition with neural networks. Our code is made openly accessible with all resulting data at this https URL.
zh

[CV-51] Many-Task Federated Fine-Tuning via Unified Task Vectors IJCAI2025

【速读】：该论文旨在解决在实际应用中联邦学习（Federated Learning, FL）面临的任务异质性（Task Heterogeneity）问题，即不同客户端处理的任务各不相同。现有的多任务联邦学习（Many-Task Federated Learning, MaT-FL）方法依赖于客户端分组或个性化层，需要服务器管理多个模型，并且没有充分考虑客户端处理多个任务的情况。论文提出的关键解决方案是MaTU方法，它通过联合学习跨客户端的任务向量（task vectors），消除了聚类和服务器端存储客户端特定权重的需求。MaTU引入了一种新颖的聚合机制，基于客户端任务向量的方向来确定任务相似性，并构建一个统一的任务向量来封装所有任务。此外，为了满足特定任务需求，通过轻量级调节器（modulators）增强统一任务向量，促进相关任务之间的知识转移，同时分离不相似的任务。通过在30个数据集上的评估，MaTU展示了优于现有最先进的多任务联邦学习方法的性能，并且其结果与针对每个任务进行微调的结果相当，同时显著减少了通信开销。

链接: https://arxiv.org/abs/2502.06376
作者: Vasileios Tsouvalas,Tanir Ozcelebi,Nirvana Meratnia
机构: Eindhoven University of Technology (埃因霍芬理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, submitted in IJCAI 2025

点击查看摘要

Abstract:Federated Learning (FL) traditionally assumes homogeneous client tasks; however, in real-world scenarios, clients often specialize in diverse tasks, introducing task heterogeneity. To address this challenge, Many-Task FL (MaT-FL) has emerged, enabling clients to collaborate effectively despite task diversity. Existing MaT-FL approaches rely on client grouping or personalized layers, requiring the server to manage individual models and failing to account for clients handling multiple tasks. We propose MaTU, a MaT-FL approach that enables joint learning of task vectors across clients, eliminating the need for clustering or client-specific weight storage at the server. Our method introduces a novel aggregation mechanism that determines task similarity based on the direction of clients task vectors and constructs a unified task vector encapsulating all tasks. To address task-specific requirements, we augment the unified task vector with lightweight modulators that facilitate knowledge transfer among related tasks while disentangling dissimilar ones. Evaluated across 30 datasets, MaTU achieves superior performance over state-of-the-art MaT-FL approaches, with results comparable to per-task fine-tuning, while delivering significant communication savings.
zh

[CV-52] FOCUS - Multi-View Foot Reconstruction From Synthetically Trained Dense Correspondences

【速读】：该论文旨在解决从少量多视角RGB图像中重建人类足部三维模型的问题。论文的关键解决方案在于提出了FOCUS方法，包含三个主要贡献：(i) SynFoot2数据集扩展，引入了参数化脚模FIND的密集对应关系；(ii) 基于合成数据集训练的不确定性感知密集对应预测器；(iii) 两种基于密集对应预测的三维表面重建方法：一种受运动结构启发，另一种基于优化使用FIND模型。这些贡献使得重建在少量视图下达到最先进的重建质量，并且运行速度显著提升。

链接: https://arxiv.org/abs/2502.06367
作者: Oliver Boyne,Roberto Cipolla
机构: Department of Engineering, University of Cambridge, U.K. (工程系，剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:Surface reconstruction from multiple, calibrated images is a challenging task - often requiring a large number of collected images with significant overlap. We look at the specific case of human foot reconstruction. As with previous successful foot reconstruction work, we seek to extract rich per-pixel geometry cues from multi-view RGB images, and fuse these into a final 3D object. Our method, FOCUS, tackles this problem with 3 main contributions: (i) SynFoot2, an extension of an existing synthetic foot dataset to include a new data type: dense correspondence with the parameterized foot model FIND; (ii) an uncertainty-aware dense correspondence predictor trained on our synthetic dataset; (iii) two methods for reconstructing a 3D surface from dense correspondence predictions: one inspired by Structure-from-Motion, and one optimization-based using the FIND model. We show that our reconstruction achieves state-of-the-art reconstruction quality in a few-view setting, performing comparably to state-of-the-art when many views are available, and runs substantially faster. We release our synthetic dataset to the research community. Code is available at: this https URL
zh

[CV-53] Guidance-base Diffusion Models for Improving Photoacoustic Image Quality

【速读】：该论文旨在解决单次拍摄的光声成像（PA imaging）图像质量较差的问题，并提出通过平均多次单次拍摄图像来提升整体成像质量，但这种方法会导致高成像成本。为了解决这一问题，论文的关键在于提出了一种利用扩散模型改进光声成像质量的方法。具体而言，该方法通过引入传感器信息改进反向扩散过程，并采用成像条件信息引导以生成高质量图像。

链接: https://arxiv.org/abs/2502.06354
作者: Tatsuhiro Eguchi,Shumpei Takezaki,Mihoko Shimano,Takayuki Yagi,Ryoma Bise
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photoacoustic(PA) imaging is a non-destructive and non-invasive technology for visualizing minute blood vessel structures in the body using ultrasonic sensors. In PA imaging, the image quality of a single-shot image is poor, and it is necessary to improve the image quality by averaging many single-shot images. Therefore, imaging the entire subject requires high imaging costs. In our study, we propose a method to improve the quality of PA images using diffusion models. In our method, we improve the reverse diffusion process using sensor information of PA imaging and introduce a guidance method using imaging condition information to generate high-quality images.
zh

[CV-54] LANTERN: Enhanced Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models

【速读】：该论文旨在解决视觉自回归（Visual Autoregressive, AR）模型在预测过程中由于多个候选标记具有相似低概率而导致的令牌选择不确定性问题，这限制了推测性解码（Speculative Decoding）的有效性。为了解决这一问题，论文的关键在于引入LANTERN++框架，该框架结合静态树起草（static tree drafting）与宽松的接受条件（relaxed acceptance condition），使得草案可以独立于低置信度预测进行选择，从而实现更深的接受序列，提高解码效率同时保持图像质量。

链接: https://arxiv.org/abs/2502.06352
作者: Sihwan Park,Doohyuk Jang,Sungyub Kim,Souvik Kundu,Eunho Yang
机构: KAIST; Intel Labs; AITRICS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures, short paper (5 pages exclude reference and appendix)

点击查看摘要

Abstract:Speculative decoding has been widely used to accelerate autoregressive (AR) text generation. However, its effectiveness in visual AR models remains limited due to token selection ambiguity, where multiple tokens receive similarly low probabilities, reducing acceptance rates. While dynamic tree drafting has been proposed to improve speculative decoding, we show that it fails to mitigate token selection ambiguity, resulting in shallow draft trees and suboptimal acceleration. To address this, we introduce LANTERN++, a novel framework that integrates static tree drafting with a relaxed acceptance condition, allowing drafts to be selected independently of low-confidence predictions. This enables deeper accepted sequences, improving decoding efficiency while preserving image quality. Extensive experiments on state-of-the-art visual AR models demonstrate that LANTERN++ significantly accelerates inference, achieving up to \mathbf\times 2.56 speedup over standard AR decoding while maintaining high image quality.
zh

[CV-55] Facial Analysis Systems and Down Syndrome

【速读】：该论文旨在探讨面部分析技术在应用于唐氏综合症人群面部图像时所表现出的偏见与局限性。研究通过创建包含唐氏综合症患者和非患者的面部图像数据集，并使用两个商业工具进行性别识别、年龄预测和面部标记三项任务的测试，揭示了实验组（唐氏综合症患者）整体预测准确性较低以及其他特定性能差异模式。关键解决方案在于通过实证研究展示当前面部分析系统在处理唐氏综合症患者面部图像时存在的偏差，从而强调了技术本身的结构性限制，这些限制源自用于训练模型的数据集。

链接: https://arxiv.org/abs/2502.06341
作者: Marco Rondina,Fabiana Vinci,Antonio Vetrò,Juan Carlos De Martin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ethical, social and legal issues surrounding facial analysis technologies have been widely debated in recent years. Key critics have argued that these technologies can perpetuate bias and discrimination, particularly against marginalized groups. We contribute to this field of research by reporting on the limitations of facial analysis systems with the faces of people with Down syndrome: this particularly vulnerable group has received very little attention in the literature so far. This study involved the creation of a specific dataset of face images. An experimental group with faces of people with Down syndrome, and a control group with faces of people who are not affected by the syndrome. Two commercial tools were tested on the dataset, along three tasks: gender recognition, age prediction and face labelling. The results show an overall lower accuracy of prediction in the experimental group, and other specific patterns of performance differences: i) high error rates in gender recognition in the category of males with Down syndrome; ii) adults with Down syndrome were more often incorrectly labelled as children; iii) social stereotypes are propagated in both the control and experimental groups, with labels related to aesthetics more often associated with women, and labels related to education level and skills more often associated with men. These results, although limited in scope, shed new light on the biases that alter face classification when applied to faces of people with Down syndrome. They confirm the structural limitation of the technology, which is inherently dependent on the datasets used to train the models.
zh

[CV-56] Zero-shot Depth Completion via Test-time Alignment with Affine-invariant Depth Prior AAAI2025

【速读】：该论文旨在解决深度完成（Depth Completion）领域中从稀疏深度测量预测密集深度图的问题。这一问题本质上是不适定的，需要先验知识的支持。现有方法主要通过学习方式隐式捕捉先验知识，但这些先验知识大多适应域内数据，在跨域场景中的泛化能力较弱。论文提出了一种零样本深度完成方法，关键在于结合仿射不变深度扩散模型和测试时对齐策略。通过使用预训练的深度扩散模型作为深度先验知识，并在测试时通过优化循环将仿射不变深度先验与度量尺度的稀疏测量结果对齐，从而实现硬约束。这种方法展示了在多种数据集上的泛化能力，相比之前的方法平均性能提升高达21%，同时增强了场景细节的理解。

链接: https://arxiv.org/abs/2502.06338
作者: Lee Hyoseok,Kyeong Seon Kim,Kwon Byung-Ki,Tae-Hyun Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025, Project page: this https URL

点击查看摘要

Abstract:Depth completion, predicting dense depth maps from sparse depth measurements, is an ill-posed problem requiring prior knowledge. Recent methods adopt learning-based approaches to implicitly capture priors, but the priors primarily fit in-domain data and do not generalize well to out-of-domain scenarios. To address this, we propose a zero-shot depth completion method composed of an affine-invariant depth diffusion model and test-time alignment. We use pre-trained depth diffusion models as depth prior knowledge, which implicitly understand how to fill in depth for scenes. Our approach aligns the affine-invariant depth prior with metric-scale sparse measurements, enforcing them as hard constraints via an optimization loop at test-time. Our zero-shot depth completion method demonstrates generalization across various domain datasets, achieving up to a 21% average performance improvement over the previous state-of-the-art methods while enhancing spatial understanding by sharpening scene details. We demonstrate that aligning a monocular affine-invariant depth prior with sparse metric measurements is a proven strategy to achieve domain-generalizable depth completion without relying on extensive training data. Project page: this https URL.
zh

[CV-57] Accelerating Outlier-robust Rotation Estimation by Stereographic Projection

【速读】：该论文旨在解决在大规模输入数据中高效且鲁棒地估计旋转的问题，特别是当这些数据包含大量异常值（即不匹配）和噪声时。现有方法通常因计算时间过长及局部最优的风险而难以应用。论文提出的关键解决方案是首先利用仅涉及旋转轴的几何约束，然后通过使用立体投影和空间投票技术来识别旋转轴和角度，从而高效获得最优旋转估计，并能够同时估计多个旋转。

链接: https://arxiv.org/abs/2502.06337
作者: Taosi Xu,Yinlong Liu,Xianbo Wang,Zhi-Xin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Rotation estimation plays a fundamental role in many computer vision and robot tasks. However, efficiently estimating rotation in large inputs containing numerous outliers (i.e., mismatches) and noise is a recognized challenge. Many robust rotation estimation methods have been designed to address this challenge. Unfortunately, existing methods are often inapplicable due to their long computation time and the risk of local optima. In this paper, we propose an efficient and robust rotation estimation method. Specifically, our method first investigates geometric constraints involving only the rotation axis. Then, it uses stereographic projection and spatial voting techniques to identify the rotation axis and angle. Furthermore, our method efficiently obtains the optimal rotation estimation and can estimate multiple rotations simultaneously. To verify the feasibility of our method, we conduct comparative experiments using both synthetic and real-world data. The results show that, with GPU assistance, our method can solve large-scale ( 10^6 points) and severely corrupted (90% outlier rate) rotation estimation problems within 0.07 seconds, with an angular error of only 0.01 degrees, which is superior to existing methods in terms of accuracy and efficiency.
zh

[CV-58] DefTransNet: A Transformer-based Method for Non-Rigid Point Cloud Registration in the Simulation of Soft Tissue Deformation

【速读】：该论文旨在解决软组织手术中由于组织变形导致的组织位置和形态难以准确定位的问题。通过将组织表面表示为点云，并应用非刚性点云配准（Non-Rigid Point Cloud Registration, PCR）方法，外科医生能够在术前、术中和术后更好地理解组织变形。论文提出的关键解决方案是DefTransNet，这是一种新型的端到端Transformer基架构，用于非刚性PCR。DefTransNet通过输入源点云和目标点云并输出位移向量场，解决了大变形、异常值、噪声和部分数据等挑战。其关键在于引入可学习的变换矩阵以增强仿射变换的鲁棒性，整合全局和局部几何信息，并利用Transformer捕捉点之间的长程依赖关系。

链接: https://arxiv.org/abs/2502.06336
作者: Sara Monji-Azad,Marvin Kinz,Siddharth Kothari,Robin Khanna,Amrei Carla Mihan,David Maennel,Claudia Scherl,Juergen Hesser
机构: Mannheim Institute for Intelligent Systems in Medicine (MIISM)(曼海姆智能系统医学研究所), Medical Faculty Mannheim, Heidelberg University(海德堡大学), Mannheim, Germany(德国);

Department of Radiation Oncology(放射肿瘤科), Brigham and Women’s Hospital(布里格姆妇女医院), Dana-Farber Cancer Institute(达纳-法伯癌症研究所), Harvard Medical School(哈佛医学院), Boston, MA, USA(美国);

International Institute of Information Technology(信息技术国际学院), Bangalore, India(印度);

Department of Otorhinolaryngology, Head and Neck Surgery(耳鼻喉头颈外科), Medical Faculty Mannheim, Heidelberg University(海德堡大学), Mannheim, Germany(德国);

AI Health Innovation Cluster(人工智能健康创新集群), Heidelberg-Mannheim Health and Life Science Alliance(海德堡-曼海姆健康与生命科学联盟), Heidelberg, Germany(德国);

Interdisciplinary Center for Scientific Computing (IWR)(跨学科科学计算中心), Heidelberg University(海德堡大学), Heidelberg, Germany(德国);

Central Institute for Computer Engineering (ZITI)(中央计算机工程研究所), Heidelberg University(海德堡大学), Heidelberg, Germany(德国);

CZS Heidelberg Center for Model-Based AI(海德堡基于模型的人工智能中心), Heidelberg University(海德堡大学), Mannheim, Germany(德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Soft-tissue surgeries, such as tumor resections, are complicated by tissue deformations that can obscure the accurate location and shape of tissues. By representing tissue surfaces as point clouds and applying non-rigid point cloud registration (PCR) methods, surgeons can better understand tissue deformations before, during, and after surgery. Existing non-rigid PCR methods, such as feature-based approaches, struggle with robustness against challenges like noise, outliers, partial data, and large deformations, making accurate point correspondence difficult. Although learning-based PCR methods, particularly Transformer-based approaches, have recently shown promise due to their attention mechanisms for capturing interactions, their robustness remains limited in challenging scenarios. In this paper, we present DefTransNet, a novel end-to-end Transformer-based architecture for non-rigid PCR. DefTransNet is designed to address the key challenges of deformable registration, including large deformations, outliers, noise, and partial data, by inputting source and target point clouds and outputting displacement vector fields. The proposed method incorporates a learnable transformation matrix to enhance robustness to affine transformations, integrates global and local geometric information, and captures long-range dependencies among points using Transformers. We validate our approach on four datasets: ModelNet, SynBench, 4DMatch, and DeformedTissue, using both synthetic and real-world data to demonstrate the generalization of our proposed method. Experimental results demonstrate that DefTransNet outperforms current state-of-the-art registration networks across various challenging conditions. Our code and data are publicly available.
zh

[CV-59] UniDemoire: Towards Universal Image Demoireing with Data Generation and Synthesis AAAI2025

【速读】：该论文旨在解决图像去moire（Image demoiréing）领域中因moire图案的不可预测性和各向异性所导致的重大挑战。当前方法受限于训练数据的数量和多样性，容易在单一moire域内过拟合，从而导致新域上的性能下降，并限制其在实际应用中的鲁棒性。论文提出的关键解决方案是UniDemoiré，这是一种具有卓越泛化能力的通用图像去moire方法。特别地，作者提出了创新且有效的数据生成与合成方法，能够自动生成大量高质量的moire图像，以训练一个通用的去moire模型。

链接: https://arxiv.org/abs/2502.06324
作者: Zemin Yang,Yujing Sun,Xidong Peng,Siu Ming Yiu,Yuexin Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Image demoiréing poses one of the most formidable challenges in image restoration, primarily due to the unpredictable and anisotropic nature of moiré patterns. Limited by the quantity and diversity of training data, current methods tend to overfit to a single moiré domain, resulting in performance degradation for new domains and restricting their robustness in real-world applications. In this paper, we propose a universal image demoiréing solution, UniDemoiré, which has superior generalization capability. Notably, we propose innovative and effective data generation and synthesis methods that can automatically provide vast high-quality moiré images to train a universal demoiréing model. Our extensive experiments demonstrate the cutting-edge performance and broad potential of our approach for generalized image demoiréing.
zh

[CV-60] From Pixels to Components: Eigenvector Masking for Visual Representation Learning

【速读】：该论文旨在解决使用随机像素块遮罩进行自监督视觉表征学习时存在的局限性，这些局限可能导致无法学得有意义的高层特征。论文的关键解决方案在于提出了一种新的遮罩策略，即通过对数据进行主成分分析（Principal Component Analysis, PCA），然后随机遮罩固定方差比例的主成分，而非直接在原始像素上操作。通过这种方法，学习任务转变为从可见的主成分重建被遮罩的主成分，从而更有效地提取有用的表征信息。

链接: https://arxiv.org/abs/2502.06314
作者: Alice Bizeul,Thomas Sutter,Alain Ryser,Bernhard Schölkopf,Julius von Kügelgen,Julia E. Vogt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting masked from visible parts of an image is a powerful self-supervised approach for visual representation learning. However, the common practice of masking random patches of pixels exhibits certain failure modes, which can prevent learning meaningful high-level features, as required for downstream tasks. We propose an alternative masking strategy that operates on a suitable transformation of the data rather than on the raw pixels. Specifically, we perform principal component analysis and then randomly mask a subset of components, which accounts for a fixed ratio of the data variance. The learning task then amounts to reconstructing the masked components from the visible ones. Compared to local patches of pixels, the principal components of images carry more global information. We thus posit that predicting masked from visible components involves more high-level features, allowing our masking strategy to extract more useful representations. This is corroborated by our empirical findings which demonstrate improved image classification performance for component over pixel masking. Our method thus constitutes a simple and robust data-driven alternative to traditional masked image modeling approaches.
zh

[CV-61] Cell Nuclei Detection and Classification in Whole Slide Images with Transformers

【速读】：该论文旨在解决在组织病理学全切片图像（Whole Slide Images, WSIs）中细胞核检测与分类的准确性和效率问题。传统方法依赖于细胞分割，但计算成本高昂且需要大量后处理，限制了其在高通量临床环境中的实用性。论文的关键解决方案是提出从分割范式转变为检测范式，并引入CellNuc-DETR模型，以更有效地提取细胞信息。实验结果显示，CellNuc-DETR在PanNuke数据集上的表现达到当前最先进水平，并在CoNSeP和MoNuSeg数据集上验证了其鲁棒性和泛化能力。此外，CellNuc-DETR在大型WSIs上的评估表明，它不仅在准确性上超越现有方法，而且显著减少了推断时间，从而在准确性和计算效率之间实现了更好的平衡。

链接: https://arxiv.org/abs/2502.06307
作者: Oscar Pina,Eduard Dorca,Verónica Vilaplana
机构: Universitat Politècnica de Catalunya - BarcelonaTech (UPC); Hospital Universitari de Bellvitge (HUB)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and efficient cell nuclei detection and classification in histopathological Whole Slide Images (WSIs) are pivotal for digital pathology applications. Traditional cell segmentation approaches, while commonly used, are computationally expensive and require extensive post-processing, limiting their practicality for high-throughput clinical settings. In this paper, we propose a paradigm shift from segmentation to detection for extracting cell information from WSIs, introducing CellNuc-DETR as a more effective solution. We evaluate the accuracy performance of CellNuc-DETR on the PanNuke dataset and conduct cross-dataset evaluations on CoNSeP and MoNuSeg to assess robustness and generalization capabilities. Our results demonstrate state-of-the-art performance in both cell nuclei detection and classification tasks. Additionally, we assess the efficiency of CellNuc-DETR on large WSIs, showing that it not only outperforms current methods in accuracy but also significantly reduces inference times. Specifically, CellNuc-DETR is twice as fast as the fastest segmentation-based method, HoVer-NeXt, while achieving substantially higher accuracy. Moreover, it surpasses CellViT in accuracy and is approximately ten times more efficient in inference speed on WSIs. These results establish CellNuc-DETR as a superior approach for cell analysis in digital pathology, combining high accuracy with computational efficiency.
zh

[CV-62] Enhancing Ground-to-Aerial Image Matching for Visual Misinformation Detection Using Semantic Segmentation

【速读】：该论文旨在解决在缺乏外部地理信息（如GPS坐标）的情况下，将非地理标记的地表视角图像与对应的卫星图像进行关联的问题。论文的关键解决方案在于提出了一种新颖的四流Siamese-like架构——四重语义对齐网络（Quadruple Semantic Align Net, SAN-QUAD），该网络通过在地表和卫星图像上应用语义分割技术，扩展了先前的最先进方法。

链接: https://arxiv.org/abs/2502.06288
作者: Matteo Mule,Matteo Pannacci,Ali Ghasemi Goudarzi,Francesco Pro,Lorenzo Papa,Luca Maiano,Irene Amerini
机构: Sapienza University of Rome (罗马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:The recent advancements in generative AI techniques, which have significantly increased the online dissemination of altered images and videos, have raised serious concerns about the credibility of digital media available on the Internet and distributed through information channels and social networks. This issue particularly affects domains that rely heavily on trustworthy data, such as journalism, forensic analysis, and Earth observation. To address these concerns, the ability to geolocate a non-geo-tagged ground-view image without external information, such as GPS coordinates, has become increasingly critical. This study tackles the challenge of linking a ground-view image, potentially exhibiting varying fields of view (FoV), to its corresponding satellite image without the aid of GPS data. To achieve this, we propose a novel four-stream Siamese-like architecture, the Quadruple Semantic Align Net (SAN-QUAD), which extends previous state-of-the-art (SOTA) approaches by leveraging semantic segmentation applied to both ground and satellite imagery. Experimental results on a subset of the CVUSA dataset demonstrate significant improvements of up to 9.8% over prior methods across various FoV settings.
zh

[CV-63] owards Efficient and Intelligent Laser Weeding: Method and Dataset for Weed Stem Detection AAAI

【速读】：该论文旨在解决激光除草系统中精准识别杂草茎部的问题。解决方案的关键在于将作物与杂草检测及杂草茎部定位整合到一个端到端的系统中，并通过构建包含人工标注的高质量杂草茎部检测数据集来训练和验证该系统。实验结果表明，所提出的系统相较于现有杂草识别系统提高了6.7%的除草精度，并减少了32.3%的能量消耗。

链接: https://arxiv.org/abs/2502.06255
作者: Dingning Liu,Jinzhe Li,Haoyang Su,Bei Cui,Zhihui Wang,Qingbo Yuan,Wanli Ouyang,Nanqing Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI-AISI 2025

点击查看摘要

Abstract:Weed control is a critical challenge in modern agriculture, as weeds compete with crops for essential nutrient resources, significantly reducing crop yield and quality. Traditional weed control methods, including chemical and mechanical approaches, have real-life limitations such as associated environmental impact and efficiency. An emerging yet effective approach is laser weeding, which uses a laser beam as the stem cutter. Although there have been studies that use deep learning in weed recognition, its application in intelligent laser weeding still requires a comprehensive understanding. Thus, this study represents the first empirical investigation of weed recognition for laser weeding. To increase the efficiency of laser beam cut and avoid damaging the crops of interest, the laser beam shall be directly aimed at the weed root. Yet, weed stem detection remains an under-explored problem. We integrate the detection of crop and weed with the localization of weed stem into one end-to-end system. To train and validate the proposed system in a real-life scenario, we curate and construct a high-quality weed stem detection dataset with human annotations. The dataset consists of 7,161 high-resolution pictures collected in the field with annotations of 11,151 instances of weed. Experimental results show that the proposed system improves weeding accuracy by 6.7% and reduces energy cost by 32.3% compared to existing weed recognition systems.
zh

[CV-64] Multi-Scale Transformer Architecture for Accurate Medical Image Classification

【速读】：该论文旨在解决医学图像分析中准确性与鲁棒性的问题，提出了一种基于增强型Transformer架构的AI驱动皮肤病变分类算法。解决方案的关键在于整合多尺度特征融合机制并优化自注意力过程，从而有效提取全局和局部特征，提升识别边界模糊和结构复杂的病变的能力。

链接: https://arxiv.org/abs/2502.06243
作者: Jiacheng Hu,Yanlin Xiang,Yang Lin,Junliang Du,Hanchao Zhang,Houze Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study introduces an AI-driven skin lesion classification algorithm built on an enhanced Transformer architecture, addressing the challenges of accuracy and robustness in medical image analysis. By integrating a multi-scale feature fusion mechanism and refining the self-attention process, the model effectively extracts both global and local features, enhancing its ability to detect lesions with ambiguous boundaries and intricate structures. Performance evaluation on the ISIC 2017 dataset demonstrates that the improved Transformer surpasses established AI models, including ResNet50, VGG19, ResNext, and Vision Transformer, across key metrics such as accuracy, AUC, F1-Score, and Precision. Grad-CAM visualizations further highlight the interpretability of the model, showcasing strong alignment between the algorithm’s focus areas and actual lesion sites. This research underscores the transformative potential of advanced AI models in medical imaging, paving the way for more accurate and reliable diagnostic tools. Future work will explore the scalability of this approach to broader medical imaging tasks and investigate the integration of multimodal data to enhance AI-driven diagnostic frameworks for intelligent healthcare.
zh

[CV-65] Unsupervised deep learning for semantic segmentation of multispectral LiDAR forest point clouds

【速读】：该论文旨在解决高密度多光谱（Multispectral, MS）机载激光扫描（Airborne Laser Scanning, ALS）点云数据中的叶-木分离问题。解决方案的关键在于提出了一种完全无监督的深度学习方法——GrowSP-ForMS，该方法基于GrowSP架构，并特别设计用于处理高密度MS ALS点云数据。实验结果显示，GrowSP-ForMS在多光谱测试集上达到了84.3%的平均精度和69.6%的平均交并比（mean intersection over union, mIoU），显著优于传统无监督方法，并且与监督学习方法如PointNet相比表现相当。此外，研究还表明利用多光谱信息能够进一步提升叶-木分离的准确性。

链接: https://arxiv.org/abs/2502.06227
作者: Lassi Ruoppa,Oona Oinonen,Josef Taher,Matti Lehtomäki,Narges Takhtkeshha,Antero Kukko,Harri Kaartinen,Juha Hyyppä
机构: Department of Remote Sensing and Photogrammetry, Finnish Geospatial Research Institute FGI, The National Land Survey of Finland (芬兰国家土地调查局); 3D Optical Metrology (3DOM) unit, Bruno Kessler Foundation (FBK) (布鲁诺凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 10 figures

点击查看摘要

Abstract:Point clouds captured with laser scanning systems from forest environments can be utilized in a wide variety of applications within forestry and plant ecology, such as the estimation of tree stem attributes, leaf angle distribution, and above-ground biomass. However, effectively utilizing the data in such tasks requires the semantic segmentation of the data into wood and foliage points, also known as leaf-wood separation. The traditional approach to leaf-wood separation has been geometry- and radiometry-based unsupervised algorithms, which tend to perform poorly on data captured with airborne laser scanning (ALS) systems, even with a high point density. While recent machine and deep learning approaches achieve great results even on sparse point clouds, they require manually labeled training data, which is often extremely laborious to produce. Multispectral (MS) information has been demonstrated to have potential for improving the accuracy of leaf-wood separation, but quantitative assessment of its effects has been lacking. This study proposes a fully unsupervised deep learning method, GrowSP-ForMS, which is specifically designed for leaf-wood separation of high-density MS ALS point clouds and based on the GrowSP architecture. GrowSP-ForMS achieved a mean accuracy of 84.3% and a mean intersection over union (mIoU) of 69.6% on our MS test set, outperforming the unsupervised reference methods by a significant margin. When compared to supervised deep learning methods, our model performed similarly to the slightly older PointNet architecture but was outclassed by more recent approaches. Finally, two ablation studies were conducted, which demonstrated that our proposed changes increased the test set mIoU of GrowSP-ForMS by 29.4 percentage points (pp) in comparison to the original GrowSP model and that utilizing MS data improved the mIoU by 5.6 pp from the monospectral case.
zh

[CV-66] FunduSAM: A Specialized Deep Learning Model for Enhanced Optic Disc and Cup Segmentation in Fundus Images

【速读】：该论文旨在解决光学盘（Optic Disc, OD）和视杯（Optic Cup, OC）在眼底图像分割中的挑战，这些问题包括复杂结构、低对比度和模糊边界，导致现有方法如Segment Anything Model (SAM) 性能不佳。为克服这些挑战，论文提出的关键解决方案是引入FunduSAM模型，通过在SAM中嵌入多个适配器（Adapters），并在编码器后的每个Transformer块中进行参数微调（Parameter Efficient Fine-Tuning, PEFT）。此外，FunduSAM采用卷积块注意力模块（Convolutional Block Attention Module, CBAM）来增强特征提取能力，并利用极坐标变换（polar transformation）优化输入图像格式。联合损失函数（joint loss）则用于确保在精确分割的同时保持OD和OC之间的结构一致性。

链接: https://arxiv.org/abs/2502.06220
作者: Jinchen Yu,Yongwei Nie,Fei Qi,Wenxiong Liao,Hongmin Cai
机构: University of Science and Technology of China(中国科学技术大学); South China University of Technology(华南理工大学); Guizhou Minzu University(贵州民族大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM) has gained popularity as a versatile image segmentation method, thanks to its strong generalization capabilities across various domains. However, when applied to optic disc (OD) and optic cup (OC) segmentation tasks, SAM encounters challenges due to the complex structures, low contrast, and blurred boundaries typical of fundus images, leading to suboptimal performance. To overcome these challenges, we introduce a novel model, FunduSAM, which incorporates several Adapters into SAM to create a deep network specifically designed for OD and OC segmentation. The FunduSAM utilizes Adapter into each transformer block after encoder for parameter fine-tuning (PEFT). It enhances SAM’s feature extraction capabilities by designing a Convolutional Block Attention Module (CBAM), addressing issues related to blurred boundaries and low contrast. Given the unique requirements of OD and OC segmentation, polar transformation is used to convert the original fundus OD images into a format better suited for training and evaluating FunduSAM. A joint loss is used to achieve structure preservation between the OD and OC, while accurate segmentation. Extensive experiments on the REFUGE dataset, comprising 1,200 fundus images, demonstrate the superior performance of FunduSAM compared to five mainstream approaches.
zh

[CV-67] Fully Exploiting Vision Foundation Models Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing

【速读】：该论文旨在解决RGB-depth驾驶场景解析中视觉基础模型（Vision Foundation Models, VFMs）应用不足的问题。关键解决方案在于提出了Heterogeneous Feature Integration Transformer (HFIT)，该网络能够高效提取和整合异构特征，无需重新训练ViTs。通过利用相对深度预测结果作为输入到HFIT侧适配器，克服了对深度图依赖的限制，从而实现了优于其他传统单模态和数据融合场景解析网络的性能。

链接: https://arxiv.org/abs/2502.06219
作者: Sicen Guo,Tianyou Wen,Chuang-Wei Liu,Qijun Chen,Rui Fan
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps. Our proposed HFIT demonstrates superior performance compared to all other traditional single-modal and data-fusion scene parsing networks, pre-trained VFMs, and ViT adapters on the Cityscapes and KITTI Semantics datasets. We believe this novel strategy paves the way for future innovations in VFM-based data-fusion techniques for driving scene parsing. Our source code is publicly available at this https URL.
zh

[CV-68] Enhancing Cost Efficiency in Active Learning with Candidate Set Query

【速读】：该论文旨在解决传统主动学习（Active Learning, AL）框架中查询设计成本高昂的问题。关键解决方案在于引入了一种名为候选集查询（candidate set query）的新颖查询设计方法，该方法通过缩小可能包含真实类别（ground-truth class）的候选类集合来显著减少搜索空间和标注成本。此外，论文利用了非正式预测（conformal prediction）动态生成小而可靠的候选集，并适应模型在连续主动学习轮次中的增强。这些改进措施通过优先选择能够以较低成本提供高信息增益的数据点，进一步提升了整体框架的有效性和可扩展性。

链接: https://arxiv.org/abs/2502.06209
作者: Yeho Gwon,Sehyun Hwang,Hoyoung Kim,Jungseul Ok,Suha Kwak
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 17 figures, 4 tables

点击查看摘要

Abstract:This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 42% on ImageNet64x64.
zh

[CV-69] Comparing Image Segmentation Algorithms

【速读】：该论文旨在解决二值图像去噪问题，针对非凸能量函数固有的挑战提出了新颖的解决方案。关键在于结合模拟退火算法与局部优化策略，通过定义一个能量函数 (E(x, y)) 来有效搜索解空间，同时保持计算效率，从而实现显著的去噪效果和结构细节保留。实验结果表明，该方法相较于传统迭代条件模式（ICM）在去噪性能上有显著提升。

链接: https://arxiv.org/abs/2502.06201
作者: Milind Cherukuri
机构: University of North Texas (北德克萨斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel approach for denoising binary images using simulated annealing (SA), a global optimization technique that addresses the inherent challenges of non convex energy functions. Binary images are often corrupted by noise, necessitating effective restoration methods. We propose an energy function E(x, y) that captures the relationship between the noisy image y and the desired clean image x. Our algorithm combines simulated annealing with a localized optimization strategy to efficiently navigate the solution space, minimizing the energy function while maintaining computational efficiency. We evaluate the performance of the proposed method against traditional iterative conditional modes (ICM), employing a binary image with 10% pixel corruption as a test case. Experimental results demonstrate that the simulated annealing method achieves a significant restoration improvement, yielding a 99.19% agreement with the original image compared to 96.21% for ICM. Visual assessments reveal that simulated annealing effectively removes noise while preserving structural details, making it a promising approach for binary image denoising. This work contributes to the field of image processing by highlighting the advantages of incorporating global optimization techniques in restoration tasks.
zh

[CV-70] Multimodal Task Representation Memory Bank vs. Catastrophic Forgetting in Anomaly Detection

【速读】：该论文旨在解决多任务表示学习中的不完整表示和灾难性遗忘问题，特别是在无监督连续异常检测（Unsupervised Continuous Anomaly Detection, UCAD）场景下。为了解决这一问题，论文提出了多模态任务表示记忆库（Multimodal Task Representation Memory Bank, MTRMB）方法，其关键是通过两个关键技术革新：1) 关键提示-多模态知识（Key-Prompt-Multimodal Knowledge, KPMK）机制，利用简洁的关键提示指导BERT与ViT之间的跨模态特征交互；2) 基于精炼结构的对比学习（Refined Structure-based Contrastive Learning, RSCL），结合Grounding DINO和SAM生成精确的分割掩码，将同一结构区域的特征拉近，同时将不同结构区域的特征推开。

链接: https://arxiv.org/abs/2502.06194
作者: You Zhou,Jiangshan Zhao,Deyu Zeng,Zuo Zuo,Weixiang Liu,Zongze Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised Continuous Anomaly Detection (UCAD) faces significant challenges in multi-task representation learning, with existing methods suffering from incomplete representation and catastrophic forgetting. Unlike supervised models, unsupervised scenarios lack prior information, making it difficult to effectively distinguish redundant and complementary multimodal features. To address this, we propose the Multimodal Task Representation Memory Bank (MTRMB) method through two key technical innovations: A Key-Prompt-Multimodal Knowledge (KPMK) mechanism that uses concise key prompts to guide cross-modal feature interaction between BERT and ViT. Refined Structure-based Contrastive Learning (RSCL) leveraging Grounding DINO and SAM to generate precise segmentation masks, pulling features of the same structural region closer while pushing different structural regions apart. Experiments on MVtec AD and VisA datasets demonstrate MTRMB’s superiority, achieving an average detection accuracy of 0.921 at the lowest forgetting rate, significantly outperforming state-of-the-art methods. We plan to open source on GitHub.
zh

[CV-71] Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures

【速读】：该论文旨在解决现有跨架构教师模型到学生模型的知识迁移方法未能充分利用隐藏在教师输出中的暗知识的问题。解决方案的关键在于提出了一种新的框架Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD)，通过Decoupled Finegrained Relation Alignment (DFRA) 和Multi-Scale Dynamic Fusion (MSDF) 模块，在logit和特征层面平衡蒸馏暗知识与异构教师模型正确类别的置信度，并动态融合多尺度特征的投影logits，从而提升学生模型的性能。

链接: https://arxiv.org/abs/2502.06189
作者: Yaoxin Yang,Peng Ye,Weihao Lin,Kangcong Li,Yan Wen,Jia Hao,Tao Chen
机构: School of Information Science and Technology, Fudan University (信息科学与技术学院, 复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Heterogeneous distillation is an effective way to transfer knowledge from cross-architecture teacher models to student models. However, existing heterogeneous distillation methods do not take full advantage of the dark knowledge hidden in the teacher’s output, limiting their this http URL this end, we propose a novel framework named Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD) to unleash the potential of relational distillation in heterogeneous distillation. Concretely, we first introduce Decoupled Finegrained Relation Alignment (DFRA) in both logit and feature levels to balance the trade-off between distilled dark knowledge and the confidence in the correct category of the heterogeneous teacher model. Then, Multi-Scale Dynamic Fusion (MSDF) module is applied to dynamically fuse the projected logits of multiscale features at different stages in student model, further improving performance of our method in feature level. We verify our method on four architectures (CNNs, Transformers, MLPs and Mambas), two datasets (CIFAR-100 and Tiny-ImageNet). Compared with the best available method, our MLDR-KD improves student model performance with gains of up to 4.86% on CIFAR-100 and 2.78% on Tiny-ImageNet datasets respectively, showing robustness and generality in heterogeneous distillation. Code will be released soon.
zh

[CV-72] CANeRV: Content Adaptive Neural Representation for Video Compression

【速读】：该论文旨在解决现有基于隐式神经表示（INR）的视频压缩方法因采用固定且统一的网络架构而导致的适应性差和动态变化捕捉不足的问题。关键在于提出了一种名为内容自适应神经表示视频压缩（CANeRV）的方法，通过引入动态序列级调整（DSA）、动态帧级调整（DFA）以及结构层级自适应（HSA）来增强其对视频内容动态特性的捕捉能力，从而实现更优的压缩效果。

链接: https://arxiv.org/abs/2502.06181
作者: Lv Tang,Jun Zhu,Xinfeng Zhang,Li Zhang,Siwei Ma,Qingming Huang
机构: School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院), Beijing, China; Bytedance Inc.(字节跳动), San Diego, USA; School of Computer Science, Peking University(北京大学计算机学院), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video compression introduce implicit neural representation (INR) based methods, which effectively capture global dependencies and characteristics of entire video sequences. Unlike traditional and deep learning based approaches, INR-based methods optimize network parameters from a global perspective, resulting in superior compression potential. However, most current INR methods utilize a fixed and uniform network architecture across all frames, limiting their adaptability to dynamic variations within and between video sequences. This often leads to suboptimal compression outcomes as these methods struggle to capture the distinct nuances and transitions in video content. To overcome these challenges, we propose Content Adaptive Neural Representation for Video Compression (CANeRV), an innovative INR-based video compression network that adaptively conducts structure optimisation based on the specific content of each video sequence. To better capture dynamic information across video sequences, we propose a dynamic sequence-level adjustment (DSA). Furthermore, to enhance the capture of dynamics between frames within a sequence, we implement a dynamic frame-level adjustment (DFA). Finally, to effectively capture spatial structural information within video frames, thereby enhancing the detail restoration capabilities of CANeRV, we devise a structure level hierarchical structural adaptation (HSA). Experimental results demonstrate that CANeRV can outperform both H.266/VVC and state-of-the-art INR-based video compression techniques across diverse video datasets.
zh

[CV-73] PLATTER: A Page-Level Handwritten Text Recognition System for Indic Scripts

【速读】：该论文旨在解决手写文本识别（Handwritten Text Recognition, HTR）领域中模型比较困难的问题，并特别关注Indic语言在相关数据集稀缺的情况下缺乏充分研究的问题。此外，论文还强调了手写文本检测（Handwritten Text Detection, HTD）在构建端到端的手写光学字符识别（OCR）系统中的重要性。论文的关键解决方案是提出了一种端到端的页面级手写文本识别框架（Page-Level hAndwriTTen TExt Recognition, PLATTER），将其视为一个包含词级HTD和随后HTR的两阶段问题。这种方法使我们能够独立识别、评估和解决每个阶段的挑战。

链接: https://arxiv.org/abs/2502.06172
作者: Badri Vishal Kasuba,Dhruv Kudale,Venkatapathy Subramanian,Parag Chaudhuri,Ganesh Ramakrishnan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitting Preprint

点击查看摘要

Abstract:In recent years, the field of Handwritten Text Recognition (HTR) has seen the emergence of various new models, each claiming to perform competitively better than the other in specific scenarios. However, making a fair comparison of these models is challenging due to inconsistent choices and diversity in test sets. Furthermore, recent advancements in HTR often fail to account for the diverse languages, especially Indic languages, likely due to the scarcity of relevant labeled datasets. Moreover, much of the previous work has focused primarily on character-level or word-level recognition, overlooking the crucial stage of Handwritten Text Detection (HTD) necessary for building a page-level end-to-end handwritten OCR pipeline. Through our paper, we address these gaps by making three pivotal contributions. Firstly, we present an end-to-end framework for Page-Level hAndwriTTen TExt Recognition (PLATTER) by treating it as a two-stage problem involving word-level HTD followed by HTR. This approach enables us to identify, assess, and address challenges in each stage independently. Secondly, we demonstrate the usage of PLATTER to measure the performance of our language-agnostic HTD model and present a consistent comparison of six trained HTR models on ten diverse Indic languages thereby encouraging consistent comparisons. Finally, we also release a Corpus of Handwritten Indic Scripts (CHIPS), a meticulously curated, page-level Indic handwritten OCR dataset labeled for both detection and recognition purposes. Additionally, we release our code and trained models, to encourage further contributions in this direction.
zh

[CV-74] An Interpretable Implicit-Based Approach for Modeling Local Spatial Effects: A Case Study of Global Gross Primary Productivity

【速读】：该论文旨在解决地球科学领域中未观测因素的非平稳空间分布导致特征与目标之间关系呈现空间异质性的问题。传统统计学习方法难以捕捉这种空间异质性，从而影响预测精度和解释可靠性。为克服这一局限，论文提出了一种新颖的方法：利用深度神经网络同时建模不同位置之间的共性特征及空间差异。该方法的关键在于采用具有编码器-解码器结构的双支路神经网络，在编码阶段通过GCN和LSTM聚合时空条件下的节点信息，以隐式条件向量形式编码位置特定的时空异质性，并使用基于自注意力机制的编码器提取数据中的位置不变的共同特征。在解码阶段，采用条件生成策略预测响应变量及其解释权重。这种方法在植被总初级生产力（GPP）预测任务中得到验证，展示了优越的性能。

链接: https://arxiv.org/abs/2502.06170
作者: Siqi Du,Hongsheng Huang,Kaixin Shen,Ziqi Liu,Shengjun Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In Earth sciences, unobserved factors exhibit non-stationary spatial distributions, causing the relationships between features and targets to display spatial heterogeneity. In geographic machine learning tasks, conventional statistical learning methods often struggle to capture spatial heterogeneity, leading to unsatisfactory prediction accuracy and unreliable interpretability. While approaches like Geographically Weighted Regression (GWR) capture local variations, they fall short of uncovering global patterns and tracking the continuous evolution of spatial heterogeneity. Motivated by this limitation, we propose a novel perspective - that is, simultaneously modeling common features across different locations alongside spatial differences using deep neural networks. The proposed method is a dual-branch neural network with an encoder-decoder structure. In the encoding stage, the method aggregates node information in a spatiotemporal conditional graph using GCN and LSTM, encoding location-specific spatiotemporal heterogeneity as an implicit conditional vector. Additionally, a self-attention-based encoder is used to extract location-invariant common features from the data. In the decoding stage, the approach employs a conditional generation strategy that predicts response variables and interpretative weights based on data features under spatiotemporal conditions. The approach is validated by predicting vegetation gross primary productivity (GPP) using global climate and land cover data from 2001 to 2020. Trained on 50 million samples and tested on 2.8 million, the proposed model achieves an RMSE of 0.836, outperforming LightGBM (1.063) and TabNet (0.944). Visualization analyses indicate that our method can reveal the distribution differences of the dominant factors of GPP across various times and locations.
zh

[CV-75] Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

【速读】：该论文旨在解决Diffusion Transformers (DiTs)在生成高保真视频过程中因复杂注意力计算和大量采样步骤导致的高昂推理成本问题。解决方案的关键在于两个方面：一是通过剪枝3D全注意力机制，并引入一种新的稀疏3D注意力机制，该机制与视频帧数量呈线性复杂度；二是采用多步一致性蒸馏缩短采样过程，通过将整个采样轨迹分割成多个段并在每个段内进行一致性蒸馏，激活少步生成能力。这两个方法共同构成了一个三阶段训练流程，从而显著提升了生成效率，同时保持了可接受的性能损失。

链接: https://arxiv.org/abs/2502.06155
作者: Hangliang Ding,Dacheng Li,Runlong Su,Peiyuan Zhang,Zhijie Deng,Ion Stoica,Hao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.
zh

[CV-76] Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

【速读】：该论文旨在解决现有基于扩散模型的字符图像动画方法（如Animate Anyone）在生成角色与其环境之间合理关联方面的不足。关键解决方案在于引入了环境赋予能力（environment affordance），通过捕捉环境表示作为条件输入，并采用了一种与形状无关的掩码策略（shape-agnostic mask strategy）来更有效地表征角色与环境之间的关系。此外，通过利用物体引导器（object guider）提取交互对象特征并使用空间混合（spatial blending）进行特征注入，进一步增强了对象互动的真实感。论文还提出了一种姿态调制策略（pose modulation strategy），以处理更加多样的运动模式。这些措施共同提升了角色动画与环境的一致性和合理性。

链接: https://arxiv.org/abs/2502.06145
作者: Li Hu,Guangyuan Wang,Zhen Shen,Xin Gao,Dechao Meng,Lian Zhuo,Peng Zhang,Bang Zhang,Liefeng Bo
机构: Tongyi Lab, Alibaba Group (通义实验室, 阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. Beyond extracting motion signals from source video, we additionally capture environmental representations as conditional inputs. The environment is formulated as the region with the exclusion of characters and our model generates characters to populate these regions while maintaining coherence with the environmental context. We propose a shape-agnostic mask strategy that more effectively characterizes the relationship between character and environment. Furthermore, to enhance the fidelity of object interactions, we leverage an object guider to extract features of interacting objects and employ spatial blending for feature injection. We also introduce a pose modulation strategy that enables the model to handle more diverse motion patterns. Experimental results demonstrate the superior performance of the proposed method.
zh

[CV-77] Enhanced Hybrid Deep Learning Approach for Botnet Attacks Detection in IoT Environment

【速读】：该论文旨在解决在物联网（IoT）环境中检测恶意软件(botnet)攻击的问题。解决方案的关键在于提出了一种基于深度学习技术的模型，该模型通过堆叠深度卷积神经网络（Deep Convolutional Neural Networks）、双向长短期记忆网络（Bi-Directional Long Short-Term Memory, Bi-LSTM）、双向门控循环单元（Bi-Directional Gated Recurrent Unit, Bi-GRU）以及循环神经网络（Recurrent Neural Networks, RNN）来提高对复杂模式和特征的识别能力。实验结果表明，该模型在UNSW-NB15数据集上的测试准确率为99.76%，并且具有99.18%的ROC-AUC曲线值，显示出其在检测botnet攻击方面的卓越性能。

链接: https://arxiv.org/abs/2502.06138
作者: A. Karthick kumar,S. Rathnamala,T. Vijayashanthi,M. Prabhananthakumar,Alavikunhu Panthakkan,Shadi Atalla,Wathiq Mansoor
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:Cyberattacks in an Internet of Things (IoT) environment can have significant impacts because of the interconnected nature of devices and systems. An attacker uses a network of compromised IoT devices in a botnet attack to carry out various harmful activities. Detecting botnet attacks poses several challenges because of the intricate and evolving nature of these threats. Botnet attacks erode trust in IoT devices and systems, undermining confidence in their security, reliability, and integrity. Deep learning techniques have significantly enhanced the detection of botnet attacks due to their ability to analyze and learn from complex patterns in data. This research proposed the stacking of Deep convolutional neural networks, Bi-Directional Long Short-Term Memory (Bi-LSTM), Bi-Directional Gated Recurrent Unit (Bi-GRU), and Recurrent Neural Networks (RNN) for botnet attacks detection. The UNSW-NB15 dataset is utilized for botnet attacks detection. According to experimental results, the proposed model accurately provides for the intricate patterns and features of botnet attacks, with a testing accuracy of 99.76%. The proposed model also identifies botnets with a high ROC-AUC curve value of 99.18%. A performance comparison of the proposed method with existing state-of-the-art models confirms its higher performance. The outcomes of this research could strengthen cyber security procedures and safeguard against new attacks.
zh

[CV-78] Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning AAAI2025

【速读】：该论文旨在解决医学时间序列数据中常见的不规则性和显著缺失值带来的挑战，这些问题给数据分析和临床决策带来了困难。论文的关键解决方案在于提出了一种联合学习框架，该框架同时采用序列和图像表示方法，并设计了三种自监督学习策略来促进这两种表示方法的融合，从而捕捉到更具泛化能力的联合表示。实验结果表明，所提方法在三个真实世界临床数据集中优于其他七种最先进的模型。

链接: https://arxiv.org/abs/2502.06134
作者: Liuqing Chen,Shuhong Xiao,Shixian Ding,Shanhai Hu,Lingyun Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, AAAI2025

点击查看摘要

Abstract:Medical time series are often irregular and face significant missingness, posing challenges for data analysis and clinical decision-making. Existing methods typically adopt a single modeling perspective, either treating series data as sequences or transforming them into image representations for further classification. In this paper, we propose a joint learning framework that incorporates both sequence and image representations. We also design three self-supervised learning strategies to facilitate the fusion of sequence and image representations, capturing a more generalizable joint representation. The results indicate that our approach outperforms seven other state-of-the-art models in three representative real-world clinical datasets. We further validate our approach by simulating two major types of real-world missingness through leave-sensors-out and leave-samples-out techniques. The results demonstrate that our approach is more robust and significantly surpasses other baselines in terms of classification performance.
zh

[CV-79] Improved YOLOv5s model for key components detection of power transmission lines

【速读】：该论文旨在解决高压输电线路智能检测过程中关键组件检测精度低的问题。解决方案的关键在于改进YOLOv5s模型：首先通过修改k-means聚类中的距离测量来优化YOLOv5s模型的锚点匹配；其次，在骨干网络中添加卷积块注意力模块（CBAM）以提高精度；最后，采用焦点损失函数减少类别不平衡的影响。这些改进使模型的mAP达到98.1%，精度达到97.5%，召回率达到94.4%，检测速率达到84.8 FPS。

链接: https://arxiv.org/abs/2502.06127
作者: Chen Chen,Guowu Yuan,Hao Zhou,Yi Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 14 figures

点击查看摘要

Abstract:High-voltage transmission lines are located far from the road, resulting in inconvenient inspection work and rising maintenance costs. Intelligent inspection of power transmission lines has become increasingly important. However, subsequent intelligent inspection relies on accurately detecting various key components. Due to the low detection accuracy of key components in transmission line image inspection, this paper proposed an improved object detection model based on the YOLOv5s (You Only Look Once Version 5 Small) model to improve the detection accuracy of key components of transmission lines. According to the characteristics of the power grid inspection image, we first modify the distance measurement in the k-means clustering to improve the anchor matching of the YOLOv5s model. Then, we add the convolutional block attention module (CBAM) attention mechanism to the backbone network to improve accuracy. Finally, we apply the focal loss function to reduce the impact of class imbalance. Our improved method’s mAP (mean average precision) reached 98.1%, the precision reached 97.5%, the recall reached 94.4%, and the detection rate reached 84.8 FPS (frames per second). The experimental results show that our improved model improves detection accuracy and has performance advantages over other models.
zh

[CV-80] An Appearance Defect Detection Method for Cigarettes Based on C-CenterNet

【速读】：该论文旨在解决传统方法在自动卷烟生产线上难以准确识别卷烟缺陷及其类型的问题。解决方案的关键在于提出了一种基于C-CenterNet的卷烟外观缺陷检测方法，该方法通过关键点估计定位中心点并回归其他所有缺陷属性。具体改进包括使用ResNet50作为特征提取网络，并引入卷积块注意力机制（CBAM）增强有效特征的提取能力，减少非目标信息的干扰；采用特征金字塔网络增强各层特征提取；利用可变形卷积替代部分普通卷积以提升不同形状缺陷的学习能力；最后使用ACON激活函数代替ReLU，实现神经元激活操作的自适应选择，从而提高网络的检测精度。实验结果显示，该模型在卷烟外观缺陷检测任务中的平均精度均值（mAP）达到95.01%，相比原始CenterNet模型提升了6.14%的成功率。

链接: https://arxiv.org/abs/2502.06119
作者: Hongyu Liu,Guowu Yuan,Lei Yang,Kunxiao Liu,Hao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 14 figures

点击查看摘要

Abstract:Due to the poor adaptability of traditional methods in the cigarette detection task on the automatic cigarette production line, it is difficult to accurately identify whether a cigarette has defects and the types of defects; thus, a cigarette appearance defect detection method based on C-CenterNet is proposed. This detector uses keypoint estimation to locate center points and regresses all other defect properties. Firstly, Resnet50 is used as the backbone feature extraction network, and the convolutional block attention mechanism (CBAM) is introduced to enhance the network’s ability to extract effective features and reduce the interference of non-target information. At the same time, the feature pyramid network is used to enhance the feature extraction of each layer. Then, deformable convolution is used to replace part of the common convolution to enhance the learning ability of different shape defects. Finally, the activation function ACON (ActivateOrNot) is used instead of the ReLU activation function, and the activation operation of some neurons is adaptively selected to improve the detection accuracy of the network. The experimental results are mainly acquired via the mean Average Precision (mAP). The experimental results show that the mAP of the C-CenterNet model applied in the cigarette appearance defect detection task is 95.01%. Compared with the original CenterNet model, the model’s success rate is increased by 6.14%, so it can meet the requirements of precision and adaptability in cigarette detection tasks on the automatic cigarette production line.
zh

[CV-81] A Novel Multi-Teacher Knowledge Distillation for Real-Time Object Detection using 4D Radar

【速读】：该论文旨在解决4D雷达在3D物体检测中的稀疏点云问题。解决方案的关键在于引入了一种新颖的知识蒸馏框架，该框架使学生模型能够在潜在空间中通过模拟教师模型的集合来稠化其稀疏输入，从而有效应对4D雷达点云稀疏的问题。这一方法在K-Radar数据集上的实验展示了相较于最先进的RTNH模型有25%的性能提升，同时保持了实时推理速度。

链接: https://arxiv.org/abs/2502.06114
作者: Seung-Hyun Song,Dong-Hee Paek,Minh-Quan Dao,Ezio Malis,Seung-Hyun Kong
机构: Korea Advanced Institute of Science and Technology(韩国科学技术院); Centre Inria d’Univeristé Côte d’Azur(尼斯蔚蓝大学Inria研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D object detection is crucial for safe autonomous navigation, requiring reliable performance across diverse weather conditions. While LiDAR performance deteriorates in challenging weather, Radar systems maintain their reliability. Traditional Radars have limitations due to their lack of elevation data, but the recent 4D Radars overcome this by measuring elevation alongside range, azimuth, and Doppler velocity, making them invaluable for autonomous vehicles. The primary challenge in utilizing 4D Radars is the sparsity of their point clouds. Previous works address this by developing architectures that better capture semantics and context in sparse point cloud, largely drawing from LiDAR-based approaches. However, these methods often overlook a unique advantage of 4D Radars: the dense Radar tensor, which encapsulates power measurements across three spatial dimensions and the Doppler dimension. Our paper leverages this tensor to tackle the sparsity issue. We introduce a novel knowledge distillation framework that enables a student model to densify its sparse input in the latent space by emulating an ensemble of teacher models. Our experiments demonstrate a 25% performance improvement over the state-of-the-art RTNH model on the K-Radar dataset. Notably, this improvement is achieved while still maintaining a real-time inference speed.
zh

[CV-82] Col-OLHTR: A Novel Framework for Multimodal Online Handwritten Text Recognition ICASSP2025

【速读】：该论文旨在解决在线手写文本识别（Online Handwritten Text Recognition, OLHTR）中的序列识别任务所面临的挑战。现有方法通常采用单一轨迹或图像编码器，或结合CTC或注意力机制的多流编码器，但这些方法存在两个主要缺点：1）单一编码器难以动态捕捉全局特征；2）多流编码器虽然更全面，但结构复杂且推理成本高。论文的关键解决方案是提出了一种基于协作学习的OLHTR框架（Col-OLHTR），该框架在训练过程中学习多模态特征，同时保持单流推理过程。Col-OLHTR包含轨迹编码器、点到空间对齐（Point-to-Spatial Alignment, P2SA）模块和注意力机制解码器。P2SA模块通过轨迹编码特征和二维旋转位置嵌入来学习图像级的空间特征。此外，在训练期间，额外的图像流编码器-解码器协同工作以监督P2SA特征。在推理阶段，额外的流被丢弃，仅使用P2SA模块并与解码器合并，从而简化流程同时保持高性能。

链接: https://arxiv.org/abs/2502.06100
作者: Chenyu Liu,Jinshui Hu,Baocai Yin,Jia Pan,Bing Yin,Jun Du,Qingfeng Liu
机构: University of Science and Technology of China(中国科学技术大学); iFLYTEK Research(科大讯飞研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: ICASSP 2025

点击查看摘要

Abstract:Online Handwritten Text Recognition (OLHTR) has gained considerable attention for its diverse range of applications. Current approaches usually treat OLHTR as a sequence recognition task, employing either a single trajectory or image encoder, or multi-stream encoders, combined with a CTC or attention-based recognition decoder. However, these approaches face several drawbacks: 1) single encoders typically focus on either local trajectories or visual regions, lacking the ability to dynamically capture relevant global features in challenging cases; 2) multi-stream encoders, while more comprehensive, suffer from complex structures and increased inference costs. To tackle this, we propose a Collaborative learning-based OLHTR framework, called Col-OLHTR, that learns multimodal features during training while maintaining a single-stream inference process. Col-OLHTR consists of a trajectory encoder, a Point-to-Spatial Alignment (P2SA) module, and an attention-based decoder. The P2SA module is designed to learn image-level spatial features through trajectory-encoded features and 2D rotary position embeddings. During training, an additional image-stream encoder-decoder is collaboratively trained to provide supervision for P2SA features. At inference, the extra streams are discarded, and only the P2SA module is used and merged before the decoder, simplifying the process while preserving high performance. Extensive experimental results on several OLHTR benchmarks demonstrate the state-of-the-art (SOTA) performance, proving the effectiveness and robustness of our design.
zh

[CV-83] Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models

【速读】：该论文旨在解决在医学领域应用视觉语言模型（Vision Language Models, VLMs）时可能存在的公平性问题。论文提出的关键解决方案是Fair-MoE模型，该模型包含两个核心组件：公平导向专家混合模型（Fairness-Oriented Mixture of Experts, FO-MoE）和公平导向损失函数（Fairness-Oriented Loss, FOL）。FO-MoE通过集成不同专家的知识来过滤偏见的补丁嵌入，并提取与特定任务相关的更公平的信息；FOL则是一种新型的公平导向损失函数，不仅最小化不同属性间的距离，还优化了各属性分布差异的分散度。实验结果表明，Fair-MoE在提高公平性和准确性方面均表现出色。

链接: https://arxiv.org/abs/2502.06094
作者: Peiran Wang,Linjie Tong,Jiaxiang Liu,Zuozhu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fairness is a fundamental principle in medical ethics. Vision Language Models (VLMs) have shown significant potential in the medical field due to their ability to leverage both visual and linguistic contexts, reducing the need for large datasets and enabling the performance of complex tasks. However, the exploration of fairness within VLM applications remains limited. Applying VLMs without a comprehensive analysis of fairness could lead to concerns about equal treatment opportunities and diminish public trust in medical deep learning models. To build trust in medical VLMs, we propose Fair-MoE, a model specifically designed to ensure both fairness and effectiveness. Fair-MoE comprises two key components: \textitthe Fairness-Oriented Mixture of Experts (FO-MoE) and \textitthe Fairness-Oriented Loss (FOL). FO-MoE is designed to leverage the expertise of various specialists to filter out biased patch embeddings and use an ensemble approach to extract more equitable information relevant to specific tasks. FOL is a novel fairness-oriented loss function that not only minimizes the distances between different attributes but also optimizes the differences in the dispersion of various attributes’ distributions. Extended experiments demonstrate the effectiveness and fairness of Fair-MoE. Tested on the Harvard-FairVLMed dataset, Fair-MoE showed improvements in both fairness and accuracy across all four attributes. Code will be publicly available.
zh

[CV-84] Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization

【速读】：该论文旨在解决在使用强化学习（Reinforcement Learning, RL）微调连续时间流模型以适应任意用户定义的奖励函数时所遇到的挑战，特别是政策崩溃（policy collapse）和计算量过大的问题。论文的关键解决方案是提出了一种新的方法——在线奖励加权条件流匹配与Wasserstein-2正则化（Online Reward-Weighted Conditional Flow Matching with Wasserstein-2 Regularization, ORW-CFM-W2）。这种方法将RL集成到流匹配框架中，通过引入在线奖励加权机制引导模型优先关注数据流形中的高奖励区域，并利用Wasserstein-2距离正则化防止政策崩溃和保持多样性，从而实现探索与开发之间的平衡。

链接: https://arxiv.org/abs/2502.06061
作者: Jiajun Fan,Shuaike Shen,Chaoran Cheng,Yuxin Chen,Chumeng Liang,Ge Liu
机构: University of Illinois Urbana-Champaign(伊利诺伊大学香槟分校); Zhejiang University(浙江大学); Tsinghua University(清华大学); University of Southern California(南加州大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 61 pages

点击查看摘要

Abstract:Recent advancements in reinforcement learning (RL) have achieved great success in fine-tuning diffusion-based generative models. However, fine-tuning continuous flow-based generative models to align with arbitrary user-defined reward functions remains challenging, particularly due to issues such as policy collapse from overoptimization and the prohibitively high computational cost of likelihoods in continuous-time flows. In this paper, we propose an easy-to-use and theoretically sound RL fine-tuning method, which we term Online Reward-Weighted Conditional Flow Matching with Wasserstein-2 Regularization (ORW-CFM-W2). Our method integrates RL into the flow matching framework to fine-tune generative models with arbitrary reward functions, without relying on gradients of rewards or filtered datasets. By introducing an online reward-weighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold. To prevent policy collapse and maintain diversity, we incorporate Wasserstein-2 (W2) distance regularization into our method and derive a tractable upper bound for it in flow matching, effectively balancing exploration and exploitation of policy optimization. We provide theoretical analyses to demonstrate the convergence properties and induced data distributions of our method, establishing connections with traditional RL algorithms featuring Kullback-Leibler (KL) regularization and offering a more comprehensive understanding of the underlying mechanisms and learning behavior of our approach. Extensive experiments on tasks including target image generation, image compression, and text-image alignment demonstrate the effectiveness of our method, where our method achieves optimal policy convergence while allowing controllable trade-offs between reward maximization and diversity preservation.
zh

[CV-85] raveling Waves Integrate Spatial Information Into Spectral Representations

【速读】：该论文旨在探索旅行波（Traveling Waves）在神经网络中的潜在功能，特别是它们如何支持空间信息的整合与传递。关键在于引入了一种卷积递归神经网络（Convolutional Recurrent Neural Networks），这些网络能够学习在隐藏状态中产生旅行波以响应视觉刺激。通过频谱分解（spectral decomposition）这些波状激活，论文提出了一种新的表示空间（representational space），这种空间在需要全局空间上下文的任务中优于局部前馈网络。关键解决方案是利用旅行波扩展局部连接神经元的感受野（receptive field），从而支持长距离的信息编码与通信，并最终实现复杂的视觉语义分割任务。

链接: https://arxiv.org/abs/2502.06034
作者: Mozes Jacobs,Roberto C. Budzinski,Lyle Muller,Demba Ba,T. Anderson Keller
机构: The Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University (哈佛大学); Western University (西方大学), Department of Mathematics, London, Ontario, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traveling waves are widely observed in the brain, but their precise computational function remains unclear. One prominent hypothesis is that they enable the transfer and integration of spatial information across neural populations. However, few computational models have explored how traveling waves might be harnessed to perform such integrative processing. Drawing inspiration from the famous ``Can one hear the shape of a drum?‘’ problem – which highlights how spectral modes encode geometric information – we introduce a set of convolutional recurrent neural networks that learn to produce traveling waves in their hidden states in response to visual stimuli. By applying a spectral decomposition to these wave-like activations, we obtain a powerful new representational space that outperforms equivalently local feed-forward networks on tasks requiring global spatial context. In particular, we observe that traveling waves effectively expand the receptive field of locally connected neurons, supporting long-range encoding and communication of information. We demonstrate that models equipped with this mechanism and spectral readouts solve visual semantic segmentation tasks demanding global integration, where local feed-forward models fail. As a first step toward traveling-wave-based representations in artificial networks, our findings suggest potential efficiency benefits and offer a new framework for connecting to biological recordings of neural activity.
zh

[CV-86] DiTASK: Multi-Task Fine-Tuning with Diffeomorphic Transformations CVPR

【速读】：该论文旨在解决在多任务学习（Multi-Task Learning, MTL）中高效适配预训练视觉变换器（Vision Transformers）以进行多个任务的问题。当前参数高效方法如LoRA通过低秩更新，使得任务间需竞争受限子空间，最终导致性能下降。论文提出的关键解决方案是DiTASK，一种新颖的微分同胚多任务微调方法。DiTASK通过保持权重矩阵奇异向量来维持预训练表示，并通过奇异值的神经微分同胚变换实现特定任务的适应，从而实现在少量新增参数的情况下同时支持共享和特定任务的特征调整。

链接: https://arxiv.org/abs/2502.06029
作者: Krishna Sri Ipsit Mantri,Carola-Bibiane Schönlieb,Bruno Ribeiro,Chaim Baskin,Moshe Eliasof
机构: Purdue University (普渡大学); University of Cambridge (剑桥大学); Ben-Gurion University of the Negev (本-古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, cvpr template

点击查看摘要

Abstract:Pre-trained Vision Transformers now serve as powerful tools for computer vision. Yet, efficiently adapting them for multiple tasks remains a challenge that arises from the need to modify the rich hidden representations encoded by the learned weight matrices, without inducing interference between tasks. Current parameter-efficient methods like LoRA, which apply low-rank updates, force tasks to compete within constrained subspaces, ultimately degrading performance. We introduce DiTASK a novel Diffeomorphic Multi-Task Fine-Tuning approach that maintains pre-trained representations by preserving weight matrix singular vectors, while enabling task-specific adaptations through neural diffeomorphic transformations of the singular values. By following this approach, DiTASK enables both shared and task-specific feature modulations with minimal added parameters. Our theoretical analysis shows that DITASK achieves full-rank updates during optimization, preserving the geometric structure of pre-trained features, and establishing a new paradigm for efficient multi-task learning (MTL). Our experiments on PASCAL MTL and NYUD show that DiTASK achieves state-of-the-art performance across four dense prediction tasks, using 75% fewer parameters than existing methods.
zh

[CV-87] Dual Caption Preference Optimization for Diffusion Models

【速读】：该论文旨在解决两个主要问题：一是现有偏好数据集中优选样本与非优选样本分布存在重叠，导致冲突分布；二是输入提示包含对于非优选图像的无关信息，限制了去噪网络在偏好优化方法中准确预测噪声的能力，即无关提示问题。为了解决这些问题，论文提出Dual Caption Preference Optimization (DCPO)，其关键是利用两个独立的标题来缓解无关提示问题，并通过引入Pick-Double Caption数据集以及三种不同的策略（captioning、perturbation和hybrid方法）来处理冲突分布问题。

链接: https://arxiv.org/abs/2502.06023
作者: Amir Saeidi,Yiran Luo,Agneet Chatterjee,Shamanthak Hegde,Bimsara Pathiraja,Yezhou Yang,Chitta Baral
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, existing preference datasets often exhibit overlap between these distributions, leading to a conflict distribution. Additionally, we identified that input prompts contain irrelevant information for less preferred images, limiting the denoising network’s ability to accurately predict noise in preference optimization methods, known as the irrelevant prompt issue. To address these challenges, we propose Dual Caption Preference Optimization (DCPO), a novel approach that utilizes two distinct captions to mitigate irrelevant prompts. To tackle conflict distribution, we introduce the Pick-Double Caption dataset, a modified version of Pick-a-Pic v2 with separate captions for preferred and less preferred images. We further propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO, and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.
zh

[CV-88] mporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding NAACL2025

【速读】：该论文旨在解决多模态基础模型（Multimodal Foundation Models, MFMs）在处理扩展时间序列时的能力限制，这限制了它们在全面视频和音频分析中的效能。论文的关键解决方案是引入了一个名为时间工作记忆（Temporal Working Memory, TWM）的专门认知模块。TWM通过查询引导注意力机制，专注于时间序列中最具有信息量的多模态片段，并仅保留最相关的内容，从而优化了模型有限容量的使用，增强了其时间建模能力。这一插件式模块可以轻松集成到现有的MFM中，显著提升了包括视频描述、问答和视频-文本检索在内的多个任务的性能。

链接: https://arxiv.org/abs/2502.06020
作者: Xingjian Diao,Chunhui Zhang,Weiyi Wu,Zhongyu Ouyang,Peijun Qing,Ming Cheng,Soroush Vosoughi,Jiang Gui
机构: Dartmouth College
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at NAACL 2025

点击查看摘要

Abstract:Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model’s limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at this https URL.
zh

[CV-89] Noise is an Efficient Learner for Zero-Shot Vision-Language Models

【速读】：该论文旨在解决测试时适应（Test-Time Adaptation, TTA）过程中预训练视觉-语言模型（Vision-Language Models, VLMs）仅通过调整可学习提示来应对分布偏移的问题。这种方法忽略了视觉表示本身可能存在的不可预测变化。论文的关键解决方案是引入了测试时噪声调节（Test-Time Noise Tuning, TNT），这是一种首次利用噪声适应策略直接在视觉输入空间优化可学习噪声的方法，从而实现从单个测试样本进行自适应特征学习。此外，TNT还通过嵌入距离一致性约束实现了跨视图表征对齐，并结合缩放 logits 和置信视图选择，显著提升了 VLM 的泛化能力和校准性能，在自然分布基准上的平均提升达到了 7.38%，跨数据集评估上提升了 0.80%，超过了零样本 CLIP 方法。

链接: https://arxiv.org/abs/2502.06019
作者: Raza Imam,Asif Hanif,Jian Zhang,Khaled Waleed Dawoud,Yova Kementchedjhieva,Mohammad Yaqub
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI, 阿联酋人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Recently, test-time adaptation has garnered attention as a method for tuning models without labeled data. The conventional modus operandi for adapting pre-trained vision-language models (VLMs) during test-time primarily focuses on tuning learnable prompts; however, this approach overlooks potential distribution shifts in the visual representations themselves. In this work, we address this limitation by introducing Test-Time Noise Tuning (TNT), a novel method for handling unpredictable shifts in the visual space. TNT leverages, for the first time, a noise adaptation strategy that optimizes learnable noise directly in the visual input space, enabling adaptive feature learning from a single test sample. We further introduce a novel approach for inter-view representation alignment by explicitly enforcing coherence in embedding distances, ensuring consistent feature representations across views. Combined with scaled logits and confident view selection at inference, TNT substantially enhances VLM generalization and calibration, achieving average gains of +7.38% on natural distributions benchmark and +0.80% on cross-dataset evaluations over zero-shot CLIP. These improvements lay a strong foundation for adaptive out-of-distribution handling.
zh

[CV-90] Pencils to Pixels: A Systematic Study of Creative Drawings across Children Adults and AI

【速读】：该论文旨在探讨能否通过计算指标量化不同智能体在绘画中的视觉创造力，同时考虑技术技能和风格的固有差异。为了解决这一问题，作者构建了一个包含1338幅由儿童、成人和AI创作的画作的数据集，并定义了衡量画作风格（包括墨水密度、墨水分布和元素数量）和内容（包括概念多样性及图像和文本嵌入的距离度量）的指标。关键在于通过这些指标比较儿童、成人和AI画作之间的风格、内容和创造力，并建立模型预测专家和自动化的创造力评分。研究表明各组之间存在显著差异，特别是儿童的画作包含更多元素，AI的画作具有更高的墨水密度，而成人的画作则表现出最大的概念多样性。

链接: https://arxiv.org/abs/2502.05999
作者: Surabhi S Nath,Guiomar del Cuvillo y Schröder,Claire E. Stevenson
机构: Max Planck Institute for Biological Cybernetics (马克斯·普朗克生物认知研究所), Germany; Max Planck School of Cognition (马克斯·普朗克认知学校), Leipzig, Germany; University of Tübingen (图宾根大学), Tübingen, Germany; University of Amsterdam (阿姆斯特丹大学), Amsterdam, Netherlands
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Can we derive computational metrics to quantify visual creativity in drawings across intelligent agents, while accounting for inherent differences in technical skill and style? To answer this, we curate a novel dataset consisting of 1338 drawings by children, adults and AI on a creative drawing task. We characterize two aspects of the drawings – (1) style and (2) content. For style, we define measures of ink density, ink distribution and number of elements. For content, we use expert-annotated categories to study conceptual diversity, and image and text embeddings to compute distance measures. We compare the style, content and creativity of children, adults and AI drawings and build simple models to predict expert and automated creativity scores. We find significant differences in style and content in the groups – children’s drawings had more components, AI drawings had greater ink density, and adult drawings revealed maximum conceptual diversity. Notably, we highlight a misalignment between creativity judgments obtained through expert and automated ratings and discuss its implications. Through these efforts, our work provides, to the best of our knowledge, the first framework for studying human and artificial creativity beyond the textual modality, and attempts to arrive at the domain-agnostic principles underlying creativity. Our data and scripts are available on GitHub.
zh

[CV-91] A Comprehensive Survey on Image Signal Processing Approaches for Low-Illumination Image Enhancement

【速读】：该论文旨在解决低光照条件下图像视觉质量下降的问题，特别是由于图像采集设备和照明条件限制导致的可见度差和噪声高的问题。论文的关键解决方案在于利用深度学习方法，尤其是卷积神经网络（Convolutional Neural Networks, CNNs），在保持重要信息的同时有效减少噪声，并提升图像亮度和对比度。此外，论文还探讨了混合技术，即结合深度学习方法与传统技术的优势，以进一步改善低光照图像的质量。

链接: https://arxiv.org/abs/2502.05995
作者: Muhammad Turab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The usage of digital content (photos and videos) in a variety of applications has increased due to the popularity of multimedia devices. These uses include advertising campaigns, educational resources, and social networking platforms. There is an increasing need for high-quality graphic information as people become more visually focused. However, captured images frequently have poor visibility and a high amount of noise due to the limitations of image-capturing devices and lighting conditions. Improving the visual quality of images taken in low illumination is the aim of low-illumination image enhancement. This problem is addressed by traditional image enhancement techniques, which alter noise, brightness, and contrast. Deep learning-based methods, however, have dominated recently made advances in this area. These methods have effectively reduced noise while preserving important information, showing promising results in the improvement of low-illumination images. An extensive summary of image signal processing methods for enhancing low-illumination images is provided in this paper. Three categories are classified in the review for approaches: hybrid techniques, deep learning-based methods, and traditional approaches. Conventional techniques include denoising, automated white balancing, and noise reduction. Convolutional neural networks (CNNs) are used in deep learningbased techniques to recognize and extract characteristics from low-light images. To get better results, hybrid approaches combine deep learning-based methodologies with more conventional methods. The review also discusses the advantages and limitations of each approach and provides insights into future research directions in this field.
zh

[CV-92] SNAT-YOLO: Efficient Cross-Layer Aggregation Network for Edge-Oriented Gangue Detection

【速读】：该论文旨在解决基于深度学习的煤炭矸石目标检测方法中存在的检测速度慢、精度低、难以部署在工业边缘设备上以及模型参数和计算需求大的问题。解决方案的关键在于提出了一种轻量级的煤炭矸石目标检测算法，该算法采用改进的YOLOv11框架，并引入了以下关键技术：使用轻量级网络ShuffleNetV2作为骨干网络；引入轻量级下采样操作ADown以减少模型复杂度并提高平均检测精度；改进C2PSA模块并融合三重注意力机制，形成C2PSA-TriAtt模块以增强模型对不同维度特征的关注能力；提出Inner-FocalerIoU损失函数替代现有的CIoU损失函数。这些改进使得模型在保持高检测精度（99.10%）的同时，减小了模型大小（38%）、减少了参数数量（41%）和计算成本（40%），并且将每张图像的平均检测时间缩短至1毫秒以内，从而显著提升了检测速度和精度，使其适用于工业边缘移动设备的部署。

链接: https://arxiv.org/abs/2502.05988
作者: Shang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To address the issues of slow detection speed,low accuracy,difficulty in deployment on industrial edge devices,and large parameter and computational requirements in deep learning-based coal gangue target detection methods,we propose a lightweight coal gangue target detection algorithm based on an improved this http URL,we use the lightweight network ShuffleNetV2 as the backbone to enhance detection this http URL,we introduce a lightweight downsampling operation,ADown,which reduces model complexity while improving average detection this http URL,we improve the C2PSA module in YOLOv11 by incorporating the Triplet Attention mechanism,resulting in the proposed C2PSA-TriAtt module,which enhances the model’s ability to focus on different dimensions of this http URL,we propose the Inner-FocalerIoU loss function to replace the existing CIoU loss this http URL results show that our model achieves a detection accuracy of 99.10% in coal gangue detection tasks,reduces the model size by 38%,the number of parameters by 41%,and the computational cost by 40%,while decreasing the average detection time per image by 1 this http URL improved model demonstrates enhanced detection speed and accuracy,making it suitable for deployment on industrial edge mobile devices,thus contributing positively to coal processing and efficient utilization of coal resources.
zh

[CV-93] VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer

【速读】：该论文旨在解决可控视觉特效（VFX）生成领域相对未被充分探索的问题。论文的关键在于提出了一种基于图像动画的新颖范式——VFX Creator框架，该框架利用视频扩散变换器（Video Diffusion Transformer），结合空间和时间可控的LoRA适配器以及插件式掩模控制模块，实现了从用户友好型文本描述和静态参考图像生成动态效果的功能。这种方法不仅提升了生成效果的真实性和动态性，还显著增强了在空间和时间上的可控性。

链接: https://arxiv.org/abs/2502.05979
作者: Xinyu Liu,Ailing Zeng,Wei Xue,Harry Yang,Wenhan Luo,Qifeng Liu,Yike Guo
机构: Hong Kong University of Science and Technology(香港科技大学); Tencent AI Lab(腾讯AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: \href{ this https URL }{this https URL}

点击查看摘要

Abstract:Crafting magic and illusions is one of the most thrilling aspects of filmmaking, with visual effects (VFX) serving as the powerhouse behind unforgettable cinematic experiences. While recent advances in generative artificial intelligence have driven progress in generic image and video synthesis, the domain of controllable VFX generation remains relatively underexplored. In this work, we propose a novel paradigm for animated VFX generation as image animation, where dynamic effects are generated from user-friendly textual descriptions and static reference images. Our work makes two primary contributions: (i) Open-VFX, the first high-quality VFX video dataset spanning 15 diverse effect categories, annotated with textual descriptions, instance segmentation masks for spatial conditioning, and start-end timestamps for temporal control. (ii) VFX Creator, a simple yet effective controllable VFX generation framework based on a Video Diffusion Transformer. The model incorporates a spatial and temporal controllable LoRA adapter, requiring minimal training videos. Specifically, a plug-and-play mask control module enables instance-level spatial manipulation, while tokenized start-end motion timestamps embedded in the diffusion process, alongside the text encoder, allow precise temporal control over effect timing and pace. Extensive experiments on the Open-VFX test set demonstrate the superiority of the proposed system in generating realistic and dynamic effects, achieving state-of-the-art performance and generalization ability in both spatial and temporal controllability. Furthermore, we introduce a specialized metric to evaluate the precision of temporal control. By bridging traditional VFX techniques with generative approaches, VFX Creator unlocks new possibilities for efficient and high-quality video effect generation, making advanced VFX accessible to a broader audience. Comments: Project page: \hrefthis https URLthis https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.05979 [cs.CV] (or arXiv:2502.05979v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.05979 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-94] Revisiting Gradient-based Uncertainty for Monocular Depth Estimation

【速读】：该论文旨在解决单目深度估计中的不确定性评估问题，特别是在存在动态对象或阴影等图像歧义的情况下。论文的关键解决方案在于提出了一种基于梯度的后处理不确定性估计方法，无需依赖真实深度标签。该方法通过引入辅助损失函数，利用预测深度与经过简单图像或特征增强生成的参考深度之间的一致性来提取梯度。最终的不确定性评分通过对单个或多个特征图进行反向传播得到的导数计算获得。这种方法在标准深度估计基准数据集KITTI和NYU上的实验结果表明，其在确定不确定性方面比相关方法更为有效，尤其是在那些因单目序列训练而最易受不确定影响的模型中。

链接: https://arxiv.org/abs/2502.05964
作者: Julia Hornauer,Amir El-Ghoussani,Vasileios Belagiannis
机构: Ulm University (乌尔姆大学); Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希-亚历山大-埃尔兰根-纽伦堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TPAMI

点击查看摘要

Abstract:Monocular depth estimation, similar to other image-based tasks, is prone to erroneous predictions due to ambiguities in the image, for example, caused by dynamic objects or shadows. For this reason, pixel-wise uncertainty assessment is required for safety-critical applications to highlight the areas where the prediction is unreliable. We address this in a post hoc manner and introduce gradient-based uncertainty estimation for already trained depth estimation models. To extract gradients without depending on the ground truth depth, we introduce an auxiliary loss function based on the consistency of the predicted depth and a reference depth. The reference depth, which acts as pseudo ground truth, is in fact generated using a simple image or feature augmentation, making our approach simple and effective. To obtain the final uncertainty score, the derivatives w.r.t. the feature maps from single or multiple layers are calculated using back-propagation. We demonstrate that our gradient-based approach is effective in determining the uncertainty without re-training using the two standard depth estimation benchmarks KITTI and NYU. In particular, for models trained with monocular sequences and therefore most prone to uncertainty, our method outperforms related approaches. In addition, we publicly provide our code and models: this https URL
zh

[CV-95] ClinKD: Cross-Modal Clinic Knowledge Distiller For Multi-Task Medical Images

【速读】：该论文旨在解决医疗视觉问答（Med-VQA）任务中的挑战，特别是在生物特征复杂性和高质量医学图像数据集稀缺的情况下，现有模型在架构和训练范式方面未针对医学领域进行优化所导致的问题。论文的关键解决方案在于引入ClinKD模型，该模型通过改进的位置编码（Med-CLIP Guided Rotary Position Embedding）和多样化的训练过程来提升模型对图像和模态变化的感知能力，并利用知识蒸馏提供先验知识。此外，在正式训练阶段采用基于反馈的训练过程进一步提高了数据利用率。这些改进使得ClinKD模型在Med-GRIT-270k数据集上达到了新的最先进性能。

链接: https://arxiv.org/abs/2502.05928
作者: Hongyu Ge,Longkun Hao,Zihui Xu,Zhenxin Lin,Bin Li,Shoujun Zhou,Hongjin Zhao,Yihang Liu
机构: unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Med-VQA (Medical Visual Question Answering) is a crucial subtask within the broader VQA (Visual Question Answering) domain. This task requires a visual question answering system to analyze the provided image and corresponding question,offering reasonable analysis and suggestions to assist medical professionals in making pathological diagnoses, or ideally, enabling the system to independently provide correct diagnoses. Furthermore, more advanced Med-VQA tasks involve Referring and Grounding, which not only require the system to accurately comprehend medical images but also to pinpoint specific biological locations within those images. While many large pre-trained models have demonstrated substantial VQA capabilities,challenges persist in the medical imaging domain. The intricacy of biological features in medical images and the scarcity of high-quality medical image datasets, combined with the fact that current models are not tailored for the medical field in terms of architecture and training paradigms, hinder the full exploitation of model generalization. This results in issues such as hallucination in Visual Grounding. In this paper, we introduce the ClinKD model, which incorporates modifications to model position encoding and a diversified training process. Initially, we enhance the model’s ability to perceive image and modality variations by using Med-CLIP Guided Rotary Position Embedding. Subsequently, we leverage distillation to provide prior knowledge to the model before using complete training data. Additionally, the feedback-based training process during the formal training phase further enhances data utilization. Notably, under unchanged evaluation protocols, we achieve a new state-of-the-art performance on the Med-GRIT-270k dataset, and the Med-CLIP Guided Rotary Position Embedding approach presents potential for generalizing to universal model position encoding.
zh

[CV-96] Multi-Branch Collaborative Learning Network for Video Quality Assessment in Industrial Video Search KDD2025

【速读】：该论文旨在解决工业级视频检索系统中低质量视频识别的问题，特别是视觉相关问题（如马赛克和黑屏）、文本问题（来自视频标题和OCR内容）以及语义问题（如帧不连贯和帧文本不匹配等AI生成视频的问题）。学术界对这些问题的关注不足，导致准确识别困难。为了解决这一挑战，论文提出了一种名为多分支协作网络（Multi-Branch Collaborative Network, MBCN）的方法。MBCN的关键在于其四个专门设计的分支，每个分支针对一种特定类型的低质量视频问题进行评分。通过加权聚合方法和挤压激励机制，MBCN能够动态处理不同场景中的质量问题，并采用点对点和成对优化目标确保评分的稳定性和合理性。实验结果表明，MBCN在提升视频检索系统的排名性能方面具有显著效果。

链接: https://arxiv.org/abs/2502.05924
作者: Hengzhu Tang,Zefeng Zhang,Zhiping Li,Zhenyu Zhang,Xing Wu,Li Gao,Suqi Cheng,Dawei Yin
机构: Baidu Inc.(百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: KDD 2025 ADS

点击查看摘要

Abstract:Video Quality Assessment (VQA) is vital for large-scale video retrieval systems, aimed at identifying quality issues to prioritize high-quality videos. In industrial systems, low-quality video characteristics fall into four categories: visual-related issues like mosaics and black boxes, textual issues from video titles and OCR content, and semantic issues like frame incoherence and frame-text mismatch from AI-generated videos. Despite their prevalence in industrial settings, these low-quality videos have been largely overlooked in academic research, posing a challenge for accurate identification. To address this, we introduce the Multi-Branch Collaborative Network (MBCN) tailored for industrial video retrieval systems. MBCN features four branches, each designed to tackle one of the aforementioned quality issues. After each branch independently scores videos, we aggregate these scores using a weighted approach and a squeeze-and-excitation mechanism to dynamically address quality issues across different scenarios. We implement point-wise and pair-wise optimization objectives to ensure score stability and reasonableness. Extensive offline and online experiments on a world-level video search engine demonstrate MBCN’s effectiveness in identifying video quality issues, significantly enhancing the retrieval system’s ranking performance. Detailed experimental analyses confirm the positive contribution of all four evaluation branches. Furthermore, MBCN significantly improves recognition accuracy for low-quality AI-generated videos compared to the baseline.
zh

[CV-97] QP-SNN: Quantized and Pruned Spiking Neural Networks ICLR2025

【速读】：该论文旨在解决将高性能脉冲神经网络（Spiking Neural Networks, SNNs）有效部署于资源受限边缘设备的问题。当前研究主要集中在通过开发大规模模型来提升性能，这限制了SNN在资源有限场景下的适用性。论文的关键解决方案在于提出了一种硬件友好且轻量级的SNN模型——QP-SNN。具体而言，论文首先开发了一个集成均匀量化和结构化剪枝的基线模型（QP-SNN baseline），尽管这一基线显著减少了存储需求和计算成本，但存在性能下降的问题。为了解决此问题，论文深入分析了量化和剪枝过程中导致性能下降的挑战，并提出了相应的改进策略：对于权重量化，提出了一个利用位宽更有效地增强模型表示能力的权重重缩放策略；对于结构化剪枝，则提出了一种基于时空脉冲活动奇异值的新颖剪枝准则，以实现冗余内核的更精确移除。实验结果表明，结合这两种方法的QP-SNN实现了最先进的性能和效率，凸显了其在边缘智能计算中增强SNN部署的潜力。

链接: https://arxiv.org/abs/2502.05905
作者: Wenjie Wei,Malu Zhang,Zijian Zhou,Ammar Belatreche,Yimeng Shan,Yu Liang,Honglin Cao,Jieyuan Zhang,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Northumbria University (纽卡斯尔大学); Liaoning Technical University (辽宁工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 17 figures, Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Brain-inspired Spiking Neural Networks (SNNs) leverage sparse spikes to encode information and operate in an asynchronous event-driven manner, offering a highly energy-efficient paradigm for machine intelligence. However, the current SNN community focuses primarily on performance improvement by developing large-scale models, which limits the applicability of SNNs in resource-limited edge devices. In this paper, we propose a hardware-friendly and lightweight SNN, aimed at effectively deploying high-performance SNN in resource-limited scenarios. Specifically, we first develop a baseline model that integrates uniform quantization and structured pruning, called QP-SNN baseline. While this baseline significantly reduces storage demands and computational costs, it suffers from performance decline. To address this, we conduct an in-depth analysis of the challenges in quantization and pruning that lead to performance degradation and propose solutions to enhance the baseline’s performance. For weight quantization, we propose a weight rescaling strategy that utilizes bit width more effectively to enhance the model’s representation capability. For structured pruning, we propose a novel pruning criterion using the singular value of spatiotemporal spike activities to enable more accurate removal of redundant kernels. Extensive experiments demonstrate that integrating two proposed methods into the baseline allows QP-SNN to achieve state-of-the-art performance and efficiency, underscoring its potential for enhancing SNN deployment in edge intelligence computing.
zh

[CV-98] Fast Omni-Directional Image Super-Resolution: Adapting the Implicit Image Function with Pixel and Semantic-Wise Spherical Geometric Priors AAAI2025

【速读】：该论文旨在解决等矩投影（EquiRectangular Projection, ERP）导致的全向图像（Omni-Directional Image, ODI）超分辨率（Super-Resolution, SR）过程中非均匀过采样特性带来的独特挑战。解决方案的关键在于提出了一种新的快速任意尺度ODI-SR模型（FAOR），通过在潜在表示和图像重建阶段以低开销方式引入球面几何先验，将隐式图像函数从平面图像域适应到ERP图像域。具体而言，在潜在表示阶段，采用像素级和语义级球面到平面失真映射对潜在表示进行仿射变换，从而融合球面属性；在图像重建阶段，引入基于测地线的重采样策略，使隐式图像函数与球面几何相匹配而不引入额外参数。

链接: https://arxiv.org/abs/2502.05902
作者: Xuelin Shen,Yitong Wang,Silin Zheng,Kang Xiao,Wenhan Yang,Xu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, AAAI 2025

点击查看摘要

Abstract:In the context of Omni-Directional Image (ODI) Super-Resolution (SR), the unique challenge arises from the non-uniform oversampling characteristics caused by EquiRectangular Projection (ERP). Considerable efforts in designing complex spherical convolutions or polyhedron reprojection offer significant performance improvements but at the expense of cumbersome processing procedures and slower inference speeds. Under these circumstances, this paper proposes a new ODI-SR model characterized by its capacity to perform Fast and Arbitrary-scale ODI-SR processes, denoted as FAOR. The key innovation lies in adapting the implicit image function from the planar image domain to the ERP image domain by incorporating spherical geometric priors at both the latent representation and image reconstruction stages, in a low-overhead manner. Specifically, at the latent representation stage, we adopt a pair of pixel-wise and semantic-wise sphere-to-planar distortion maps to perform affine transformations on the latent representation, thereby incorporating it with spherical properties. Moreover, during the image reconstruction stage, we introduce a geodesic-based resampling strategy, aligning the implicit image function with spherical geometrics without introducing additional parameters. As a result, the proposed FAOR outperforms the state-of-the-art ODI-SR models with a much faster inference speed. Extensive experimental results and ablation studies have demonstrated the effectiveness of our design.
zh

[CV-99] Beyond Fine-Tuning: A Systematic Study of Sampling Techniques in Personalized Image Generation

【速读】：该论文旨在解决个性化文本到图像生成中概念保真度与生成多样性的平衡问题。现有方法主要通过不同的微调参数化和改进的采样策略来应对这一挑战，这些策略在扩散过程中整合了超级类轨迹。论文的关键解决方案在于提出了一种决策框架，该框架评估文本对齐、计算约束和保真目标，以指导采样策略的选择。此框架能够与不同架构和训练方法集成，系统性地优化概念保留、提示遵从性和资源效率。

链接: https://arxiv.org/abs/2502.05895
作者: Vera Soboleva,Maksim Nakhodnov,Aibek Alanov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally

点击查看摘要

Abstract:Personalized text-to-image generation aims to create images tailored to user-defined concepts and textual descriptions. Balancing the fidelity of the learned concept with its ability for generation in various contexts presents a significant challenge. Existing methods often address this through diverse fine-tuning parameterizations and improved sampling strategies that integrate superclass trajectories during the diffusion process. While improved sampling offers a cost-effective, training-free solution for enhancing fine-tuned models, systematic analyses of these methods remain limited. Current approaches typically tie sampling strategies with fixed fine-tuning configurations, making it difficult to isolate their impact on generation outcomes. To address this issue, we systematically analyze sampling strategies beyond fine-tuning, exploring the impact of concept and superclass trajectories on the results. Building on this analysis, we propose a decision framework evaluating text alignment, computational constraints, and fidelity objectives to guide strategy selection. It integrates with diverse architectures and training approaches, systematically optimizing concept preservation, prompt adherence, and resource efficiency. The source code can be found at this https URL.
zh

[CV-100] MMGDreamer: Mixed-Modality Graph for Geometry-Controllable 3D Indoor Scene Generation AAAI2025

【速读】：该论文旨在解决可控3D场景生成在虚拟现实和室内设计中的应用需求，当前基于图的方法受限于文本输入且适应灵活用户输入的能力不足，导致难以精确控制物体几何。为了解决这一问题，论文提出MMGDreamer，这是一种双分支扩散模型，关键在于引入了混合模态图（Mixed-Modality Graph），视觉增强模块和关系预测器。混合模态图允许物体节点整合文本和视觉模态信息，并可选地包含节点间的关系，从而增强对灵活用户输入的适应性并实现对生成场景中物体几何的精细控制。视觉增强模块通过使用文本嵌入构造视觉表示来丰富仅含文本节点的视觉保真度。关系预测器则利用节点表示推断缺失的节点间关系，从而产生更连贯的场景布局。

链接: https://arxiv.org/abs/2502.05874
作者: Zhifei Yang,Keyang Lu,Chao Zhang,Jiaxing Qi,Hanqi Jiang,Ruifei Ma,Shenglin Yin,Yifan Xu,Mingzhe Xing,Zhen Xiao,Jieyi Long,Xiangde Liu,Guangyao Zhai
机构: Beijing Digital Native Digital City Research Center (北京数字原生数字城市研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025 Main Track

点击查看摘要

Abstract:Controllable 3D scene generation has extensive applications in virtual reality and interior design, where the generated scenes should exhibit high levels of realism and controllability in terms of geometry. Scene graphs provide a suitable data representation that facilitates these applications. However, current graph-based methods for scene generation are constrained to text-based inputs and exhibit insufficient adaptability to flexible user inputs, hindering the ability to precisely control object geometry. To address this issue, we propose MMGDreamer, a dual-branch diffusion model for scene generation that incorporates a novel Mixed-Modality Graph, visual enhancement module, and relation predictor. The mixed-modality graph allows object nodes to integrate textual and visual modalities, with optional relationships between nodes. It enhances adaptability to flexible user inputs and enables meticulous control over the geometry of objects in the generated scenes. The visual enhancement module enriches the visual fidelity of text-only nodes by constructing visual representations using text embeddings. Furthermore, our relation predictor leverages node representations to infer absent relationships between nodes, resulting in more coherent scene layouts. Extensive experimental results demonstrate that MMGDreamer exhibits superior control of object geometry, achieving state-of-the-art scene generation performance. Project page: this https URL.
zh

[CV-101] HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition

【速读】：该论文旨在解决基于骨架的人体动作识别中Transformer模型的二次计算复杂度问题，以及现有线性注意力机制难以捕捉骨架数据层次结构的问题。为了解决这些问题，论文提出了一种新的方法HyLiFormer，关键在于引入了Hyperbolic Transformation with Curvatures (HTC)模块将骨架数据映射到双曲空间，并使用Hyperbolic Linear Attention (HLA)模块进行高效的长距离依赖建模。这种方案不仅显著降低了计算复杂度，同时保持了模型精度。

链接: https://arxiv.org/abs/2502.05869
作者: Yue Li,Haoxuan Qu,Mengyuan Liu,Jun Liu,Yujun Cai
机构: Sun Yat-sen University(中山大学); Lancaster University(兰卡斯特大学); Peking University(北京大学); University of Queensland(昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformers have demonstrated remarkable performance in skeleton-based human action recognition, yet their quadratic computational complexity remains a bottleneck for real-world applications. To mitigate this, linear attention mechanisms have been explored but struggle to capture the hierarchical structure of skeleton data. Meanwhile, the Poincaré model, as a typical hyperbolic geometry, offers a powerful framework for modeling hierarchical structures but lacks well-defined operations for existing mainstream linear attention. In this paper, we propose HyLiFormer, a novel hyperbolic linear attention Transformer tailored for skeleton-based action recognition. Our approach incorporates a Hyperbolic Transformation with Curvatures (HTC) module to map skeleton data into hyperbolic space and a Hyperbolic Linear Attention (HLA) module for efficient long-range dependency modeling. Theoretical analysis and extensive experiments on NTU RGB+D and NTU RGB+D 120 datasets demonstrate that HyLiFormer significantly reduces computational complexity while preserving model accuracy, making it a promising solution for efficiency-critical applications.
zh

[CV-102] SphereFusion: Efficient Panorama Depth Estimation via Gated Fusion

【速读】：该论文旨在解决全景深度估计中的挑战，特别是在不同投影格式（如等距柱状投影、立方体贴图投影、切线投影和球面投影）下所遇到的失真、不连续性和纹理细节丢失等问题。解决方案的关键在于提出了一种名为SphereFusion的端到端框架，该框架结合了多种投影方法的优势。SphereFusion通过在等距柱状投影和球面投影域中提取两种不同的特征，并利用门控融合模块选择最可靠的特征进行融合，最终在球面域内估计全景深度。此外，SphereFusion采用缓存策略以提高网格操作的效率。

链接: https://arxiv.org/abs/2502.05859
作者: Qingsong Yan,Qiang Wang,Kaiyong Zhao,Jie Chen,Bo Li,Xiaowen Chu,Fei Deng
机构: Wuhan University (武汉大学); HIT (Shenzhen) (哈尔滨工业大学（深圳）); XGRIDS (深圳); HKBU (香港浸会大学); HKUST (香港科技大学); HKUST (Guangzhou) (香港科技大学（广州）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3DV 2025

点击查看摘要

Abstract:Due to the rapid development of panorama cameras, the task of estimating panorama depth has attracted significant attention from the computer vision community, especially in applications such as robot sensing and autonomous driving. However, existing methods relying on different projection formats often encounter challenges, either struggling with distortion and discontinuity in the case of equirectangular, cubemap, and tangent projections, or experiencing a loss of texture details with the spherical projection. To tackle these concerns, we present SphereFusion, an end-to-end framework that combines the strengths of various projection methods. Specifically, SphereFusion initially employs 2D image convolution and mesh operations to extract two distinct types of features from the panorama image in both equirectangular and spherical projection domains. These features are then projected onto the spherical domain, where a gate fusion module selects the most reliable features for fusion. Finally, SphereFusion estimates panorama depth within the spherical domain. Meanwhile, SphereFusion employs a cache strategy to improve the efficiency of mesh operation. Extensive experiments on three public panorama datasets demonstrate that SphereFusion achieves competitive results with other state-of-the-art methods, while presenting the fastest inference speed at only 17 ms on a 512 \times 1024 panorama image.
zh

[CV-103] Acquisition through My Eyes and Steps: A Joint Predictive Agent Model in Egocentric Worlds

【速读】：该论文旨在解决在第一人称视角环境中，使代理模型能够像人类一样同时感知、预测和行动的问题。现有方法通常为这三种能力分别训练独立模型，导致信息孤岛，阻碍了这些能力之间的相互学习与有效协作。论文提出的关键解决方案是EgoAgent，这是一种联合预测代理模型，采用单一Transformer同时学习表征世界、预测未来状态及采取合理行动。通过将这三种能力映射到连续标记序列，并添加可学习查询标记来获取当前状态、未来状态和下一步动作，利用联合监督机制，EgoAgent建立了这三种能力之间的内在联系，有效模拟了人类的推理和学习过程。

链接: https://arxiv.org/abs/2502.05857
作者: Lu Chen,Yizhou Wang,Shixiang Tang,Qianhong Ma,Tong He,Wanli Ouyang,Xiaowei Zhou,Hujun Bao,Sida Peng
机构: State Key Lab of CAD&CG, Zhejiang University(浙江大学 CAD&CG国家重点实验室); The Chinese University of Hong Kong(香港中文大学); Shanghai Jiao Tong University(上海交通大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper addresses the task of learning an agent model behaving like humans, which can jointly perceive, predict, and act in egocentric worlds. Previous methods usually train separate models for these three abilities, leading to information silos among them, which prevents these abilities from learning from each other and collaborating effectively. In this paper, we propose a joint predictive agent model, named EgoAgent, that simultaneously learns to represent the world, predict future states, and take reasonable actions with a single transformer. EgoAgent unifies the representational spaces of the three abilities by mapping them all into a sequence of continuous tokens. Learnable query tokens are appended to obtain current states, future states, and next actions. With joint supervision, our agent model establishes the internal relationship among these three abilities and effectively mimics the human inference and learning processes. Comprehensive evaluations of EgoAgent covering image classification, egocentric future state prediction, and 3D human motion prediction tasks demonstrate the superiority of our method. The code and trained model will be released for reproducibility.
zh

[CV-104] DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

【速读】：该论文旨在解决机器人在多样化环境中执行复杂长期任务的能力不足问题，特别是现有视觉-语言-动作（Vision-Language-Action, VLA）模型在动作表示和高效训练方面的局限性。论文的关键解决方案在于引入DexVLA框架，该框架通过采用基于扩散的动作专家（diffusion-based action expert）和新的身体形态课程学习策略（embodiment curriculum learning strategy），实现了跨身体形态的高效预训练及快速适应新任务的能力。这一方案使得DexVLA能够在不同类型的机器人上无需特定任务调整的情况下完成挑战性任务，并能够使用直接的语言提示（如折叠衣物）来完成复杂的长期任务。

链接: https://arxiv.org/abs/2502.05855
作者: Junjie Wen,Yichen Zhu,Jinming Li,Zhibin Tang,Chaomin Shen,Feifei Feng
机构: Midea Group; East China Normal University
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: The webpage is at this https URL

点击查看摘要

Abstract:Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion expert that is separable from the VLA on cross-embodiment data, (2) aligning the VLA model to specific embodiments, and (3) post-training for rapid adaptation to new tasks. We conduct comprehensive experiments across multiple embodiments, including single-arm, bimanual, and dexterous hand, demonstrating DexVLA’s adaptability to challenging tasks without task-specific adaptation, its ability to learn dexterous skills on novel embodiments with limited data, and its capacity to complete complex, long-horizon tasks using only direct language prompting, such as laundry folding. In all settings, our method demonstrates superior performance compared to state-of-the-art models like Octo, OpenVLA, and Diffusion Policy.
zh

[CV-105] raining-free Anomaly Event Detection via LLM -guided Symbolic Pattern Discovery

【速读】：该论文旨在解决异常事件检测领域中依赖监督学习方法所面临的挑战，包括对大量标注数据的需求以及决策过程缺乏可解释性。解决方案的关键在于提出了一种无需训练的框架，该框架结合了开放集目标检测与符号回归，并利用大规模语言模型（LLMs）进行高效的符号模式发现，从而实现直接推理而无需训练过程，提供易于理解的逻辑表达，并大幅减少标注需求。

链接: https://arxiv.org/abs/2502.05843
作者: Yuhui Zeng,Haoxiang Wu,Wenjie Nie,Guangyao Chen,Xiawu Zheng,Yunhang Shen,Guilin Li,Yixiong Zou,Yonghong Tian,Rongrong Ji
机构: Xiamen University (厦门大学); Peking University (北京大学); Tencent Youtu Lab (腾讯优图实验室); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Anomaly event detection plays a crucial role in various real-world applications. However, current approaches predominantly rely on supervised learning, which faces significant challenges: the requirement for extensive labeled training data and lack of interpretability in decision-making processes. To address these limitations, we present a training-free framework that integrates open-set object detection with symbolic regression, powered by Large Language Models (LLMs) for efficient symbolic pattern discovery. The LLMs guide the symbolic reasoning process, establishing logical relationships between detected entities. Through extensive experiments across multiple domains, our framework demonstrates several key advantages: (1) achieving superior detection accuracy through direct reasoning without any training process; (2) providing highly interpretable logical expressions that are readily comprehensible to humans; and (3) requiring minimal annotation effort - approximately 1% of the data needed by traditional training-based this http URL facilitate comprehensive evaluation and future research, we introduce two datasets: a large-scale private dataset containing over 110,000 annotated images covering various anomaly scenarios including construction site safety violations, illegal fishing activities, and industrial hazards, along with a public benchmark dataset of 5,000 samples with detailed anomaly event annotations. Code is available at here.
zh

[CV-106] Contrastive Representation Distillation via Multi-Scale Feature Decoupling

【速读】：该论文旨在解决知识蒸馏过程中仅关注全局特征信息而忽视不同区域特征中嵌入的多样化信息的问题。关键解决方案在于引入多尺度解耦机制，在特征传递过程中将局部特征解耦并分别处理，结合对比学习进行整合。此方法不仅降低了计算成本，还提升了效率，使得学生网络仅使用单批次样本即可实现性能提升，并在CIFAR-100和ImageNet数据集上验证了其优越性，部分学生网络的表现甚至超越了预训练教师网络。

链接: https://arxiv.org/abs/2502.05835
作者: Cuipeng Wang,Tieyuan Chen,Haipeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge distillation is a technique aimed at enhancing the performance of a smaller student network without increasing its parameter size by transferring knowledge from a larger, pre-trained teacher network. Previous approaches have predominantly focused on distilling global feature information while overlooking the importance of disentangling the diverse types of information embedded within different regions of the feature. In this work, we introduce multi-scale decoupling in the feature transfer process for the first time, where the decoupled local features are individually processed and integrated with contrastive learning. Moreover, compared to previous contrastive learning-based distillation methods, our approach not only reduces computational costs but also enhances efficiency, enabling performance improvements for the student network using only single-batch samples. Extensive evaluations on CIFAR-100 and ImageNet demonstrate our method’s superiority, with some student networks distilled using our method even surpassing the performance of their pre-trained teacher networks. These results underscore the effectiveness of our approach in enabling student networks to thoroughly absorb knowledge from teacher networks.
zh

[CV-107] Compressing Model with Few Class-Imbalance Samples: An Out-of-Distribution Expedition

【速读】：该论文旨在解决在极有限样本条件下因类别不平衡导致的模型压缩性能下降问题。解决方案的关键在于提出了一种名为OOD-Enhanced Few-Sample Model Compression (OOD-增强型少样本模型压缩, OE-FSMC) 的新框架，该框架通过将容易获取的分布外 (Out-of-Distribution, OOD) 数据融入压缩和微调过程，有效地重新平衡了训练分布，并通过引入联合蒸馏损失和正则化项来降低模型过拟合到OOD数据的风险。

链接: https://arxiv.org/abs/2502.05832
作者: Tian-Shuang Wu,Shen-Huan Lyu,Ning Chen,Zhihao Qu,Baoliu Ye
机构: Key Laboratory of Water Big Data Technology of Ministry of Water Resources, College of Computer Science and Software Engineering, Hohai University (河海大学), Nanjing, China; State Key Laboratory for Novel Software Technology, Nanjing University (南京大学), Nanjing, China
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, as a compromise between privacy and performance, few-sample model compression has been widely adopted to deal with limited data resulting from privacy and security concerns. However, when the number of available samples is extremely limited, class imbalance becomes a common and tricky problem. Achieving an equal number of samples across all classes is often costly and impractical in real-world applications, and previous studies on few-sample model compression have mostly ignored this significant issue. Our experiments comprehensively demonstrate that class imbalance negatively affects the overall performance of few-sample model compression methods. To address this problem, we propose a novel and adaptive framework named OOD-Enhanced Few-Sample Model Compression (OE-FSMC). This framework integrates easily accessible out-of-distribution (OOD) data into both the compression and fine-tuning processes, effectively rebalancing the training distribution. We also incorporate a joint distillation loss and a regularization term to reduce the risk of the model overfitting to the OOD data. Extensive experiments on multiple benchmark datasets show that our framework can be seamlessly incorporated into existing few-sample model compression methods, effectively mitigating the accuracy degradation caused by class imbalance.
zh

[CV-108] Divide-and-Conquer: Tree-structured Strategy with Answer Distribution Estimator for Goal-Oriented Visual Dialogue

【速读】：该论文致力于解决在目标导向视觉对话中，现有方法缺乏明确策略引导问题生成，导致搜索过程随机且结果不收敛的问题。关键解决方案在于提出了一种树状结构策略与答案分布估计器（Tree-Structured Strategy with Answer Distribution Estimator, TSADE），通过每轮排除当前候选对象的一半来指导问题生成，并采用最大化二元奖励的方法实现。此外，设计了一个候选对象最小化奖励，以促使模型在对话后期缩小候选对象范围。这种方法使得代理能够在较少的重复问题和轮次中实现更高的任务导向准确性。

链接: https://arxiv.org/abs/2502.05806
作者: Shuo Cai,Xinzhe Han,Shuhui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Goal-oriented visual dialogue involves multi-round interaction between artificial agents, which has been of remarkable attention due to its wide applications. Given a visual scene, this task occurs when a Questioner asks an action-oriented question and an Answerer responds with the intent of letting the Questioner know the correct action to take. The quality of questions affects the accuracy and efficiency of the target search progress. However, existing methods lack a clear strategy to guide the generation of questions, resulting in the randomness in the search process and inconvergent results. We propose a Tree-Structured Strategy with Answer Distribution Estimator (TSADE) which guides the question generation by excluding half of the current candidate objects in each round. The above process is implemented by maximizing a binary reward inspired by the ``divide-and-conquer’’ paradigm. We further design a candidate-minimization reward which encourages the model to narrow down the scope of candidate objects toward the end of the dialogue. We experimentally demonstrate that our method can enable the agents to achieve high task-oriented accuracy with fewer repeating questions and rounds compared to traditional ergodic question generation approaches. Qualitative results further show that TSADE facilitates agents to generate higher-quality questions.
zh

[CV-109] MicroViT: A Vision Transformer with Low Complexity Self Attention for Edge Device

【速读】：该论文旨在解决Vision Transformer (ViT)在边缘设备上由于高计算需求而难以实用的问题。解决方案的关键在于提出MicroViT，这是一种轻量级的ViT架构，通过引入高效的单头注意力机制（Efficient Single Head Attention, ESHA）显著降低了计算复杂度，同时保持了较高的准确性。ESHA机制利用组卷积减少特征冗余，并仅处理部分通道，从而减轻自注意力机制的负担。

链接: https://arxiv.org/abs/2502.05800
作者: Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh
机构: Department of Electro-Optics Engineering, National Formosa University (国立华梵大学光电工程学系), Taiwan; Department of Electrical Engineering, National Taipei University (台北市立大学电机工程学系), Taiwan; Department of Electrical Engineering, University of Muhammadiyah Malang (印尼穆罕默迪亚马拉大学电机工程学系), Indonesia; College of Artificial Intelligence and Green Energy, National Yang Ming Chiao Tung University (阳明交通大学人工智能与绿能学院), Taiwan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Vision Transformer (ViT) has demonstrated state-of-the-art performance in various computer vision tasks, but its high computational demands make it impractical for edge devices with limited resources. This paper presents MicroViT, a lightweight Vision Transformer architecture optimized for edge devices by significantly reducing computational complexity while maintaining high accuracy. The core of MicroViT is the Efficient Single Head Attention (ESHA) mechanism, which utilizes group convolution to reduce feature redundancy and processes only a fraction of the channels, thus lowering the burden of the self-attention mechanism. MicroViT is designed using a multi-stage MetaFormer architecture, stacking multiple MicroViT encoders to enhance efficiency and performance. Comprehensive experiments on the ImageNet-1K and COCO datasets demonstrate that MicroViT achieves competitive accuracy while significantly improving 3.6 faster inference speed and reducing energy consumption with 40% higher efficiency than the MobileViT series, making it suitable for deployment in resource-constrained environments such as mobile and edge devices.
zh

[CV-110] EPBC-YOLOv8: An efficient and accurate improved YOLOv8 underwater detector based on an attention mechanism

【速读】：该论文旨在解决水下目标检测中的图像退化问题，以提高检测精度。关键解决方案在于将通道和空间注意力机制集成到YOLOv8的主干网络中，并在FasterNeXt中应用点态卷积以构建FasterPW模型。此外，通过改进的跨尺度连接和鲁棒性，在受BiFPN启发的WFPN结构中采用加权拼接（Weighted Concat）。同时，利用CARAFE进行精细特征重组，从而实现了在URPC2019和URPC2020数据集上的mAP@0.5分别为76.7%和79.0%，较原始YOLOv8分别提升了2.3%和0.7%，显著增强了对海洋生物检测的准确性。

链接: https://arxiv.org/abs/2502.05788
作者: Xing Jiang,Xiting Zhuang,Jisheng Chen,Jian Zhang
机构: Hainan University (海南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we enhance underwater target detection by integrating channel and spatial attention into YOLOv8’s backbone, applying Pointwise Convolution in FasterNeXt for the FasterPW model, and leveraging Weighted Concat in a BiFPN-inspired WFPN structure for improved cross-scale connections and robustness. Utilizing CARAFE for refined feature reassembly, our framework addresses underwater image degradation, achieving mAP at 0.5 scores of 76.7 percent and 79.0 percent on URPC2019 and URPC2020 datasets, respectively. These scores are 2.3 percent and 0.7 percent higher than the original YOLOv8, showcasing enhanced precision in detecting marine organisms.
zh

[CV-111] A 3D Multimodal Feature for Infrastructure Anomaly Detection

【速读】：该论文旨在解决在老化结构中识别微小裂缝和水侵入等缺陷的难题。解决方案的关键在于提出了一种新颖的三维多模态特征（3DMulti-FPFHI），它结合了定制的快速点特征直方图（Fast Point Feature Histogram, FPFH）与强度特征，并将其集成到PatchCore异常检测算法中。实验结果表明，3DMulti-FPFHI显著提升了裂缝检测的质量，并能够识别因水侵入引起的强度异常，其性能优于单独使用FPFH或现有的多模态异常检测方法。

链接: https://arxiv.org/abs/2502.05779
作者: Yixiong Jing,Wei Lin,Brian Sheil,Sinan Acikgoz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ageing structures require periodic inspections to identify structural defects. Previous work has used geometric distortions to locate cracks in synthetic masonry bridge point clouds but has struggled to detect small cracks. To address this limitation, this study proposes a novel 3D multimodal feature, 3DMulti-FPFHI, that combines a customized Fast Point Feature Histogram (FPFH) with an intensity feature. This feature is integrated into the PatchCore anomaly detection algorithm and evaluated through statistical and parametric analyses. The method is further evaluated using point clouds of a real masonry arch bridge and a full-scale experimental model of a concrete tunnel. Results show that the 3D intensity feature enhances inspection quality by improving crack detection; it also enables the identification of water ingress which introduces intensity anomalies. The 3DMulti-FPFHI outperforms FPFH and a state-of-the-art multimodal anomaly detection method. The potential of the method to address diverse infrastructure anomaly detection scenarios is highlighted by the minimal requirements for data compared to learning-based methods. The code and related point cloud dataset are available at this https URL.
zh

[CV-112] Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

【速读】：该论文旨在解决针对视觉大型语言模型（Vision Large Language Models, VLLMs）的安全防御系统在面对复杂对抗性攻击时效果有限的问题。论文的关键在于提出了一种名为MultiFaceted Attack的新攻击框架，该框架通过三种互补的攻击手段——视觉攻击（Visual Attack）、对齐破坏攻击（Alignment Breaking Attack）以及对抗性签名（Adversarial Signature），系统性地绕过了现有多层次防御机制，从而显著提升了攻击成功率。

链接: https://arxiv.org/abs/2502.05772
作者: Yijun Yang,Lichao Wang,Xiao Yang,Lanqing Hong,Jun Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Large Language Models (VLLMs) integrate visual data processing, expanding their real-world applications, but also increasing the risk of generating unsafe responses. In response, leading companies have implemented Multi-Layered safety defenses, including alignment training, safety system prompts, and content moderation. However, their effectiveness against sophisticated adversarial attacks remains largely unexplored. In this paper, we propose MultiFaceted Attack, a novel attack framework designed to systematically bypass Multi-Layered Defenses in VLLMs. It comprises three complementary attack facets: Visual Attack that exploits the multimodal nature of VLLMs to inject toxic system prompts through images; Alignment Breaking Attack that manipulates the model’s alignment mechanism to prioritize the generation of contrasting responses; and Adversarial Signature that deceives content moderators by strategically placing misleading information at the end of the response. Extensive evaluations on eight commercial VLLMs in a black-box setting demonstrate that MultiFaceted Attack achieves a 61.56% attack success rate, surpassing state-of-the-art methods by at least 42.18%.
zh

[CV-113] Digital Twin Buildings: 3D Modeling GIS Integration and Visual Descriptions Using Gaussian Splatting ChatGPT /Deepseek and Google Maps Platforms

【速读】：该论文旨在解决单个建筑物层面的城市数字孪生建模与数据集成问题。解决方案的关键在于通过连接云地图平台API、利用先进的多智能体大型语言模型（如ChatGPT(4o) 和Deepseek-V3/R1）进行数据分析，以及采用基于高斯点 splatting 的网格提取管道，从而能够根据建筑物的地址、邮政编码或地理坐标检索其三维模型及视觉描述，并实现基于大型语言模型的数据分析与云端地图集成。

链接: https://arxiv.org/abs/2502.05769
作者: Kyle Gao,Dening Lu,Liangzhi Li,Nan Chen,Hongjie He,Linlin Xu,Jonathan Li
机构: Department of Systems Design Engineering, University of Waterloo, Canada (滑铁卢大学系统设计工程系); Department of Geography and Environmental Management, University of Waterloo, Canada (滑铁卢大学地理与环境管理系); School of Computer Science, Xi’an Aeronautical University, China (西安航空学院计算机科学学院); College of Land Engineering, Chang’an University, China (长安大学土地工程学院); Department of Geomatics Engineering, University of Calgary, Canada (卡尔加里大学测绘工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Urban digital twins are virtual replicas of cities that use multi-source data and data analytics to optimize urban planning, infrastructure management, and decision-making. Towards this, we propose a framework focused on the single-building scale. By connecting to cloud mapping platforms such as Google Map Platforms APIs, by leveraging state-of-the-art multi-agent Large Language Models data analysis using ChatGPT(4o) and Deepseek-V3/R1, and by using our Gaussian Splatting-based mesh extraction pipeline, our Digital Twin Buildings framework can retrieve a building’s 3D model, visual descriptions, and achieve cloud-based mapping integration with large language model-based data analytics using a building’s address, postal code, or geographic coordinates.
zh

[CV-114] 3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly AAAI2025

【速读】：该论文旨在解决现有工业异常检测数据集在缺陷样本数量、缺陷类型及真实场景可用性方面的局限性，从而限制了检测性能进一步提升的问题。为解决这一问题，论文提出了一个名为3CAD的新大规模异常检测数据集，并引入了一种简单而有效的无监督异常检测框架——粗到细检测范式结合恢复引导（Coarse-to-Fine detection paradigm with Recovery Guidance, CFRG）。关键在于利用异构蒸馏模型进行粗定位，并通过分割模型实现精确定位，同时引入恢复特征作为引导以更好地捕捉正常模式。

链接: https://arxiv.org/abs/2502.05761
作者: Enquan Yang,Peng Xing,Hanyang Sun,Wenbo Guo,Yuanwei Ma,Zechao Li,Dan Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by AAAI2025, github: this https URL

点击查看摘要

Abstract:Industrial anomaly detection achieves progress thanks to datasets such as MVTec-AD and VisA. However, they suf- fer from limitations in terms of the number of defect sam- ples, types of defects, and availability of real-world scenes. These constraints inhibit researchers from further exploring the performance of industrial detection with higher accuracy. To this end, we propose a new large-scale anomaly detection dataset called 3CAD, which is derived from real 3C produc- tion lines. Specifically, the proposed 3CAD includes eight different types of manufactured parts, totaling 27,039 high- resolution images labeled with pixel-level anomalies. The key features of 3CAD are that it covers anomalous regions of different sizes, multiple anomaly types, and the possibility of multiple anomalous regions and multiple anomaly types per anomaly image. This is the largest and first anomaly de- tection dataset dedicated to 3C product quality control for community exploration and development. Meanwhile, we in- troduce a simple yet effective framework for unsupervised anomaly detection: a Coarse-to-Fine detection paradigm with Recovery Guidance (CFRG). To detect small defect anoma- lies, the proposed CFRG utilizes a coarse-to-fine detection paradigm. Specifically, we utilize a heterogeneous distilla- tion model for coarse localization and then fine localiza- tion through a segmentation model. In addition, to better capture normal patterns, we introduce recovery features as guidance. Finally, we report the results of our CFRG frame- work and popular anomaly detection methods on the 3CAD dataset, demonstrating strong competitiveness and providing a highly challenging benchmark to promote the development of the anomaly detection field. Data and code are available: this https URL.
zh

[CV-115] Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces AAAI2025

【速读】：该论文旨在评估Vision Transformer (ViT)模型在检测来自在线市场（如Craigslist和OfferUp）的汽车零部件图像中潜在非法活动模式的能力。关键解决方案在于通过提取高维视觉嵌入，利用Uniform Manifold Approximation and Projection (UMAP)进行降维可视化，并采用K-Means聚类算法对相似物品进行分类。此方法通过识别每个聚类中心附近的代表性帖子，揭示了不同聚类的构成和特征。虽然ViT在识别视觉模式方面表现出色，但重叠聚类和异常值的存在也凸显了单一模态方法在此领域的局限性。

链接: https://arxiv.org/abs/2502.05756
作者: Cameron Armijo,Pablo Rivas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2025 Workshop on AI for Social Impact: Bridging Innovations in Finance, Social Media, and Crime Prevention

点击查看摘要

Abstract:This study examines the capabilities of the Vision Transformer (ViT) model in generating visual embeddings for images of auto parts sourced from online marketplaces, such as Craigslist and OfferUp. By focusing exclusively on single-modality data, the analysis evaluates ViT’s potential for detecting patterns indicative of illicit activities. The workflow involves extracting high-dimensional embeddings from images, applying dimensionality reduction techniques like Uniform Manifold Approximation and Projection (UMAP) to visualize the embedding space, and using K-Means clustering to categorize similar items. Representative posts nearest to each cluster centroid provide insights into the composition and characteristics of the clusters. While the results highlight the strengths of ViT in isolating visual patterns, challenges such as overlapping clusters and outliers underscore the limitations of single-modal approaches in this domain. This work contributes to understanding the role of Vision Transformers in analyzing online marketplaces and offers a foundation for future advancements in detecting fraudulent or illegal activities.
zh

[CV-116] PINGS: Gaussian Splatting Meets Distance Fields within a Point-Based Implicit Neural Map

【速读】：该论文旨在解决机器人在复杂环境中高效运行所需的高保真场景重建问题。为实现这一目标，论文提出了一种新的地图表示方法，将连续的符号距离场和高斯点扩散辐射场统一到一个弹性紧凑的基于点的隐式神经地图中。关键在于通过强制这两个场之间的几何一致性，从而利用两种模态的优势实现相互改进。

链接: https://arxiv.org/abs/2502.05752
作者: Yue Pan,Xingguang Zhong,Liren Jin,Louis Wiesmann,Marija Popović,Jens Behley,Cyrill Stachniss
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Robots require high-fidelity reconstructions of their environment for effective operation. Such scene representations should be both, geometrically accurate and photorealistic to support downstream tasks. While this can be achieved by building distance fields from range sensors and radiance fields from cameras, the scalable incremental mapping of both fields consistently and at the same time with high quality remains challenging. In this paper, we propose a novel map representation that unifies a continuous signed distance field and a Gaussian splatting radiance field within an elastic and compact point-based implicit neural map. By enforcing geometric consistency between these fields, we achieve mutual improvements by exploiting both modalities. We devise a LiDAR-visual SLAM system called PINGS using the proposed map representation and evaluate it on several challenging large-scale datasets. Experimental results demonstrate that PINGS can incrementally build globally consistent distance and radiance fields encoded with a compact set of neural points. Compared to the state-of-the-art methods, PINGS achieves superior photometric and geometric rendering at novel views by leveraging the constraints from the distance field. Furthermore, by utilizing dense photometric cues and multi-view consistency from the radiance field, PINGS produces more accurate distance fields, leading to improved odometry estimation and mesh reconstruction.
zh

[CV-117] UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control

【速读】：该论文旨在解决现有扩散桥模型在图像翻译和恢复任务中经常产生模糊或过度平滑细节的问题，并缺乏全面的理论基础来解释这些局限性。解决方案的关键在于提出了一种基于随机最优控制（Stochastic Optimal Control, SOC）的统一框架UniDB。通过SOC优化问题并导出最优控制器的闭式解，UniDB不仅统一和推广了现有的扩散桥模型，而且通过引入可调终端惩罚系数实现了控制成本与终端惩罚之间的最佳平衡，显著提升了细节保留和输出质量。

链接: https://arxiv.org/abs/2502.05749
作者: Kaizhen Zhu,Mokai Pan,Yuexin Ma,Yanwei Fu,Jingyi Yu,Jingya Wang,Ye Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Recent advances in diffusion bridge models leverage Doob’s h -transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations, we propose UniDB, a unified framework for diffusion bridges based on Stochastic Optimal Control (SOC). UniDB formulates the problem through an SOC-based optimization and derives a closed-form solution for the optimal controller, thereby unifying and generalizing existing diffusion bridge models. We demonstrate that existing diffusion bridges employing Doob’s h -transform constitute a special case of our framework, emerging when the terminal penalty coefficient in the SOC cost function tends to infinity. By incorporating a tunable terminal penalty coefficient, UniDB achieves an optimal balance between control costs and terminal penalties, substantially improving detail preservation and output quality. Notably, UniDB seamlessly integrates with existing diffusion bridge models, requiring only minimal code modifications. Extensive experiments across diverse image restoration tasks validate the superiority and adaptability of the proposed framework. Our code is available at this https URL.
zh

[CV-118] Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling

【速读】：该论文旨在探究扩散模型（Diffusion Models）在自监督学习中生成高质量表示的能力，尽管这些模型最初设计用于生成任务。论文的关键在于提出了一种基于低维数据模型和后验估计的数学框架，揭示了图像生成最后阶段生成质量和表示质量之间的基本权衡。通过分析噪声尺度下的表示动力学，主要受数据去噪和类别指定之间相互作用的驱动，论文进一步提出了跨噪声水平聚合特征的集成方法，显著提升了清洁样本性能和标签噪声条件下的鲁棒性。实验结果验证了这一发现的有效性。

链接: https://arxiv.org/abs/2502.05743
作者: Xiao Li,Zekai Zhang,Xiang Li,Siyi Chen,Zhihui Zhu,Peng Wang,Qing Qu
机构: University of Michigan(密歇根大学); Ohio State University(俄亥俄州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: First two authors contributed equally

点击查看摘要

Abstract:This work addresses the critical question of why and when diffusion models, despite being designed for generative tasks, can excel at learning high-quality representations in a self-supervised manner. To address this, we develop a mathematical framework based on a low-dimensional data model and posterior estimation, revealing a fundamental trade-off between generation and representation quality near the final stage of image generation. Our analysis explains the unimodal representation dynamics across noise scales, mainly driven by the interplay between data denoising and class specification. Building on these insights, we propose an ensemble method that aggregates features across noise levels, significantly improving both clean performance and robustness under label noise. Extensive experiments on both synthetic and real-world datasets validate our findings.
zh

[CV-119] Linear Attention Modeling for Learned Image Compression

【速读】：该论文旨在解决学习型图像压缩在实现高效编码的同时，较少关注低复杂度设计的问题。论文的关键解决方案在于提出了一种线性注意力模型LALIC，其中包含了Bi-RWKV块和基于RWKV的空间-通道上下文模型（RWKV-SCCTX）。Bi-RWKV块通过空间混合和通道混合模块实现了更紧凑的特征提取，并应用了基于卷积的全向位移模块以适应二维潜表示。RWKV-SCCTX则利用Bi-RWKV有效地建模相邻特征之间的相关性，进一步提升了率失真（RD）性能。据作者所知，这是首次将高效的Bi-RWKV模型应用于基于线性注意力的学习型图像压缩。实验结果表明，该方法在Kodak、Tecnick和CLIC Professional验证数据集上的BD-rate分别优于VTM-9.1达-14.84%、-15.20%和-17.32%。

链接: https://arxiv.org/abs/2502.05741
作者: Donghui Feng,Zhengxue Cheng,Shen Wang,Ronghua Wu,Hongwei Hu,Guo Lu,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent years, learned image compression has made tremendous progress to achieve impressive coding efficiency. Its coding gain mainly comes from non-linear neural network-based transform and learnable entropy modeling. However, most of recent focuses have been solely on a strong backbone, and few studies consider the low-complexity design. In this paper, we propose LALIC, a linear attention modeling for learned image compression. Specially, we propose to use Bi-RWKV blocks, by utilizing the Spatial Mix and Channel Mix modules to achieve more compact features extraction, and apply the Conv based Omni-Shift module to adapt to two-dimensional latent representation. Furthermore, we propose a RWKV-based Spatial-Channel ConTeXt model (RWKV-SCCTX), that leverages the Bi-RWKV to modeling the correlation between neighboring features effectively, to further improve the RD performance. To our knowledge, our work is the first work to utilize efficient Bi-RWKV models with linear attention for learned image compression. Experimental results demonstrate that our method achieves competitive RD performances by outperforming VTM-9.1 by -14.84%, -15.20%, -17.32% in BD-rate on Kodak, Tecnick and CLIC Professional validation datasets.
zh

[CV-120] Performance Analysis of Traditional VQA Models Under Limited Computational Resources

【速读】：该论文旨在解决在计算资源受限的现实应用中，有效整合视觉和文本信息以进行视觉问答（VQA）所面临的挑战。研究重点在于通过评估基于双向GRU（BidGRU）、GRU、双向LSTM（BidLSTM）以及卷积神经网络（CNN）的模型，提升VQA性能，特别是在处理数值和计数问题方面。研究的关键在于采用具有300维嵌入维度和3000词汇量的BidGRU模型，以实现最佳的整体性能，同时避免较大模型带来的计算开销。消融研究突出了在资源限制条件下，注意机制和计数信息对于处理复杂推理任务的重要性。

链接: https://arxiv.org/abs/2502.05738
作者: Jihao Gu
机构: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 1 figure, 5 tabels, the paper has been accepted by the PRML’25 conference

点击查看摘要

Abstract:In real-world applications where computational resources are limited, effectively integrating visual and textual information for Visual Question Answering (VQA) presents significant challenges. This paper investigates the performance of traditional models under computational constraints, focusing on enhancing VQA performance, particularly for numerical and counting questions. We evaluate models based on Bidirectional GRU (BidGRU), GRU, Bidirectional LSTM (BidLSTM), and Convolutional Neural Networks (CNN), analyzing the impact of different vocabulary sizes, fine-tuning strategies, and embedding dimensions. Experimental results show that the BidGRU model with an embedding dimension of 300 and a vocabulary size of 3000 achieves the best overall performance without the computational overhead of larger models. Ablation studies emphasize the importance of attention mechanisms and counting information in handling complex reasoning tasks under resource limitations. Our research provides valuable insights for developing more efficient VQA models suitable for deployment in environments with limited computational capacity.
zh

[CV-121] SSDD-GAN: Single-Step Denoising Diffusion GAN for Cochlear Implant Surgical Scene Completion

【速读】：该论文旨在解决合成乳突切除术后数据集中的手术场景完整性问题。解决方案的关键在于提出了一种名为单步去噪扩散-GAN（Single-Step Denoising Diffusion-GAN, SSDD-GAN）的方法，通过结合扩散模型的优势与生成对抗网络（GANs）的对抗优化，实现了结构相似性提升6%的效果。该方法利用自监督学习在真实手术数据集上的训练，并采用零样本迁移方式直接应用于合成乳突切除术后数据集，从而生成逼真的完整手术场景，无需显式的地面真值标签。

链接: https://arxiv.org/abs/2502.05710
作者: Yike Zhang,Eduardo Davalos,Jack Noble
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent deep learning-based image completion methods, including both inpainting and outpainting, have demonstrated promising results in restoring corrupted images by effectively filling various missing regions. Among these, Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPMs) have been employed as key generative image completion approaches, excelling in the field of generating high-quality restorations with reduced artifacts and improved fine details. In previous work, we developed a method aimed at synthesizing views from novel microscope positions for mastoidectomy surgeries; however, that approach did not have the ability to restore the surrounding surgical scene environment. In this paper, we propose an efficient method to complete the surgical scene of the synthetic postmastoidectomy dataset. Our approach leverages self-supervised learning on real surgical datasets to train a Single-Step Denoising Diffusion-GAN (SSDD-GAN), combining the advantages of diffusion models with the adversarial optimization of GANs for improved Structural Similarity results of 6%. The trained model is then directly applied to the synthetic postmastoidectomy dataset using a zero-shot approach, enabling the generation of realistic and complete surgical scenes without the need for explicit ground-truth labels from the synthetic postmastoidectomy dataset. This method addresses key limitations in previous work, offering a novel pathway for full surgical microscopy scene completion and enhancing the usability of the synthetic postmastoidectomy dataset in surgical preoperative planning and intraoperative navigation.
zh

[CV-122] Semantic-Aware Adaptive Video Streaming Using Latent Diffusion Models for Wireless Networks

【速读】：该论文旨在解决高带宽使用、存储效率低下以及传统恒定比特率流媒体（Constant Bitrate Streaming, CBS）和自适应比特率流媒体（Adaptive Bitrate Streaming, ABS）引起的体验质量（Quality of Experience, QoE）下降的问题。解决方案的关键在于将潜扩散模型（Latent Diffusion Models, LDMs）集成到FFmpeg技术中，通过压缩I帧至潜在空间以实现显著的存储和语义传输节省，同时保持B帧和P帧作为调整元数据以确保高效视频重建。此外，结合最先进的去噪和视频帧插值（Video Frame Interpolation, VFI）技术，进一步提升视频质量和时序连贯性。

链接: https://arxiv.org/abs/2502.05695
作者: Zijiang Yan,Jianhua Pei,Hongda Wu,Hina Tabassum,Ping Wang
机构: Department of Electrical Engineering and Computer Science, York University (约克大学), Canada
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Submission for possible publication

点击查看摘要

Abstract:This paper proposes a novel framework for real-time adaptive-bitrate video streaming by integrating latent diffusion models (LDMs) within the FFmpeg techniques. This solution addresses the challenges of high bandwidth usage, storage inefficiencies, and quality of experience (QoE) degradation associated with traditional constant bitrate streaming (CBS) and adaptive bitrate streaming (ABS). The proposed approach leverages LDMs to compress I-frames into a latent space, offering significant storage and semantic transmission savings without sacrificing high visual quality. While it keeps B-frames and P-frames as adjustment metadata to ensure efficient video reconstruction at the user side, the proposed framework is complemented with the most state-of-the-art denoising and video frame interpolation (VFI) techniques. These techniques mitigate semantic ambiguity and restore temporal coherence between frames, even in noisy wireless communication environments. Experimental results demonstrate the proposed method achieves high-quality video streaming with optimized bandwidth usage, outperforming state-of-the-art solutions in terms of QoE and resource efficiency. This work opens new possibilities for scalable real-time video streaming in 5G and future post-5G networks.
zh

[CV-123] he Evolution of Dataset Distillation: Toward Scalable and Generalizable Solutions

【速读】：该论文旨在解决大规模数据集高效精炼的问题，以生成紧凑的合成表示，从而促进现代深度学习模型的训练。论文的关键解决方案在于提出和综合分析了几种核心方法：轨迹匹配（Trajectory Matching）、梯度匹配（Gradient Matching）、分布匹配（Distribution Matching）、可扩展生成方法（Scalable Generative Approaches）以及解耦优化机制（Decoupled Optimization Mechanisms）。这些方法共同推动了有效且高效的精炼框架SRe2L的发展，并引入了软标签策略（Soft Label Strategies）和无损蒸馏技术（Lossless Distillation Techniques），显著提升了模型精度并最大化压缩效果。此外，论文还探讨了如何增强鲁棒性以应对对抗性和后门攻击，并有效处理非独立同分布（Non-IID）的数据。

链接: https://arxiv.org/abs/2502.05673
作者: Ping Liu,Jiawei Du
机构: University of Nevada(内华达大学), Reno; Centre for Frontier AI Research (CFAR)(前沿人工智能研究中心); Institute of High Performance Computing (IHPC)(高性能计算研究所), A*STAR(新加坡 Agency for Science, Technology and Research)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation, which condenses large-scale datasets into compact synthetic representations, has emerged as a critical solution for training modern deep learning models efficiently. While prior surveys focus on developments before 2023, this work comprehensively reviews recent advances, emphasizing scalability to large-scale datasets such as ImageNet-1K and ImageNet-21K. We categorize progress into a few key methodologies: trajectory matching, gradient matching, distribution matching, scalable generative approaches, and decoupling optimization mechanisms. As a comprehensive examination of recent dataset distillation advances, this survey highlights breakthrough innovations: the SRe2L framework for efficient and effective condensation, soft label strategies that significantly enhance model accuracy, and lossless distillation techniques that maximize compression while maintaining performance. Beyond these methodological advancements, we address critical challenges, including robustness against adversarial and backdoor attacks, effective handling of non-IID data distributions. Additionally, we explore emerging applications in video and audio processing, multi-modal learning, medical imaging, and scientific computing, highlighting its domain versatility. By offering extensive performance comparisons and actionable research directions, this survey equips researchers and practitioners with practical insights to advance efficient and generalizable dataset distillation, paving the way for future innovations.
zh

[CV-124] Rigid Body Adversarial Attacks

【速读】：该论文旨在解决在某些情况下，刚体模拟器（Rigid Body Simulators）因未能考虑物体有限的刚度（有限的非零柔度）而导致的模拟结果与实际行为之间的显著差异。论文的关键解决方案在于提出了一种针对刚体模拟器的对抗性攻击方法。通过优化构建感知上刚性的对抗物体（Perceptually Rigid Adversarial Objects），这些物体具有与参考物体相同的碰撞几何形状和质量矩（moments of mass），从而在刚体模拟中表现一致，但在更精确的可变形模拟中表现出最大差异。这种方案验证了其有效性，通过比较多个实例在商业可用模拟器中的模拟结果。

链接: https://arxiv.org/abs/2502.05669
作者: Aravind Ramakrishnan,David I.W. Levin,Alec Jacobson
机构: University of Toronto; NVIDIA; Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 17 pages, 14 figures, 3DV 2025

点击查看摘要

Abstract:Due to their performance and simplicity, rigid body simulators are often used in applications where the objects of interest can considered very stiff. However, no material has infinite stiffness, which means there are potentially cases where the non-zero compliance of the seemingly rigid object can cause a significant difference between its trajectories when simulated in a rigid body or deformable simulator. Similarly to how adversarial attacks are developed against image classifiers, we propose an adversarial attack against rigid body simulators. In this adversarial attack, we solve an optimization problem to construct perceptually rigid adversarial objects that have the same collision geometry and moments of mass to a reference object, so that they behave identically in rigid body simulations but maximally different in more accurate deformable simulations. We demonstrate the validity of our method by comparing simulations of several examples in commercially available simulators. Comments: 17 pages, 14 figures, 3DV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2502.05669 [cs.CV] (or arXiv:2502.05669v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.05669 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-125] An inpainting approach to manipulate asymmetry in pre-operative breast images

【速读】：该论文旨在解决乳腺癌手术后乳房外观变化预测的问题。解决方案的关键在于提出了一种图像修复（Inpainting）方法，用于操纵乳房形状和乳头位置，从而预测乳腺癌治疗后的美学结果。通过使用多种模型架构进行实验，包括能够在没有真实乳房轮廓和乳头标注的情况下操作乳房的可逆网络，验证了所提模型能够逼真地改变患者乳房外观，实现术前乳房不对称情况的忠实再现。

链接: https://arxiv.org/abs/2502.05652
作者: Helena Montenegro,Maria J. Cardoso,Jaime S. Cardoso
机构: University of Porto(波尔图大学); INESC TEC(英端克); University of Lisbon(里斯本大学); Fundação Champalimaud(尚帕利玛基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:One of the most frequent modalities of breast cancer treatment is surgery. Breast surgery can cause visual alterations to the breasts, due to scars and asymmetries. To enable an informed choice of treatment, the patient must be adequately informed of the aesthetic outcomes of each treatment plan. In this work, we propose an inpainting approach to manipulate breast shape and nipple position in breast images, for the purpose of predicting the aesthetic outcomes of breast cancer treatment. We perform experiments with various model architectures for the inpainting task, including invertible networks capable of manipulating breasts in the absence of ground-truth breast contour and nipple annotations. Experiments on two breast datasets show the proposed models’ ability to realistically alter a patient’s breasts, enabling a faithful reproduction of breast asymmetries of post-operative patients in pre-operative images.
zh

[CV-126] XiHeFusion: Harnessing Large Language Models for Science Communication in Nuclear Fusion

【速读】：该论文旨在通过开发大型预训练模型来加速核聚变知识的普及，并提高其逻辑推理能力。论文的关键解决方案是提出了首个核聚变领域的大型模型XiHeFusion，该模型基于开源大模型Qwen2.5-14B通过有监督微调获得。通过收集多源数据进行训练，并利用思维链方法增强其逻辑推理能力，使其能够提供更准确且符合逻辑的回答。

链接: https://arxiv.org/abs/2502.05615
作者: Xiao Wang,Qingquan Yang,Fuling Wang,Qiang Chen,Wentao Wu,Yu Jin,Jingtao Jiang,Liye Jin,Bo Jiang,Dengdi Sun,Wanli Lv,Meiwen Chen,Zehua Chen,Guosheng Xu,Jin Tang
机构: Anhui University (安徽大学), Hefei 230601, China; Institute of Plasma Physics, Chinese Academy of Sciences (中国科学院等离子体物理研究所), Hefei, China; Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系), Beijing 100190, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nuclear fusion is one of the most promising ways for humans to obtain infinite energy. Currently, with the rapid development of artificial intelligence, the mission of nuclear fusion has also entered a critical period of its development. How to let more people to understand nuclear fusion and join in its research is one of the effective means to accelerate the implementation of fusion. This paper proposes the first large model in the field of nuclear fusion, XiHeFusion, which is obtained through supervised fine-tuning based on the open-source large model Qwen2.5-14B. We have collected multi-source knowledge about nuclear fusion tasks to support the training of this model, including the common crawl, eBooks, arXiv, dissertation, etc. After the model has mastered the knowledge of the nuclear fusion field, we further used the chain of thought to enhance its logical reasoning ability, making XiHeFusion able to provide more accurate and logical answers. In addition, we propose a test questionnaire containing 180+ questions to assess the conversational ability of this science popularization large model. Extensive experimental results show that our nuclear fusion dialogue model, XiHeFusion, can perform well in answering science popularization knowledge. The pre-trained XiHeFusion model is released on this https URL.
zh

[CV-127] FreeBlend: Advancing Concept Blending with Staged Feedback-Driven Interpolation Diffusion

【速读】：该论文旨在解决概念融合（Concept Blending）在生成模型中的挑战，特别是不兼容的语义信息和形状外观上的差异。解决方案的关键在于引入了一种无需训练的FreeBlend框架，通过利用迁移的图像嵌入作为条件输入来减轻跨模态损失并增强特征细节。此外，该框架采用逐步增加的潜在空间插值策略，并引入了一个基于反馈的机制以逆序更新辅助潜在变量，从而实现全局融合并防止产生僵硬或不自然的结果。

链接: https://arxiv.org/abs/2502.05606
作者: Yufan Zhou,Haoyu Shen,Huan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 14 figures, conference

点击查看摘要

Abstract:Concept blending is a promising yet underexplored area in generative models. While recent approaches, such as embedding mixing and latent modification based on structural sketches, have been proposed, they often suffer from incompatible semantic information and discrepancies in shape and appearance. In this work, we introduce FreeBlend, an effective, training-free framework designed to address these challenges. To mitigate cross-modal loss and enhance feature detail, we leverage transferred image embeddings as conditional inputs. The framework employs a stepwise increasing interpolation strategy between latents, progressively adjusting the blending ratio to seamlessly integrate auxiliary features. Additionally, we introduce a feedback-driven mechanism that updates the auxiliary latents in reverse order, facilitating global blending and preventing rigid or unnatural outputs. Extensive experiments demonstrate that our method significantly improves both the semantic coherence and visual quality of blended images, yielding compelling and coherent results.
zh

[CV-128] Semantic Data Augmentation Enhanced Invariant Risk Minimization for Medical Image Domain Generalization

【速读】：该论文旨在解决医疗影像分类中因扫描设备供应商、成像协议和操作员差异导致的数据异质性问题，特别是在分布外泛化(out-of-distribution generalization)中的挑战。为了解决这一问题，论文提出了一种新的领域导向方向选择器，以替代VIRM中使用的随机增强策略。关键在于利用领域间协方差作为增强方向的引导，从而更有效地减少领域间的差异并提升模型的泛化性能。实验结果表明，在多中心糖尿病视网膜病变数据集上的表现优于现有方法，尤其是在数据量有限且领域异质性显著的情况下。

链接: https://arxiv.org/abs/2502.05593
作者: Yaoyao Zhu,Xiuding Cai,Yingkai Wang,Yu Yao,Xu Luo,Zhongliang Fu
机构: Chengdu Institute of Computer Application, Chinese Academy of Sciences(成都计算机应用研究所, 中国科学院); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has achieved remarkable success in medical image classification. However, its clinical application is often hindered by data heterogeneity caused by variations in scanner vendors, imaging protocols, and operators. Approaches such as invariant risk minimization (IRM) aim to address this challenge of out-of-distribution generalization. For instance, VIRM improves upon IRM by tackling the issue of insufficient feature support overlap, demonstrating promising potential. Nonetheless, these methods face limitations in medical imaging due to the scarcity of annotated data and the inefficiency of augmentation strategies. To address these issues, we propose a novel domain-oriented direction selector to replace the random augmentation strategy used in VIRM. Our method leverages inter-domain covariance as a guider for augmentation direction, guiding data augmentation towards the target domain. This approach effectively reduces domain discrepancies and enhances generalization performance. Experiments on a multi-center diabetic retinopathy dataset demonstrate that our method outperforms state-of-the-art approaches, particularly under limited data conditions and significant domain heterogeneity.
zh

[CV-129] Event Stream-based Visual Object Tracking: HDETrack V2 and A High-Definition Benchmark CVPR24

【速读】：该论文旨在解决现有事件驱动跟踪数据集分辨率低的问题，并提出了一种新颖的方法以提高跟踪性能和灵活性。论文的关键在于引入了一种包含相似性矩阵、特征表示及响应图蒸馏的分层知识蒸馏策略，同时通过时域Fourier变换增强模型捕捉时间依赖性的能力。此外，文中还提出了测试时微调策略，以适应特定目标对象，从而在高分辨率事件驱动跟踪数据集EventVOT上实现高效的目标跟踪。

链接: https://arxiv.org/abs/2502.05574
作者: Shiao Wang,Xiao Wang,Chao Wang,Liye Jin,Lin Zhu,Bo Jiang,Yonghong Tian,Jin Tang
机构: School of Computer Science and Technology, Anhui University, Hefei 230601, China(安徽大学计算机科学与技术学院); Beijing Institute of Technology, Beijing, China(北京理工大学); Peng Cheng Laboratory, Shenzhen, China(鹏城实验室); National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, China(多媒体信息处理国家重点实验室，北京大学计算机学院); School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, China(北京大学深圳研究生院电子与计算机工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Journal Extension of EventVOT, CVPR24

点击查看摘要

Abstract:We then introduce a novel hierarchical knowledge distillation strategy that incorporates the similarity matrix, feature representation, and response map-based distillation to guide the learning of the student Transformer network. We also enhance the model’s ability to capture temporal dependencies by applying the temporal Fourier transform to establish temporal relationships between video frames. We adapt the network model to specific target objects during testing via a newly proposed test-time tuning strategy to achieve high performance and flexibility in target tracking. Recognizing the limitations of existing event-based tracking datasets, which are predominantly low-resolution, we propose EventVOT, the first large-scale high-resolution event-based tracking dataset. It comprises 1141 videos spanning diverse categories such as pedestrians, vehicles, UAVs, ping pong, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, FELT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. Both the benchmark dataset and source code have been released on this https URL
zh

[CV-130] MMHMER:Multi-viewer and Multi-task for Handwritten Mathematical Expression Recognition

【速读】：该论文旨在解决手写数学表达式识别（HMER）中的两个关键挑战：1）如何有效地融合基于CNN/GRU和Transformer的方法；2）如何在适当复杂度下实现更高的性能。论文的关键解决方案在于提出了一种新的多视角、多任务框架，即CNN-Transformer多视角模型，通过利用CNN的特征提取能力和Transformer的序列建模能力，有效结合两种方法的优势，从而提升手写数学表达式的识别性能。

链接: https://arxiv.org/abs/2502.05557
作者: Kehua Chen,Haoyang Shen,Lifan Zhong,Mingyi Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages;2 figures

点击查看摘要

Abstract:Handwritten Mathematical Expression Recognition (HMER) methods have made remarkable progress, with most existing HMER approaches based on either a hybrid CNN/RNN-based with GRU architecture or Transformer architectures. Each of these has its strengths and weaknesses. Leveraging different model structures as viewers and effectively integrating their diverse capabilities presents an intriguing avenue for exploration. This involves addressing two key challenges: 1) How to fuse these two methods effectively, and 2) How to achieve higher performance under an appropriate level of complexity. This paper proposes an efficient CNN-Transformer multi-viewer, multi-task approach to enhance the model’s recognition performance. Our MMHMER model achieves 63.96%, 62.51%, and 65.46% ExpRate on CROHME14, CROHME16, and CROHME19, outperforming Posformer with an absolute gain of 1.28%, 1.48%, and 0.58%. The main contribution of our approach is that we propose a new multi-view, multi-task framework that can effectively integrate the strengths of CNN and Transformer. By leveraging the feature extraction capabilities of CNN and the sequence modeling capabilities of Transformer, our model can better handle the complexity of handwritten mathematical expressions.
zh

[CV-131] Efficient Reinforcement Learning Through Adaptively Pretrained Visual Encoder AAAI2025

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）智能体在处理复杂任务时难以有效泛化到新环境的问题。关键在于提出了一种名为APE（自适应预训练视觉编码器）的框架，通过在预训练阶段采用自适应增强策略，并仅需少量任务交互即可提取可泛化的特征，从而显著提升视觉强化学习算法的泛化能力和效率。

链接: https://arxiv.org/abs/2502.05555
作者: Yuhan Zhang,Guoqing Ma,Guangfu Hao,Liangxuan Guo,Yang Chen,Shan Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:While Reinforcement Learning (RL) agents can successfully learn to handle complex tasks, effectively generalizing acquired skills to unfamiliar settings remains a challenge. One of the reasons behind this is the visual encoders used are task-dependent, preventing effective feature extraction in different settings. To address this issue, recent studies have tried to pretrain encoders with diverse visual inputs in order to improve their performance. However, they rely on existing pretrained encoders without further exploring the impact of pretraining period. In this work, we propose APE: efficient reinforcement learning through Adaptively Pretrained visual Encoder – a framework that utilizes adaptive augmentation strategy during the pretraining phase and extracts generalizable features with only a few interactions within the task environments in the policy learning period. Experiments are conducted across various domains, including DeepMind Control Suite, Atari Games and Memory Maze benchmarks, to verify the effectiveness of our method. Results show that mainstream RL methods, such as DreamerV3 and DrQ-v2, achieve state-of-the-art performance when equipped with APE. In addition, APE significantly improves the sampling efficiency using only visual inputs during learning, approaching the efficiency of state-based method in several control tasks. These findings demonstrate the potential of adaptive pretraining of encoder in enhancing the generalization ability and efficiency of visual RL algorithms.
zh

[CV-132] 4DR P2T: 4D Radar Tensor Synthesis with Point Clouds

【速读】：该论文旨在解决四维雷达点云生成过程中，常用于杂波抑制的恒虚警率（Constant False Alarm Rate, CFAR）算法无法充分捕捉物体空间特征的问题。解决方案的关键在于提出了一种名为4D雷达点到张量（4D Radar Point-to-Tensor, 4DR P2T）的模型，该模型采用改进的条件生成对抗网络（Conditional Generative Adversarial Network, cGAN），有效处理四维雷达点云数据并生成适用于深度学习应用的张量数据，从而在减少测量损失的同时生成高质量的张量数据。

链接: https://arxiv.org/abs/2502.05550
作者: Woo-Jin Jung,Dong-Hee Paek,Seung-Hyun Kong
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:In four-dimensional (4D) Radar-based point cloud generation, clutter removal is commonly performed using the constant false alarm rate (CFAR) algorithm. However, CFAR may not fully capture the spatial characteristics of objects. To address limitation, this paper proposes the 4D Radar Point-to-Tensor (4DR P2T) model, which generates tensor data suitable for deep learning applications while minimizing measurement loss. Our method employs a conditional generative adversarial network (cGAN), modified to effectively process 4D Radar point cloud data and generate tensor data. Experimental results on the K-Radar dataset validate the effectiveness of the 4DR P2T model, achieving an average PSNR of 30.39dB and SSIM of 0.96. Additionally, our analysis of different point cloud generation methods highlights that the 5% percentile method provides the best overall performance, while the 1% percentile method optimally balances data volume reduction and performance, making it well-suited for deep learning applications.
zh

[CV-133] Demystifying Catastrophic Forgetting in Two-Stage Incremental Object Detector

【速读】：该论文旨在解决增量目标检测（Incremental Object Detection, IOD）中的灾难性遗忘问题。关键在于提出了一种名为NSGP-RePRE的框架，其中Regional Prototype Replay (RePRE)通过重放两种类型的原型来缓解分类器的遗忘，而Null Space Gradient Projection (NSGP)则通过梯度投影更新特征提取器，使其在与旧输入子空间正交的方向上进行调整，从而消除原型-特征的错配。这一设计使得NSGP-RePRE在Pascal VOC和MS COCO数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2502.05540
作者: Qirui Wu,Shizhou Zhang,De Cheng,Yinghui Xing,Di Xu,Peng Wang,Yanning Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Catastrophic forgetting is a critical chanllenge for incremental object detection (IOD). Most existing methods treat the detector monolithically, relying on instance replay or knowledge distillation without analyzing component-specific forgetting. Through dissection of Faster R-CNN, we reveal a key insight: Catastrophic forgetting is predominantly localized to the RoI Head classifier, while regressors retain robustness across incremental stages. This finding challenges conventional assumptions, motivating us to develop a framework termed NSGP-RePRE. Regional Prototype Replay (RePRE) mitigates classifier forgetting via replay of two types of prototypes: coarse prototypes represent class-wise semantic centers of RoI features, while fine-grained prototypes model intra-class variations. Null Space Gradient Projection (NSGP) is further introduced to eliminate prototype-feature misalignment by updating the feature extractor in directions orthogonal to subspace of old inputs via gradient projection, aligning RePRE with incremental learning dynamics. Our simple yet effective design allows NSGP-RePRE to achieve state-of-the-art performance on the Pascal VOC and MS COCO datasets under various settings. Our work not only advances IOD methodology but also provide pivotal insights for catastrophic forgetting mitigation in IOD. Code will be available soon.
zh

[CV-134] SSH: Sparse Spectrum Adaptation via Discrete Hartley Transformation

【速读】：该论文旨在解决在对大型基础模型（Large Foundation Model, LLM）进行微调时，低秩适应（Low-rank adaptation, LoRA）方法虽能减少可训练参数数量，但在扩展至更大模型或处理更复杂任务适应时仍面临计算和内存挑战的问题。论文的关键解决方案是引入了一种名为稀疏谱适应（Sparse Spectrum Adaptation via Discrete Hartley Transformation, SSH）的新方法。SSH通过离散哈特莱变换（Discrete Hartley Transformation, DHT）选择所有层中最具信息量的频谱成分，并利用轻量级逆DHT将频谱投影回空间域以进行更新，从而显著减少了可训练参数数量，同时提升了模型性能。

链接: https://arxiv.org/abs/2502.05539
作者: Yixian Shen,Qi Bi,Jia-Hong Huang,Hongyi Zhu,Andy D. Pimentel,Anuj Pathania
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has been demonstrated effective in reducing the trainable parameter number when fine-tuning a large foundation model (LLM). However, it still encounters computational and memory challenges when scaling to larger models or addressing more complex task adaptation. In this work, we introduce Sparse Spectrum Adaptation via Discrete Hartley Transformation (SSH), a novel approach that significantly reduces the number of trainable parameters while enhancing model performance. It selects the most informative spectral components across all layers, under the guidance of the initial weights after a discrete Hartley transformation (DHT). The lightweight inverse DHT then projects the spectrum back into the spatial domain for updates. Extensive experiments across both single-modality tasks such as language understanding and generation and multi-modality tasks such as video-text understanding demonstrate that SSH outperforms existing parameter-efficient fine-tuning (PEFT) methods while achieving substantial reductions in computational cost and memory requirements. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2502.05539 [cs.CV] (or arXiv:2502.05539v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.05539 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-135] Fg-T2M: LLM s-Augmented Fine-Grained Text Driven Human Motion Generation

【速读】：该论文旨在解决细粒度文本驱动的人体运动生成中的挑战性问题。现有方法因缺乏有效的文本解析以提取详细的语义线索（如身体部位描述）以及未能充分建模词与词之间的语言结构而无法准确捕捉文本中指定的关系。为解决这些局限，论文提出了一种新颖的细粒度框架Fg-T2M++，其关键是包含三个模块：(1) 大型语言模型 (LLMs) 语义解析模块，用于从文本中提取身体部位描述和语义；(2) 双曲文本表示模块，通过将句法依存图嵌入双曲空间来编码文本单元间的关联信息；(3) 多模态融合模块，用于分层融合文本和运动特征。

链接: https://arxiv.org/abs/2502.05534
作者: Yin Wang,Mu Li,Jiapeng Liu,Zhiying Leng,Frederick W. B. Li,Ziyao Zhang,Xiaohui Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.
zh

[CV-136] Evaluation of Vision Transformers for Multimodal Image Classification: A Case Study on Brain Lung and Kidney Tumors

【速读】：该论文旨在评估Vision Transformers架构（包括Swin Transformer和MaxViT）在不同医学影像数据集中的表现，特别是磁共振成像（MRI）和计算机断层扫描（CT）图像。研究通过脑部、肺部和肾脏肿瘤的三个训练集进行实验，涵盖多种肿瘤分类标签。关键在于通过微调模型在单独及组合的图像模态下进行性能分析，发现Swin Transformer在肾肿瘤分类中达到了高达99.9%的准确率，并在组合数据集中达到99.3%的准确率。相比之下，MaxViT在单独数据集中表现优异，但在数据组合时表现不佳。论文强调了基于Transformer的模型在不同图像模态和特征上的适应性，同时也指出了标注数据有限和可解释性问题等挑战。

链接: https://arxiv.org/abs/2502.05517
作者: Óscar A. Martín,Javier Sánchez
机构: Centro de Tecnologías de la Imagen (CTIM);
Instituto Universitario de Cibernética, Empresas y Sociedad (IUCES);
University of Las Palmas de Gran Canaria
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Neural networks have become the standard technique for medical diagnostics, especially in cancer detection and classification. This work evaluates the performance of Vision Transformers architectures, including Swin Transformer and MaxViT, in several datasets of magnetic resonance imaging (MRI) and computed tomography (CT) scans. We used three training sets of images with brain, lung, and kidney tumors. Each dataset includes different classification labels, from brain gliomas and meningiomas to benign and malignant lung conditions and kidney anomalies such as cysts and cancers. This work aims to analyze the behavior of the neural networks in each dataset and the benefits of combining different image modalities and tumor classes. We designed several experiments by fine-tuning the models on combined and individual image modalities. The results revealed that the Swin Transformer provided high accuracy, achieving up to 99.9% for kidney tumor classification and 99.3% accuracy in a combined dataset. MaxViT also provided excellent results in individual datasets but performed poorly when data is combined. This research highlights the adaptability of Transformer-based models to various image modalities and features. However, challenges persist, including limited annotated data and interpretability issues. Future works will expand this study by incorporating other image modalities and enhancing diagnostic capabilities. Integrating these models across diverse datasets could mark a pivotal advance in precision medicine, paving the way for more efficient and comprehensive healthcare solutions.
zh

[CV-137] Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model

【速读】：该论文旨在解决在特定私有数据领域缺乏合适的基础模型的问题，以生成差分隐私（Differential Privacy, DP）合成数据。论文的关键在于发现Private Evolution (PE)框架可以扩展到使用模拟器（如基于计算机图形的图像合成工具）作为推理API，而不仅限于基础模型。这种方法被命名为Sim-PE，它显著提高了下游分类准确性，并减少了FID分数，从而证明了其在不同领域的有效性。

链接: https://arxiv.org/abs/2502.05505
作者: Zinan Lin,Tadas Baltrusaitis,Sergey Yekhanin
机构: 未知
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow inference APIs beyond foundation models. Specifically, we show that simulators – such as computer graphics-based image synthesis tools – can also serve as effective APIs within the PE framework. This insight greatly expands the applicability of PE, enabling the use of a wide variety of domain-specific simulators for DP data synthesis. We explore the potential of this approach, named Sim-PE, in the context of image synthesis. Across three diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x and reducing the FID score by up to 80%. We also show that simulators and foundation models can be easily leveraged together within the PE framework to achieve further improvements. The code is open-sourced in the Private Evolution Python library: this https URL.
zh

[CV-138] A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction

【速读】：该论文旨在解决视频生成模型在生成过程中偏离物理定律的问题，这一问题是大多数文本到视频（Text-to-Video, T2V）基准测试所忽视的关键关注点。论文的关键解决方案是引入了一个名为PhyCoBench的新基准，用于评估生成视频的物理一致性，并提出了一种自动评估模型PhyCoPredictor，它通过级联方式生成光流和视频帧，以实现对物理一致性的有效评估。实验结果表明，PhyCoPredictor与人工评估结果最为吻合，从而能够为未来模型优化提供有价值的见解。

链接: https://arxiv.org/abs/2502.05503
作者: Yongfan Chen,Xiuwen Zhu,Tianyu Li,Hao Chen,Chunhua Shen
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in video generation models demonstrate their potential as world simulators, but they often struggle with videos deviating from physical laws, a key concern overlooked by most text-to-video benchmarks. We introduce a benchmark designed specifically to assess the Physical Coherence of generated videos, PhyCoBench. Our benchmark includes 120 prompts covering 7 categories of physical principles, capturing key physical laws observable in video content. We evaluated four state-of-the-art (SoTA) T2V models on PhyCoBench and conducted manual assessments. Additionally, we propose an automated evaluation model: PhyCoPredictor, a diffusion model that generates optical flow and video frames in a cascade manner. Through a consistency evaluation comparing automated and manual sorting, the experimental results show that PhyCoPredictor currently aligns most closely with human evaluation. Therefore, it can effectively evaluate the physical coherence of videos, providing insights for future model optimization. Our benchmark, which includes physical coherence prompts, automatic evaluation tool PhyCoPredictor, and generated video dataset, will all be released on GitHub shortly.
zh

[CV-139] HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation ICLR2025

【速读】：该论文旨在解决大型基础模型在机器人任务中的开放世界泛化能力不足的问题。论文的关键解决方案在于提出了一种分层的视觉-语言-动作（Vision-Language-Action, VLA）模型，该模型通过高阶视觉-语言模型（Vision-Language Model, VLM）微调生成粗略的二维路径，指示所需机器人末端执行器轨迹，从而指导低阶三维感知控制策略进行精确操作。这种分层设计能够减轻高阶VLM在细粒度动作预测上的负担，并减少低阶策略在复杂任务推理上的压力，进而实现跨显著域间隙的有效迁移学习。

链接: https://arxiv.org/abs/2502.05485
作者: Yi Li,Yuquan Deng,Jesse Zhang,Joel Jang,Marius Memme,Raymond Yu,Caelan Reed Garrett,Fabio Ramos,Dieter Fox,Anqi Li,Abhishek Gupta,Ankit Goyal
机构: NVIDIA(英伟达); University of Washington (华盛顿大学); University of Southern California (南加州大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: to be published in ICLR 2025

点击查看摘要

Abstract:Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is the lack of robotic data, which are typically obtained through expensive on-robot operation. A promising remedy is to leverage cheaper, off-domain data such as action-free videos, hand-drawn sketches or simulation data. In this work, we posit that hierarchical vision-language-action (VLA) models can be more effective in utilizing off-domain data than standard monolithic VLA models that directly finetune vision-language models (VLMs) to predict actions. In particular, we study a class of hierarchical VLA models, where the high-level VLM is finetuned to produce a coarse 2D path indicating the desired robot end-effector trajectory given an RGB image and a task description. The intermediate 2D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Doing so alleviates the high-level VLM from fine-grained action prediction, while reducing the low-level policy’s burden on complex task-level reasoning. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios, including differences on embodiments, dynamics, visual appearances and task semantics, etc. In the real-robot experiments, we observe an average of 20% improvement in success rate across seven different axes of generalization over OpenVLA, representing a 50% relative gain. Visual results are provided at: this https URL
zh

[CV-140] Robustifying Fourier Features Embeddings for Implicit Neural Representations

【速读】：该论文旨在解决隐式神经表示（Implicit Neural Representations, INRs）在处理包含不同频率场景时所面临的谱偏置（spectral bias）问题。为克服这一挑战，传统方法采用基于Fourier特征的方法如位置编码（positional encoding），但这些方法会引入噪声，从而影响下游任务的性能。论文的关键解决方案在于提出将多层感知器（Multi-layer Perceptrons, MLPs）与Fourier特征嵌入相结合，以互补增强各自的优点，同时通过一个简单的定理验证了这一假设，为设计有效方案奠定了基础。

链接: https://arxiv.org/abs/2502.05482
作者: Mingze Ma,Qingtian Zhu,Yifan Zhan,Zhengwei Yin,Hongjun Wang,Yinqiang Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) employ neural networks to represent continuous functions by mapping coordinates to the corresponding values of the target function, with applications e.g., inverse graphics. However, INRs face a challenge known as spectral bias when dealing with scenes containing varying frequencies. To overcome spectral bias, the most common approach is the Fourier features-based methods such as positional encoding. However, Fourier features-based methods will introduce noise to output, which degrades their performances when applied to downstream tasks. In response, this paper initially hypothesizes that combining multi-layer perceptrons (MLPs) with Fourier feature embeddings mutually enhances their strengths, yet simultaneously introduces limitations inherent in Fourier feature embeddings. By presenting a simple theorem, we validate our hypothesis, which serves as a foundation for the design of our solution. Leveraging these insights, we propose the use of multi-layer perceptrons (MLPs) without additive
zh

[CV-141] Convolutional Neural Network Segmentation for Satellite Imagery Data to Identify Landforms Using U-Net Architecture

【速读】：该论文旨在解决土地形貌检测中的语义分割问题。解决方案的关键在于采用改进的U-Net架构，结合卷积神经网络（Convolutional Neural Network, CNN）进行有效的特征提取，并通过Dropout技术实现正则化以增强模型的鲁棒性，同时利用Adam优化器进行高效训练。该方法在处理大量预处理的卫星地形图像时表现出高分辨率输出、快速特征提取以及广泛的应用灵活性。

链接: https://arxiv.org/abs/2502.05476
作者: Mitul Goswami,Sainath Dey,Aniruddha Mukherjee,Suneeta Mohanty,Prasant Kumar Pattnaik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6th International Conference on Computational Intelligence and Pattern Recognition

点击查看摘要

Abstract:This study demonstrates a novel use of the U-Net architecture in the field of semantic segmentation to detect landforms using preprocessed satellite imagery. The study applies the U-Net model for effective feature extraction by using Convolutional Neural Network (CNN) segmentation techniques. Dropout is strategically used for regularization to improve the model’s perseverance, and the Adam optimizer is used for effective training. The study thoroughly assesses the performance of the U-Net architecture utilizing a large sample of preprocessed satellite topographical images. The model excels in semantic segmentation tasks, displaying high-resolution outputs, quick feature extraction, and flexibility to a wide range of applications. The findings highlight the U-Net architecture’s substantial contribution to the advancement of machine learning and image processing technologies. The U-Net approach, which emphasizes pixel-wise categorization and comprehensive segmentation map production, is helpful in practical applications such as autonomous driving, disaster management, and land use planning. This study not only investigates the complexities of U-Net architecture for semantic segmentation, but also highlights its real-world applications in image classification, analysis, and landform identification. The study demonstrates the U-Net model’s key significance in influencing the environment of modern technology.
zh

[CV-142] LMS-Net: A Learned Mumford-Shah Network For Few-Shot Medical Image Segmentation

【速读】：该论文旨在解决现有少样本语义分割（Few-shot Semantic Segmentation, FSS）方法在数据稀缺场景下的可解释性不足以及未能充分整合语义区域物理结构的问题。解决方案的关键在于提出了一种名为学习型Mumford-Shah网络（Learned Mumford-Shah Network, LMS-Net）的新框架。该框架通过将学习型Mumford-Shah模型（LMS模型）重新表述为原型更新和掩码更新任务，并采用交替优化算法高效求解，从而实现了清晰的可解释性。实验结果验证了该方法在处理复杂结构和应对挑战性分割场景中的优越性和鲁棒性。

链接: https://arxiv.org/abs/2502.05473
作者: Shengdong Zhang,Fan Jia,Xiang Li,Hao Zhang,Jun Shi,Liyan Ma,Shihui Ying
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot semantic segmentation (FSS) methods have shown great promise in handling data-scarce scenarios, particularly in medical image segmentation tasks. However, most existing FSS architectures lack sufficient interpretability and fail to fully incorporate the underlying physical structures of semantic regions. To address these issues, in this paper, we propose a novel deep unfolding network, called the Learned Mumford-Shah Network (LMS-Net), for the FSS task. Specifically, motivated by the effectiveness of pixel-to-prototype comparison in prototypical FSS methods and the capability of deep priors to model complex spatial structures, we leverage our learned Mumford-Shah model (LMS model) as a mathematical foundation to integrate these insights into a unified framework. By reformulating the LMS model into prototype update and mask update tasks, we propose an alternating optimization algorithm to solve it efficiently. Further, the iterative steps of this algorithm are unfolded into corresponding network modules, resulting in LMS-Net with clear interpretability. Comprehensive experiments on three publicly available medical segmentation datasets verify the effectiveness of our method, demonstrating superior accuracy and robustness in handling complex structures and adapting to challenging segmentation scenarios. These results highlight the potential of LMS-Net to advance FSS in medical imaging applications. Our code will be available at: this https URL
zh

[CV-143] DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability

【速读】：该论文旨在解决传统卷积神经网络（Convolutional Neural Networks, CNN）在处理不平衡数据集和数据增强不足方面的问题。解决方案的关键在于提出了一种新颖的集成方法，即DCENWCNet模型，它通过整合三个具有不同dropout和max-pooling层设置的CNN架构，有效平衡了偏差-方差权衡，从而提高了特征学习能力，并在多个性能指标上超越现有的最先进网络。

链接: https://arxiv.org/abs/2502.05459
作者: Sibasish Dhibar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cell Behavior (q-bio.CB); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:White blood cells (WBC) are important parts of our immune system, and they protect our body against infections by eliminating viruses, bacteria, parasites and fungi. The number of WBC types and the total number of WBCs provide important information about our health status. A traditional method, convolutional neural networks (CNN), a deep learning architecture, can classify the blood cell from a part of an object and perform object recognition. Various CNN models exhibit potential; however, their development often involves ad-hoc processes that neglect unnecessary layers, leading to issues with unbalanced datasets and insufficient data augmentation. To address these challenges, we propose a novel ensemble approach that integrates three CNN architectures, each uniquely configured with different dropout and max-pooling layer settings to enhance feature learning. This ensemble model, named DCENWCNet, effectively balances the bias-variance trade-off. When evaluated on the widely recognized Rabbin-WBC dataset, our model outperforms existing state-of-the-art networks, achieving highest mean accuracy. Additionally, it demonstrates superior performance in precision, recall, F1-score, and Area Under the ROC Curve (AUC) across all categories. To delve deeper into the interpretability of classifiers, we employ reliable post-hoc explanation techniques, including Local Interpretable Model-Agnostic Explanations (LIME). These methods approximate the behavior of a black-box model by elucidating the relationships between feature values and predictions. Interpretable results enable users to comprehend and validate the model’s predictions, thereby increasing their confidence in the automated diagnosis.
zh

[CV-144] Block Graph Neural Networks for tumor heterogeneity prediction

【速读】：该论文旨在解决肿瘤分类准确性不足的问题，当前方法如标准肿瘤分级和单细胞测序存在局限性。论文的关键解决方案在于构建一个基于数学模型的框架，该模型模拟肿瘤演化，并生成人工数据集用于肿瘤分类。通过归一化熵估计肿瘤异质性，并采用阈值进行高低异质性的分类。关键创新点包括人工数据的切割与图生成过程、肿瘤特征设计以及基于图神经网络的Block Graph Neural Networks (BGNN)，用于预测肿瘤异质性。实验结果显示，所提出的特征和模型在人工生成的数据上取得了89.67%的测试准确率。

链接: https://arxiv.org/abs/2502.05458
作者: Marianne Abémgnigni Njifon,Tobias Weber,Viktor Bezborodov,Tyll Krueger,Dominic Schuhmacher
机构: Deutsche Forschungsgemeinschaft (德国研究联合会); Institute for Mathematical Stochastics (数学随机研究所), University of Göttingen (哥廷根大学); Tübingen AI Center (图宾根人工智能中心), University of Tübingen (图宾根大学); Wrocław University of Science and Technology (弗罗茨瓦夫科技大学); Institute for Mathematical Stochastics (数学随机研究所), University of Göttingen (哥廷根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 27 pages, 8 figures

点击查看摘要

Abstract:Accurate tumor classification is essential for selecting effective treatments, but current methods have limitations. Standard tumor grading, which categorizes tumors based on cell differentiation, is not recommended as a stand-alone procedure, as some well-differentiated tumors can be malignant. Tumor heterogeneity assessment via single-cell sequencing offers profound insights but can be costly and may still require significant manual intervention. Many existing statistical machine learning methods for tumor data still require complex pre-processing of MRI and histopathological data. In this paper, we propose to build on a mathematical model that simulates tumor evolution (Ożański (2017)) and generate artificial datasets for tumor classification. Tumor heterogeneity is estimated using normalized entropy, with a threshold to classify tumors as having high or low heterogeneity. Our contributions are threefold: (1) the cut and graph generation processes from the artificial data, (2) the design of tumor features, and (3) the construction of Block Graph Neural Networks (BGNN), a Graph Neural Network-based approach to predict tumor heterogeneity. The experimental results reveal that the combination of the proposed features and models yields excellent results on artificially generated data ( 89.67% accuracy on the test data). In particular, in alignment with the emerging trends in AI-assisted grading and spatial transcriptomics, our results suggest that enriching traditional grading methods with birth (e.g., Ki-67 proliferation index) and death markers can improve heterogeneity prediction and enhance tumor classification. Comments: 27 pages, 8 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.05458 [cs.CV] (or arXiv:2502.05458v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.05458 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-145] Content-based Video Retrieval in Traffic Videos using Latent Dirichlet Allocation Topic Model

【速读】：该论文旨在解决基于内容的视频检索（CBVR）在监控系统中的挑战，特别是通过无监督方式标注视频时所面临的歧义性问题。关键解决方案在于采用主题模型Latent Dirichlet Allocation (LDA)，并通过处理特征向量和主要模型来获得次级模型，该模型能够以缺乏歧义的基本模式描述场景。实验表明，相比其他基于主题模型的方法，所提出的方法在检索任务中性能有所提升，在假正类和真正类响应方面分别提高了至少80%和124%。

链接: https://arxiv.org/abs/2502.05457
作者: Mohammad Kianpisheh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Content-based video retrieval is one of the most challenging tasks in surveillance systems. In this study, Latent Dirichlet Allocation (LDA) topic model is used to annotate surveillance videos in an unsupervised manner. In scene understanding methods, some of the learned patterns are ambiguous and represents a mixture of atomic actions. To address the ambiguity issue in the proposed method, feature vectors, and the primary model are processed to obtain a secondary model which describes the scene with primitive patterns that lack any ambiguity. Experiments show performance improvement in the retrieval task compared to other topic model-based methods. In terms of false positive and true positive responses, the proposed method achieves at least 80% and 124% improvement respectively. Four search strategies are proposed, and users can define and search for a variety of activities using the proposed query formulation which is based on topic models. In addition, the lightweight database in our method occupies much fewer storage which in turn speeds up the search procedure compared to the methods which are based on low-level features.
zh

[CV-146] AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection

【速读】：该论文旨在解决文本驱动的长视频编辑过程中因内存开销过大而带来的挑战。尽管已有方法将此任务简化为关键帧翻译和插值生成两步过程，但逐令牌的关键帧翻译仍限制了视频长度的上限。论文提出了一种新颖且无需训练的方法AdaFlow，其关键是引入自适应注意力剪枝方案（Adaptive Attention Slimming）来压缩KV序列，从而将可翻译的关键帧数量提高一个数量级，并结合自适应关键帧选择方案（Adaptive Keyframe Selection）以选取具有代表性的帧进行联合编辑，进一步提升生成质量。通过这些创新设计，AdaFlow能够在一次推理中实现高质量的长视频编辑，处理超过1000帧的视频，长度是现有方法如TokenFlow的约十倍。

链接: https://arxiv.org/abs/2502.05433
作者: Shuheng Zhang,Yuqi Liu,Hongbo Zhou,Jun Peng,Yiyi Zhou,Xiaoshuai Sun,Rongrong Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite great progress, text-driven long video editing is still notoriously challenging mainly due to excessive memory overhead. Although recent efforts have simplified this task into a two-step process of keyframe translation and interpolation generation, the token-wise keyframe translation still plagues the upper limit of video length. In this paper, we propose a novel and training-free approach towards efficient and effective long video editing, termed AdaFlow. We first reveal that not all tokens of video frames hold equal importance for keyframe translation, based on which we propose an Adaptive Attention Slimming scheme for AdaFlow to squeeze the KV sequence, thus increasing the number of keyframes for translations by an order of magnitude. In addition, an Adaptive Keyframe Selection scheme is also equipped to select the representative frames for joint editing, further improving generation quality. With these innovative designs, AdaFlow achieves high-quality long video editing of minutes in one inference, i.e., more than 1 k frames on one A800 GPU, which is about ten times longer than the compared methods, e.g., TokenFlow. To validate AdaFlow, we also build a new benchmark for long video editing with high-quality annotations, termed LongV-EVAL. Our code is released at: this https URL.
zh

[CV-147] MoFM: A Large-Scale Human Motion Foundation Model

【速读】：该论文旨在解决复杂人体动作在时间和空间上的语义理解问题。解决方案的关键在于提出了MoFM（Motion Foundation Model），这是一种基于离散化运动的人体动作基础模型。MoFM利用Thermal Cubes捕捉时空运动热图，并采用离散变分模型的原理将人体动作编码为离散单元，从而实现更高效和可扩展的表示方式。通过在大规模运动数据集上训练，MoFM能够适应多样化的下游任务，支持包括一次拍摄、无监督和有监督任务在内的多种范式。

链接: https://arxiv.org/abs/2502.05432
作者: Mohammadreza Baharani,Ghazal Alinezhad Noghre,Armin Danesh Pazho,Gabriel Maldonado,Hamed Tabkhi
机构: The UNC at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AFoundation Models (FM) have increasingly drawn the attention of researchers due to their scalability and generalization across diverse tasks. Inspired by the success of FMs and the principles that have driven advancements in Large Language Models (LLMs), we introduce MoFM as a novel Motion Foundation Model. MoFM is designed for the semantic understanding of complex human motions in both time and space. To facilitate large-scale training, MotionBook, a comprehensive human motion dictionary of discretized motions is designed and employed. MotionBook utilizes Thermal Cubes to capture spatio-temporal motion heatmaps, applying principles from discrete variational models to encode human movements into discrete units for a more efficient and scalable representation. MoFM, trained on a large corpus of motion data, provides a foundational backbone adaptable to diverse downstream tasks, supporting paradigms such as one-shot, unsupervised, and supervised tasks. This versatility makes MoFM well-suited for a wide range of motion-based applications.
zh

[CV-148] LRA-GNN: Latent Relation-Aware Graph Neural Network with Initial and Dynamic Residual for Facial Age Estimation

【速读】：该论文旨在解决基于图神经网络（Graph Neural Networks, GNN）在人脸表征建模过程中存在的潜在关系缺失问题，这对深度语义的人脸老化表示至关重要。为了解决这一问题，论文提出了一种新的潜在关系感知图神经网络（Latent Relation-Aware Graph Neural Network with Initial and Dynamic Residual, LRA-GNN）。关键解决方案在于首先利用面部关键点构建初始图，并采用随机游走策略获取全局结构，然后通过多注意力机制捕捉潜在关系，生成包含丰富面部信息的全连接图。为了应对全连接图上的过平滑问题，设计了融合自适应初始残差和动态发展残差的深层残差图卷积网络，确保信息的一致性和多样性。最后，通过渐进式强化学习优化集成分类回归器，以提高估计精度和泛化能力。

链接: https://arxiv.org/abs/2502.05423
作者: Yiping Zhang,Yuntao Shou,Wei Ai,Tao Meng,Keqin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face information is mainly concentrated among facial key points, and frontier research has begun to use graph neural networks to segment faces into patches as nodes to model complex face representations. However, these methods construct node-to-node relations based on similarity thresholds, so there is a problem that some latent relations are missing. These latent relations are crucial for deep semantic representation of face aging. In this novel, we propose a new Latent Relation-Aware Graph Neural Network with Initial and Dynamic Residual (LRA-GNN) to achieve robust and comprehensive facial representation. Specifically, we first construct an initial graph utilizing facial key points as prior knowledge, and then a random walk strategy is employed to the initial graph for obtaining the global structure, both of which together guide the subsequent effective exploration and comprehensive representation. Then LRA-GNN leverages the multi-attention mechanism to capture the latent relations and generates a set of fully connected graphs containing rich facial information and complete structure based on the aforementioned guidance. To avoid over-smoothing issues for deep feature extraction on the fully connected graphs, the deep residual graph convolutional networks are carefully designed, which fuse adaptive initial residuals and dynamic developmental residuals to ensure the consistency and diversity of information. Finally, to improve the estimation accuracy and generalization ability, progressive reinforcement learning is proposed to optimize the ensemble classification regressor. Our proposed framework surpasses the state-of-the-art baselines on several age estimation benchmarks, demonstrating its strength and effectiveness.
zh

[CV-149] Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

【速读】：该论文旨在解决Show-o模型在文本到图像以及图像到文本生成中的推理效率低下问题。解决方案的关键在于引入Show-o Turbo，通过统一去噪视角和扩展一致性蒸馏（Consistency Distillation, CD）方法来缩短去噪过程，并采用轨迹分割策略和课程学习程序以提高训练收敛性。结果显示，在文本到图像生成任务中，Show-o Turbo在不使用分类器自由引导（Classifier-Free Guidance, CFG）的情况下，仅需4个采样步骤即可达到GenEval评分为0.625，优于原版Show-o的8个步骤及使用CFG的情况；而在图像到文本生成中，Show-o Turbo实现了1.5倍的速度提升且性能未显著降低。

链接: https://arxiv.org/abs/2502.05415
作者: Chenkai Xu,Xu Wang,Zhenyi Liao,Yishun Li,Tianqi Hou,Zhijie Deng
机构: Shanghai Jiao Tong University (上海交通大学); Huawei (华为); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at this https URL.
zh

[CV-150] Vision-in-the-loop Simulation for Deep Monocular Pose Estimation of UAV in Ocean Environment

【速读】：该论文旨在解决在实际海洋环境中验证基于深度单目视觉的无人机(UAV)姿态估计算法所面临的挑战。由于研究船只的有限可用性和高昂的操作成本，直接在实际环境中进行验证存在显著困难。为了解决这些问题，论文提出的关键方案是构建一个基于高斯散斑(Gaussian splatting)技术的逼真三维虚拟环境。此方法通过将图像像素建模为三维空间中的高斯分布，创建了一个轻量且高质量的多视角视觉模型，从而实现从多个实地采集的真实世界图像集成的虚拟环境。该虚拟环境能够支持室内测试飞行机动性，并验证飞行软件、硬件以及深度单目姿态估计算法的所有方面，提供了一种经济有效的测试和验证船载无人机自主飞行的方法，特别关注基于视觉的控制和估计算法。

链接: https://arxiv.org/abs/2502.05409
作者: Maneesha Wickramasuriya,Beomyeol Yu,Taeyoung Lee,Murray Snyder
机构: George Washington University (乔治华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 8 pages, 15 figures, conference

点击查看摘要

Abstract:This paper proposes a vision-in-the-loop simulation environment for deep monocular pose estimation of a UAV operating in an ocean environment. Recently, a deep neural network with a transformer architecture has been successfully trained to estimate the pose of a UAV relative to the flight deck of a research vessel, overcoming several limitations of GPS-based approaches. However, validating the deep pose estimation scheme in an actual ocean environment poses significant challenges due to the limited availability of research vessels and the associated operational costs. To address these issues, we present a photo-realistic 3D virtual environment leveraging recent advancements in Gaussian splatting, a novel technique that represents 3D scenes by modeling image pixels as Gaussian distributions in 3D space, creating a lightweight and high-quality visual model from multiple viewpoints. This approach enables the creation of a virtual environment integrating multiple real-world images collected in situ. The resulting simulation enables the indoor testing of flight maneuvers while verifying all aspects of flight software, hardware, and the deep monocular pose estimation scheme. This approach provides a cost-effective solution for testing and validating the autonomous flight of shipboard UAVs, specifically focusing on vision-based control and estimation algorithms.
zh

[CV-151] Convolutional Deep Colorization for Image Compression: A Color Grid Based Approach

【速读】：该论文旨在优化基于卷积神经网络的图像色彩化方法，以实现全自动的图像色彩信息保留，从而提高图像压缩比率。关键在于通过色彩网格方法最小化存储的色彩信息量，同时确保图像能够被忠实重新着色，最终实现了较高的彩色结构相似性度量（CSIM）值。

链接: https://arxiv.org/abs/2502.05402
作者: Ian Tassin,Kristen Goebel,Brittany Lasher
机构: Oregon State University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The search for image compression optimization techniques is a topic of constant interest both in and out of academic circles. One method that shows promise toward future improvements in this field is image colorization since image colorization algorithms can reduce the amount of color data that needs to be stored for an image. Our work focuses on optimizing a color grid based approach to fully-automated image color information retention with regard to convolutional colorization network architecture for the purposes of image compression. More generally, using a convolutional neural network for image re-colorization, we want to minimize the amount of color information that is stored while still being able to faithfully re-color images. Our results yielded a promising image compression ratio, while still allowing for successful image recolorization reaching high CSIM values.
zh

[CV-152] Beyond and Free from Diffusion: Invertible Guided Consistency Training

【速读】：该论文旨在解决在一致性模型（Consistency Models, CMs）中实现高效且有针对性的图像生成与编辑的问题。现有方法依赖于从预训练的扩散模型（Diffusion Models, DMs）中蒸馏分类器自由引导（Classifier-free Guidance, CFG）的知识，这不仅成本高昂而且缺乏灵活性。论文的关键解决方案是提出了可逆引导一致性训练（invertible Guided Consistency Training, iGCT），这是一种完全基于数据驱动的新训练框架。iGCT无需训练和蒸馏扩散模型，显著减少了计算需求，并解决了高引导尺度下CFG导致的饱和伪影问题。实验结果表明，iGCT在CIFAR-10和ImageNet64数据集上显著提升了FID分数和精度。

链接: https://arxiv.org/abs/2502.05391
作者: Chia-Hong Hsu,Shiu-hong Kao,Randall Balestriero
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Guidance in image generation steers models towards higher-quality or more targeted outputs, typically achieved in Diffusion Models (DMs) via Classifier-free Guidance (CFG). However, recent Consistency Models (CMs), which offer fewer function evaluations, rely on distilling CFG knowledge from pretrained DMs to achieve guidance, making them costly and inflexible. In this work, we propose invertible Guided Consistency Training (iGCT), a novel training framework for guided CMs that is entirely data-driven. iGCT, as a pioneering work, contributes to fast and guided image generation and editing without requiring the training and distillation of DMs, greatly reducing the overall compute requirements. iGCT addresses the saturation artifacts seen in CFG under high guidance scales. Our extensive experiments on CIFAR-10 and ImageNet64 show that iGCT significantly improves FID and precision compared to CFG. At a guidance of 13, iGCT improves precision to 0.8, while DM’s drops to 0.47. Our work takes the first step toward enabling guidance and inversion for CMs without relying on DMs.
zh

[CV-153] Coarse-to-Fine Structure-Aware Artistic Style Transfer

【速读】：该论文旨在解决艺术风格迁移过程中存在的问题，即现有方法通常仅将风格图像的纹理和颜色简单转移到内容图像的整体结构上，而忽略了局部结构的一致性。为了解决这一问题，论文提出了一种有效的方法，通过在低分辨率下使用粗网络（Coarse Network）重构不同层次的粗略风格化特征，在此过程中大致转移风格的颜色分布，并结合内容结构与风格结构。随后，利用细网络（Fine Network）中的三个结构选择性融合（SSF）模块，采用重构特征和内容特征合成高质量的结构感知风格化图像。关键在于结合全局和局部风格信息的同时保持内容的基本结构。

链接: https://arxiv.org/abs/2502.05387
作者: Kunxiao Liu,Guowu Yuan,Hao Wu,Wenhua Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 17 figures

点击查看摘要

Abstract:Artistic style transfer aims to use a style image and a content image to synthesize a target image that retains the same artistic expression as the style image while preserving the basic content of the content image. Many recently proposed style transfer methods have a common problem; that is, they simply transfer the texture and color of the style image to the global structure of the content image. As a result, the content image has a local structure that is not similar to the local structure of the style image. In this paper, we present an effective method that can be used to transfer style patterns while fusing the local style structure into the local content structure. In our method, dif-ferent levels of coarse stylized features are first reconstructed at low resolution using a Coarse Network, in which style color distribution is roughly transferred, and the content structure is combined with the style structure. Then, the reconstructed features and the content features are adopted to synthesize high-quality structure-aware stylized images with high resolution using a Fine Network with three structural selective fusion (SSF) modules. The effectiveness of our method is demonstrated through the generation of appealing high-quality stylization results and a com-parison with some state-of-the-art style transfer methods.
zh

[CV-154] NextBestPath: Efficient 3D Mapping of Unseen Environments ICLR2025

【速读】：本文旨在解决主动三维映射（Active 3D Mapping）中的问题，即智能体需要找到一个有效的轨迹以全面重建新场景。现有方法主要预测智能体位置附近的最佳视角，这容易导致局部区域的局限性。此外，现有的室内数据集由于几何复杂度有限和精确地面真实网格不足而显得不够充分。为克服这些限制，本文引入了一个名为AiMDoom的新数据集，并使用游戏《毁灭战士》（Doom）的地图生成器，从而更好地在多样的室内环境中基准测试主动三维映射。同时，本文提出了一种称为路径最优（Next-Best-Path, NBP）的新方法，该方法预测长期目标而非仅仅关注短期视图。模型联合预测长期目标的累积表面覆盖增益和障碍物地图，使其能够通过统一模型高效规划最优路径。通过利用在线数据收集、数据增强和课程学习，NBP在现有的MP3D数据集和本文的AiMDoom数据集上显著优于现有最先进方法，实现了在不同复杂度的室内环境中的更高效映射。

链接: https://arxiv.org/abs/2502.05378
作者: Shiyao Li,Antoine Guédon,Clémentin Boittiaux,Shizhe Chen,Vincent Lepetit
机构: LIGM, École Nationale des Ponts et Chaussees, IP Paris, Univ Gustave Eiffel, CNRS(国立桥路学校, IP巴黎, 格斯塔夫·埃菲尔大学, 法国国家科学研究中心), France; Inria(法国国家信息与自动化研究所), École normale supérieure(高等师范学院), CNRS(法国国家科学研究中心), PSL Research University(巴黎文理研究大学), France
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: To appear at ICLR 2025. Project webpage: this https URL

点击查看摘要

Abstract:This work addresses the problem of active 3D mapping, where an agent must find an efficient trajectory to exhaustively reconstruct a new scene. Previous approaches mainly predict the next best view near the agent’s location, which is prone to getting stuck in local areas. Additionally, existing indoor datasets are insufficient due to limited geometric complexity and inaccurate ground truth meshes. To overcome these limitations, we introduce a novel dataset AiMDoom with a map generator for the Doom video game, enabling to better benchmark active 3D mapping in diverse indoor environments. Moreover, we propose a new method we call next-best-path (NBP), which predicts long-term goals rather than focusing solely on short-sighted views. The model jointly predicts accumulated surface coverage gains for long-term goals and obstacle maps, allowing it to efficiently plan optimal paths with a unified model. By leveraging online data collection, data augmentation and curriculum learning, NBP significantly outperforms state-of-the-art methods on both the existing MP3D dataset and our AiMDoom dataset, achieving more efficient mapping in indoor environments of varying complexity.
zh

[CV-155] owards Fine-grained Renal Vasculature Segmentation: Full-Scale Hierarchical Learning with FH-Seg

【速读】：该论文旨在解决肾血管精细分割的准确性问题，特别是在结构复杂且标注数据不足的情况下。为应对这一挑战，论文提出了一种名为FH-Seg的全尺度分层学习框架。FH-Seg的关键在于其采用的全尺度跳跃连接，能够融合跨尺度的解剖细节信息与上下文语义信息，有效弥合结构与病理背景之间的差距。此外，通过引入可学习的分层软注意力门机制，进一步减少非核心信息的干扰，增强对关键血管特征的关注。

链接: https://arxiv.org/abs/2502.05320
作者: Yitian Long,Zhongze Wu,Xiu Su,Lining Yu,Ruining Deng,Haichun Yang,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate fine-grained segmentation of the renal vasculature is critical for nephrological analysis, yet it faces challenges due to diverse and insufficiently annotated images. Existing methods struggle to accurately segment intricate regions of the renal vasculature, such as the inner and outer walls, arteries and lesions. In this paper, we introduce FH-Seg, a Full-scale Hierarchical Learning Framework designed for comprehensive segmentation of the renal vasculature. Specifically, FH-Seg employs full-scale skip connections that merge detailed anatomical information with contextual semantics across scales, effectively bridging the gap between structural and pathological contexts. Additionally, we implement a learnable hierarchical soft attention gates to adaptively reduce interference from non-core information, enhancing the focus on critical vascular features. To advance research on renal pathology segmentation, we also developed a Large Renal Vasculature (LRV) dataset, which contains 16,212 fine-grained annotated images of 5,600 renal arteries. Extensive experiments on the LRV dataset demonstrate FH-Seg’s superior accuracies (71.23% Dice, 73.06% F1), outperforming Omni-Seg by 2.67 and 2.13 percentage points respectively. Code is available at: this https URL.
zh

[CV-156] Drone Detection and Tracking with YOLO and a Rule-based Method

【速读】：该论文旨在解决公共空间中非法无人机活动（如边界侵入）的检测问题。解决方案的关键在于扩展了一个已发布的开源数据集，并使用该数据集训练了YoloV7及其一些小变体的深度学习模型。为了提高视频中的跟踪性能并减少检测丢失，采用了基于简单交叉相关性的追踪器。

链接: https://arxiv.org/abs/2502.05292
作者: Purbaditya Bhattacharya,Patrick Nowak
机构: Helmut Schmidt University (赫尔穆特·施密特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Drones or unmanned aerial vehicles are traditionally used for military missions, warfare, and espionage. However, the usage of drones has significantly increased due to multiple industrial applications involving security and inspection, transportation, research purposes, and recreational drone flying. Such an increased volume of drone activity in public spaces requires regulatory actions for purposes of privacy protection and safety. Hence, detection of illegal drone activities such as boundary encroachment becomes a necessity. Such detection tasks are usually automated and performed by deep learning models which are trained on annotated image datasets. This paper builds on a previous work and extends an already published open source dataset. A description and analysis of the entire dataset is provided. The dataset is used to train the YOLOv7 deep learning model and some of its minor variants and the results are provided. Since the detection models are based on a single image input, a simple cross-correlation based tracker is used to reduce detection drops and improve tracking performance in videos. Finally, the entire drone detection system is summarized.
zh

[CV-157] Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning

【速读】：该论文旨在解决在密集对比表示学习（Dense Contrastive Representation Learning, DCRL）中由于医学图像特性导致的不可靠对应关系发现的问题，进而产生大规模的假阳性与假阴性（False Positive and Negative, FPN）配对。论文的关键解决方案是提出了GEMINI学习框架，该框架嵌入了同胚先验知识到DCRL中，并通过可变形同胚学习（Deformable Homeomorphism Learning, DHL）来建模医学图像的同胚性，从而可靠地预测像素间的对应关系。此外，引入了几何语义相似度（Geometric Semantic Similarity, GSS）以提取特征中的语义信息，衡量对应关系的学习对齐程度，进一步提升变形学习的效率和性能，确保正样本对的可靠构建。

链接: https://arxiv.org/abs/2502.05282
作者: Yuting He,Boyu Wang,Rongjun Ge,Yang Chen,Guanyu Yang,Shuo Li
机构: Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing, China (东南大学); Centre de Recherche en Information Biom├®dicale Sino-Fran├ºais (CRIBs), Nanjing, China (中法生物医学信息研究中心); Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing, Nanjing, China (江苏省医疗信息处理联合国际研究实验室); School of Instrument Science and Engineering, Southeast University, Nanjing, China (东南大学仪器科学与工程学院); Department of Computer Science, Western University, London, ON N6A 3K7, Canada (西方大学计算机科学系); Department of Biomedical Engineering and the Department of Computer and Data Science, Case Western Reserve University, Cleveland, OH 44106 USA (凯斯西储大学生物医学工程系和计算机与数据科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by T-PAMI 2025

点击查看摘要

Abstract:Dense contrastive representation learning (DCRL) has greatly improved the learning efficiency for image-dense prediction tasks, showing its great potential to reduce the large costs of medical image collection and dense annotation. However, the properties of medical images make unreliable correspondence discovery, bringing an open problem of large-scale false positive and negative (FPN) pairs in DCRL. In this paper, we propose GEoMetric vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior to DCRL and enables a reliable correspondence discovery for effective dense contrast. We propose a deformable homeomorphism learning (DHL) which models the homeomorphism of medical images and learns to estimate a deformable mapping to predict the pixels’ correspondence under topological preservation. It effectively reduces the searching space of pairing and drives an implicit and soft learning of negative pairs via a gradient. We also propose a geometric semantic similarity (GSS) which extracts semantic information in features to measure the alignment degree for the correspondence learning. It will promote the learning efficiency and performance of deformation, constructing positive pairs reliably. We implement two practical variants on two typical representation learning tasks in our experiments. Our promising results on seven datasets which outperform the existing methods show our great superiority. We will release our code on a companion link: this https URL.
zh

[CV-158] Invizo: Arabic Handwritten Document Optical Character Recognition Solution

【速读】：该论文旨在解决阿拉伯语手写和印刷文本的识别问题，这一任务因其复杂多变的书写系统而面临诸多挑战。解决方案的关键在于提出了一种端到端的识别模型，该模型集成了最先进的基于CNN的特征提取和基于Transformer的序列建模技术，以应对书写风格、笔画粗细、对齐方式及噪声条件的变化。该模型在印刷体文本上的字符错误率(CER)为0.59%，字错误率(WER)为1.72%，在手写文本上的CER为7.91%，WER为31.41%。这些结果表明，该方案在实际生活中可用于光学字符识别(OCR)任务，并且具有通用性，适用于任何阿拉伯语手写或印刷文档。

链接: https://arxiv.org/abs/2502.05277
作者: Alhossien Waly,Bassant Tarek,Ali Feteha,Rewan Yehia,Gasser Amr,Walid Gomaa,Ahmed Fares
机构: Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology(埃及日本科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Converting images of Arabic text into plain text is a widely researched topic in academia and industry. However, recognition of Arabic handwritten and printed text presents difficult challenges due to the complex nature of variations of the Arabic script. This work proposes an end-to-end solution for recognizing Arabic handwritten, printed, and Arabic numbers and presents the data in a structured manner. We reached 81.66% precision, 78.82% Recall, and 79.07% F-measure on a Text Detection task that powers the proposed solution. The proposed recognition model incorporates state-of-the-art CNN-based feature extraction, and Transformer-based sequence modeling to accommodate variations in handwriting styles, stroke thicknesses, alignments, and noise conditions. The evaluation of the model suggests its strong performances on both printed and handwritten texts, yielding 0.59% CER and 1.72% WER on printed text, and 7.91% CER and 31.41% WER on handwritten text. The overall proposed solution has proven to be relied on in real-life OCR tasks. Equipped with both detection and recognition models as well as other Feature Extraction and Matching helping algorithms. With the general purpose implementation, making the solution valid for any given document or receipt that is Arabic handwritten or printed. Thus, it is practical and useful for any given context.
zh

[CV-159] Interpretable Failure Detection with Human-Level Concepts

【速读】：该论文旨在解决神经网络在安全关键应用中可靠故障检测的问题。现有方法依赖于类别级别的信号（如logits）来评估置信度，这导致模型对于误分类样本会产生过于自信的预测，从而引发问题。论文的关键解决方案在于引入一种创新策略，通过利用人类概念层面的信息，基于概念激活的序贯排名对输入图像进行细粒度的置信度评估，从而显著降低了不同现实世界图像分类基准中的误报率。具体而言，在ImageNet数据集上误报率降低了3.7%，在EuroSAT数据集上则降低了9%。

链接: https://arxiv.org/abs/2502.05275
作者: Kien X. Nguyen,Tang Li,Xi Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable failure detection holds paramount importance in safety-critical applications. Yet, neural networks are known to produce overconfident predictions for misclassified samples. As a result, it remains a problematic matter as existing confidence score functions rely on category-level signals, the logits, to detect failures. This research introduces an innovative strategy, leveraging human-level concepts for a dual purpose: to reliably detect when a model fails and to transparently interpret why. By integrating a nuanced array of signals for each category, our method enables a finer-grained assessment of the model’s confidence. We present a simple yet highly effective approach based on the ordinal ranking of concept activation to the input image. Without bells and whistles, our method significantly reduce the false positive rate across diverse real-world image classification benchmarks, specifically by 3.7% on ImageNet and 9% on EuroSAT.
zh

[CV-160] Survey on AI-Generated Media Detection: From Non-MLLM to MLLM

【速读】：该论文旨在解决AI生成媒体检测领域中从领域特定方法向通用方法过渡的研究空白。论文的关键在于通过系统性回顾和对比分析，探讨非多模态大语言模型（Non-MLLM-based）和基于多模态大语言模型（MLLM-based）两种检测方法的相似性和差异性，并探索潜在的混合方法及识别伪造检测的关键挑战，从而为未来研究提供方向。此外，论文还关注了随着多模态大语言模型在检测任务中的广泛应用而产生的伦理和安全问题，审视了不同司法管辖区生成式AI (GenAI) 的监管环境，为研究人员和从业者提供了有价值的见解。

链接: https://arxiv.org/abs/2502.05240
作者: Yueying Zou,Peipei Li,Zekun Li,Huaibo Huang,Xing Cui,Xuannan Liu,Chenghanyu Zhang,Ran He
机构: School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院), China; School of Computer Science, University of California, Santa Barbara(加州大学圣塔芭芭拉分校计算机科学学院), USA; State Key Laboratory of Multi-modal Artificial Intelligence Systems, CASIA, New Laboratory of Pattern Recognition, CASIA, and School of Artificial Intelligence, University of Chinese Academy of Sciences(中科院多模态人工智能系统国家重点实验室;中科院模式识别新实验室;中国科学院大学人工智能学院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of AI-generated media poses significant challenges to information authenticity and social trust, making reliable detection methods highly demanded. Methods for detecting AI-generated media have evolved rapidly, paralleling the advancement of Multimodal Large Language Models (MLLMs). Current detection approaches can be categorized into two main groups: Non-MLLM-based and MLLM-based methods. The former employs high-precision, domain-specific detectors powered by deep learning techniques, while the latter utilizes general-purpose detectors based on MLLMs that integrate authenticity verification, explainability, and localization capabilities. Despite significant progress in this field, there remains a gap in literature regarding a comprehensive survey that examines the transition from domain-specific to general-purpose detection methods. This paper addresses this gap by providing a systematic review of both approaches, analyzing them from single-modal and multi-modal perspectives. We present a detailed comparative analysis of these categories, examining their methodological similarities and differences. Through this analysis, we explore potential hybrid approaches and identify key challenges in forgery detection, providing direction for future research. Additionally, as MLLMs become increasingly prevalent in detection tasks, ethical and security considerations have emerged as critical global concerns. We examine the regulatory landscape surrounding Generative AI (GenAI) across various jurisdictions, offering valuable insights for researchers and practitioners in this field.
zh

[CV-161] L2GNet: Optimal Local-to-Global Representation of Anatomical Structures for Generalized Medical Image Segmentation

【速读】：该论文旨在解决现有连续潜空间（Continuous Latent Space, CLS）和协同连续离散潜空间（Synergistic Continuous and Discrete Latent Space, CDLS）模型在医学图像分割中捕捉长程依赖关系时存在的问题。具体而言，这些问题包括模型可能在冗余区域捕获依赖关系，从而影响解剖结构内容的理解，增加假阴性结果，并削弱模型的泛化能力。论文的关键解决方案是提出L2GNet，通过使用最优传输方法关联离散代码以及在可训练参考上对齐代码来学习全局依赖关系，从而实现高效的表示学习，而无需额外的权重矩阵。这使得L2GNet在保持计算效率的同时，在多器官分割和心脏数据集上的表现优于现有最先进的方法，包括CDLS方法SynergyNet。

链接: https://arxiv.org/abs/2502.05229
作者: Vandan Gorade,Sparsh Mittal,Neethi Dasu,Rekha Singhal,KC Santosh,Debesh Jha
机构: Department of Biomedical Engineering, Northwestern University (西北大学), IL, USA;

Mehta School of Data Science and AI, Indian Institute of Technology (印度理工学院), Roorkee, India;

Beth Israel Lahey Health, University of Massachusetts Chan School of Medicine (马萨诸塞大学Chan医学院), USA;

TCS Research (塔塔咨询服务公司研究部门), New York, USA;

Applied AI Research Lab (应用人工智能研究中心), Department of Computer Science, University of South Dakota (南达科他大学), USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continuous Latent Space (CLS) and Discrete Latent Space (DLS) models, like AttnUNet and VQUNet, have excelled in medical image segmentation. In contrast, Synergistic Continuous and Discrete Latent Space (CDLS) models show promise in handling fine and coarse-grained information. However, they struggle with modeling long-range dependencies. CLS or CDLS-based models, such as TransUNet or SynergyNet are adept at capturing long-range dependencies. Since they rely heavily on feature pooling or aggregation using self-attention, they may capture dependencies among redundant regions. This hinders comprehension of anatomical structure content, poses challenges in modeling intra-class and inter-class dependencies, increases false negatives and compromises generalization. Addressing these issues, we propose L2GNet, which learns global dependencies by relating discrete codes obtained from DLS using optimal transport and aligning codes on a trainable reference. L2GNet achieves discriminative on-the-fly representation learning without an additional weight matrix in self-attention models, making it computationally efficient for medical applications. Extensive experiments on multi-organ segmentation and cardiac datasets demonstrate L2GNet’s superiority over state-of-the-art methods, including the CDLS method SynergyNet, offering an novel approach to enhance deep learning models’ performance in medical image analysis.
zh

[CV-162] VistaFlow: Photorealistic Volumetric Reconstruction with Dynamic Resolution Management via Q-Learning

【速读】：该论文旨在解决通过2D照片重建全交互三维体数据图像的问题。关键解决方案在于引入VistaFlow技术，该技术利用可微渲染系统和QuiQ中间视频控制器，通过动态分辨率管理实现高帧率的新型视角合成。此外，VistaFlow采用PlenOctree数据结构来渲染复杂的光照交互，如反射和次表面散射，并且能够在消费级硬件上以超过100帧每秒的速度达到1080p的分辨率，同时支持从高端工作站到低成本微控制器的各种硬件设备。

链接: https://arxiv.org/abs/2502.05222
作者: Jayram Palamadai,William Yu
机构: Illinois Mathematics and Science Academy(伊利诺伊数学与科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We introduce VistaFlow, a scalable three-dimensional imaging technique capable of reconstructing fully interactive 3D volumetric images from a set of 2D photographs. Our model synthesizes novel viewpoints through a differentiable rendering system capable of dynamic resolution management on photorealistic 3D scenes. We achieve this through the introduction of QuiQ, a novel intermediate video controller trained through Q-learning to maintain a consistently high framerate by adjusting render resolution with millisecond precision. Notably, VistaFlow runs natively on integrated CPU graphics, making it viable for mobile and entry-level devices while still delivering high-performance rendering. VistaFlow bypasses Neural Radiance Fields (NeRFs), using the PlenOctree data structure to render complex light interactions such as reflection and subsurface scattering with minimal hardware requirements. Our model is capable of outperforming state-of-the-art methods with novel view synthesis at a resolution of 1080p at over 100 frames per second on consumer hardware. By tailoring render quality to the capabilities of each device, VistaFlow has the potential to improve the efficiency and accessibility of photorealistic 3D scene rendering across a wide spectrum of hardware, from high-end workstations to inexpensive microcontrollers.
zh

[CV-163] Adversarial Machine Learning: Attacking and Safeguarding Image Datasets

【速读】：该论文旨在解决卷积神经网络（Convolutional Neural Networks, CNNs）在对抗性攻击下的脆弱性问题。论文的关键解决方案在于通过对抗训练（adversarial training），即重新训练模型以同时使用清晰图像和对抗样本（adversarial examples），来增强模型的鲁棒性。论文评估了这种防御方法的有效性，尽管经过对抗训练后模型的鲁棒性有所提升，但仍存在一定的性能损失。

链接: https://arxiv.org/abs/2502.05203
作者: Koushik Chowdhury
机构: Saarland University (萨尔兰大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, published in Proceedings of the Fourth International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS-2024)

点击查看摘要

Abstract:This paper examines the vulnerabilities of convolutional neural networks (CNNs) to adversarial attacks and explores a method for their safeguarding. In this study, CNNs were implemented on four of the most common image datasets, namely CIFAR-10, ImageNet, MNIST, and Fashion-MNIST, and achieved high baseline accuracy. To assess the strength of these models, the Fast Gradient Sign Method was used, which is a type of exploit on the model that is used to bring down the models accuracies by adding a very minimal perturbation to the input image. To counter the FGSM attack, a safeguarding approach went through, which includes retraining the models on clear and pollutant or adversarial images to increase their resistance ability. The next step involves applying FGSM again, but this time to the adversarially trained models, to see how much the accuracy of the models has gone down and evaluate the effectiveness of the defense. It appears that while most level of robustness is achieved against the models after adversarial training, there are still a few losses in the performance of these models against adversarial perturbations. This work emphasizes the need to create better defenses for models deployed in real-world scenarios against adversaries.
zh

[CV-164] Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?

【速读】：该论文旨在探讨通用型视觉基础模型（DINOv2）与领域特定的基础模型（RETFound）在眼科疾病检测和系统性疾病预测任务中的性能差异。研究的关键在于通过微调不同规模的DINOv2模型（large, base, small）以及RETFound模型，评估它们在标准化眼科数据集及Moorfields AlzEye和UK Biobank数据集上的表现。研究发现，DINOv2-large模型在糖尿病视网膜病变和多类别眼病检测方面优于RETFound，而RETFound在预测心力衰竭、心肌梗死和缺血性中风方面表现更佳。这一系列对比实验揭示了通用型和领域特定型基础模型在不同临床任务中的优势，强调了根据具体任务需求选择合适基础模型的重要性。

链接: https://arxiv.org/abs/2502.06289
作者: Qingshan Hou,Yukun Zhou,Jocelyn Hui Lin Goh,Ke Zou,Samantha Min Er Yew,Sahana Srinivasan,Meng Wang,Thaddaeus Lo,Xiaofeng Lei,Siegfried K. Wagner,Mark A. Chia,Dawei Yang,Hongyang Jiang,AnRan Ran,Rui Santos,Gabor Mark Somfai,Juan Helen Zhou,Haoyu Chen,Qingyu Chen,Carol Yim-Lui Cheung,Pearse A. Keane,Yih Chung Tham
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domains. However, its applicability to clinical tasks remains underexplored. To address this, we conducted head-to-head evaluations by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular disease detection and systemic disease prediction tasks, across eight standardized open-source ocular datasets, as well as the Moorfields AlzEye and the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, all P=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P0.001). In glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, P0.001). Conversely, RETFound achieved superior performance over all DINOv2 models in predicting heart failure, myocardial infarction, and ischaemic stroke (AUROC=0.732-0.796 vs 0.663-0.771, all P0.001). These trends persisted even with 10% of the fine-tuning data. These findings showcase the distinct scenarios where general-purpose and domain-specific FMs excel, highlighting the importance of aligning FM selection with task-specific requirements to optimise clinical performance.
zh

[CV-165] A Data-Efficient Pan-Tumor Foundation Model for Oncology CT Interpretation

【速读】：该论文旨在解决在肿瘤诊断和管理中，计算机断层扫描（CT）基础模型开发所面临的高质量标注数据稀缺的问题。关键解决方案在于开发了PASTA-Gen，这是一种创新的合成肿瘤生成框架，能够生成包含30,000个CT扫描图像及其像素级标注病变和配对结构化报告的大规模综合数据集。这些数据涵盖了十个器官中的恶性肿瘤及五种良性病变类型。通过利用这一丰富的高质量合成数据，论文显著提升了在多种任务上的表现，并有效缓解了由于隐私限制和精确数据标注所需大量劳动而造成的高质量标注数据缺乏的瓶颈。

链接: https://arxiv.org/abs/2502.06171
作者: Wenhui Lei,Hanyu Chen,Zitian Zhang,Luyang Luo,Qiong Xiao,Yannian Gu,Peng Gao,Yankai Jiang,Ci Wang,Guangtao Wu,Tongjia Xu,Yingjie Zhang,Xiaofan Zhang,Pranav Rajpurkar,Shaoting Zhang,Zhenning Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 57 pages, 7 figures

点击查看摘要

Abstract:Artificial intelligence-assisted imaging analysis has made substantial strides in tumor diagnosis and management. Here we present PASTA, a pan-tumor CT foundation model that achieves state-of-the-art performance on 45 of 46 representative oncology tasks – including lesion segmentation, tumor detection in plain CT, tumor staging, survival prediction, structured report generation, and cross-modality transfer learning, significantly outperforming the second-best models on 35 tasks. This remarkable advancement is driven by our development of PASTA-Gen, an innovative synthetic tumor generation framework that produces a comprehensive dataset of 30,000 CT scans with pixel-level annotated lesions and paired structured reports, encompassing malignancies across ten organs and five benign lesion types. By leveraging this rich, high-quality synthetic data, we overcome a longstanding bottleneck in the development of CT foundation models – specifically, the scarcity of publicly available, high-quality annotated datasets due to privacy constraints and the substantial labor required for scaling precise data annotation. Encouragingly, PASTA demonstrates exceptional data efficiency with promising practical value, markedly improving performance on various tasks with only a small amount of real-world data. The open release of both the synthetic dataset and PASTA foundation model effectively addresses the challenge of data scarcity, thereby advancing oncological research and clinical translation.
zh

[CV-166] Event Vision Sensor: A Review

【速读】：该论文旨在综述从神经形态工程到最先进的基于事件的视觉传感器技术的发展趋势、工作原理及关键特性，并深入探讨基于事件的视觉传感器在红外成像领域的敏感度及其面临的机遇与挑战。关键解决方案在于通过背照式（BSI）技术、晶圆堆叠技术和工业接口的整合，显著提升基于事件的视觉传感器的性能，包括降低噪声、提高分辨率和增加读出速率，从而增强其与当前及边缘视觉系统的兼容性，为实际应用提供更大的可能性。

链接: https://arxiv.org/abs/2502.06116
作者: Xinyue Qin,Junlin Zhang,Wenzhong Bao,Chun Lin,Honglei Chen
机构: 未知
类目: Instrumentation and Detectors (physics.ins-det); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:By monitoring temporal contrast, event-based vision sensors can provide high temporal resolution and low latency while maintaining low power consumption and simplicity in circuit structure. These characteristics have garnered significant attention in both academia and industry. In recent years, the application of back-illuminated (BSI) technology, wafer stacking techniques, and industrial interfaces has brought new opportunities for enhancing the performance of event-based vision sensors. This is evident in the substantial advancements made in reducing noise, improving resolution, and increasing readout rates. Additionally, the integration of these technologies has enhanced the compatibility of event-based vision sensors with current and edge vision systems, providing greater possibilities for their practical applications. This paper will review the progression from neuromorphic engineering to state-of-the-art event-based vision sensor technologies, including their development trends, operating principles, and key features. Moreover, we will delve into the sensitivity of event-based vision sensors and the opportunities and challenges they face in the realm of infrared imaging, providing references for future research and applications.
zh

[CV-167] Inverse Problem Sampling in Latent Space Using Sequential Monte Carlo

【速读】：该论文旨在解决图像处理中的逆向问题，即在已知退化模型的情况下，从退化图像中恢复出合理的原始图像。论文的关键解决方案在于提出了一种基于扩散模型潜空间的序列蒙特卡洛（Sequential Monte Carlo, SMC）采样方法。通过利用扩散模型的正向过程添加额外辅助观测，并在反向过程中执行SMC采样，从而克服了传统方法中由于扩散模型顺序性质及自动编码器变换引入的挑战。

链接: https://arxiv.org/abs/2502.05908
作者: Idan Achituve,Hai Victor Habi,Amir Rosenfeld,Arnon Netzer,Idit Diamant,Ethan Fetaya
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In image processing, solving inverse problems is the task of finding plausible reconstructions of an image that was corrupted by some (usually known) degradation model. Commonly, this process is done using a generative image model that can guide the reconstruction towards solutions that appear natural. The success of diffusion models over the last few years has made them a leading candidate for this task. However, the sequential nature of diffusion models makes this conditional sampling process challenging. Furthermore, since diffusion models are often defined in the latent space of an autoencoder, the encoder-decoder transformations introduce additional difficulties. Here, we suggest a novel sampling method based on sequential Monte Carlo (SMC) in the latent space of diffusion models. We use the forward process of the diffusion model to add additional auxiliary observations and then perform an SMC sampling as part of the backward process. Empirical evaluations on ImageNet and FFHQ show the benefits of our approach over competing methods on various inverse problem tasks.
zh

[CV-168] Image-Based Alzheimers Disease Detection Using Pretrained Convolutional Neural Network Models

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s Disease, AD）早期诊断难题。论文的关键解决方案在于提出了一种基于深度学习技术的计算机辅助诊断系统，通过提取神经影像数据中的相关视觉特征，以实现对阿尔茨海默病的准确预测。实验结果表明，基于VGG16模型的方法在标准数据集上的表现超越了现有技术水平。

链接: https://arxiv.org/abs/2502.05815
作者: Nasser A Alsadhan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Alzheimer’s disease is an untreatable, progressive brain disorder that slowly robs people of their memory, thinking abilities, and ultimately their capacity to complete even the most basic tasks. Among older adults, it is the most frequent cause of dementia. Although there is presently no treatment for Alzheimer’s disease, scientific trials are ongoing to discover drugs to combat the condition. Treatments to slow the signs of dementia are also available. Many researchers throughout the world became interested in developing computer-aided diagnosis systems to aid in the early identification of this deadly disease and assure an accurate diagnosis. In particular, image based approaches have been coupled with machine learning techniques to address the challenges of Alzheimer’s disease detection. This study proposes a computer aided diagnosis system to detect Alzheimer’s disease from biomarkers captured using neuroimaging techniques. The proposed approach relies on deep learning techniques to extract the relevant visual features from the image collection to accurately predict the Alzheimer’s class value. In the experiments, standard datasets and pre-trained deep learning models were investigated. Moreover, standard performance measures were used to assess the models’ performances. The obtained results proved that VGG16-based models outperform the state of the art performance.
zh

[CV-169] 4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis

【速读】：该论文旨在解决早期特发性肺纤维化（Idiopathic Pulmonary Fibrosis, IPF）患者未来计算机断层扫描（CT）图像的准确预测问题，以改善治疗策略和生存结果。关键解决方案在于提出了一种名为4D向量量化生成对抗网络（4D-VQ-GAN）的模型，该模型通过两阶段训练方法实现：首先使用3D-VQ-GAN重建CT体积，然后利用基于神经常微分方程（Neural ODE）的时间模型捕捉编码器生成的量化嵌入的时间动态变化。

链接: https://arxiv.org/abs/2502.05713
作者: An Zhao,Moucheng Xu,Ahmed H. Shahin,Wim Wuyts,Mark G. Jones,Joseph Jacob,Daniel C. Alexander
机构: Hawkes Institute, University College London (霍克斯研究所，伦敦大学学院); Department of Computer Science, University College London (计算机科学系，伦敦大学学院); Department of Med. Physics & Biomedical Engineering, University College London (医学物理与生物医学工程系，伦敦大学学院); Department of Respiratory Medicine, University Hospitals Leuven (呼吸医学系，鲁汶大学医院); NIHR Southampton Biomedical Research Centre and Clinical and Experimental Sciences, University of Southampton (国家健康研究所南安普顿生物医学研究中心和临床与实验科学，南安普顿大学); UCL Respiratory, University College London (伦敦大学学院呼吸科)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4D image synthesis, VQ-GAN, neural ODEs, spatial temporal disease progression modelling, CT, IPF

点击查看摘要

Abstract:Understanding the progression trajectories of diseases is crucial for early diagnosis and effective treatment planning. This is especially vital for life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a chronic, progressive lung disease with a prognosis comparable to many cancers. Computed tomography (CT) imaging has been established as a reliable diagnostic tool for IPF. Accurately predicting future CT scans of early-stage IPF patients can aid in developing better treatment strategies, thereby improving survival outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF patients at any time point. The model is trained using a two-stage approach. In the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the second stage, a Neural Ordinary Differential Equation (ODE) based temporal model is trained to capture the temporal dynamics of the quantised embeddings generated by the encoder in the first stage. We evaluate different configurations of our model for generating longitudinal CT scans and compare the results against ground truth data, both quantitatively and qualitatively. For validation, we conduct survival analysis using imaging biomarkers derived from generated CT scans and achieve a C-index comparable to that of biomarkers derived from the real CT scans. The survival analysis results demonstrate the potential clinical utility inherent to generated longitudinal CT scans, showing that they can reliably predict survival outcomes.
zh

[CV-170] Inversion of Magnetic Data using Learned Dictionaries and Scale Space

【速读】：该论文旨在解决磁数据反演中的固有问题，包括非唯一解、深度不确定性及噪声敏感性。传统方法依赖预定义的正则化技术来稳定解，但这些方法在处理复杂或多样化的地质场景时适应性较差。论文的关键解决方案在于引入可变字典学习和尺度空间方法。通过使用自学习字典，该方法能够灵活地表示复杂的地下特征，而这些特征难以用预定义基底捕捉。此外，通过尺度空间框架引入多尺度表示，扩展了经典的变分反演，逐步引入结构细节的同时减少了过拟合现象。这种方法显著提高了重建精度和鲁棒性，展示了其在地质勘探、环境评估和矿物探测等领域的应用潜力。

链接: https://arxiv.org/abs/2502.05451
作者: Shadab Ahamed,Simon Ghyselincks,Pablo Chang Huang Arias,Julian Kloiber,Yasin Ranjbar,Jingrong Tang,Niloufar Zakariaei,Eldad Haber
机构: 未知
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Magnetic data inversion is an important tool in geophysics, used to infer subsurface magnetic susceptibility distributions from surface magnetic field measurements. This inverse problem is inherently ill-posed, characterized by non-unique solutions, depth ambiguity, and sensitivity to noise. Traditional inversion approaches rely on predefined regularization techniques to stabilize solutions, limiting their adaptability to complex or diverse geological scenarios. In this study, we propose an approach that integrates variable dictionary learning and scale-space methods to address these challenges. Our method employs learned dictionaries, allowing for adaptive representation of complex subsurface features that are difficult to capture with predefined bases. Additionally, we extend classical variational inversion by incorporating multi-scale representations through a scale-space framework, enabling the progressive introduction of structural detail while mitigating overfitting. We implement both fixed and dynamic dictionary learning techniques, with the latter introducing iteration-dependent dictionaries for enhanced flexibility. Using a synthetic dataset to simulate geological scenarios, we demonstrate significant improvements in reconstruction accuracy and robustness compared to conventional variational and dictionary-based methods. Our results highlight the potential of learned dictionaries, especially when coupled with scale-space dynamics, to improve model recovery and noise handling. These findings underscore the promise of our data-driven approach for advance magnetic data inversion and its applications in geophysical exploration, environmental assessment, and mineral prospecting.
zh

[CV-171] Unsupervised Self-Prior Embedding Neural Representation for Iterative Sparse-View CT Reconstruction

【速读】：该论文旨在解决稀疏视图 computed tomography (CT) 重建中的逆问题，特别是在存在噪声的情况下。论文的关键解决方案是引入了一种名为 Self-prior embedding neural representation (Spener) 的新型无监督方法。Spener 通过在每次迭代中提取前一次迭代中的局部图像先验特征并嵌入到神经表示中，从而约束解空间，以此来提高生成式隐式神经表示 (INR) 方法在处理稀疏视图且带有噪声的 CT 数据时的性能。

链接: https://arxiv.org/abs/2502.05445
作者: Xuanyu Tian,Lixuan Chen,Qing Wu,Chenhe Du,Jingjing Shi,Hongjiang Wei,Yuyao Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Emerging unsupervised implicit neural representation (INR) methods, such as NeRP, NeAT, and SCOPE, have shown great potential to address sparse-view computed tomography (SVCT) inverse problems. Although these INR-based methods perform well in relatively dense SVCT reconstructions, they struggle to achieve comparable performance to supervised methods in sparser SVCT scenarios. They are prone to being affected by noise, limiting their applicability in real clinical settings. Additionally, current methods have not fully explored the use of image domain priors for solving SVCsT inverse problems. In this work, we demonstrate that imperfect reconstruction results can provide effective image domain priors for INRs to enhance performance. To leverage this, we introduce Self-prior embedding neural representation (Spener), a novel unsupervised method for SVCT reconstruction that integrates iterative reconstruction algorithms. During each iteration, Spener extracts local image prior features from the previous iteration and embeds them to constrain the solution space. Experimental results on multiple CT datasets show that our unsupervised Spener method achieves performance comparable to supervised state-of-the-art (SOTA) methods on in-domain data while outperforming them on out-of-domain datasets. Moreover, Spener significantly improves the performance of INR-based methods in handling SVCT with noisy sinograms. Our code is available at this https URL.
zh

[CV-172] Diverse Image Generation with Diffusion Models and Cross Class Label Learning for Polyp Classification

【速读】：该论文旨在解决结直肠癌（CRC）病理诊断中结肠息肉分类精度受限的问题。现有分类技术主要依赖单一成像模式，并由于数据稀缺而表现有限。为克服这些问题，论文提出了一种名为PathoPolyp-Diff的新模型，该模型能够生成受文本控制的合成图像，具有多样的病理特征、成像模态和质量。解决方案的关键在于引入了跨类别标签学习，使模型能够从其他类别中学到特征，从而减轻数据标注负担。实验结果显示，在平衡准确率方面，使用公开数据集有最高达7.91%的提升，在视频级别分析中则实现了高达18.33%的显著提升。

链接: https://arxiv.org/abs/2502.05444
作者: Vanshali Sharma,Debesh Jha,M.K. Bhuyan,Pradip K. Das,Ulas Bagci
机构: Department of Computer Science & Engineering, Indian Institute of Technology Guwahati (印度理工学院古瓦哈提分校计算机科学与工程系); Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University (西北大学放射学系机器与混合智能实验室); Department of Electronics & Electrical Engineering, Indian Institute of Technology Guwahati (印度理工学院古瓦哈提分校电子与电气工程系); Department of Computer Science & Engineering, Indian Institute of Technology Guwahati (印度理工学院古瓦哈提分校计算机科学与工程系); Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University (西北大学放射学系机器与混合智能实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathologic diagnosis is a critical phase in deciding the optimal treatment procedure for dealing with colorectal cancer (CRC). Colonic polyps, precursors to CRC, can pathologically be classified into two major types: adenomatous and hyperplastic. For precise classification and early diagnosis of such polyps, the medical procedure of colonoscopy has been widely adopted paired with various imaging techniques, including narrow band imaging and white light imaging. However, the existing classification techniques mainly rely on a single imaging modality and show limited performance due to data scarcity. Recently, generative artificial intelligence has been gaining prominence in overcoming such issues. Additionally, various generation-controlling mechanisms using text prompts and images have been introduced to obtain visually appealing and desired outcomes. However, such mechanisms require class labels to make the model respond efficiently to the provided control input. In the colonoscopy domain, such controlling mechanisms are rarely explored; specifically, the text prompt is a completely uninvestigated area. Moreover, the unavailability of expensive class-wise labels for diverse sets of images limits such explorations. Therefore, we develop a novel model, PathoPolyp-Diff, that generates text-controlled synthetic images with diverse characteristics in terms of pathology, imaging modalities, and quality. We introduce cross-class label learning to make the model learn features from other classes, reducing the burdensome task of data annotation. The experimental results report an improvement of up to 7.91% in balanced accuracy using a publicly available dataset. Moreover, cross-class label learning achieves a statistically significant improvement of up to 18.33% in balanced accuracy during video-level analysis. The code is available at this https URL.
zh

[CV-173] A Novel Convolutional-Free Method for 3D Medical Imaging Segmentation

【速读】：该论文旨在解决3D医学图像分割中的长程依赖性和全局上下文捕捉难题，特别是在精细和复杂结构上的表现不佳。解决方案的关键在于提出了一种基于Transformer架构和自注意力机制的全新全卷积自由模型，专注于提高多语义分割精度，并通过设计一种联合损失函数来应对厚切片与薄切片CT图像之间的领域适应挑战。此外，论文还构建了一个针对薄切片多语义分割的基准数据集。实验结果表明，所提出的模型优于传统及混合架构模型。

链接: https://arxiv.org/abs/2502.05396
作者: Canxuan Gang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:Segmentation of 3D medical images is a critical task for accurate diagnosis and treatment planning. Convolutional neural networks (CNNs) have dominated the field, achieving significant success in 3D medical image segmentation. However, CNNs struggle with capturing long-range dependencies and global context, limiting their performance, particularly for fine and complex structures. Recent transformer-based models, such as TransUNet and nnFormer, have demonstrated promise in addressing these limitations, though they still rely on hybrid CNN-transformer architectures. This paper introduces a novel, fully convolutional-free model based on transformer architecture and self-attention mechanisms for 3D medical image segmentation. Our approach focuses on improving multi-semantic segmentation accuracy and addressing domain adaptation challenges between thick and thin slice CT images. We propose a joint loss function that facilitates effective segmentation of thin slices based on thick slice annotations, overcoming limitations in dataset availability. Furthermore, we present a benchmark dataset for multi-semantic segmentation on thin slices, addressing a gap in current medical imaging research. Our experiments demonstrate the superiority of the proposed model over traditional and hybrid architectures, offering new insights into the future of convolution-free medical image segmentation.
zh

[CV-174] Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge

【速读】：该论文旨在解决多类腹主动脉在计算机断层血管造影（CTA）扫描中的分割问题，以支持复杂腔内治疗的诊断和规划。现有的方法将腹主动脉分割简化为二分类问题，限制了其测量不同分支和区域直径的能力。为填补这一空白，论文组织了AortaSeg24 MICCAI挑战赛，引入了首个包含100个CTA体积数据的公开数据集，这些数据集标注了23个临床上相关的腹主动脉分支和区域。关键在于利用此数据集促进模型开发与验证，并吸引了全球121支团队参与，采用先进的框架如nnU-Net及探索创新技术，包括级联模型、数据增强策略和定制损失函数。评估采用了Dice相似性系数（DSC）和归一化表面距离（NSD）。

链接: https://arxiv.org/abs/2502.05330
作者: Muhammad Imran,Jonathan R. Krebs,Vishal Balaji Sivaraman,Teng Zhang,Amarjeet Kumar,Walker R. Ueland,Michael J. Fassler,Jinlong Huang,Xiao Sun,Lisheng Wang,Pengcheng Shi,Maximilian Rokuss,Michael Baumgartner,Yannick Kirchhof,Klaus H. Maier-Hein,Fabian Isensee,Shuolin Liu,Bing Han,Bong Thanh Nguyen,Dong-jin Shin,Park Ji-Woo,Mathew Choi,Kwang-Hyun Uhm,Sung-Jea Ko,Chanwoong Lee,Jaehee Chun,Jin Sung Kim,Minghui Zhang,Hanxiao Zhang,Xin You,Yun Gu,Zhaohong Pan,Xuan Liu,Xiaokun Liang,Markus Tiefenthaler,Enrique Almar-Munoz,Matthias Schwab,Mikhail Kotyushev,Rostislav Epifanov,Marek Wodzinski,Henning Muller,Abdul Qayyum,Moona Mazher,Steven A. Niederer,Zhiwei Wang,Kaixiang Yang,Jintao Ren,Stine Sofia Korreman,Yuchong Gao,Hongye Zeng,Haoyu Zheng,Rui Zheng,Jinghua Yue,Fugen Zhou,Bo Liu,Alexander Cosman,Muxuan Liang,Chang Zhao,Gilbert R. Upchurch Jr.,Jun Ma,Yuyin Zhou,Michol A. Cooper,Wei Shao
机构: Department of Medicine, University of Florida (佛罗里达大学); Department of Surgery, University of Florida (佛罗里达大学); Department of Electrical and Computer Engineering, University of Florida (佛罗里达大学); Department of Computer, Information Science, and Engineering, University of Florida (佛罗里达大学); Shanghai Jiao Tong University (上海交通大学); Electronic & Information Engineering School, Harbin Institute of Technology (哈尔滨工业大学); Division of Medical Image Computing, German Cancer Research Center (DKFZ) (德国癌症研究中心); CANON MEDICAL SYSTEMS (CHINA) CO., LTD (佳能医疗系统（中国）有限公司); MedAI (未翻译); Korea University (韩国大学); Department of Radiation Oncology, Yonsei Cancer Center, Heavy Ion Therapy Research Institute, Yonsei University College of Medicine (延世大学医学院); Institute of Medical Robotics, Shanghai Jiao Tong University (上海交通大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Medical University of Innsbruck (因斯布鲁克医科大学); Novosibirsk State University (诺沃西比尔斯克国立大学); AGH University of Krakow (克拉科夫AGH科技大学); Institute of Informatics, University of Applied Sciences Western Switzerland (HES-SO) (瑞士西部应用科学大学); University of Geneva (日内瓦大学); The Sense Innovation and Research Center (感知创新与研究中心); National Heart and Lung Institute, Faculty of Medicine, Imperial College London (伦敦帝国理工学院); Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology (华中科技大学); Aarhus University (奥胡斯大学); Shanghaitech University (上海科技大学); Image Processing Center, Beihang University (北京航空航天大学); Department of Computer Science and Engineering, University of California, Santa Cruz (加利福尼亚大学圣克鲁兹分校); Department of Biostatistics, University of Florida (佛罗里达大学); University Health Network (健康网络大学); Agronomy Department, University of Florida (佛罗里达大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently available to support the development of multi-class aortic segmentation methods. To address this gap, we organized the AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes annotated for 23 clinically relevant aortic branches and zones. This dataset was designed to facilitate both model development and validation. The challenge attracted 121 teams worldwide, with participants leveraging state-of-the-art frameworks such as nnU-Net and exploring novel techniques, including cascaded models, data augmentation strategies, and custom loss functions. We evaluated the submitted algorithms using the Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD), highlighting the approaches adopted by the top five performing teams. This paper presents the challenge design, dataset details, evaluation metrics, and an in-depth analysis of the top-performing algorithms. The annotated dataset, evaluation code, and implementations of the leading methods are publicly available to support further research. All resources can be accessed at this https URL.
zh

[CV-175] CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models

【速读】：该论文旨在解决医疗领域深度学习模型在对抗性攻击面前的脆弱性问题，特别是这些模型在面对具有临床特征的误诊场景时的不足。论文的关键解决方案是提出了一种名为概念性报告扰动攻击（Concept-based Report Perturbation Attack, CoRPA）的新框架，该框架专门针对医学成像领域的黑盒对抗性攻击，通过利用临床概念生成接近真实临床误诊情况的对抗性放射学报告和图像。研究表明，即使对传统对抗性攻击具有强鲁棒性的深度学习模型，在面对CoRPA的临床针对性扰动时也表现出显著的脆弱性。这一发现强调了在医学AI系统中解决特定领域漏洞的重要性。

链接: https://arxiv.org/abs/2502.05214
作者: Amy Rafferty,Rishi Ramaesh,Ajitha Rajan
机构: School of Informatics, University of Edinburgh(爱丁堡大学信息学院); NHS Lothian( NHS Lothian); School of Informatics, University of Edinburgh(爱丁堡大学信息学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Deep learning models for medical image classification tasks are becoming widely implemented in AI-assisted diagnostic tools, aiming to enhance diagnostic accuracy, reduce clinician workloads, and improve patient outcomes. However, their vulnerability to adversarial attacks poses significant risks to patient safety. Current attack methodologies use general techniques such as model querying or pixel value perturbations to generate adversarial examples designed to fool a model. These approaches may not adequately address the unique characteristics of clinical errors stemming from missed or incorrectly identified clinical features. We propose the Concept-based Report Perturbation Attack (CoRPA), a clinically-focused black-box adversarial attack framework tailored to the medical imaging domain. CoRPA leverages clinical concepts to generate adversarial radiological reports and images that closely mirror realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our evaluation reveals that deep learning models exhibiting strong resilience to conventional adversarial attacks are significantly less robust when subjected to CoRPA’s clinically-focused perturbations. This underscores the importance of addressing domain-specific vulnerabilities in medical AI systems. By introducing a specialized adversarial attack framework, this study provides a foundation for developing robust, real-world-ready AI models in healthcare, ensuring their safe and reliable deployment in high-stakes clinical environments.
zh

[CV-176] Self-supervised Domain Adaptation for Breaking the Limits of Low-quality Fundus Image Quality Enhancement

【速读】：该论文旨在解决低质量视网膜眼底图像在疾病诊断中的不确定性及潜在误诊问题。解决方案的关键在于采用完全无监督的方式进行图像质量增强，不依赖配对图像或高质量参考图像。通过构建多个基于补丁的领域，并利用辅助预训练的质量评估网络和风格聚类，论文提出了两个自监督领域适应任务，以分离图像内容、低质量因素和风格信息的特征，从而实现鲁棒的低质量图像增强并解决风格不一致问题。

链接: https://arxiv.org/abs/2301.06943
作者: Qingshan Hou,Peng Cao,Jiaqi Wang,Xiaoli Liu,Jinzhu Yang,Osmar R. Zaiane
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retinal fundus images have been applied for the diagnosis and screening of eye diseases, such as Diabetic Retinopathy (DR) or Diabetic Macular Edema (DME). However, both low-quality fundus images and style inconsistency potentially increase uncertainty in the diagnosis of fundus disease and even lead to misdiagnosis by ophthalmologists. Most of the existing image enhancement methods mainly focus on improving the image quality by leveraging the guidance of high-quality images, which is difficult to be collected in medical applications. In this paper, we tackle image quality enhancement in a fully unsupervised setting, i.e., neither paired images nor high-quality images. To this end, we explore the potential of the self-supervised task for improving the quality of fundus images without the requirement of high-quality reference images. Specifically, we construct multiple patch-wise domains via an auxiliary pre-trained quality assessment network and a style clustering. To achieve robust low-quality image enhancement and address style inconsistency, we formulate two self-supervised domain adaptation tasks to disentangle the features of image content, low-quality factor and style information by exploring intrinsic supervision signals within the low-quality images. Extensive experiments are conducted on EyeQ and Messidor datasets, and results show that our DASQE method achieves new state-of-the-art performance when only low-quality images are available.
zh

人工智能

[AI-0] Matryoshka Quantization

链接: https://arxiv.org/abs/2502.06786
作者: Pranav Nair,Puranjay Datta,Jeff Dean,Prateek Jain,Aditya Kusupati
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models – especially to low precisions like int4 or int2 – requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to 10% more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.

[AI-1] RelGNN: Composite Message Passing for Relational Deep Learning

链接: https://arxiv.org/abs/2502.06784
作者: Tianlang Chen,Charilaos Kanatsoulis,Jure Leskovec
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: 14 pages

点击查看摘要

Abstract:Predictive tasks on relational databases are critical in real-world applications spanning e-commerce, healthcare, and social media. To address these tasks effectively, Relational Deep Learning (RDL) encodes relational data as graphs, enabling Graph Neural Networks (GNNs) to exploit relational structures for improved predictions. However, existing heterogeneous GNNs often overlook the intrinsic structural properties of relational databases, leading to modeling inefficiencies. Here we introduce RelGNN, a novel GNN framework specifically designed to capture the unique characteristics of relational databases. At the core of our approach is the introduction of atomic routes, which are sequences of nodes forming high-order tripartite structures. Building upon these atomic routes, RelGNN designs new composite message passing mechanisms between heterogeneous nodes, allowing direct single-hop interactions between them. This approach avoids redundant aggregations and mitigates information entanglement, ultimately leading to more efficient and accurate predictive modeling. RelGNN is evaluated on 30 diverse real-world tasks from RelBench (Fey et al., 2024), and consistently achieves state-of-the-art accuracy with up to 25% improvement.

[AI-2] owards Internet-Scale Training For Agents

链接: https://arxiv.org/abs/2502.06776
作者: Brandon Trabucco,Gunnar Sigurdsson,Robinson Piramuthu,Ruslan Salakhutdinov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The predominant approach for training web navigation agents gathers human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data are an inefficient resource. We develop a pipeline to facilitate Internet-scale training for agents without laborious human annotations. In the first stage, an LLM generates tasks for 150k diverse websites. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM reviews the trajectories and judges their success. Language models are competitive with human annotators, detecting and filtering out harmful content with an accuracy of 97%, generating feasible tasks with an 89% rate, and judging successful trajectories with an 82.6% accuracy. Scaling the pipeline, agents based on Llama 3.1 70B solve 16.7% of tasks for 150k sites. Training on the data generated by our pipeline is competitive with training on human demonstrations. In data-limited settings derived from Mind2Web and WebLINX, we improve Step Accuracy by up to +89.5% and +122.1% respectively for agents trained on mixtures of data from our pipeline, and human data. When training agents with all available human data from these benchmarks, agents fail to generalize to diverse real sites, and adding our data improves their generalization by +149.0% for WebLINX and +156.3% for Mind2Web. Code will be available at: this http URL.

[AI-3] What makes a good feedforward computational graph?

链接: https://arxiv.org/abs/2502.06751
作者: Alex Vitvitskyi,João G. M. Araújo,Marc Lackenby,Petar Veličković
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: Work in progress – comments welcome. 16 pages, 7 figures

点击查看摘要

Abstract:As implied by the plethora of literature on graph rewiring, the choice of computational graph employed by a neural network can make a significant impact on its downstream performance. Certain effects related to the computational graph, such as under-reaching and over-squashing, may even render the model incapable of learning certain functions. Most of these effects have only been thoroughly studied in the domain of undirected graphs; however, recent years have seen a significant rise in interest in feedforward computational graphs: directed graphs without any back edges. In this paper, we study the desirable properties of a feedforward computational graph, discovering two important complementary measures: fidelity and mixing time, and evaluating a few popular choices of graphs through the lens of these measures. Our study is backed by both theoretical analyses of the metrics’ asymptotic behaviour for various graphs, as well as correlating these metrics to the performance of trained neural network models using the corresponding graphs.

[AI-4] Gradient Multi-Normalization for Stateless and Scalable LLM Training

链接: https://arxiv.org/abs/2502.06742
作者: Meyer Scetbon,Chao Ma,Wenbo Gong,Edward Meeds
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for optimizer states while achieving performance comparable to Adam via a multi-step preprocessing procedure applied to instantaneous gradients. Motivated by the success of SWAN, we introduce a novel framework for designing stateless optimizers that normalizes stochastic gradients according to multiple norms. To achieve this, we propose a simple alternating scheme to enforce the normalization of gradients w.r.t these norms. We show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem, and that SWAN is a particular instance of our approach with carefully chosen norms, providing a deeper understanding of its design. However, SWAN’s computationally expensive whitening/orthogonalization step limit its practicality for large LMs. Using our principled perspective, we develop of a more efficient, scalable, and practical stateless optimizer. Our algorithm relaxes the properties of SWAN, significantly reducing its computational cost while retaining its memory efficiency, making it applicable to training large-scale models. Experiments on pre-training LLaMA models with up to 1 billion parameters demonstrate a 3X speedup over Adam with significantly reduced memory requirements, outperforming other memory-efficient baselines.

[AI-5] Low-power Spike-based Wearable Analytics on RRAM Crossbars ISCAS

链接: https://arxiv.org/abs/2502.06736
作者: Abhiroop Bhattacharjee,Jinquan Shi,Wei-Chen Chen,Xinxin Wang,Priyadarshini Panda
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Accepted in 2025 IEEE International Symposium on Circuits and Systems (ISCAS)

点击查看摘要

Abstract:This work introduces a spike-based wearable analytics system utilizing Spiking Neural Networks (SNNs) deployed on an In-memory Computing engine based on RRAM crossbars, which are known for their compactness and energy-efficiency. Given the hardware constraints and noise characteristics of the underlying RRAM crossbars, we propose online adaptation of pre-trained SNNs in real-time using Direct Feedback Alignment (DFA) against traditional backpropagation (BP). Direct Feedback Alignment (DFA) learning, that allows layer-parallel gradient computations, acts as a fast, energy area-efficient method for online adaptation of SNNs on RRAM crossbars, unleashing better algorithmic performance against those adapted using BP. Through extensive simulations using our in-house hardware evaluation engine called DFA_Sim, we find that DFA achieves upto 64.1% lower energy consumption, 10.1% lower area overhead, and a 2.1x reduction in latency compared to BP, while delivering upto 7.55% higher inference accuracy on human activity recognition (HAR) tasks.

[AI-6] Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining ICLR-2025 ICLR2025

链接: https://arxiv.org/abs/2502.06733
作者: Daouda Sow,Herbert Woisetschläger,Saikiran Bulusu,Shiqiang Wang,Hans-Arno Jacobsen,Yingbin Liang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at ICLR 2025. Code base available: this https URL

点击查看摘要

Abstract:Pretraining large language models (LLMs) on vast and heterogeneous datasets is crucial for achieving state-of-the-art performance across diverse downstream tasks. However, current training paradigms treat all samples equally, overlooking the importance or relevance of individual samples throughout the training process. Existing reweighting strategies, which primarily focus on group-level data importance, fail to leverage fine-grained instance-level information and do not adapt dynamically to individual sample importance as training progresses. In this paper, we introduce novel algorithms for dynamic, instance-level data reweighting aimed at improving both the efficiency and effectiveness of LLM pretraining. Our methods adjust the weight of each training sample based on its loss value in an online fashion, allowing the model to dynamically focus on more informative or important samples at the current training stage. In particular, our framework allows us to systematically devise reweighting strategies deprioritizing redundant or uninformative data, which we find tend to work best. Furthermore, we develop a new theoretical framework for analyzing the impact of loss-based reweighting on the convergence of gradient-based optimization, providing the first formal characterization of how these strategies affect convergence bounds. We empirically validate our approach across a spectrum of tasks, from pretraining 7B and 1.4B parameter LLMs to smaller-scale language models and linear regression problems, demonstrating that our loss-based reweighting approach can lead to faster convergence and significantly improved performance.

[AI-7] FlexDeMo: Decoupled Momentum Optimization for Fully and Hybrid Sharded Training

链接: https://arxiv.org/abs/2502.06728
作者: Mogens Henrik From,Jacob Nielsen,Lukas Galke,Peter Schneider-Kamp
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, when considering larger models that do not fit on a single accelerate, the exchange of gradient information and the integration of DeMo needs to be reconsidered. Here, we propose employing a hybrid strategy, FlexDeMo, whereby nodes fully synchronize locally between different GPUs and inter-node communication is improved through only using the fast-moving components. This effectively combines previous hybrid sharding strategies with the advantages of decoupled momentum. Our experimental results show that FlexDeMo is on par with AdamW in terms of validation loss, demonstrating its viability.

[AI-8] Application of Artificial Intelligence (AI) in Civil Engineering

链接: https://arxiv.org/abs/2502.06727
作者: Temitope Funmilayo Awolusi,Bernard Chukwuemeka Finbarrs-Ezema,Isaac Munachimdinamma Chukwudulue,Marc Azab
类目: Artificial Intelligence (cs.AI)
*备注: Kindly cite published version if given access

点击查看摘要

Abstract:Hard computing generally deals with precise data, which provides ideal solutions to problems. However, in the civil engineering field, amongst other disciplines, that is not always the case as real-world systems are continuously changing. Here lies the need to explore soft computing methods and artificial intelligence to solve civil engineering shortcomings. The integration of advanced computational models, including Artificial Neural Networks (ANNs), Fuzzy Logic, Genetic Algorithms (GAs), and Probabilistic Reasoning, has revolutionized the domain of civil engineering. These models have significantly advanced diverse sub-fields by offering innovative solutions and improved analysis capabilities. Sub-fields such as: slope stability analysis, bearing capacity, water quality and treatment, transportation systems, air quality, structural materials, etc. ANNs predict non-linearities and provide accurate estimates. Fuzzy logic uses an efficient decision-making process to provide a more precise assessment of systems. Lastly, while GAs optimizes models (based on evolutionary processes) for better outcomes, probabilistic reasoning lowers their statistical uncertainties.

[AI-9] Recent Advances Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium

链接: https://arxiv.org/abs/2502.06693
作者: Amin Adibi,Xu Cao,Zongliang Ji,Jivat Neet Kaur,Winston Chen,Elizabeth Healey,Brighton Nuwagira,Wenqian Ye,Geoffrey Woollard,Maxwell A Xu,Hejie Cui,Johnny Xi,Trenton Chang,Vasiliki Bikia,Nicole Zhang,Ayush Noori,Yuan Xia,Md. Belal Hossain,Hanna A. Frank,Alina Peluso,Yuan Pu,Shannon Zejiang Shen,John Wu,Adibvafa Fallahpour,Sazan Mahbub,Ross Duncan,Yuwei Zhang,Yurui Cao,Zuheng Xu,Michael Craig,Rahul G. Krishnan,Rahmatollah Beheshti,James M. Rehg,Mohammad Ehsanul Karim,Megan Coffee,Leo Anthony Celi,Jason Alan Fries,Mohsen Sadatsafavi,Dennis Shung,Shannon McWeeney,Jessica Dafflon,Sarah Jabbour
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The fourth Machine Learning for Health (ML4H) symposium was held in person on December 15th and 16th, 2024, in the traditional, ancestral, and unceded territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, British Columbia, Canada. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community. The organization of the research roundtables at the conference involved 13 senior and 27 junior chairs across 13 tables. Each roundtable session included an invited senior chair (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with an interest in the session’s topic.

[AI-10] EquiTabPFN: A Target-Permutation Equivariant Prior Fitted Networks

链接: https://arxiv.org/abs/2502.06684
作者: Michael Arbel,David Salinas,Frank Hutter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent foundational models for tabular data, such as TabPFN, have demonstrated remarkable effectiveness in adapting to new tasks through in-context learning. However, these models overlook a crucial equivariance property: the arbitrary ordering of target dimensions should not influence model predictions. In this study, we identify this oversight as a source of incompressible error, termed the equivariance gap, which introduces instability in predictions. To mitigate these issues, we propose a novel model designed to preserve equivariance across output dimensions. Our experimental results indicate that our proposed model not only addresses these pitfalls effectively but also achieves competitive benchmark performance.

[AI-11] Evaluation of Deep Audio Representations for Hearables ICASSP2025

链接: https://arxiv.org/abs/2502.06664
作者: Fabian Gröger,Pascal Baumann,Ludovic Amruthalingam,Laurent Simon,Ruksana Giurda,Simone Lionetti
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Effectively steering hearable devices requires understanding the acoustic environment around the user. In the computational analysis of sound scenes, foundation models have emerged as the state of the art to produce high-performance, robust, multi-purpose audio representations. We introduce and release Deep Evaluation of Audio Representations (DEAR), the first dataset and benchmark to evaluate the efficacy of foundation models in capturing essential acoustic properties for hearables. The dataset includes 1,158 audio tracks, each 30 seconds long, created by spatially mixing proprietary monologues with commercial, high-quality recordings of everyday acoustic scenes. Our benchmark encompasses eight tasks that assess the general context, speech sources, and technical acoustic properties of the audio scenes. Through our evaluation of four general-purpose audio representation models, we demonstrate that the BEATs model significantly surpasses its counterparts. This superiority underscores the advantage of models trained on diverse audio collections, confirming their applicability to a wide array of auditory tasks, including encoding the environment properties necessary for hearable steering. The DEAR dataset and associated code are available at this https URL.

[AI-12] A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management

链接: https://arxiv.org/abs/2502.06656
作者: Simeon Campos,Henry Papadatos,Fabien Roger,Chloé Touzet,Malcolm Murray,Otter Quarks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent development of powerful AI systems has highlighted the need for robust risk management frameworks in the AI industry. Although companies have begun to implement safety frameworks, current approaches often lack the systematic rigor found in other high-risk industries. This paper presents a comprehensive risk management framework for the development of frontier AI that bridges this gap by integrating established risk management principles with emerging AI-specific practices. The framework consists of four key components: (1) risk identification (through literature review, open-ended red-teaming, and risk modeling), (2) risk analysis and evaluation using quantitative metrics and clearly defined thresholds, (3) risk treatment through mitigation measures such as containment, deployment controls, and assurance processes, and (4) risk governance establishing clear organizational structures and accountability. Drawing from best practices in mature industries such as aviation or nuclear power, while accounting for AI’s unique challenges, this framework provides AI developers with actionable guidelines for implementing robust risk management. The paper details how each component should be implemented throughout the life-cycle of the AI system - from planning through deployment - and emphasizes the importance and feasibility of conducting risk management work prior to the final training run to minimize the burden associated with it.

[AI-13] Unbiased Evaluation of Large Language Models from a Causal Perspective

链接: https://arxiv.org/abs/2502.06655
作者: Meilin Chen,Jian Tian,Liang Ma,Di Xie,Weijie Chen,Jiang Zhu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Benchmark contamination has become a significant concern in the LLM evaluation community. Previous Agents-as-an-Evaluator address this issue by involving agents in the generation of questions. Despite their success, the biases in Agents-as-an-Evaluator methods remain largely unexplored. In this paper, we present a theoretical formulation of evaluation bias, providing valuable insights into designing unbiased evaluation protocols. Furthermore, we identify two type of bias in Agents-as-an-Evaluator through carefully designed probing tasks on a minimal Agents-as-an-Evaluator setup. To address these issues, we propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of this http URL experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results.

[AI-14] Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

链接: https://arxiv.org/abs/2502.06634
作者: Zhiqiang Zhong,Simon Sataa-Yu Larsen,Haoyu Guo,Tao Tang,Kuangyu Zhou,Davide Mottin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA ^3 , a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA ^3 by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on text-based de novo molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA ^3 leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA ^3 notable applications in image, text and graph tasks, affirming its versatility and utility. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.06634 [cs.LG] (or arXiv:2502.06634v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.06634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Combining Large Language Models with Static Analyzers for Code Review Generation

链接: https://arxiv.org/abs/2502.06633
作者: Imen Jaoua,Oussama Ben Sghaier,Houari Sahraoui
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Code review is a crucial but often complex, subjective, and time-consuming activity in software development. Over the past decades, significant efforts have been made to automate this process. Early approaches focused on knowledge-based systems (KBS) that apply rule-based mechanisms to detect code issues, providing precise feedback but struggling with complex, context-dependent cases. More recent work has shifted toward fine-tuning pre-trained language models for code review, enabling broader issue coverage but often at the expense of precision. In this paper, we propose a hybrid approach that combines the strengths of KBS and learning-based systems (LBS) to generate high-quality, comprehensive code reviews. Our method integrates knowledge at three distinct stages of the language model pipeline: during data preparation (Data-Augmented Training, DAT), at inference (Retrieval-Augmented Generation, RAG), and after inference (Naive Concatenation of Outputs, NCO). We empirically evaluate our combination strategies against standalone KBS and LBS fine-tuned on a real-world dataset. Our results show that these hybrid strategies enhance the relevance, completeness, and overall quality of review comments, effectively bridging the gap between rule-based tools and deep learning models.

[AI-16] Amortized In-Context Bayesian Posterior Estimation

链接: https://arxiv.org/abs/2502.06601
作者: Sarthak Mittal,Niels Leif Bracher,Guillaume Lajoie,Priyank Jaini,Marcus Brubaker
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian inference provides a natural way of incorporating prior beliefs and assigning a probability measure to the space of hypotheses. Current solutions rely on iterative routines like Markov Chain Monte Carlo (MCMC) sampling and Variational Inference (VI), which need to be re-run whenever new observations are available. Amortization, through conditional estimation, is a viable strategy to alleviate such difficulties and has been the guiding principle behind simulation-based inference, neural processes and in-context methods using pre-trained models. In this work, we conduct a thorough comparative analysis of amortized in-context Bayesian posterior estimation methods from the lens of different optimization objectives and architectural choices. Such methods train an amortized estimator to perform posterior parameter inference by conditioning on a set of data examples passed as context to a sequence model such as a transformer. In contrast to language models, we leverage permutation invariant architectures as the true posterior is invariant to the ordering of context examples. Our empirical study includes generalization to out-of-distribution tasks, cases where the assumed underlying model is misspecified, and transfer from simulated to real problems. Subsequently, it highlights the superiority of the reverse KL estimator for predictive problems, especially when combined with the transformer architecture and normalizing flows.

[AI-17] he Minimal Search Space for Conditional Causal Bandits ICML2025

链接: https://arxiv.org/abs/2502.06577
作者: Francisco N. F. Q. Simoes,Itai Feigenbaum,Mehdi Dastani,Thijs van Ommen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Submitted to ICML2025

点击查看摘要

Abstract:Causal knowledge can be used to support decision-making problems. This has been recognized in the causal bandits literature, where a causal (multi-armed) bandit is characterized by a causal graphical model and a target variable. The arms are then interventions on the causal model, and rewards are samples of the target variable. Causal bandits were originally studied with a focus on hard interventions. We focus instead on cases where the arms are conditional interventions, which more accurately model many real-world decision-making problems by allowing the value of the intervened variable to be chosen based on the observed values of other variables. This paper presents a graphical characterization of the minimal set of nodes guaranteed to contain the optimal conditional intervention, which maximizes the expected reward. We then propose an efficient algorithm with a time complexity of O(|V| + |E|) to identify this minimal set of nodes. We prove that the graphical characterization and the proposed algorithm are correct. Finally, we empirically demonstrate that our algorithm significantly prunes the search space and substantially accelerates convergence rates when integrated into standard multi-armed bandit algorithms.

[AI-18] Predictive Red Teaming: Breaking Policies Without Breaking Robots

链接: https://arxiv.org/abs/2502.06575
作者: Anirudha Majumdar,Mohit Sharma,Dmitry Kalashnikov,Sumeet Singh,Pierre Sermanet,Vikas Sindhwani
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Visuomotor policies trained via imitation learning are capable of performing challenging manipulation tasks, but are often extremely brittle to lighting, visual distractors, and object locations. These vulnerabilities can depend unpredictably on the specifics of training, and are challenging to expose without time-consuming and expensive hardware evaluations. We propose the problem of predictive red teaming: discovering vulnerabilities of a policy with respect to environmental factors, and predicting the corresponding performance degradation without hardware evaluations in off-nominal scenarios. In order to achieve this, we develop RoboART: an automated red teaming (ART) pipeline that (1) modifies nominal observations using generative image editing to vary different environmental factors, and (2) predicts performance under each variation using a policy-specific anomaly detector executed on edited observations. Experiments across 500+ hardware trials in twelve off-nominal conditions for visuomotor diffusion policies demonstrate that RoboART predicts performance degradation with high accuracy (less than 0.19 average difference between predicted and real success rates). We also demonstrate how predictive red teaming enables targeted data collection: fine-tuning with data collected under conditions predicted to be adverse boosts baseline performance by 2-7x.

[AI-19] On the Impact of the Utility in Semivalue-based Data Valuation

链接: https://arxiv.org/abs/2502.06574
作者: Mélissa Tamine,Benjamin Heymann,Patrick Loiseau,Maxime Vono
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 34 pages, 21 figures

点击查看摘要

Abstract:Semivalue-based data valuation in machine learning (ML) quantifies the contribution of individual data points to a downstream ML task by leveraging principles from cooperative game theory and the notion of utility. While this framework has been used in practice for assessing data quality, our experiments reveal inconsistent valuation outcomes across different utilities, albeit all related to ML performance. Beyond raising concerns about the reliability of data valuation, this inconsistency is challenging to interpret, as it stems from the complex interaction of the utility with data points and semivalue weights, which has barely been studied in prior work. In this paper, we take a first step toward clarifying the utility impact on semivalue-based data valuation. Specifically, we provide geometric interpretations of this impact for a broad family of classification utilities, which includes the accuracy and the arithmetic mean. We introduce the notion of spatial signatures: given a semivalue, data points can be embedded into a two-dimensional space, and utility functions map to the dual of this space. This geometric perspective separates the influence of the dataset and semivalue from that of the utility, providing a theoretical explanation for the experimentally observed sensitivity of valuation outcomes to the utility choice.

[AI-20] Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

链接: https://arxiv.org/abs/2502.06559
作者: Maria Eriksson,Erasmo Purificato,Arman Noroozian,Joao Vinagre,Guillaume Chaslot,Emilia Gomez,David Fernandez-Llorca
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2025

点击查看摘要

Abstract:Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impact capabilities, safety and systemic risks. This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices, published in the last 10 years. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Our review also highlights a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results. Furthermore, it underscores how benchmark practices are fundamentally shaped by cultural, commercial and competitive dynamics that often prioritise state-of-the-art performance at the expense of broader societal concerns. By providing an overview of risks associated with existing benchmarking procedures, we problematise disproportionate trust placed in benchmarks and contribute to ongoing efforts to improve the accountability and relevance of quantitative AI benchmarks within the complexities of real-world scenarios.

[AI-21] ghter Value-Function Approximations for POMDPs AAMAS2025

链接: https://arxiv.org/abs/2502.06523
作者: Merlijn Krale,Wietze Koops,Sebastian Junges,Thiago D. Simão,Nils Jansen
类目: Artificial Intelligence (cs.AI)
*备注: AAMAS 2025 submission

点击查看摘要

Abstract:Solving partially observable Markov decision processes (POMDPs) typically requires reasoning about the values of exponentially many state beliefs. Towards practical performance, state-of-the-art solvers use value bounds to guide this reasoning. However, sound upper value bounds are often computationally expensive to compute, and there is a tradeoff between the tightness of such bounds and their computational cost. This paper introduces new and provably tighter upper value bounds than the commonly used fast informed bound. Our empirical evaluation shows that, despite their additional computational overhead, the new upper bounds accelerate state-of-the-art POMDP solvers on a wide range of benchmarks.

[AI-22] Model-Based Offline Reinforcement Learning with Reliability-Guaranteed Sequence Modeling

链接: https://arxiv.org/abs/2502.06491
作者: Shenghong He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model-based offline reinforcement learning (MORL) aims to learn a policy by exploiting a dynamics model derived from an existing dataset. Applying conservative quantification to the dynamics model, most existing works on MORL generate trajectories that approximate the real data distribution to facilitate policy learning by using current information (e.g., the state and action at time step t ). However, these works neglect the impact of historical information on environmental dynamics, leading to the generation of unreliable trajectories that may not align with the real data distribution. In this paper, we propose a new MORL algorithm \textbfReliability-guaranteed \textbfTransformer (RT), which can eliminate unreliable trajectories by calculating the cumulative reliability of the generated trajectory (i.e., using a weighted variational distance away from the real data). Moreover, by sampling candidate actions with high rewards, RT can efficiently generate high-return trajectories from the existing offline data. We theoretically prove the performance guarantees of RT in policy learning, and empirically demonstrate its effectiveness against state-of-the-art model-based methods on several benchmark tasks.

[AI-23] SIGMA: Sheaf-Informed Geometric Multi-Agent Pathfinding ICRA

链接: https://arxiv.org/abs/2502.06440
作者: Shuhao Liao,Weihang Xia,Yuhong Cao,Weiheng Dai,Chengyang He,Wenjun Wu,Guillaume Sartoretti
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted for presentation at the 2025 IEEE International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract:The Multi-Agent Path Finding (MAPF) problem aims to determine the shortest and collision-free paths for multiple agents in a known, potentially obstacle-ridden environment. It is the core challenge for robotic deployments in large-scale logistics and transportation. Decentralized learning-based approaches have shown great potential for addressing the MAPF problems, offering more reactive and scalable solutions. However, existing learning-based MAPF methods usually rely on agents making decisions based on a limited field of view (FOV), resulting in short-sighted policies and inefficient cooperation in complex scenarios. There, a critical challenge is to achieve consensus on potential movements between agents based on limited observations and communications. To tackle this challenge, we introduce a new framework that applies sheaf theory to decentralized deep reinforcement learning, enabling agents to learn geometric cross-dependencies between each other through local consensus and utilize them for tightly cooperative decision-making. In particular, sheaf theory provides a mathematical proof of conditions for achieving global consensus through local observation. Inspired by this, we incorporate a neural network to approximately model the consensus in latent space based on sheaf theory and train it through self-supervised learning. During the task, in addition to normal features for MAPF as in previous works, each agent distributedly reasons about a learned consensus feature, leading to efficient cooperation on pathfinding and collision avoidance. As a result, our proposed method demonstrates significant improvements over state-of-the-art learning-based MAPF planners, especially in relatively large and complex scenarios, demonstrating its superiority over baselines in various simulations and real-world robot experiments.

[AI-24] sting software for non-discrimination: an updated and extended audit in the Italian car insurance domain

链接: https://arxiv.org/abs/2502.06439
作者: Marco Rondina,Antonio Vetrò,Riccardo Coppola,Oumaima Regragrui,Alessandro Fabris,Gianmaria Silvello,Gian Antonio Susto,Juan Carlos De Martin
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 14 pages, 1 figure

点击查看摘要

Abstract:Context. As software systems become more integrated into society’s infrastructure, the responsibility of software professionals to ensure compliance with various non-functional requirements increases. These requirements include security, safety, privacy, and, increasingly, non-discrimination. Motivation. Fairness in pricing algorithms grants equitable access to basic services without discriminating on the basis of protected attributes. Method. We replicate a previous empirical study that used black box testing to audit pricing algorithms used by Italian car insurance companies, accessible through a popular online system. With respect to the previous study, we enlarged the number of tests and the number of demographic variables under analysis. Results. Our work confirms and extends previous findings, highlighting the problematic permanence of discrimination across time: demographic variables significantly impact pricing to this day, with birthplace remaining the main discriminatory factor against individuals not born in Italian cities. We also found that driver profiles can determine the number of quotes available to the user, denying equal opportunities to all. Conclusion. The study underscores the importance of testing for non-discrimination in software systems that affect people’s everyday lives. Performing algorithmic audits over time makes it possible to evaluate the evolution of such algorithms. It also demonstrates the role that empirical software engineering can play in making software systems more accountable. Comments: 14 pages, 1 figure Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2502.06439 [cs.SE] (or arXiv:2502.06439v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.06439 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Marco Rondina [view email] [v1] Mon, 10 Feb 2025 13:16:01 UTC (289 KB) Full-text links: Access Paper: View a PDF of the paper titled Testing software for non-discrimination: an updated and extended audit in the Italian car insurance domain, by Marco Rondina and 7 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2025-02 Change to browse by: cs cs.AI cs.HC cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-25] FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model

链接: https://arxiv.org/abs/2502.06438
作者: Anna Tegon,Thorir Mar Ingolfsson,Xiaying Wang,Luca Benini,Yawei Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 3 figures, 5 tables, pre-print

点击查看摘要

Abstract:Accurate and efficient electroencephalography (EEG) analysis is essential for detecting seizures and artifacts in long-term monitoring, with applications spanning hospital diagnostics to wearable health devices. Robust EEG analytics have the potential to greatly improve patient care. However, traditional deep learning models, especially Transformer-based architectures, are hindered by their quadratic time and memory complexity, making them less suitable for resource-constrained environments. To address these challenges, we present FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel self-supervised framework that establishes new efficiency benchmarks for EEG analysis through bidirectional state-space modeling. Unlike Transformer-based models, which incur quadratic time and memory complexity, FEMBA scales linearly with sequence length, enabling more scalable and efficient processing of extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and fine-tuned on three downstream tasks, FEMBA achieves competitive performance in comparison with transformer models, with significantly lower computational cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates viability for resource-constrained devices. These results pave the way for scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as a promising candidate for wearable applications.

[AI-26] Generating Privacy-Preserving Personalized Advice with Zero-Knowledge Proofs and LLM s WWW

链接: https://arxiv.org/abs/2502.06425
作者: Hiroki Watanabe,Motonobu Uchikoshi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted to The ACM Web Conference (WWW) 2025 Short Paper Track

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilized in domains such as finance, healthcare, and interpersonal relationships to provide advice tailored to user traits and contexts. However, this personalization often relies on sensitive data, raising critical privacy concerns and necessitating data minimization. To address these challenges, we propose a framework that integrates zero-knowledge proof (ZKP) technology, specifically zkVM, with LLM-based chatbots. This integration enables privacy-preserving data sharing by verifying user traits without disclosing sensitive information. Our research introduces both an architecture and a prompting strategy for this approach. Through empirical evaluation, we clarify the current constraints and performance limitations of both zkVM and the proposed prompting strategy, thereby demonstrating their practical feasibility in real-world scenarios.

[AI-27] CS-SHAP: Extending SHAP to Cyclic-Spectral Domain for Better Interpretability of Intelligent Fault Diagnosis

链接: https://arxiv.org/abs/2502.06424
作者: Qian Chen,Xingjian Dong,Kui Hu,Kangkang Chen,Zhike Peng,Guang Meng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 21 figures

点击查看摘要

Abstract:Neural networks (NNs), with their powerful nonlinear mapping and end-to-end capabilities, are widely applied in mechanical intelligent fault diagnosis (IFD). However, as typical black-box models, they pose challenges in understanding their decision basis and logic, limiting their deployment in high-reliability scenarios. Hence, various methods have been proposed to enhance the interpretability of IFD. Among these, post-hoc approaches can provide explanations without changing model architecture, preserving its flexibility and scalability. However, existing post-hoc methods often suffer from limitations in explanation forms. They either require preprocessing that disrupts the end-to-end nature or overlook fault mechanisms, leading to suboptimal explanations. To address these issues, we derived the cyclic-spectral (CS) transform and proposed the CS-SHAP by extending Shapley additive explanations (SHAP) to the CS domain. CS-SHAP can evaluate contributions from both carrier and modulation frequencies, aligning more closely with fault mechanisms and delivering clearer and more accurate explanations. Three datasets are utilized to validate the superior interpretability of CS-SHAP, ensuring its correctness, reproducibility, and practical performance. With open-source code and outstanding interpretability, CS-SHAP has the potential to be widely adopted and become the post-hoc interpretability benchmark in IFD, even in other classification tasks. The code is available on this https URL.

[AI-28] AppVLM: A Lightweight Vision Language Model for Online App Control

链接: https://arxiv.org/abs/2502.06395
作者: Georgios Papoudakis,Thomas Coste,Zhihao Wu,Jianye Hao,Jun Wang,Kun Shao
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device’s interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results indicate that AppVLM achieves the highest action prediction accuracy in offline evaluation on the AndroidControl dataset, compared to all evaluated baselines, and matches GPT-4o in online task completion success rate in the AndroidWorld environment, while being up to ten times faster. This makes AppVLM a practical and efficient solution for real-world deployment.

[AI-29] Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo

链接: https://arxiv.org/abs/2502.06379
作者: Filip Ekström Kelvinius,Zheng Zhao,Fredrik Lindsten
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on ``decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic data and image reconstruction tasks. Further, we demonstrate how the approach can be extended to discrete data.

[AI-30] Hyperparameters in Score-Based Membership Inference Attacks

链接: https://arxiv.org/abs/2502.06374
作者: Gauri Pradhan,Joonas Jälkö,Marlon Tobaben,Antti Honkela
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work has been accepted for publication in the 3rd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML’25). The final version will be available on IEEE Xplore

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) have emerged as a valuable framework for evaluating privacy leakage by machine learning models. Score-based MIAs are distinguished, in particular, by their ability to exploit the confidence scores that the model generates for particular inputs. Existing score-based MIAs implicitly assume that the adversary has access to the target model’s hyperparameters, which can be used to train the shadow models for the attack. In this work, we demonstrate that the knowledge of target hyperparameters is not a prerequisite for MIA in the transfer learning setting. Based on this, we propose a novel approach to select the hyperparameters for training the shadow models for MIA when the attacker has no prior knowledge about them by matching the output distributions of target and shadow models. We demonstrate that using the new approach yields hyperparameters that lead to an attack near indistinguishable in performance from an attack that uses target hyperparameters to train the shadow models. Furthermore, we study the empirical privacy risk of unaccounted use of training data for hyperparameter optimization (HPO) in differentially private (DP) transfer learning. We find no statistically significant evidence that performing HPO using training data would increase vulnerability to MIA.

[AI-31] AiRacleX: Automated Detection of Price Oracle Manipulations via LLM -Driven Knowledge Mining and Prompt Generation

链接: https://arxiv.org/abs/2502.06348
作者: Bo Gao,Yuan Wang,Qingsong Wei,Yong Liu,Rick Siow Mong Goh
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Decentralized finance applications depend on accurate price oracles to ensure secure transactions, yet these oracles are highly vulnerable to manipulation, enabling attackers to exploit smart contract vulnerabilities for unfair asset valuation and financial gain. Detecting such manipulations traditionally relies on the manual effort of experienced experts, presenting significant challenges. In this paper, we propose a novel LLM-driven framework that automates the detection of price oracle manipulations by leveraging the complementary strengths of different LLM models. Our approach begins with domain-specific knowledge extraction, where an LLM model synthesizes precise insights about price oracle vulnerabilities from top-tier academic papers, eliminating the need for profound expertise from developers or auditors. This knowledge forms the foundation for a second LLM model to generate structured, context-aware chain of thought prompts, which guide a third LLM model in accurately identifying manipulation patterns in smart contracts. We validate the framework effectiveness through experiments on 60 known vulnerabilities from 46 real-world DeFi attacks or projects spanning 2021 to 2023. The best performing combination of LLMs (Haiku-Haiku-4o-mini) identified by AiRacleX demonstrate a 2.58-times improvement in recall (0.667 vs 0.259) compared to the state-of-the-art tool GPTScan, while maintaining comparable precision. Furthermore, our framework demonstrates the feasibility of replacing commercial models with open-source alternatives, enhancing privacy and security for developers.

[AI-32] Prompt-Driven Continual Graph Learning

链接: https://arxiv.org/abs/2502.06327
作者: Qi Wang,Tianfei Zhou,Ye Yuan,Rui Mao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7figures

点击查看摘要

Abstract:Continual Graph Learning (CGL), which aims to accommodate new tasks over evolving graph data without forgetting prior knowledge, is garnering significant research interest. Mainstream solutions adopt the memory replay-based idea, ie, caching representative data from earlier tasks for retraining the graph model. However, this strategy struggles with scalability issues for constantly evolving graphs and raises concerns regarding data privacy. Inspired by recent advancements in the prompt-based learning paradigm, this paper introduces a novel prompt-driven continual graph learning (PROMPTCGL) framework, which learns a separate prompt for each incoming task and maintains the underlying graph neural network model fixed. In this way, PROMPTCGL naturally avoids catastrophic forgetting of knowledge from previous tasks. More specifically, we propose hierarchical prompting to instruct the model from both feature- and topology-level to fully address the variability of task graphs in dynamic continual learning. Additionally, we develop a personalized prompt generator to generate tailored prompts for each graph node while minimizing the number of prompts needed, leading to constant memory consumption regardless of the graph scale. Extensive experiments on four benchmarks show that PROMPTCGL achieves superior performance against existing CGL approaches while significantly reducing memory consumption. Our code is available at this https URL.

[AI-33] End-to-End Multi-Microphone Speaker Extraction Using Relative Transfer Functions

链接: https://arxiv.org/abs/2502.06285
作者: Aviad Eisenberg,Sharon Gannot,Shlomo E. Chazan
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.

[AI-34] HODDI: A Dataset of High-Order Drug-Drug Interactions for Computational Pharmacovigilance

链接: https://arxiv.org/abs/2502.06274
作者: Zhaoying Wang,Yingdan Shi,Xiang Liu,Can Chen,Jun Wen,Ren Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
*备注:

点击查看摘要

Abstract:Drug-side effect research is vital for understanding adverse reactions arising in complex multi-drug therapies. However, the scarcity of higher-order datasets that capture the combinatorial effects of multiple drugs severely limits progress in this field. Existing resources such as TWOSIDES primarily focus on pairwise interactions. To fill this critical gap, we introduce HODDI, the first Higher-Order Drug-Drug Interaction Dataset, constructed from U.S. Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) records spanning the past decade, to advance computational pharmacovigilance. HODDI contains 109,744 records involving 2,506 unique drugs and 4,569 unique side effects, specifically curated to capture multi-drug interactions and their collective impact on adverse effects. Comprehensive statistical analyses demonstrate HODDI’s extensive coverage and robust analytical metrics, making it a valuable resource for studying higher-order drug relationships. Evaluating HODDI with multiple models, we found that simple Multi-Layer Perceptron (MLP) can outperform graph models, while hypergraph models demonstrate superior performance in capturing complex multi-drug interactions, further validating HODDI’s effectiveness. Our findings highlight the inherent value of higher-order information in drug-side effect prediction and position HODDI as a benchmark dataset for advancing research in pharmacovigilance, drug safety, and personalized medicine. The dataset and codes are available at this https URL.

[AI-35] Conditioning and AGM-like belief change in the Desirability-Indifference framework

链接: https://arxiv.org/abs/2502.06235
作者: Kathelijne Coussement,Gert de Cooman,Keano De Vos
类目: Artificial Intelligence (cs.AI); Probability (math.PR); Quantum Physics (quant-ph)
*备注: 11 pages

点击查看摘要

Abstract:We show how the AGM framework for belief change (expansion, revision, contraction) can be extended to deal with conditioning in the so-called Desirability-Indifference framework, based on abstract notions of accepting and rejecting options, as well as on abstract notions of events. This level of abstraction allows us to deal simultaneously with classical and quantum probability theory.

[AI-36] Can LLM s Replace Human Evaluators? An Empirical Study of LLM -as-a-Judge in Software Engineering ISSTA2025

链接: https://arxiv.org/abs/2502.06193
作者: Ruiqi Wang,Jiyu Guo,Cuiyun Gao,Guodong Fan,Chun Yong Chong,Xin Xia
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted by ISSTA 2025

点击查看摘要

Abstract:Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored. In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide…

[AI-37] Right Time to Learn:Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation

链接: https://arxiv.org/abs/2502.06192
作者: Guanglong Sun,Hongwei Yan,Liyuan Wang,Qian Li,Bo Lei,Yi Zhong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). Although it was originally proposed to train a more compact student'' model from a large teacher’’ model, many recent efforts have focused on adapting it to promote generalization of the model itself, such as online KD and self KD. % as an effective way Here, we propose an accessible and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named \emphspacing effect in biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. With both theoretical and empirical analyses, we demonstrate that the benefits of the proposed Spaced KD stem from convergence to a flatter loss landscape during stochastic gradient descent (SGD). We perform extensive experiments to validate the effectiveness of Spaced KD in improving the learning performance of DNNs (e.g., the performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively).

[AI-38] Low Tensor-Rank Adaptation of Kolmogorov–Arnold Networks

链接: https://arxiv.org/abs/2502.06153
作者: Yihang Gao,Michael K. Ng,Vincent Y.F. Tan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Kolmogorov–Arnold networks (KANs) have demonstrated their potential as an alternative to multi-layer perceptions (MLPs) in various domains, especially for science-related tasks. However, transfer learning of KANs remains a relatively unexplored area. In this paper, inspired by Tucker decomposition of tensors and evidence on the low tensor-rank structure in KAN parameter updates, we develop low tensor-rank adaptation (LoTRA) for fine-tuning KANs. We study the expressiveness of LoTRA based on Tucker decomposition approximations. Furthermore, we provide a theoretical analysis to select the learning rates for each LoTRA component to enable efficient training. Our analysis also shows that using identical learning rates across all components leads to inefficient training, highlighting the need for an adaptive learning rate strategy. Beyond theoretical insights, we explore the application of LoTRA for efficiently solving various partial differential equations (PDEs) by fine-tuning KANs. Additionally, we propose Slim KANs that incorporate the inherent low-tensor-rank properties of KAN parameter tensors to reduce model size while maintaining superior performance. Experimental results validate the efficacy of the proposed learning rate selection strategy and demonstrate the effectiveness of LoTRA for transfer learning of KANs in solving PDEs. Further evaluations on Slim KANs for function representation and image classification tasks highlight the expressiveness of LoTRA and the potential for parameter reduction through low tensor-rank decomposition.

[AI-39] he Value of Information in Human-AI Decision-making

链接: https://arxiv.org/abs/2502.06152
作者: Ziyang Guo,Yifan Wu,Jason Hartline,Jessica Hullman
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans and AIs are often paired on decision tasks with the expectation of achieving complementary performance, where the combination of human and AI outperforms either one alone. However, how to improve performance of a human-AI team is often not clear without knowing more about what particular information and strategies each agent employs. We provide a decision-theoretic framework for characterizing the value of information – and consequently, opportunities for agents to better exploit available information–in AI-assisted decision workflow. We demonstrate the use of the framework for model selection, empirical evaluation of human-AI performance, and explanation design. We propose a novel information-based instance-level explanation technique that adapts a conventional saliency-based explanation to explain information value in decision making.

[AI-40] Powerformer: A Transformer with Weighted Causal Attention for Time-series Forecasting

链接: https://arxiv.org/abs/2502.06151
作者: Kareem Hegazy,Michael W. Mahoney,N. Benjamin Erichson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transformers have recently shown strong performance in time-series forecasting, but their all-to-all attention mechanism overlooks the (temporal) causal and often (temporally) local nature of data. We introduce Powerformer, a novel Transformer variant that replaces noncausal attention weights with causal weights that are reweighted according to a smooth heavy-tailed decay. This simple yet effective modification endows the model with an inductive bias favoring temporally local dependencies, while still allowing sufficient flexibility to learn the unique correlation structure of each dataset. Our empirical results demonstrate that Powerformer not only achieves state-of-the-art accuracy on public time-series benchmarks, but also that it offers improved interpretability of attention patterns. Our analyses show that the model’s locality bias is amplified during training, demonstrating an interplay between time-series data and power-law-based attention. These findings highlight the importance of domain-specific modifications to the Transformer architecture for time-series forecasting, and they establish Powerformer as a strong, efficient, and principled baseline for future research and real-world applications.

[AI-41] Guided Exploration for Efficient Relational Model Learning

链接: https://arxiv.org/abs/2502.06146
作者: Annie Feng,Nishanth Kumar,Tomas Lozano-Perez,Leslie Pack-Kaelbling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient exploration is critical for learning relational models in large-scale environments with complex, long-horizon tasks. Random exploration methods often collect redundant or irrelevant data, limiting their ability to learn accurate relational models of the environment. Goal-literal babbling (GLIB) improves upon random exploration by setting and planning to novel goals, but its reliance on random actions and random novel goal selection limits its scalability to larger domains. In this work, we identify the principles underlying efficient exploration in relational domains: (1) operator initialization with demonstrations that cover the distinct lifted effects necessary for planning and (2) refining preconditions to collect maximally informative transitions by selecting informative goal-action pairs and executing plans to them. To demonstrate these principles, we introduce Baking-Large, a challenging domain with extensive state-action spaces and long-horizon tasks. We evaluate methods using oracle-driven demonstrations for operator initialization and precondition-targeting guidance to efficiently gather critical transitions. Experiments show that both the oracle demonstrations and precondition-targeting oracle guidance significantly improve sample efficiency and generalization, paving the way for future methods to use these principles to efficiently learn accurate relational models in complex domains.

[AI-42] Graph Neural Networks at a Fraction

链接: https://arxiv.org/abs/2502.06136
作者: Rucha Bhalchandra Joshi,Sagar Prakash Barad,Nidhi Tiwari,Subhankar Mishra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 2 figures, accepted at PAKKD 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for learning representations of graph-structured data. In addition to real-valued GNNs, quaternion GNNs also perform well on tasks on graph-structured data. With the aim of reducing the energy footprint, we reduce the model size while maintaining accuracy comparable to that of the original-sized GNNs. This paper introduces Quaternion Message Passing Neural Networks (QMPNNs), a framework that leverages quaternion space to compute node representations. Our approach offers a generalizable method for incorporating quaternion representations into GNN architectures at one-fourth of the original parameter count. Furthermore, we present a novel perspective on Graph Lottery Tickets, redefining their applicability within the context of GNNs and QMPNNs. We specifically aim to find the initialization lottery from the subnetwork of the GNNs that can achieve comparable performance to the original GNN upon training. Thereby reducing the trainable model parameters even further. To validate the effectiveness of our proposed QMPNN framework and LTH for both GNNs and QMPNNs, we evaluate their performance on real-world datasets across three fundamental graph-based tasks: node classification, link prediction, and graph classification.

[AI-43] Foundation Model of Electronic Medical Records for Adaptive Risk Estimation

链接: https://arxiv.org/abs/2502.06124
作者: Pawel Renc,Michal K. Grzeszczyk,Nassim Oufattole,Deirdre Goode,Yugang Jia,Szymon Bieganski,Matthew B. A. McDermott,Jaroslaw Was,Anthony E. Samir,Jonathan W. Cunningham,David W. Bates,Arkadiusz Sitek
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS predicts future PHTs using transformer-based architectures. The Adaptive Risk Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk probabilities for clinician-defined critical events. ARES incorporates a personalized explainability module that identifies key clinical factors influencing risk estimates for individual patients. ARES was evaluated on the MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its performance against traditional early warning systems and machine learning models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, with 60% including hospital admissions. The dataset contained over 357 million tokens. ETHOS outperformed benchmark models in predicting hospital admissions, ICU admissions, and prolonged hospital stays, achieving superior AUC scores. ETHOS-based risk estimates demonstrated robustness across demographic subgroups with strong model reliability, confirmed via calibration curves. The personalized explainability module provides insights into patient-specific factors contributing to risk. ARES, powered by ETHOS, advances predictive healthcare AI by providing dynamic, real-time, and personalized risk estimation with patient-specific explainability to enhance clinician trust. Its adaptability and superior accuracy position it as a transformative tool for clinical decision-making, potentially improving patient outcomes and resource allocation in emergency and inpatient settings. We release the full code at this http URL to facilitate future research.

[AI-44] Revisiting Dynamic Graph Clustering via Matrix Factorization

链接: https://arxiv.org/abs/2502.06117
作者: Dongyuan Li,Satoshi Kosugi,Ying Zhang,Manabu Okumura,Feng Xia,Renhe Jiang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted by TheWebConf 2025 (Oral)

点击查看摘要

Abstract:Dynamic graph clustering aims to detect and track time-varying clusters in dynamic graphs, revealing the evolutionary mechanisms of complex real-world dynamic systems. Matrix factorization-based methods are promising approaches for this task; however, these methods often struggle with scalability and can be time-consuming when applied to large-scale dynamic graphs. Moreover, they tend to lack robustness and are vulnerable to real-world noisy data. To address these issues, we make three key contributions. First, to improve scalability, we propose temporal separated matrix factorization, where a single matrix is divided into multiple smaller matrices for independent factorization, resulting in faster computation. Second, to improve robustness, we introduce bi-clustering regularization, which jointly optimizes graph embedding and clustering, thereby filtering out noisy features from the graph embeddings. Third, to further enhance effectiveness and efficiency, we propose selective embedding updating, where we update only the embeddings of dynamic nodes while the embeddings of static nodes are fixed among different timestamps. Experimental results on six synthetic and five real-world benchmarks demonstrate the scalability, robustness and effectiveness of our proposed method. Source code is available at this https URL.

[AI-45] CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories

链接: https://arxiv.org/abs/2502.06111
作者: Yijia Xiao,Runhui Wang,Luyang Kong,Davor Golac,Wei Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing complexity of computer science research projects demands more effective tools for deploying code repositories. Large Language Models (LLMs), such as Anthropic Claude and Meta Llama, have demonstrated significant advancements across various fields of computer science research, including the automation of diverse software engineering tasks. To evaluate the effectiveness of LLMs in handling complex code development tasks of research projects, particularly for NLP/CV/AI/ML/DM topics, we introduce CSR-Bench, a benchmark for Computer Science Research projects. This benchmark assesses LLMs from various aspects including accuracy, efficiency, and deployment script quality, aiming to explore their potential in conducting computer science research autonomously. We also introduce a novel framework, CSR-Agents, that utilizes multiple LLM agents to automate the deployment of GitHub code repositories of computer science research projects. Specifically, by checking instructions from markdown files and interpreting repository structures, the model generates and iteratively improves bash commands that set up the experimental environments and deploy the code to conduct research tasks. Preliminary results from CSR-Bench indicate that LLM agents can significantly enhance the workflow of repository deployment, thereby boosting developer productivity and improving the management of developmental workflows.

[AI-46] Comprehensive Framework for Evaluating Conversational AI Chatbots

链接: https://arxiv.org/abs/2502.06105
作者: Shailja Gupta,Rajesh Ranjan,Surya Narayan Singh
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 2 Figures

点击查看摘要

Abstract:Conversational AI chatbots are transforming industries by streamlining customer service, automating transactions, and enhancing user engagement. However, evaluating these systems remains a challenge, particularly in financial services, where compliance, user trust, and operational efficiency are critical. This paper introduces a novel evaluation framework that systematically assesses chatbots across four dimensions: cognitive and conversational intelligence, user experience, operational efficiency, and ethical and regulatory compliance. By integrating advanced AI methodologies with financial regulations, the framework bridges theoretical foundations and real-world deployment challenges. Additionally, we outline future research directions, emphasizing improvements in conversational coherence, real-time adaptability, and fairness.

[AI-47] NLGR: Utilizing Neighbor Lists for Generative Rerank in Personalized Recommendation Systems WWW2025

链接: https://arxiv.org/abs/2502.06097
作者: Shuli Wang,Xue Wei,Senjie Kou,Chi Wang,Wenshuai Chen,Qi Tang,Yinhua Zhu,Xiong Xiao,Xingxing Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by WWW 2025 Industry Track

点击查看摘要

Abstract:Reranking plays a crucial role in modern multi-stage recommender systems by rearranging the initial ranking list. Due to the inherent challenges of combinatorial search spaces, some current research adopts an evaluator-generator paradigm, with a generator generating feasible sequences and an evaluator selecting the best sequence based on the estimated list utility. However, these methods still face two issues. Firstly, due to the goal inconsistency problem between the evaluator and generator, the generator tends to fit the local optimal solution of exposure distribution rather than combinatorial space optimization. Secondly, the strategy of generating target items one by one is difficult to achieve optimality because it ignores the information of subsequent items. To address these issues, we propose a utilizing Neighbor Lists model for Generative Reranking (NLGR), which aims to improve the performance of the generator in the combinatorial space. NLGR follows the evaluator-generator paradigm and improves the generator’s training and generating methods. Specifically, we use neighbor lists in combination space to enhance the training process, making the generator perceive the relative scores and find the optimization direction. Furthermore, we propose a novel sampling-based non-autoregressive generation method, which allows the generator to jump flexibly from the current list to any neighbor list. Extensive experiments on public and industrial datasets validate NLGR’s effectiveness and we have successfully deployed NLGR on the Meituan food delivery platform. Comments: Accepted by WWW 2025 Industry Track Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.06097 [cs.IR] (or arXiv:2502.06097v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2502.06097 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3701716.3715251 Focus to learn more DOI(s) linking to related resources

[AI-48] Rateless Joint Source-Channel Coding and a Blueprint for 6G Semantic Communications System Design

链接: https://arxiv.org/abs/2502.06095
作者: Saeed R. Khosravirad
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
*备注: 39 pages, 9 figures, 2 tables

点击查看摘要

Abstract:This paper introduces rateless joint source-channel coding (rateless JSCC). The code is rateless in that it is designed and optimized for a continuum of coding rates such that it achieves a desired distortion for any rate in that continuum. We further introduce rate-adaptive and stable communication link operation to accommodate rateless JSCCs. The link operation resembles a ``bit pipe’’ that is identified by its rate in bits per frame, and, by the rate of bits that are flipped in each frame. Thus, the link operation is rate-adaptive such that it punctures the rateless JSCC codeword to adapt its length (and coding rate) to the underlying channel capacity, and is stable in maintaining the bit flipping ratio across time frames. Next, a new family of autoencoder rateless JSCC codes are introduced. The code family is dubbed RLACS code (read as relax code, standing for ratelss and lossy autoencoder channel and source code). The code is tested for reconstruction loss of image signals and demonstrates powerful performance that is resilient to variation of channel quality. RLACS code is readily applicable to the case of semantic distortion suited to variety of semantic and effectiveness communications use cases. In the second part of the paper, we dive into the practical concerns around semantic communication and provide a blueprint for semantic networking system design relying on updating the existing network systems with some essential modifications. We further outline a comprehensive list of open research problems and development challenges towards a practical 6G communications system design that enables semantic networking. Comments: 39 pages, 9 figures, 2 tables Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.06095 [cs.IT] (or arXiv:2502.06095v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2502.06095 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-49] Physics-Guided Foundation Model for Scientific Discovery: An Application to Aquatic Science

链接: https://arxiv.org/abs/2502.06084
作者: Runlong Yu,Chonghao Qiu,Robert Ladwig,Paul Hanson,Yiqun Xie,Xiaowei Jia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Physics-guided machine learning (PGML) has become a prevalent approach in studying scientific systems due to its ability to integrate scientific theories for enhancing machine learning (ML) models. However, most PGML approaches are tailored to isolated and relatively simple tasks, which limits their applicability to complex systems involving multiple interacting processes and numerous influencing features. In this paper, we propose a \textit\textbfPhysics-\textbfGuided \textbfFoundation \textbfModel (\textbfPGFM) that combines pre-trained ML models and physics-based models and leverages their complementary strengths to improve the modeling of multiple coupled processes. To effectively conduct pre-training, we construct a simulated environmental system that encompasses a wide range of influencing features and various simulated variables generated by physics-based models. The model is pre-trained in this system to adaptively select important feature interactions guided by multi-task objectives. We then fine-tune the model for each specific task using true observations, while maintaining consistency with established physical theories, such as the principles of mass and energy conservation. We demonstrate the effectiveness of this methodology in modeling water temperature and dissolved oxygen dynamics in real-world lakes. The proposed PGFM is also broadly applicable to a range of scientific fields where physics-based models are being used.

[AI-50] Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability

链接: https://arxiv.org/abs/2502.06051
作者: Qingyue Zhao,Kaixuan Ji,Heyang Zhao,Tong Zhang,Quanquan Gu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 23 pages

点击查看摘要

Abstract:KL-regularized policy optimization has become a workhorse in learning-based decision making, while its theoretical understanding is still very limited. Although recent progress has been made towards settling the sample complexity of KL-regularized contextual bandits, existing sample complexity bounds are either \tildeO(\epsilon^-2) under single-policy concentrability or \tildeO(\epsilon^-1) under all-policy concentrability. In this paper, we propose the \emphfirst algorithm with \tildeO(\epsilon^-1) sample complexity under single-policy concentrability for offline contextual bandits. Our algorithm is designed for general function approximation and based on the principle of \emphpessimism in the face of uncertainty. The core of our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator to refine a mean-value-type risk upper bound to its extreme. This in turn leads to a novel covariance-based analysis, effectively bypassing the need for uniform control over the discrepancy between any two functions in the function class. The near-optimality of our algorithm is demonstrated by an \tilde\Omega(\epsilon^-1) lower bound. Furthermore, we extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.

[AI-51] Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models

链接: https://arxiv.org/abs/2502.06039
作者: Marc Bruni,Fabio Gabrielli,Mohammad Ghafari,Martin Kropp
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accepted at the 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge 2025). 10 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Prompt engineering reduces reasoning mistakes in Large Language Models (LLMs). However, its effectiveness in mitigating vulnerabilities in LLM-generated code remains underexplored. To address this gap, we implemented a benchmark to automatically assess the impact of various prompt engineering strategies on code security. Our benchmark leverages two peer-reviewed prompt datasets and employs static scanners to evaluate code security at scale. We tested multiple prompt engineering techniques on GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. Our results show that for GPT-4o and GPT-4o-mini, a security-focused prompt prefix can reduce the occurrence of security vulnerabilities by up to 56%. Additionally, all tested models demonstrated the ability to detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code when using iterative prompting techniques. Finally, we introduce a “prompt agent” that demonstrates how the most effective techniques can be applied in real-world development workflows.

[AI-52] Provably Overwhelming Transformer Models with Designed Inputs

链接: https://arxiv.org/abs/2502.06038
作者: Lev Stambler,Seyed Sajjad Nezhadi,Matthew Coudron
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:We develop an algorithm which, given a trained transformer model \mathcalM as input, as well as a string of tokens s of length n_fix and an integer n_free , can generate a mathematical proof that \mathcalM is overwhelmed'' by s , in time and space \widetildeO(n_fix^2 + n_free^3) . We say that \mathcalM is overwhelmed’’ by s when the output of the model evaluated on this string plus any additional string t , \mathcalM(s + t) , is completely insensitive to the value of the string t whenever length( t ) \leq n_free . Along the way, we prove a particularly strong worst-case form of ``over-squashing’', which we use to bound the model’s behavior. Our technique uses computer-aided proofs to establish this type of operationally relevant guarantee about transformer models. We empirically test our algorithm on a single layer transformer complete with an attention head, layer-norm, MLP/ReLU layers, and RoPE positional encoding. We believe that this work is a stepping stone towards the difficult task of obtaining useful guarantees for trained transformer models.

[AI-53] Kolmogorov-Arnold Fourier Networks

链接: https://arxiv.org/abs/2502.06018
作者: Jusheng Zhang,Yijia Fan,Kaitong Cai,Keze Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although Kolmogorov-Arnold based interpretable networks (KAN) have strong theoretical expressiveness, they face significant parameter explosion and high-frequency feature capture challenges in high-dimensional tasks. To address this issue, we propose the Kolmogorov-Arnold-Fourier Network (KAF), which effectively integrates trainable Random Fourier Features (RFF) and a novel hybrid GELU-Fourier activation mechanism to balance parameter efficiency and spectral representation capabilities. Our key technical contributions include: (1) merging KAN’s dual-matrix structure through matrix association properties to substantially reduce parameters; (2) introducing learnable RFF initialization strategies to eliminate spectral distortion in high-dimensional approximation tasks; (3) implementing an adaptive hybrid activation function that progressively enhances frequency representation during the training process. Comprehensive experiments demonstrate the superiority of our KAF across various domains including vision, NLP, audio processing, and differential equation-solving tasks, effectively combining theoretical interpretability with practical utility and computational efficiency.

[AI-54] Motion Control in Multi-Rotor Aerial Robots Using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.05996
作者: Gaurav Shetty,Mahya Ramezani,Hamed Habibi,Holger Voos,Jose Luis Sanchez-Lopez
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates the application of Deep Reinforcement (DRL) Learning to address motion control challenges in drones for additive manufacturing (AM). Drone-based additive manufacturing promises flexible and autonomous material deposition in large-scale or hazardous environments. However, achieving robust real-time control of a multi-rotor aerial robot under varying payloads and potential disturbances remains challenging. Traditional controllers like PID often require frequent parameter re-tuning, limiting their applicability in dynamic scenarios. We propose a DRL framework that learns adaptable control policies for multi-rotor drones performing waypoint navigation in AM tasks. We compare Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3) within a curriculum learning scheme designed to handle increasing complexity. Our experiments show TD3 consistently balances training stability, accuracy, and success, particularly when mass variability is introduced. These findings provide a scalable path toward robust, autonomous drone control in additive manufacturing.

[AI-55] Redefining Robot Generalization Through Interactive Intelligence

链接: https://arxiv.org/abs/2502.05963
作者: Sharmita Dey
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Recent advances in large-scale machine learning have produced high-capacity foundation models capable of adapting to a broad array of downstream tasks. While such models hold great promise for robotics, the prevailing paradigm still portrays robots as single, autonomous decision-makers, performing tasks like manipulation and navigation, with limited human involvement. However, a large class of real-world robotic systems, including wearable robotics (e.g., prostheses, orthoses, exoskeletons), teleoperation, and neural interfaces, are semiautonomous, and require ongoing interactive coordination with human partners, challenging single-agent assumptions. In this position paper, we argue that robot foundation models must evolve to an interactive multi-agent perspective in order to handle the complexities of real-time human-robot co-adaptation. We propose a generalizable, neuroscience-inspired architecture encompassing four modules: (1) a multimodal sensing module informed by sensorimotor integration principles, (2) an ad-hoc teamwork model reminiscent of joint-action frameworks in cognitive science, (3) a predictive world belief model grounded in internal model theories of motor control, and (4) a memory/feedback mechanism that echoes concepts of Hebbian and reinforcement-based plasticity. Although illustrated through the lens of cyborg systems, where wearable devices and human physiology are inseparably intertwined, the proposed framework is broadly applicable to robots operating in semi-autonomous or interactive contexts. By moving beyond single-agent designs, our position emphasizes how foundation models in robotics can achieve a more robust, personalized, and anticipatory level of performance.

[AI-56] Cyri: A Conversational AI-based Assistant for Supporting the Human User in Detecting and Responding to Phishing Attacks

链接: https://arxiv.org/abs/2502.05951
作者: Antonio La Torre,Marco Angelini
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This work introduces Cyri, an AI-powered conversational assistant designed to support a human user in detecting and analyzing phishing emails by leveraging Large Language Models. Cyri has been designed to scrutinize emails for semantic features used in phishing attacks, such as urgency, and undesirable consequences, using an approach that unifies features already established in the literature with others by Cyri features extraction methodology. Cyri can be directly plugged into a client mail or webmail, ensuring seamless integration with the user’s email workflow while maintaining data privacy through local processing. By performing analyses on the user’s machine, Cyri eliminates the need to transmit sensitive email data over the internet, reducing associated security risks. The Cyri user interface has been designed to reduce habituation effects and enhance user engagement. It employs dynamic visual cues and context-specific explanations to keep users alert and informed while using emails. Additionally, it allows users to explore identified malicious semantic features both through conversation with the agent and visual exploration, obtaining the advantages of both modalities for expert or non-expert users. It also allows users to keep track of the conversation, supports the user in solving additional questions on both computed features or new parts of the mail, and applies its detection on demand. To evaluate Cyri, we crafted a comprehensive dataset of 420 phishing emails and 420 legitimate emails. Results demonstrate high effectiveness in identifying critical phishing semantic features fundamental to phishing detection. A user study involving 10 participants, both experts and non-experts, evaluated Cyri’s effectiveness and usability. Results indicated that Cyri significantly aided users in identifying phishing emails and enhanced their understanding of phishing tactics.

[AI-57] Survival Concept-Based Learning Models

链接: https://arxiv.org/abs/2502.05950
作者: Stanislav R. Kirpichenko,Lev V. Utkin,Andrei V. Konstantinov,Natalya M. Verbova
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Concept-based learning enhances prediction accuracy and interpretability by leveraging high-level, human-understandable concepts. However, existing CBL frameworks do not address survival analysis tasks, which involve predicting event times in the presence of censored data – a common scenario in fields like medicine and reliability analysis. To bridge this gap, we propose two novel models: SurvCBM (Survival Concept-based Bottleneck Model) and SurvRCM (Survival Regularized Concept-based Model), which integrate concept-based learning with survival analysis to handle censored event time data. The models employ the Cox proportional hazards model and the Beran estimator. SurvCBM is based on the architecture of the well-known concept bottleneck model, offering interpretable predictions through concept-based explanations. SurvRCM uses concepts as regularization to enhance accuracy. Both models are trained end-to-end and provide interpretable predictions in terms of concepts. Two interpretability approaches are proposed: one leveraging the linear relationship in the Cox model and another using an instance-based explanation framework with the Beran estimator. Numerical experiments demonstrate that SurvCBM outperforms SurvRCM and traditional survival models, underscoring the importance and advantages of incorporating concept information. The code for the proposed algorithms is publicly available.

[AI-58] Verifying Proportionality in Temporal Voting AAAI

链接: https://arxiv.org/abs/2502.05949
作者: Edith Elkind,Svetlana Obraztsova,Jannik Peters,Nicholas Teh
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: Appears in the 39th AAAI Conference on Artificial Intelligence (AAAI), 2025

点击查看摘要

Abstract:We study a model of temporal voting where there is a fixed time horizon, and at each round the voters report their preferences over the available candidates and a single candidate is selected. Prior work has adapted popular notions of justified representation as well as voting rules that provide strong representation guarantees from the multiwinner election setting to this model. In our work, we focus on the complexity of verifying whether a given outcome offers proportional representation. We show that in the temporal setting verification is strictly harder than in multiwinner voting, but identify natural special cases that enable efficient algorithms.

[AI-59] Barriers and Pathways to Human-AI Alignment: A Game-Theoretic Approach

链接: https://arxiv.org/abs/2502.05934
作者: Aran Nayebi
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 32 pages, including 5 main theorems and 10 lemmas

点击查看摘要

Abstract:Under what conditions can capable AI agents efficiently align their actions with human preferences? More specifically, when they are proficient enough to collaborate with us, how long does coordination take, and when is it computationally feasible? These foundational questions of AI alignment help define what makes an AI agent ``sufficiently safe’’ and valuable to humans. Since such generally capable systems do not yet exist, a theoretical analysis is needed to establish when guarantees hold – and what they even are. We introduce a game-theoretic framework that generalizes prior alignment approaches with fewer assumptions, allowing us to analyze the computational complexity of alignment across M objectives and N agents, providing both upper and lower bounds. Unlike previous work, which often assumes common priors, idealized communication, or implicit tractability, our framework formally characterizes the difficulty of alignment under minimal assumptions. Our main result shows that even when agents are fully rational and computationally \emphunbounded, alignment can be achieved with high probability in time \emphlinear in the task space size. Therefore, in real-world settings, where task spaces are often \emphexponential in input length, this remains impractical. More strikingly, our lower bound demonstrates that alignment is \emphimpossible to speed up when scaling to exponentially many tasks or agents, highlighting a fundamental computational barrier to scalable alignment. Relaxing these idealized assumptions, we study \emphcomputationally bounded agents with noisy messages (representing obfuscated intent), showing that while alignment can still succeed with high probability, it incurs additional \emphexponential slowdowns in the task space size, number of agents, and number of tasks. We conclude by identifying conditions that make alignment more feasible. Comments: 32 pages, including 5 main theorems and 10 lemmas Subjects: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2502.05934 [cs.AI] (or arXiv:2502.05934v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.05934 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aran Nayebi [view email] [v1] Sun, 9 Feb 2025 15:27:35 UTC (74 KB) Full-text links: Access Paper: View a PDF of the paper titled Barriers and Pathways to Human-AI Alignment: A Game-Theoretic Approach, by Aran NayebiView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-02 Change to browse by: cs cs.CC cs.GT cs.LG cs.MA References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-60] Skill Expansion and Composition in Parameter Space ICLR2025

链接: https://arxiv.org/abs/2502.05932
作者: Tenglong Liu,Jianxiong Li,Yinan Zheng,Haoyi Niu,Yixing Lan,Xin Xu,Xianyuan Zhan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: ICLR 2025, 37 pages

点击查看摘要

Abstract:Humans excel at reusing prior knowledge to address new challenges and developing skills while solving problems. This paradigm becomes increasingly popular in the development of autonomous agents, as it develops systems that can self-evolve in response to new challenges like human beings. However, previous methods suffer from limited training efficiency when expanding new skills and fail to fully leverage prior knowledge to facilitate new task learning. In this paper, we propose Parametric Skill Expansion and Composition (PSEC), a new framework designed to iteratively evolve the agents’ capabilities and efficiently address new challenges by maintaining a manageable skill library. This library can progressively integrate skill primitives as plug-and-play Low-Rank Adaptation (LoRA) modules in parameter-efficient finetuning, facilitating efficient and flexible skill expansion. This structure also enables the direct skill compositions in parameter space by merging LoRA modules that encode different skills, leveraging shared information across skills to effectively program new skills. Based on this, we propose a context-aware module to dynamically activate different skills to collaboratively handle new tasks. Empowering diverse applications including multi-objective composition, dynamics shift, and continual policy shift, the results on D4RL, DSRL benchmarks, and the DeepMind Control Suite show that PSEC exhibits superior capacity to leverage prior knowledge to efficiently tackle new challenges, as well as expand its skill libraries to evolve the capabilities. Project website: this https URL.

[AI-61] Protecting Intellectual Property of EEG-based Neural Networks with Watermarking

链接: https://arxiv.org/abs/2502.05931
作者: Ahmed Abdelaziz,Ahmed Fathi,Ahmed Fares
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 21 pages, 13 figures, and 6 tables

点击查看摘要

Abstract:EEG-based neural networks, pivotal in medical diagnosis and brain-computer interfaces, face significant intellectual property (IP) risks due to their reliance on sensitive neurophysiological data and resource-intensive development. Current watermarking methods, particularly those using abstract trigger sets, lack robust authentication and fail to address the unique challenges of EEG models. This paper introduces a cryptographic wonder filter-based watermarking framework tailored for EEG-based neural networks. Leveraging collision-resistant hashing and public-key encryption, the wonder filter embeds the watermark during training, ensuring minimal distortion ( \leq 5% drop in EEG task accuracy) and high reliability (100% watermark detection). The framework is rigorously evaluated against adversarial attacks, including fine-tuning, transfer learning, and neuron pruning. Results demonstrate persistent watermark retention, with classification accuracy for watermarked states remaining above 90% even after aggressive pruning, while primary task performance degrades faster, deterring removal attempts. Piracy resistance is validated by the inability to embed secondary watermarks without severe accuracy loss ( 10% in EEGNet and CCNN models). Cryptographic hashing ensures authentication, reducing brute-force attack success probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet, TSception), the method achieves 99.4% null-embedding accuracy, effectively eliminating false positives. By integrating wonder filters with EEG-specific adaptations, this work bridges a critical gap in IP protection for neurophysiological models, offering a secure, tamper-proof solution for healthcare and biometric applications. The framework’s robustness against adversarial modifications underscores its potential to safeguard sensitive EEG models while maintaining diagnostic utility.

[AI-62] Sign-Symmetry Learning Rules are Robust Fine-Tuners

链接: https://arxiv.org/abs/2502.05925
作者: Aymene Berriche,Mehdi Zakaria Adjal,Riyadh Baghdadi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Backpropagation (BP) has long been the predominant method for training neural networks due to its effectiveness. However, numerous alternative approaches, broadly categorized under feedback alignment, have been proposed, many of which are motivated by the search for biologically plausible learning mechanisms. Despite their theoretical appeal, these methods have consistently underperformed compared to BP, leading to a decline in research interest. In this work, we revisit the role of such methods and explore how they can be integrated into standard neural network training pipelines. Specifically, we propose fine-tuning BP-pre-trained models using Sign-Symmetry learning rules and demonstrate that this approach not only maintains performance parity with BP but also enhances robustness. Through extensive experiments across multiple tasks and benchmarks, we establish the validity of our approach. Our findings introduce a novel perspective on neural network training and open new research directions for leveraging biologically inspired learning rules in deep learning.

[AI-63] NeuralPrefix: A Zero-shot Sensory Data Imputation Plugin

链接: https://arxiv.org/abs/2502.05883
作者: Abdelwahed Khamis,Sara Khalifa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted in PerCom 25

点击查看摘要

Abstract:Real-world sensing challenges such as sensor failures, communication issues, and power constraints lead to data intermittency. An issue that is known to undermine the traditional classification task that assumes a continuous data stream. Previous works addressed this issue by designing bespoke solutions (i.e. task-specific and/or modality-specific imputation). These approaches, while effective for their intended purposes, had limitations in their applicability across different tasks and sensor modalities. This raises an important question: Can we build a task-agnostic imputation pipeline that is transferable to new sensors without requiring additional training? In this work, we formalise the concept of zero-shot imputation and propose a novel approach that enables the adaptation of pre-trained models to handle data intermittency. This framework, named NeuralPrefix, is a generative neural component that precedes a task model during inference, filling in gaps caused by data intermittency. NeuralPrefix is built as a continuous dynamical system, where its internal state can be estimated at any point in time by solving an Ordinary Differential Equation (ODE). This approach allows for a more versatile and adaptable imputation method, overcoming the limitations of task-specific and modality-specific solutions. We conduct a comprehensive evaluation of NeuralPrefix on multiple sensory datasets, demonstrating its effectiveness across various domains. When tested on intermittent data with a high 50% missing data rate, NeuralPreifx accurately recovers all the missing samples, achieving SSIM score between 0.93-0.96. Zero-shot evaluations show that NeuralPrefix generalises well to unseen datasets, even when the measurements come from a different modality.

[AI-64] Uni-Retrieval: A Multi-Style Retrieval Framework for STEMs Education

链接: https://arxiv.org/abs/2502.05863
作者: Yanhao Jia,Xinyi Wu,Hao Li,Qinglin Zhang,Yuxiao Hu,Shuai Zhao,Wenqi Fan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In AI-facilitated teaching, leveraging various query styles to interpret abstract text descriptions is crucial for ensuring high-quality teaching. However, current retrieval models primarily focus on natural text-image retrieval, making them insufficiently tailored to educational scenarios due to the ambiguities in the retrieval process. In this paper, we propose a diverse expression retrieval task tailored to educational scenarios, supporting retrieval based on multiple query styles and expressions. We introduce the STEM Education Retrieval Dataset (SER), which contains over 24,000 query pairs of different styles, and the Uni-Retrieval, an efficient and style-diversified retrieval vision-language model based on prompt tuning. Uni-Retrieval extracts query style features as prototypes and builds a continuously updated Prompt Bank containing prompt tokens for diverse queries. This bank can updated during test time to represent domain-specific knowledge for different subject retrieval scenarios. Our framework demonstrates scalability and robustness by dynamically retrieving prompt tokens based on prototype similarity, effectively facilitating learning for unknown queries. Experimental results indicate that Uni-Retrieval outperforms existing retrieval models in most retrieval tasks. This advancement provides a scalable and precise solution for diverse educational needs.

[AI-65] HyGEN: Regularizing Negative Hyperedge Generation for Accurate Hyperedge Prediction WWW

链接: https://arxiv.org/abs/2502.05827
作者: Song Kyung Yu,Da Eun Lee,Yunyong Ko,Sang-Wook Kim
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 4 pages, 4 figures, 2 tables, the Web Conference (WWW) 2025

点击查看摘要

Abstract:Hyperedge prediction is a fundamental task to predict future high-order relations based on the observed network structure. Existing hyperedge prediction methods, however, suffer from the data sparsity problem. To alleviate this problem, negative sampling methods can be used, which leverage non-existing hyperedges as contrastive information for model training. However, the following important challenges have been rarely studied: (C1) lack of guidance for generating negatives and (C2) possibility of producing false negatives. To address them, we propose a novel hyperedge prediction method, HyGEN, that employs (1) a negative hyperedge generator that employs positive hyperedges as a guidance to generate more realistic ones and (2) a regularization term that prevents the generated hyperedges from being false negatives. Extensive experiments on six real-world hypergraphs reveal that HyGEN consistently outperforms four state-of-the-art hyperedge prediction methods.

[AI-66] MindCraft: Revolutionizing Education through AI-Powered Personalized Learning and Mentorship for Rural India

链接: https://arxiv.org/abs/2502.05826
作者: Arihant Bardia,Aayush Agrawal
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:MindCraft is a modern platform designed to revolutionize education in rural India by leveraging Artificial Intelligence (AI) to create personalized learning experiences, provide mentorship, and foster resource-sharing. In a country where access to quality education is deeply influenced by geography and socio economic status, rural students often face significant barriers in their educational journeys. MindCraft aims to bridge this gap by utilizing AI to create tailored learning paths, connect students with mentors, and enable a collaborative network of educational resources that transcends both physical and digital divides. This paper explores the challenges faced by rural students, the transformative potential of AI, and how MindCraft offers a scalable, sustainable solution for equitable education system. By focusing on inclusivity, personalized learning, and mentorship, MindCraft seeks to empower rural students, equipping them with the skills, knowledge, and opportunities needed to thrive in an increasingly digital world. Ultimately, MindCraft envisions a future in which technology not only bridges educational gaps but also becomes the driving force for a more inclusive and empowered society.

[AI-67] he Curse of Depth in Large Language Models

链接: https://arxiv.org/abs/2502.05795
作者: Wenfang Sun,Xinyuan Song,Pengxiang Li,Lu Yin,Yefeng Zheng,Shiwei Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models(LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training.

[AI-68] WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch

链接: https://arxiv.org/abs/2502.05783
作者: Ying Lei,Yancheng Cao,Will Wang,Yuanzhe Dong,Changchang Yin,Weidan Cao,Ping Zhang,Jingzhen Yang,Bingsheng Yao,Yifan Peng,Chunhua Weng,Randy Auerbach,Lena Mamykina,Dakuo Wang,Yuntao Wang,Xuhai Xu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under submission

点击查看摘要

Abstract:While just-in-time interventions (JITIs) have effectively targeted common health behaviors, individuals often have unique needs to intervene in personal undesirable actions that can negatively affect physical, mental, and social well-being. We present WatchGuardian, a smartwatch-based JITI system that empowers users to define custom interventions for these personal actions with a small number of samples. For the model to detect new actions based on limited new data samples, we developed a few-shot learning pipeline that finetuned a pre-trained inertial measurement unit (IMU) model on public hand-gesture datasets. We then designed a data augmentation and synthesis process to train additional classification layers for customization. Our offline evaluation with 26 participants showed that with three, five, and ten examples, our approach achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of 74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to compare WatchGuardian against a rule-based intervention. Our results demonstrated that our system led to a significant reduction by 64.0 ± 22.6% in undesirable actions, substantially outperforming the baseline by 29.0%. Our findings underscore the effectiveness of a customizable, AI-driven JITI system for individuals in need of behavioral intervention in personal undesirable actions. We envision that our work can inspire broader applications of user-defined personalized intervention with advanced AI solutions.

[AI-69] Predictive Crash Analytics for Traffic Safety using Deep Learning

链接: https://arxiv.org/abs/2502.05777
作者: Karthik Sivakoti
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional automated crash analysis systems heavily rely on static statistical models and historical data, requiring significant manual interpretation and lacking real-time predictive capabilities. This research presents an innovative approach to traffic safety analysis through the integration of ensemble learning methods and multi-modal data fusion for real-time crash risk assessment and prediction. Our primary contribution lies in developing a hierarchical severity classification system that combines spatial-temporal crash patterns with environmental conditions, achieving significant improvements over traditional statistical approaches. The system demonstrates a Mean Average Precision (mAP) of 0.893, representing a 15% improvement over current state-of-the-art methods (baseline mAP: 0.776). We introduce a novel feature engineering technique that integrates crash location data with incident reports and weather conditions, achieving 92.4% accuracy in risk prediction and 89.7% precision in hotspot identification. Through extensive validation using 500,000 initial crash records filtered to 59,496 high-quality samples, our solution shows marked improvements in both prediction accuracy and computational efficiency. Key innovations include a robust data cleaning pipeline, adaptive feature generation, and a scalable real-time prediction system capable of handling peak loads of 1,000 concurrent requests while maintaining sub-100ms response times.

[AI-70] PIPA: Preference Alignment as Prior-Informed Statistical Estimation

链接: https://arxiv.org/abs/2502.05773
作者: Junbo Li,Zhangyang Wang,Qiang Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a 3\sim10% performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2502.05773 [cs.LG] (or arXiv:2502.05773v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.05773 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-71] RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care

链接: https://arxiv.org/abs/2502.05740
作者: Ziqi Yang,Yuxuan Lu,Jennifer Bagdasarian,Vedant Das Swain,Ritu Agarwal,Collin Campbell,Waddah Al-Refaire,Jehan El-Bayoumi,Guodong Gao,Dakuo Wang,Bingsheng Yao,Nawar Shara
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group of cancers that account for more than 35% of cancer-related deaths worldwide, but postoperative complications are unpredictable and can be life-threatening. In this paper, we investigate how recent advancements in large language models (LLMs) can benefit remote patient monitoring (RPM) systems through clinical integration by designing RECOVER, an LLM-powered RPM system for postoperative GI cancer care. To closely engage stakeholders in the design process, we first conducted seven participatory design sessions with five clinical staff and interviewed five cancer patients to derive six major design strategies for integrating clinical guidelines and information needs into LLM-based RPM systems. We then designed and implemented RECOVER, which features an LLM-powered conversational agent for cancer patients and an interactive dashboard for clinical staff to enable efficient postoperative RPM. Finally, we used RECOVER as a pilot system to assess the implementation of our design strategies with four clinical staff and five patients, providing design implications by identifying crucial design elements, offering insights on responsible AI, and outlining opportunities for future LLM-powered RPM systems.

[AI-72] Mitigating Sensitive Information Leakage in LLM s4Code through Machine Unlearning

链接: https://arxiv.org/abs/2502.05739
作者: Ruotong Geng,Mingyang Geng,Shangwen Wang,Haotian Wang,Zhipeng Lin,Dezun Dong
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 11 pages

点击查看摘要

Abstract:Large Language Models for Code (LLMs4Code) excel at code generation tasks, yielding promise to release developers from huge software development burdens. Nonetheless, these models have been shown to suffer from the significant privacy risks due to the potential leakage of sensitive information embedded during training, known as the memorization problem. Addressing this issue is crucial for ensuring privacy compliance and upholding user trust, but till now there is a dearth of dedicated studies in the literature that focus on this specific direction. Recently, machine unlearning has emerged as a promising solution by enabling models to “forget” sensitive information without full retraining, offering an efficient and scalable approach compared to traditional data cleaning methods. In this paper, we empirically evaluate the effectiveness of unlearning techniques for addressing privacy concerns in this http URL, we investigate three state-of-the-art unlearning algorithms and three well-known open-sourced LLMs4Code, on a benchmark that takes into consideration both the privacy data to be forgotten as well as the code generation capabilites of these models. Results show that it is feasible to mitigate the privacy concerns of LLMs4Code through machine unlearning while maintain their code generation capabilities at the same time. We also dissect the forms of privacy protection/leakage after unlearning and observe that there is a shift from direct leakage to indirect leakage, which underscores the need for future studies addressing this risk.

[AI-73] Rethinking Link Prediction for Directed Graphs

链接: https://arxiv.org/abs/2502.05724
作者: Mingguo He,Yuhe Guo,Yanping Zheng,Zhewei Wei,Stephan Günnemann,Xiaokui Xiao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 30 pages

点击查看摘要

Abstract:Link prediction for directed graphs is a crucial task with diverse real-world applications. Recent advances in embedding methods and Graph Neural Networks (GNNs) have shown promising improvements. However, these methods often lack a thorough analysis of embedding expressiveness and suffer from ineffective benchmarks for a fair evaluation. In this paper, we propose a unified framework to assess the expressiveness of existing methods, highlighting the impact of dual embeddings and decoder design on performance. To address limitations in current experimental setups, we introduce DirLinkBench, a robust new benchmark with comprehensive coverage and standardized evaluation. The results show that current methods struggle to achieve strong performance on the new benchmark, while DiGAE outperforms others overall. We further revisit DiGAE theoretically, showing its graph convolution aligns with GCN on an undirected bipartite graph. Inspired by these insights, we propose a novel spectral directed graph auto-encoder SDGAE that achieves SOTA results on DirLinkBench. Finally, we analyze key factors influencing directed link prediction and highlight open challenges.

[AI-74] Pareto-Optimality Smoothness and Stochasticity in Learning-Augmented One-Max-Search

链接: https://arxiv.org/abs/2502.05720
作者: Ziyad Benomar,Lorenzo Croissant,Vianney Perchet,Spyros Angelopoulos
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One-max search is a classic problem in online decision-making, in which a trader acts on a sequence of revealed prices and accepts one of them irrevocably to maximise its profit. The problem has been studied both in probabilistic and in worst-case settings, notably through competitive analysis, and more recently in learning-augmented settings in which the trader has access to a prediction on the sequence. However, existing approaches either lack smoothness, or do not achieve optimal worst-case guarantees: they do not attain the best possible trade-off between the consistency and the robustness of the algorithm. We close this gap by presenting the first algorithm that simultaneously achieves both of these important objectives. Furthermore, we show how to leverage the obtained smoothness to provide an analysis of one-max search in stochastic learning-augmented settings which capture randomness in both the observed prices and the prediction.

[AI-75] Extended Histogram-based Outlier Score (EHBOS)

链接: https://arxiv.org/abs/2502.05719
作者: Tanvir Islam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Histogram-Based Outlier Score (HBOS) is a widely used outlier or anomaly detection method known for its computational efficiency and simplicity. However, its assumption of feature independence limits its ability to detect anomalies in datasets where interactions between features are critical. In this paper, we propose the Extended Histogram-Based Outlier Score (EHBOS), which enhances HBOS by incorporating two-dimensional histograms to capture dependencies between feature pairs. This extension allows EHBOS to identify contextual and dependency-driven anomalies that HBOS fails to detect. We evaluate EHBOS on 17 benchmark datasets, demonstrating its effectiveness and robustness across diverse anomaly detection scenarios. EHBOS outperforms HBOS on several datasets, particularly those where feature interactions are critical in defining the anomaly structure, achieving notable improvements in ROC AUC. These results highlight that EHBOS can be a valuable extension to HBOS, with the ability to model complex feature dependencies. EHBOS offers a powerful new tool for anomaly detection, particularly in datasets where contextual or relational anomalies play a significant role.

[AI-76] Proving the Coding Interview: A Benchmark for Formally Verified Code Generation ICSE2025

链接: https://arxiv.org/abs/2502.05714
作者: Quinn Dougherty,Ronak Mehta
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 8 pages, to appear at the 2025LLM4Code Workshop at ICSE 2025

点击查看摘要

Abstract:We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification benchmark, including 1083 curated and quality controlled samples. Previously, APPS provided a benchmark and dataset for programming puzzles to be completed in Python and checked against unit tests, of the kind seen in technical assessments in the software engineering industry. Building upon recent approaches for benchmarks in interactive theorem proving, we generalize the unit tests to Lean 4 theorems given without proof (i.e., using Lean’s “sorry” keyword). On the 406 theorems of 100 randomly selected samples, Sonnet correctly proves 30% and Gemini correctly proves 18%. We challenge the machine learning and program synthesis communities to solve both each general purpose programming problem and its associated correctness specifications. The benchmark is available at this https URL.

[AI-77] Context information can be more important than reasoning for time series forecasting with a large language model

链接: https://arxiv.org/abs/2502.05699
作者: Janghoon Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the evolution of large language models (LLMs), there is growing interest in leveraging LLMs for time series tasks. In this paper, we explore the characteristics of LLMs for time series forecasting by considering various existing and proposed prompting techniques. Forecasting for both short and long time series was evaluated. Our findings indicate that no single prompting method is universally applicable. It was also observed that simply providing proper context information related to the time series, without additional reasoning prompts, can achieve performance comparable to the best-performing prompt for each case. From this observation, it is expected that providing proper context information can be more crucial than a prompt for specific reasoning in time series forecasting. Several weaknesses in prompting for time series forecasting were also identified. First, LLMs often fail to follow the procedures described by the prompt. Second, when reasoning steps involve simple algebraic calculations with several operands, LLMs often fail to calculate accurately. Third, LLMs sometimes misunderstand the semantics of prompts, resulting in incomplete responses.

[AI-78] Managing Geological Uncertainty in Critical Mineral Supply Chains: A POMDP Approach with Application to U.S. Lithium Resources

链接: https://arxiv.org/abs/2502.05690
作者: Mansur Arief,Yasmine Alonso,CJ Oshiro,William Xu,Anthony Corso,David Zhen Yin,Jef K. Caers,Mykel J. Kochenderfer
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:The world is entering an unprecedented period of critical mineral demand, driven by the global transition to renewable energy technologies and electric vehicles. This transition presents unique challenges in mineral resource development, particularly due to geological uncertainty-a key characteristic that traditional supply chain optimization approaches do not adequately address. To tackle this challenge, we propose a novel application of Partially Observable Markov Decision Processes (POMDPs) that optimizes critical mineral sourcing decisions while explicitly accounting for the dynamic nature of geological uncertainty. Through a case study of the U.S. lithium supply chain, we demonstrate that POMDP-based policies achieve superior outcomes compared to traditional approaches, especially when initial reserve estimates are imperfect. Our framework provides quantitative insights for balancing domestic resource development with international supply diversification, offering policymakers a systematic approach to strategic decision-making in critical mineral supply chains.

[AI-79] Mobile Application Threats and Security

链接: https://arxiv.org/abs/2502.05685
作者: Timur Mirzoev,Mark Miller,Shamimara Lasker,Michael Brannon
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The movement to mobile computing solutions provides flexibility to different users whether it is a business user, a student, or even providing entertainment to children and adults of all ages. Due to these emerging technologies mobile users are unable to safeguard private information in a very effective way and cybercrimes are increasing day by day. This manuscript will focus on security vulnerabilities in the mobile computing industry, especially focusing on tablets and smart phones. This study will dive into current security threats for the Android Apple iOS market, exposing security risks and threats that the novice or average user may not be aware of. The purpose of this study is to analyze current security risks and threats, and provide solutions that may be deployed to protect against such threats.

[AI-80] Machine Unlearning via Information Theoretic Regularization

链接: https://arxiv.org/abs/2502.05684
作者: Shizhou Xu,Thomas Strohmer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 31 pages, 2 figures

点击查看摘要

Abstract:How can we effectively remove or “unlearn” undesirable information, such as specific features or individual data points, from a learning outcome while minimizing utility loss and ensuring rigorous guarantees? We introduce a mathematical framework based on information-theoretic regularization to address both feature and data point unlearning. For feature unlearning, we derive a unified solution that simultaneously optimizes diverse learning objectives, including entropy, conditional entropy, KL-divergence, and the energy of conditional probability. For data point unlearning, we first propose a novel definition that serves as a practical condition for unlearning via retraining, is easy to verify, and aligns with the principles of differential privacy from an inference perspective. Then, we provide provable guarantees for our framework on data point unlearning. By combining flexibility in learning objectives with simplicity in regularization design, our approach is highly adaptable and practical for a wide range of machine learning and AI applications.

[AI-81] Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs

链接: https://arxiv.org/abs/2502.05641
作者: Aayam Shrestha,Pan Liu,German Ros,Kai Yuan,Alan Fern
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work focuses on generating realistic, physically-based human behaviors from multi-modal inputs, which may only partially specify the desired motion. For example, the input may come from a VR controller providing arm motion and body velocity, partial key-point animation, computer vision applied to videos, or even higher-level motion goals. This requires a versatile low-level humanoid controller that can handle such sparse, under-specified guidance, seamlessly switch between skills, and recover from failures. Current approaches for learning humanoid controllers from demonstration data capture some of these characteristics, but none achieve them all. To this end, we introduce the Masked Humanoid Controller (MHC), a novel approach that applies multi-objective imitation learning on augmented and selectively masked motion demonstrations. The training methodology results in an MHC that exhibits the key capabilities of catch-up to out-of-sync input commands, combining elements from multiple motion sequences, and completing unspecified parts of motions from sparse multimodal input. We demonstrate these key capabilities for an MHC learned over a dataset of 87 diverse skills and showcase different multi-modal use cases, including integration with planning frameworks to highlight MHC’s ability to solve new user-defined tasks without any finetuning.

[AI-82] Adversarial Machine Learning: Attacks Defenses and Open Challenges

链接: https://arxiv.org/abs/2502.05637
作者: Pranav K Jha
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adversarial Machine Learning (AML) addresses vulnerabilities in AI systems where adversaries manipulate inputs or training data to degrade performance. This article provides a comprehensive analysis of evasion and poisoning attacks, formalizes defense mechanisms with mathematical rigor, and discusses the challenges of implementing robust solutions in adaptive threat models. Additionally, it highlights open challenges in certified robustness, scalability, and real-world deployment.

[AI-83] Amorphous Fortress Online: Collaboratively Designing Open-Ended Multi-Agent AI and Game Environments

链接: https://arxiv.org/abs/2502.05632
作者: M Charity,Mayu Wilson,Steven Lee,Dipika Rajesh,Sam Earle,Julian Togelius
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work introduces Amorphous Fortress Online – a web-based platform where users can design petri-dish-like environments and games consisting of multi-agent AI characters. Users can play, create, and share artificial life and game environments made up of microscopic but transparent finite-state machine agents that interact with each other. The website features multiple interactive editors and accessible settings to view the multi-agent interactions directly from the browser. This system serves to provide a database of thematically diverse AI and game environments that use the emergent behaviors of simple AI agents.

[AI-84] Closing the Responsibility Gap in AI-based Network Management: An Intelligent Audit System Approach

链接: https://arxiv.org/abs/2502.05608
作者: Emanuel Figetakis,Ahmed Refaey Hussein
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Existing network paradigms have achieved lower downtime as well as a higher Quality of Experience (QoE) through the use of Artificial Intelligence (AI)-based network management tools. These AI management systems, allow for automatic responses to changes in network conditions, lowering operation costs for operators, and improving overall performance. While adopting AI-based management tools enhance the overall network performance, it also introduce challenges such as removing human supervision, privacy violations, algorithmic bias, and model inaccuracies. Furthermore, AI-based agents that fail to address these challenges should be culpable themselves rather than the network as a whole. To address this accountability gap, a framework consisting of a Deep Reinforcement Learning (DRL) model and a Machine Learning (ML) model is proposed to identify and assign numerical values of responsibility to the AI-based management agents involved in any decision-making regarding the network conditions, which eventually affects the end-user. A simulation environment was created for the framework to be trained using simulated network operation parameters. The DRL model had a 96% accuracy during testing for identifying the AI-based management agents, while the ML model using gradient descent learned the network conditions at an 83% accuracy during testing.

[AI-85] Low-Rank Agent -Specific Adaptation (LoRASA) for Multi-Agent Policy Learning

链接: https://arxiv.org/abs/2502.05573
作者: Beining Zhang,Aditya Kapoor,Mingfei Sun
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 31 pages, 20 figures, 13 tables

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) often relies on \emphparameter sharing (PS) to scale efficiently. However, purely shared policies can stifle each agent’s unique specialization, reducing overall performance in heterogeneous environments. We propose \textbfLow-Rank Agent-Specific Adaptation (LoRASA), a novel approach that treats each agent’s policy as a specialized ``task’’ fine-tuned from a shared backbone. Drawing inspiration from parameter-efficient transfer methods, LoRASA appends small, low-rank adaptation matrices to each layer of the shared policy, naturally inducing \emphparameter-space sparsity that promotes both specialization and scalability. We evaluate LoRASA on challenging benchmarks including the StarCraft Multi-Agent Challenge (SMAC) and Multi-Agent MuJoCo (MAMuJoCo), implementing it atop widely used algorithms such as MAPPO and A2PO. Across diverse tasks, LoRASA matches or outperforms existing baselines \emphwhile reducing memory and computational overhead. Ablation studies on adapter rank, placement, and timing validate the method’s flexibility and efficiency. Our results suggest LoRASA’s potential to establish a new norm for MARL policy parameterization: combining a shared foundation for coordination with low-rank agent-specific refinements for individual specialization.

[AI-86] abICL: A Tabular Foundation Model for In-Context Learning on Large Data

链接: https://arxiv.org/abs/2502.05564
作者: Jingang Qu,David Holzmüller,Gaël Varoquaux,Marine Le Morvan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The long-standing dominance of gradient-boosted decision trees on tabular data is currently challenged by tabular foundation models using In-Context Learning (ICL): setting the training data as context for the test data and predicting in a single forward pass without parameter updates. While the very recent TabPFNv2 foundation model (2025) excels on tables with up to 10K samples, its alternating column- and row-wise attentions make handling large training sets computationally prohibitive. So, can ICL be effectively scaled and deliver a benefit for larger tables? We introduce TabICL, a tabular foundation model for classification, pretrained on synthetic datasets with up to 60K samples and capable of handling 500K samples on affordable resources. This is enabled by a novel two-stage architecture: a column-then-row attention mechanism to build fixed-dimensional embeddings of rows, followed by a transformer for efficient ICL. Across 200 classification datasets from the TALENT benchmark, TabICL is on par with TabPFNv2 while being systematically faster (up to 10 times), and significantly outperforms all other approaches. On 56 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data.

[AI-87] Knowledge is Power: Harnessing Large Language Models for Enhanced Cognitive Diagnosis

链接: https://arxiv.org/abs/2502.05556
作者: Zhiang Dong,Jingyuan Chen,Fei Wu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cognitive Diagnosis Models (CDMs) are designed to assess students’ cognitive states by analyzing their performance across a series of exercises. However, existing CDMs often struggle with diagnosing infrequent students and exercises due to a lack of rich prior knowledge. With the advancement in large language models (LLMs), which possess extensive domain knowledge, their integration into cognitive diagnosis presents a promising opportunity. Despite this potential, integrating LLMs with CDMs poses significant challenges. LLMs are not well-suited for capturing the fine-grained collaborative interactions between students and exercises, and the disparity between the semantic space of LLMs and the behavioral space of CDMs hinders effective integration. To address these issues, we propose a novel Knowledge-enhanced Cognitive Diagnosis (KCD) framework, which is a model-agnostic framework utilizing LLMs to enhance CDMs and compatible with various CDM architectures. The KCD framework operates in two stages: LLM Diagnosis and Cognitive Level Alignment. In the LLM Diagnosis stage, both students and exercises are diagnosed to achieve comprehensive and detailed modeling. In the Cognitive Level Alignment stage, we bridge the gap between the CDMs’ behavioral space and the LLMs’ semantic space using contrastive learning and mask-reconstruction approaches. Experiments on several real-world datasets demonstrate the effectiveness of our proposed framework.

[AI-88] Dual Defense: Enhancing Privacy and Mitigating Poisoning Attacks in Federated Learning NEURIPS2024

链接: https://arxiv.org/abs/2502.05547
作者: Runhua Xu,Shiqi Gao,Chao Li,James Joshi,Jianxin Li
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: accepted by The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Federated learning (FL) is inherently susceptible to privacy breaches and poisoning attacks. To tackle these challenges, researchers have separately devised secure aggregation mechanisms to protect data privacy and robust aggregation methods that withstand poisoning attacks. However, simultaneously addressing both concerns is challenging; secure aggregation facilitates poisoning attacks as most anomaly detection techniques require access to unencrypted local model updates, which are obscured by secure aggregation. Few recent efforts to simultaneously tackle both challenges offen depend on impractical assumption of non-colluding two-server setups that disrupt FL’s topology, or three-party computation which introduces scalability issues, complicating deployment and application. To overcome this dilemma, this paper introduce a Dual Defense Federated learning (DDFed) framework. DDFed simultaneously boosts privacy protection and mitigates poisoning attacks, without introducing new participant roles or disrupting the existing FL topology. DDFed initially leverages cutting-edge fully homomorphic encryption (FHE) to securely aggregate model updates, without the impractical requirement for non-colluding two-server setups and ensures strong privacy protection. Additionally, we proposes a unique two-phase anomaly detection mechanism for encrypted model updates, featuring secure similarity computation and feedback-driven collaborative selection, with additional measures to prevent potential privacy breaches from Byzantine clients incorporated into the detection process. We conducted extensive experiments on various model poisoning attacks and FL scenarios, including both cross-device and cross-silo FL. Experiments on publicly available datasets demonstrate that DDFed successfully protects model privacy and effectively defends against model poisoning threats.

[AI-89] Sequential Stochastic Combinatorial Optimization Using Hierarchal Reinforcement Learning

链接: https://arxiv.org/abs/2502.05537
作者: Xinsong Feng,Zihan Yu,Yanhai Xiong,Haipeng Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a promising tool for combinatorial optimization (CO) problems due to its ability to learn fast, effective, and generalizable solutions. Nonetheless, existing works mostly focus on one-shot deterministic CO, while sequential stochastic CO (SSCO) has rarely been studied despite its broad applications such as adaptive influence maximization (IM) and infectious disease intervention. In this paper, we study the SSCO problem where we first decide the budget (e.g., number of seed nodes in adaptive IM) allocation for all time steps, and then select a set of nodes for each time step. The few existing studies on SSCO simplify the problems by assuming a uniformly distributed budget allocation over the time horizon, yielding suboptimal solutions. We propose a generic hierarchical RL (HRL) framework called wake-sleep option (WS-option), a two-layer option-based framework that simultaneously decides adaptive budget allocation on the higher layer and node selection on the lower layer. WS-option starts with a coherent formulation of the two-layer Markov decision processes (MDPs), capturing the interdependencies between the two layers of decisions. Building on this, WS-option employs several innovative designs to balance the model’s training stability and computational efficiency, preventing the vicious cyclic interference issue between the two layers. Empirical results show that WS-option exhibits significantly improved effectiveness and generalizability compared to traditional methods. Moreover, the learned model can be generalized to larger graphs, which significantly reduces the overhead of computational resources.

[AI-90] owards Learning Scalable Agile Dynamic Motion Planning for Robosoccer Teams with Policy Optimization

链接: https://arxiv.org/abs/2502.05526
作者: Brandon Ho,Batuhan Altundas,Matthew Gombolay
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In fast-paced, ever-changing environments, dynamic Motion Planning for Multi-Agent Systems in the presence of obstacles is a universal and unsolved problem. Be it from path planning around obstacles to the movement of robotic arms, or in planning navigation of robot teams in settings such as Robosoccer, dynamic motion planning is needed to avoid collisions while reaching the targeted destination when multiple agents occupy the same area. In continuous domains where the world changes quickly, existing classical Motion Planning algorithms such as RRT* and A* become computationally expensive to rerun at every time step. Many variations of classical and well-formulated non-learning path-planning methods have been proposed to solve this universal problem but fall short due to their limitations of speed, smoothness, optimally, etc. Deep Learning models overcome their challenges due to their ability to adapt to varying environments based on past experience. However, current learning motion planning models use discretized environments, do not account for heterogeneous agents or replanning, and build up to improve the classical motion planners’ efficiency, leading to issues with scalability. To prevent collisions between heterogenous team members and collision to obstacles while trying to reach the target location, we present a learning-based dynamic navigation model and show our model working on a simple environment in the concept of a simple Robosoccer Game.

[AI-91] IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

链接: https://arxiv.org/abs/2502.05512
作者: Wei Deng,Siyi Zhou,Jingchen Shu,Jinchao Wang,Lu Wang
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning this http URL, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model. We add some novel improvements. Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin, making the pronunciations of polyphonic characters and long-tail characters controllable. We also performed a comparative analysis of the Vector Quantization (VQ) with Finite-Scalar Quantization (FSQ) for codebook utilization of acoustic speech tokens. To further enhance the effect and stability of voice cloning, we introduce a conformer-based speech conditional encoder and replace the speechcode decoder with BigVGAN2. Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning. As for the popular TTS systems in the open-source, such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, IndexTTS has a relatively simple training process, more controllable usage, and faster inference speed. Moreover, its performance surpasses that of these systems. Our demos are available at this https URL.

[AI-92] Vision-Ultrasound Robotic System based on Deep Learning for Gas and Arc Hazard Detection in Manufacturing

链接: https://arxiv.org/abs/2502.05500
作者: Jin-Hee Lee,Dahyun Nam,Robin Inho Kee,YoungKey Kim,Seok-Jun Buu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to Engineering Applications of Artificial Intelligence

点击查看摘要

Abstract:Gas leaks and arc discharges present significant risks in industrial environments, requiring robust detection systems to ensure safety and operational efficiency. Inspired by human protocols that combine visual identification with acoustic verification, this study proposes a deep learning-based robotic system for autonomously detecting and classifying gas leaks and arc discharges in manufacturing settings. The system is designed to execute all experimental tasks entirely onboard the robot. Utilizing a 112-channel acoustic camera operating at a 96 kHz sampling rate to capture ultrasonic frequencies, the system processes real-world datasets recorded in diverse industrial scenarios. These datasets include multiple gas leak configurations (e.g., pinhole, open end) and partial discharge types (Corona, Surface, Floating) under varying environmental noise conditions. Proposed system integrates visual detection and a beamforming-enhanced acoustic analysis pipeline. Signals are transformed using STFT and refined through Gamma Correction, enabling robust feature extraction. An Inception-inspired CNN further classifies hazards, achieving 99% gas leak detection accuracy. The system not only detects individual hazard sources but also enhances classification reliability by fusing multi-modal data from both vision and acoustic sensors. When tested in reverberation and noise-augmented environments, the system outperformed conventional models by up to 44%p, with experimental tasks meticulously designed to ensure fairness and reproducibility. Additionally, the system is optimized for real-time deployment, maintaining an inference time of 2.1 seconds on a mobile robotic platform. By emulating human-like inspection protocols and integrating vision with acoustic modalities, this study presents an effective solution for industrial automation, significantly improving safety and operational reliability.

[AI-93] Riemannian Manifold Learning for Stackelberg Games with Neural Flow Representations

链接: https://arxiv.org/abs/2502.05498
作者: Larkin Liu,Kashif Rasul,Yutong Chao,Jalal Etesami
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注: Stackelberg games. Manifold learning. Online learning

点击查看摘要

Abstract:We present a novel framework for online learning in Stackelberg general-sum games, where two agents, the leader and follower, engage in sequential turn-based interactions. At the core of this approach is a learned diffeomorphism that maps the joint action space to a smooth Riemannian manifold, referred to as the Stackelberg manifold. This mapping, facilitated by neural normalizing flows, ensures the formation of tractable isoplanar subspaces, enabling efficient techniques for online learning. By assuming linearity between the agents’ reward functions on the Stackelberg manifold, our construct allows the application of standard bandit algorithms. We then provide a rigorous theoretical basis for regret minimization on convex manifolds and establish finite-time bounds on simple regret for learning Stackelberg equilibria. This integration of manifold learning into game theory uncovers a previously unrecognized potential for neural normalizing flows as an effective tool for multi-agent learning. We present empirical results demonstrating the effectiveness of our approach compared to standard baselines, with applications spanning domains such as cybersecurity and economic supply chain optimization.

[AI-94] Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection

链接: https://arxiv.org/abs/2502.05494
作者: Ya Zhou,Yujie Yang,Jianhuang Gan,Xiangjie Li,Jing Yuan,Wei Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: Under review in a journal

点击查看摘要

Abstract:Electrocardiogram (ECG) analysis is a fundamental tool for diagnosing cardiovascular conditions, yet anomaly detection in ECG signals remains challenging due to their inherent complexity and variability. We propose Multi-scale Masked Autoencoder for ECG anomaly detection (MMAE-ECG), a novel end-to-end framework that effectively captures both global and local dependencies in ECG data. Unlike state-of-the-art methods that rely on heartbeat segmentation or R-peak detection, MMAE-ECG eliminates the need for such pre-processing steps, enhancing its suitability for clinical deployment. MMAE-ECG partitions ECG signals into non-overlapping segments, with each segment assigned learnable positional embeddings. A novel multi-scale masking strategy and multi-scale attention mechanism, along with distinct positional embeddings, enable a lightweight Transformer encoder to effectively capture both local and global dependencies. The masked segments are then reconstructed using a single-layer Transformer block, with an aggregation strategy employed during inference to refine the outputs. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art approaches while significantly reducing computational complexity-approximately 1/78 of the floating-point operations (FLOPs) required for inference. Ablation studies further validate the effectiveness of each component, highlighting the potential of multi-scale masked autoencoders for anomaly detection.

[AI-95] LLM -Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning

链接: https://arxiv.org/abs/2502.05453
作者: Hanqing Yang,Jingdi Chen,Marie Siew,Tania Lorido-Botran,Carlee Joe-Wong
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Developing intelligent agents for long-term cooperation in dynamic open-world scenarios is a major challenge in multi-agent systems. Traditional Multi-agent Reinforcement Learning (MARL) frameworks like centralized training decentralized execution (CTDE) struggle with scalability and flexibility. They require centralized long-term planning, which is difficult without custom reward functions, and face challenges in processing multi-modal data. CTDE approaches also assume fixed cooperation strategies, making them impractical in dynamic environments where agents need to adapt and plan independently. To address decentralized multi-agent cooperation, we propose Decentralized Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in a novel Multi-agent Crafter environment. Our generative agents, powered by Large Language Models (LLMs), are more scalable than traditional MARL agents by leveraging external knowledge and language for long-term planning and reasoning. Instead of fully sharing information from all past experiences, DAMCS introduces a multi-modal memory system organized as a hierarchical knowledge graph and a structured communication protocol to optimize agent cooperation. This allows agents to reason from past interactions and share relevant information efficiently. Experiments on novel multi-agent open-world tasks show that DAMCS outperforms both MARL and LLM baselines in task efficiency and collaboration. Compared to single-agent scenarios, the two-agent scenario achieves the same goal with 63% fewer steps, and the six-agent scenario with 74% fewer steps, highlighting the importance of adaptive memory and structured communication in achieving long-term goals. We publicly release our project at: this https URL.

[AI-96] ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy

链接: https://arxiv.org/abs/2502.05450
作者: Yuhui Chen,Shuai Tian,Shugao Liu,Yingting Zhou,Haoran Li,Dongbin Zhao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown substantial potential in real-world robotic manipulation. However, fine-tuning these models through supervised learning struggles to achieve robust performance due to limited, inconsistent demonstrations, especially in contact-rich environments. In this paper, we propose a reinforced fine-tuning approach for VLA models, named ConRFT, which consists of offline and online fine-tuning with a unified consistency-based training objective, to address these challenges. In the offline stage, our method integrates behavior cloning and Q-learning to effectively extract policy from a small set of demonstrations and stabilize value estimating. In the online stage, the VLA model is further fine-tuned via consistency policy, with human interventions to ensure safe exploration and high sample efficiency. We evaluate our approach on eight diverse real-world manipulation tasks. It achieves an average success rate of 96.3% within 45-90 minutes of online fine-tuning, outperforming prior supervised methods with a 144% improvement in success rate and 1.9x shorter episode length. This work highlights the potential of integrating reinforcement learning to enhance the performance of VLA models for real-world robotic applications.

[AI-97] he Odyssey of the Fittest: Can Agents Survive and Still Be Good?

链接: https://arxiv.org/abs/2502.05442
作者: Dylan Waldner,Risto Miikkulainen
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Submitted to CogSci 2025. 9 Pages in this version, 6 + references in CogSci version

点击查看摘要

Abstract:As AI models grow in power and generality, understanding how agents learn and make decisions in complex environments is critical to promoting ethical behavior. This paper examines the ethical implications of implementing biological drives, specifically, self preservation, into three different agents. A Bayesian agent optimized with NEAT, a Bayesian agent optimized with stochastic variational inference, and a GPT 4o agent play a simulated, LLM generated text based adventure game. The agents select actions at each scenario to survive, adapting to increasingly challenging scenarios. Post simulation analysis evaluates the ethical scores of the agent’s decisions, uncovering the tradeoffs they navigate to survive. Specifically, analysis finds that when danger increases, agents ignore ethical considerations and opt for unethical behavior. The agents’ collective behavior, trading ethics for survival, suggests that prioritizing survival increases the risk of unethical behavior. In the context of AGI, designing agents to prioritize survival may amplify the likelihood of unethical decision making and unintended emergent behaviors, raising fundamental questions about goal design in AI safety research.

[AI-98] APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding ICLR2025

链接: https://arxiv.org/abs/2502.05431
作者: Xinyu Yang,Tianqi Chen,Beidi Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025

点击查看摘要

Abstract:Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts as a sequence introduces a considerable computational burden by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to independently pre-compute and cache each context’s KV states. This approach enables the direct loading of cached states during inference while accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding ( \textbfAPE ), which brings shared prefix, attention temperature, and scaling factor to align the distribution of parallel encoding with sequential encoding. Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5 \times speedup by reducing 28 \times prefilling time for a 128K-length context.

[AI-99] he Complexity of Learning Sparse Superposed Features with Feedback

链接: https://arxiv.org/abs/2502.05407
作者: Akash Kumar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 40 pages, 20 figures

点击查看摘要

Abstract:The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative \textittriplet comparisons. These features may represent various constructs, including dictionaries in LLMs or components of a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent’s feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machine-trained models and dictionary extraction from sparse autoencoders trained on Large Language Models.

[AI-100] Probabilistic Foundations for Metacognition via Hybrid-AI AAAI

链接: https://arxiv.org/abs/2502.05398
作者: Paulo Shakarian,Gerardo I. Simari,Nathaniel D. Bastian
类目: Artificial Intelligence (cs.AI)
*备注: Accepted to AAAI-MAKE 2025

点击查看摘要

Abstract:Metacognition is the concept of reasoning about an agent’s own internal processes, and it has recently received renewed attention with respect to artificial intelligence (AI) and, more specifically, machine learning systems. This paper reviews a hybrid-AI approach known as “error detecting and correcting rules” (EDCR) that allows for the learning of rules to correct perceptual (e.g., neural) models. Additionally, we introduce a probabilistic framework that adds rigor to prior empirical studies, and we use this framework to prove results on necessary and sufficient conditions for metacognitive improvement, as well as limits to the approach. A set of future

[AI-101] fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving

链接: https://arxiv.org/abs/2502.05370
作者: Hanfei Yu,Xingqi Cui,Hong Zhang,Hao Wang,Hao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained immense success in revolutionizing various applications, including content generation, search and recommendation, and AI-assisted operation. To reduce high training costs, Mixture-of-Experts (MoE) architecture has become a popular backbone for modern LLMs. However, despite the benefits, serving MoE-based LLMs experience severe memory inefficiency due to sparsely activated experts. Recent studies propose to offload inactive experts from GPU memory to CPU memory to improve the serving efficiency of MoE models. However, they either incur high inference latency or high model memory footprints due to coarse-grained designs. To tame the latency-memory trade-off in MoE serving, we present fMoE, a fine-grained expert offloading system for MoE serving that achieves low inference latency with memory efficiency. We design fMoE to extract fine-grained expert selection patterns from MoE models and semantic hints from input prompts to efficiently guide expert prefetching, caching, and offloading decisions. fMoE is prototyped on top of HuggingFace Transformers and deployed on a six-GPU testbed. Experiments with open-source MoE models and real-world workloads show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.

[AI-102] ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

链接: https://arxiv.org/abs/2502.05352
作者: Saurabh Jha(1),Rohan Arora(1),Yuji Watanabe(1),Takumi Yanagawa(1),Yinfang Chen(2),Jackson Clark(2),Bhavya Bhavya(1),Mudit Verma(1),Harshit Kumar(1),Hirokuni Kitahara(1),Noah Zheutlin(1),Saki Takano(1),Divya Pathak(1),Felix George(1),Xinbo Wu(2),Bekir O. Turkkan(1),Gerard Vanloo(1),Michael Nidd(1),Ting Dai(1),Oishik Chatterjee(1),Pranjal Gupta(1),Suranjana Samanta(1),Pooja Aggarwal(1),Rong Lee(1),Pavankumar Murali(1),Jae-wook Ahn(1),Debanjana Kar(1),Ameet Rahane(1),Carlos Fonseca(1),Amit Paradkar(1),Yu Deng(1),Pratibha Moogi(1),Prateeti Mohapatra(1),Naoki Abe(1),Chandrasekhar Narayanaswami(1),Tianyin Xu(2),Lav R. Varshney(2),Ruchi Mahindru(1),Anca Sailer(1),Laura Shwartz(1),Daby Sow(1),Nicholas C. M. Fuller(1),Ruchir Puri(1) ((1) IBM, (2) University of Illinois at Urbana-Champaign)
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.

[AI-103] Estimating Voltage Drop: Models Features and Data Representation Towards a Neural Surrogate

链接: https://arxiv.org/abs/2502.05345
作者: Yifei Jin,Dimitrios Koutlis,Hector Bandala,Marios Daoutis
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate estimation of voltage drop (IR drop) in modern Application-Specific Integrated Circuits (ASICs) is highly time and resource demanding, due to the growing complexity and the transistor density in recent technology nodes. To mitigate this challenge, we investigate how Machine Learning (ML) techniques, including Extreme Gradient Boosting (XGBoost), Convolutional Neural Network (CNN), and Graph Neural Network (GNN) can aid in reducing the computational effort and implicitly the time required to estimate the IR drop in Integrated Circuits (ICs). Traditional methods, including commercial tools, require considerable time to produce accurate approximations, especially for complicated designs with numerous transistors. ML algorithms, on the other hand, are explored as an alternative solution to offer quick and precise IR drop estimation, but in considerably less time. Our approach leverages ASICs’ electrical, timing, and physical to train ML models, ensuring adaptability across diverse designs with minimal adjustments. Experimental results underscore the superiority of ML models over commercial tools, greatly enhancing prediction speed. Particularly, GNNs exhibit promising performance with minimal prediction errors in voltage drop estimation. The incorporation of GNNs marks a groundbreaking advancement in accurate IR drop prediction. This study illustrates the effectiveness of ML algorithms in precisely estimating IR drop and optimizing ASIC sign-off. Utilizing ML models leads to expedited predictions, reducing calculation time and improving energy efficiency, thereby reducing environmental impact through optimized power circuits.

[AI-104] RAG -Verus: Repository-Level Program Verification with LLM s using Retrieval Augmented Generation

链接: https://arxiv.org/abs/2502.05344
作者: Sicheng Zhong,Jiading Zhu,Yifang Tian,Xujie Si
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling automated formal verification to real-world projects requires resolving cross-module dependencies and global contexts, which are challenges overlooked by existing function-centric methods. We introduce RagVerus, a framework that synergizes retrieval-augmented generation with context-aware prompting to automate proof synthesis for multi-module repositories, achieving a 27% relative improvement on our novel RepoVBench benchmark – the first repository-level dataset for Verus with 383 proof completion tasks. RagVerus triples proof pass rates on existing benchmarks under constrained language model budgets, demonstrating a scalable and sample-efficient verification.

[AI-105] Oracular Programming: A Modular Foundation for Building LLM -Enabled Software

链接: https://arxiv.org/abs/2502.05310
作者: Jonathan Laurent,André Platzer
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models have proved surprisingly effective at solving a wide range of tasks from just a handful of examples. However, their lack of reliability and modularity limits their capacity to tackle large problems that require many steps of reasoning. In response, researchers have proposed advanced pipelines that leverage domain-specific knowledge to chain smaller prompts, provide intermediate feedback and improve performance through search. However, the current complexity of writing, tuning, maintaining and improving such pipelines has limited their sophistication. We propose oracular programming, a foundational paradigm for building LLM-enabled applications that lets domain experts express high-level problem-solving strategies as programs with unresolved choice points. These choice points are resolved at runtime by LLMs, which generalize from user-provided examples of correct and incorrect decisions. An oracular program is composed of three orthogonal components: a strategy that consists in a nondeterministic program with choice points that can be reified into a search tree, a policy that specifies how to navigate this tree with the help of LLM oracles, and a set of demonstrations that describe successful and unsuccessful search tree navigation scenarios across diverse problem instances. Each component is expressed in a dedicated programming language and can be independently improved or substituted. We address the key programming language design challenges of modularly composing oracular programs and enforcing consistency between their components as they evolve.

[AI-106] Parameter Symmetry Breaking and Restoration Determines the Hierarchical Learning in AI Systems

链接: https://arxiv.org/abs/2502.05300
作者: Liu Ziyin,Yizhou Xu,Tomaso Poggio,Isaac Chuang
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The dynamics of learning in modern large AI systems is hierarchical, often characterized by abrupt, qualitative shifts akin to phase transitions observed in physical systems. While these phenomena hold promise for uncovering the mechanisms behind neural networks and language models, existing theories remain fragmented, addressing specific cases. In this paper, we posit that parameter symmetry breaking and restoration serve as a unifying mechanism underlying these behaviors. We synthesize prior observations and show how this mechanism explains three distinct hierarchies in neural networks: learning dynamics, model complexity, and representation formation. By connecting these hierarchies, we highlight symmetry – a cornerstone of theoretical physics – as a potential fundamental principle in modern AI.

[AI-107] Probabilistic Artificial Intelligence

链接: https://arxiv.org/abs/2502.05244
作者: Andreas Krause,Jonas Hübotter
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial intelligence commonly refers to the science and engineering of artificial systems that can carry out tasks generally associated with requiring aspects of human intelligence, such as playing games, translating languages, and driving cars. In recent years, there have been exciting advances in learning-based, data-driven approaches towards AI, and machine learning and deep learning have enabled computer systems to perceive the world in unprecedented ways. Reinforcement learning has enabled breakthroughs in complex games such as Go and challenging robotics tasks such as quadrupedal locomotion. A key aspect of intelligence is to not only make predictions, but reason about the uncertainty in these predictions, and to consider this uncertainty when making decisions. This is what this manuscript on “Probabilistic Artificial Intelligence” is about. The first part covers probabilistic approaches to machine learning. We discuss the differentiation between “epistemic” uncertainty due to lack of data and “aleatoric” uncertainty, which is irreducible and stems, e.g., from noisy observations and outcomes. We discuss concrete approaches towards probabilistic inference and modern approaches to efficient approximate inference. The second part of the manuscript is about taking uncertainty into account in sequential decision tasks. We consider active learning and Bayesian optimization – approaches that collect data by proposing experiments that are informative for reducing the epistemic uncertainty. We then consider reinforcement learning and modern deep RL approaches that use neural network function approximation. We close by discussing modern approaches in model-based RL, which harness epistemic and aleatoric uncertainty to guide exploration, while also reasoning about safety. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.05244 [cs.AI] (or arXiv:2502.05244v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.05244 Focus to learn more arXiv-issued DOI via DataCite

[AI-108] PSM-SQL: Progressive Schema Learning with Multi-granularity Semantics for Text-to-SQL

链接: https://arxiv.org/abs/2502.05237
作者: Zhuopan Yang,Yuanzhen Xie,Ruichao Zhong,Yunzhi Tan,Enjie Liu,Zhenguo Yang,Mochi Gao,Bo Hu,Zang Li
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 9 pages, 3 figures, submission in progress

点击查看摘要

Abstract:It is challenging to convert natural language (NL) questions into executable structured query language (SQL) queries for text-to-SQL tasks due to the vast number of database schemas with redundancy, which interferes with semantic learning, and the domain shift between NL and SQL. Existing works for schema linking focus on the table level and perform it once, ignoring the multi-granularity semantics and chainable cyclicity of schemas. In this paper, we propose a progressive schema linking with multi-granularity semantics (PSM-SQL) framework to reduce the redundant database schemas for text-to-SQL. Using the multi-granularity schema linking (MSL) module, PSM-SQL learns the schema semantics at the column, table, and database levels. More specifically, a triplet loss is used at the column level to learn embeddings, while fine-tuning LLMs is employed at the database level for schema reasoning. MSL employs classifier and similarity scores to model schema interactions for schema linking at the table level. In particular, PSM-SQL adopts a chain loop strategy to reduce the task difficulty of schema linking by continuously reducing the number of redundant schemas. Experiments conducted on text-to-SQL datasets show that the proposed PSM-SQL is 1-3 percentage points higher than the existing methods.

[AI-109] Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

链接: https://arxiv.org/abs/2502.05236
作者: Shehzeen Hussain,Paarth Neekhara,Xuesong Yang,Edresson Casanova,Subhankar Ghosh,Mikyas T. Desta,Roy Fejgin,Rafael Valle,Jason Li
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned metrics, outperforms state-of-the-art TTS models, despite being trained on a significantly smaller dataset. Audio samples and demos are available on our website.

[AI-110] Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

链接: https://arxiv.org/abs/2502.05232
作者: Adam Stooke,Rohit Prabhavalkar,Khe Chai Sim,Pedro Moreno Mengibar
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the “Aligner-Encoder”. To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention – it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform “self-transduction”.

[AI-111] BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks

链接: https://arxiv.org/abs/2502.05225
作者: Hanyong Lee,Chaelyn Lee,Yongjae Lee,Jaesung Lee
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 18 pages, To appear in the Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics 2025

点击查看摘要

Abstract:Phishing often targets victims through visually perturbed texts to bypass security systems. The noise contained in these texts functions as an adversarial attack, designed to deceive language models and hinder their ability to accurately interpret the content. However, since it is difficult to obtain sufficient phishing cases, previous studies have used synthetic datasets that do not contain real-world cases. In this study, we propose the BitAbuse dataset, which includes real-world phishing cases, to address the limitations of previous research. Our dataset comprises a total of 325,580 visually perturbed texts. The dataset inputs are drawn from the raw corpus, consisting of visually perturbed sentences and sentences generated through an artificial perturbation process. Each input sentence is labeled with its corresponding ground truth, representing the restored, non-perturbed version. Language models trained on our proposed dataset demonstrated significantly better performance compared to previous methods, achieving an accuracy of approximately 96%. Our analysis revealed a significant gap between real-world and synthetic examples, underscoring the value of our dataset for building reliable pre-trained models for restoration tasks. We release the BitAbuse dataset, which includes real-world phishing cases annotated with visual perturbations, to support future research in adversarial attack defense.

[AI-112] A Survey on Backdoor Threats in Large Language Models (LLM s): Attacks Defenses and Evaluations

链接: https://arxiv.org/abs/2502.05224
作者: Yihe Zhou,Tao Ni,Wei-Bin Lee,Qingchuan Zhao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved significantly advanced capabilities in understanding and generating human language text, which have gained increasing popularity over recent years. Apart from their state-of-the-art natural language processing (NLP) performance, considering their widespread usage in many industries, including medicine, finance, education, etc., security concerns over their usage grow simultaneously. In recent years, the evolution of backdoor attacks has progressed with the advancement of defense mechanisms against them and more well-developed features in the LLMs. In this paper, we adapt the general taxonomy for classifying machine learning attacks on one of the subdivisions - training-time white-box backdoor attacks. Besides systematically classifying attack methods, we also consider the corresponding defense methods against backdoor attacks. By providing an extensive summary of existing works, we hope this survey can serve as a guideline for inspiring future research that further extends the attack scenarios and creates a stronger defense against them for more robust LLMs.

[AI-113] Aero-LLM : A Distributed Framework for Secure UAV Communication and Intelligent Decision-Making

链接: https://arxiv.org/abs/2502.05220
作者: Balakrishnan Dharmalingam,Rajdeep Mukherjee,Brett Piggott,Guohuan Feng,Anyi Liu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: This manuscript was accepted by the 1st International Workshop on Integrated Sensing, Communication, and Computing in Internet of Things (IoT) Systems at the The 33rd International Conference on Computer Communications and Networks (ICCCN 2024)

点击查看摘要

Abstract:Increased utilization of unmanned aerial vehicles (UAVs) in critical operations necessitates secure and reliable communication with Ground Control Stations (GCS). This paper introduces Aero-LLM, a framework integrating multiple Large Language Models (LLMs) to enhance UAV mission security and operational efficiency. Unlike conventional singular LLMs, Aero-LLM leverages multiple specialized LLMs for various tasks, such as inferencing, anomaly detection, and forecasting, deployed across onboard systems, edge, and cloud servers. This dynamic, distributed architecture reduces performance bottleneck and increases security capabilities. Aero-LLM’s evaluation demonstrates outstanding task-specific metrics and robust defense against cyber threats, significantly enhancing UAV decision-making and operational capabilities and security resilience against cyber attacks, setting a new standard for secure, intelligent UAV operations.

[AI-114] Enabling External Scrutiny of AI Systems with Privacy-Enhancing Technologies

链接: https://arxiv.org/abs/2502.05219
作者: Kendrea Beers,Helen Toner
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This article describes how technical infrastructure developed by the nonprofit OpenMined enables external scrutiny of AI systems without compromising sensitive information. Independent external scrutiny of AI systems provides crucial transparency into AI development, so it should be an integral component of any approach to AI governance. In practice, external researchers have struggled to gain access to AI systems because of AI companies’ legitimate concerns about security, privacy, and intellectual property. But now, privacy-enhancing technologies (PETs) have reached a new level of maturity: end-to-end technical infrastructure developed by OpenMined combines several PETs into various setups that enable privacy-preserving audits of AI systems. We showcase two case studies where this infrastructure has been deployed in real-world governance scenarios: “Understanding Social Media Recommendation Algorithms with the Christchurch Call” and “Evaluating Frontier Models with the UK AI Safety Institute.” We describe types of scrutiny of AI systems that could be facilitated by current setups and OpenMined’s proposed future setups. We conclude that these innovative approaches deserve further exploration and support from the AI governance community. Interested policymakers can focus on empowering researchers on a legal level. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2502.05219 [cs.CY] (or arXiv:2502.05219v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2502.05219 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Kendrea Beers [view email] [v1] Wed, 5 Feb 2025 15:31:11 UTC (11 KB)

[AI-115] Watermarking across Modalities for Content Tracing and Generative AI

链接: https://arxiv.org/abs/2502.05215
作者: Pierre Fernandez
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: PhD thesis - webpage available at this https URL

点击查看摘要

Abstract:Watermarking embeds information into digital content like images, audio, or text, imperceptible to humans but robustly detectable by specific algorithms. This technology has important applications in many challenges of the industry such as content moderation, tracing AI-generated content, and monitoring the usage of AI models. The contributions of this thesis include the development of new watermarking techniques for images, audio, and text. We first introduce methods for active moderation of images on social platforms. We then develop specific techniques for AI-generated content. We specifically demonstrate methods to adapt latent generative models to embed watermarks in all generated content, identify watermarked sections in speech, and improve watermarking in large language models with tests that ensure low false positive rates. Furthermore, we explore the use of digital watermarking to detect model misuse, including the detection of watermarks in language models fine-tuned on watermarked text, and introduce training-free watermarks for the weights of large transformers. Through these contributions, the thesis provides effective solutions for the challenges posed by the increasing use of generative AI models and the need for model monitoring and content moderation. It finally examines the challenges and limitations of watermarking techniques and discuss potential future directions for research in this area.

[AI-116] DERMARK: A Dynamic Efficient and Robust Multi-bit Watermark for Large Language Models

链接: https://arxiv.org/abs/2502.05213
作者: Qihao Lin,Chen Tang,Lan zhang,Junyang zhang,Xiangyang Li
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 8 pages, 15 figures

点击查看摘要

Abstract:Well-trained large language models (LLMs) present significant risks, including potential malicious use and copyright infringement. Current studies aim to trace the distribution of LLM-generated texts by implicitly embedding watermarks. Among these, the single-bit watermarking method can only determine whether a given text was generated by an LLM. In contrast, the multi-bit watermarking method embeds richer information into the generated text, which can identify which LLM generated and distributed a given text to which user. However, existing efforts embed the multi-bit watermark directly into the generated text without accounting for its watermarking capacity. This approach can result in embedding failures when the text’s watermarking capacity is insufficient. In this paper, we derive the watermark embedding distribution based on the logits of LLMs and propose a formal inequality to segment the text optimally for watermark embedding. Building on this foundation, we propose DERMARK, a dynamic, efficient, and robust multi-bit watermarking method. DERMARK divides the text into segments of varying lengths for each bit embedding, adaptively matching the text’s capacity. It achieves this with negligible overhead and robust performance against text editing by minimizing watermark extraction loss. Comprehensive experiments demonstrate that, compared to the SOTA method, our method reduces the number of tokens required for embedding each bit by 20%, reduces watermark embedding time by 50%, and is robust to text editing and watermark erasure attacks.

[AI-117] Decoding FL Defenses: Systemization Pitfalls and Remedies

链接: https://arxiv.org/abs/2502.05211
作者: Momin Ahmad Khan,Virat Shejwalkar,Yasra Chandio,Amir Houmansadr,Fatima Muhammad Anwar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While the community has designed various defenses to counter the threat of poisoning attacks in Federated Learning (FL), there are no guidelines for evaluating these defenses. These defenses are prone to subtle pitfalls in their experimental setups that lead to a false sense of security, rendering them unsuitable for practical deployment. In this paper, we systematically understand, identify, and provide a better approach to address these challenges. First, we design a comprehensive systemization of FL defenses along three dimensions: i) how client updates are processed, ii) what the server knows, and iii) at what stage the defense is applied. Next, we thoroughly survey 50 top-tier defense papers and identify the commonly used components in their evaluation setups. Based on this survey, we uncover six distinct pitfalls and study their prevalence. For example, we discover that around 30% of these works solely use the intrinsically robust MNIST dataset, and 40% employ simplistic attacks, which may inadvertently portray their defense as robust. Using three representative defenses as case studies, we perform a critical reevaluation to study the impact of the identified pitfalls and show how they lead to incorrect conclusions about robustness. We provide actionable recommendations to help researchers overcome each pitfall.

[AI-118] Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

链接: https://arxiv.org/abs/2502.05209
作者: Zora Che,Stephen Casper,Robert Kirk,Anirudh Satheesh,Stewart Slocum,Lev E McKinney,Rohit Gandikota,Aidan Ewart,Domenic Rosati,Zichu Wu,Zikui Cai,Bilal Chughtai,Yarin Gal,Furong Huang,Dylan Hadfield-Menell
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model’s worst-possible-case behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations than input-space attacks alone. We release models at this https URL

[AI-119] Mitigation of Camouflaged Adversarial Attacks in Autonomous Vehicles–A Case Study Using CARLA Simulator

链接: https://arxiv.org/abs/2502.05208
作者: Yago Romano Martinez,Brady Carter,Abhijeet Solanki,Wesam Al Amiri,Syed Rafay Hasan,Terry N. Guo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous vehicles (AVs) rely heavily on cameras and artificial intelligence (AI) to make safe and accurate driving decisions. However, since AI is the core enabling technology, this raises serious cyber threats that hinder the large-scale adoption of AVs. Therefore, it becomes crucial to analyze the resilience of AV security systems against sophisticated attacks that manipulate camera inputs, deceiving AI models. In this paper, we develop camera-camouflaged adversarial attacks targeting traffic sign recognition (TSR) in AVs. Specifically, if the attack is initiated by modifying the texture of a stop sign to fool the AV’s object detection system, thereby affecting the AV actuators. The attack’s effectiveness is tested using the CARLA AV simulator and the results show that such an attack can delay the auto-braking response to the stop sign, resulting in potential safety issues. We conduct extensive experiments under various conditions, confirming that our new attack is effective and robust. Additionally, we address the attack by presenting mitigation strategies. The proposed attack and defense methods are applicable to other end-to-end trained autonomous cyber-physical systems.

[AI-120] Enhancing Team Diversity with Generative AI: A Novel Project Management Framework WWW

链接: https://arxiv.org/abs/2502.05181
作者: Johnny Chan,Yuming Li
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: A published version can be found from here - this https URL

点击查看摘要

Abstract:This research-in-progress paper presents a new project management framework that utilises GenAI technology. The framework is designed to address the common challenge of uniform team compositions in academic and research project teams, particularly in universities and research institutions. It does so by integrating sociologically identified patterns of successful team member personalities and roles, using GenAI agents to fill gaps in team dynamics. This approach adds an additional layer of analysis to conventional project management processes by evaluating team members’ personalities and roles and employing GenAI agents, fine-tuned on personality datasets, to fill specific team roles. Our initial experiments have shown improvements in the model’s ability to understand and process personality traits, suggesting the potential effectiveness of GenAI teammates in real-world project settings. This paper aims to explore the practical application of AI in enhancing team diversity and project management

[AI-121] Is Prior-Free Black-Box Non-Stationary Reinforcement Learning Feasible?

链接: https://arxiv.org/abs/2410.13772
作者: Argyrios Gerogiannis,Yu-Han Huang,Venugopal V. Veeravalli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Corrected minor typos in the proof of Theorem 2 on pages 25 and 26

点击查看摘要

Abstract:We study the problem of Non-Stationary Reinforcement Learning (NS-RL) without prior knowledge about the system’s non-stationarity. A state-of-the-art, black-box algorithm, known as MASTER, is considered, with a focus on identifying the conditions under which it can achieve its stated goals. Specifically, we prove that MASTER’s non-stationarity detection mechanism is not triggered for practical choices of horizon, leading to performance akin to a random restarting algorithm. Moreover, we show that the regret bound for MASTER, while being order optimal, stays above the worst-case linear regret until unreasonably large values of the horizon. To validate these observations, MASTER is tested for the special case of piecewise stationary multi-armed bandits, along with methods that employ random restarting, and others that use quickest change detection to restart. A simple, order optimal random restarting algorithm, that has prior knowledge of the non-stationarity is proposed as a baseline. The behavior of the MASTER algorithm is validated in simulations, and it is shown that methods employing quickest change detection are more robust and consistently outperform MASTER and other random restarting approaches.

[AI-122] Recent Advances in Discrete Speech Tokens: A Review

链接: https://arxiv.org/abs/2502.06490
作者: Yiwei Guo,Zhihan Li,Hankun Wang,Bohan Li,Chongtian Shao,Hanglei Zhang,Chenpeng Du,Xie Chen,Shujie Liu,Kai Yu
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Signal Processing (eess.SP)
*备注: 26 pages, 8 figures, 3 tables. Work in progress

点击查看摘要

Abstract:The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.

[AI-123] WyckoffDiff - A Generative Diffusion Model for Crystal Symmetry

链接: https://arxiv.org/abs/2502.06485
作者: Filip Ekström Kelvinius,Oskar B. Andersson,Abhijith S. Parackal,Dong Qian,Rickard Armiento,Fredrik Lindsten
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Crystalline materials often exhibit a high level of symmetry. However, most generative models do not account for symmetry, but rather model each atom without any constraints on its position or element. We propose a generative model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based descriptions of crystals. This is enabled by considering a crystal structure representation that encodes all symmetry, and we design a novel neural network architecture which enables using this representation inside a discrete generative model framework. In addition to respecting symmetry by construction, the discrete nature of our model enables fast generation. We additionally present a new metric, Fréchet Wrenformer Distance, which captures the symmetry aspects of the materials generated, and we benchmark WyckoffDiff against recently proposed generative models for crystal generation.

[AI-124] Conditioning through indifference in quantum mechanics

链接: https://arxiv.org/abs/2502.06249
作者: Keano De Vos,Gert de Cooman
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Probability (math.PR)
*备注: 11 pages

点击查看摘要

Abstract:We can learn (more) about the state a quantum system is in through measurements. We look at how to describe the uncertainty about a quantum system’s state conditional on executing such measurements. We show that by exploiting the interplay between desirability, coherence and indifference, a general rule for conditioning can be derived. We then apply this rule to conditioning on measurement outcomes, and show how it generalises to conditioning on a set of measurement outcomes.

[AI-125] Post-detection inference for sequential changepoint localization

链接: https://arxiv.org/abs/2502.06096
作者: Aytijhya Saha,Aaditya Ramdas
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper addresses a fundamental but largely unexplored challenge in sequential changepoint analysis: conducting inference following a detected change. We study the problem of localizing the changepoint using only the data observed up to a data-dependent stopping time at which a sequential detection algorithm \mathcal A declares a change. We first construct confidence sets for the unknown changepoint when pre- and post-change distributions are assumed to be known. We then extend our framework to composite pre- and post-change scenarios. We impose no conditions on the observation space or on \mathcal A – we only need to be able to run \mathcal A on simulated data sequences. In summary, this work offers both theoretically sound and practically effective tools for sequential changepoint localization.

[AI-126] Multi-modal Data Fusion and Deep Ensemble Learning for Accurate Crop Yield Prediction

链接: https://arxiv.org/abs/2502.06062
作者: Akshay Dagadu Yewle,Laman Mirzayeva,Oktay Karakuş
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注: 28 pages, 7 figures and 5 tables

点击查看摘要

Abstract:This study introduces RicEns-Net, a novel Deep Ensemble model designed to predict crop yields by integrating diverse data sources through multimodal data fusion techniques. The research focuses specifically on the use of synthetic aperture radar (SAR), optical remote sensing data from Sentinel 1, 2, and 3 satellites, and meteorological measurements such as surface temperature and rainfall. The initial field data for the study were acquired through Ernst Young’s (EY) Open Science Challenge 2023. The primary objective is to enhance the precision of crop yield prediction by developing a machine-learning framework capable of handling complex environmental data. A comprehensive data engineering process was employed to select the most informative features from over 100 potential predictors, reducing the set to 15 features from 5 distinct modalities. This step mitigates the ``curse of dimensionality" and enhances model performance. The RicEns-Net architecture combines multiple machine learning algorithms in a deep ensemble framework, integrating the strengths of each technique to improve predictive accuracy. Experimental results demonstrate that RicEns-Net achieves a mean absolute error (MAE) of 341 kg/Ha (roughly corresponds to 5-6% of the lowest average yield in the region), significantly exceeding the performance of previous state-of-the-art models, including those developed during the EY challenge.

[AI-127] On the Convergence and Stability of Upside-Down Reinforcement Learning Goal-Conditioned Supervised Learning and Online Decision Transformers

链接: https://arxiv.org/abs/2502.05672
作者: Miroslav Štrupl,Oleg Szehr,Francesco Faccio,Dylan R. Ashley,Rupesh Kumar Srivastava,Jürgen Schmidhuber
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
*备注: 85 pages in main text + 4 pages of references + 26 pages of appendices, 12 figures in main text + 2 figures in appendices; source code available at this https URL

点击查看摘要

Abstract:This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers. These algorithms performed competitively across various benchmarks, from games to robotic tasks, but their theoretical understanding is limited to specific environmental conditions. This work initiates a theoretical foundation for algorithms that build on the broad paradigm of approaching reinforcement learning through supervised learning or sequence modeling. At the core of this investigation lies the analysis of conditions on the underlying environment, under which the algorithms can identify optimal solutions. We also assess whether emerging solutions remain stable in situations where the environment is subject to tiny levels of noise. Specifically, we study the continuity and asymptotic convergence of command-conditioned policies, values and the goal-reaching objective depending on the transition kernel of the underlying Markov Decision Process. We demonstrate that near-optimal behavior is achieved if the transition kernel is located in a sufficiently small neighborhood of a deterministic kernel. The mentioned quantities are continuous (with respect to a specific topology) at deterministic kernels, both asymptotically and after a finite number of learning cycles. The developed methods allow us to present the first explicit estimates on the convergence and stability of policies and values in terms of the underlying transition kernels. On the theoretical side we introduce a number of new concepts to reinforcement learning, like working in segment spaces, studying continuity in quotient topologies and the application of the fixed-point theory of dynamical systems. The theoretical study is accompanied by a detailed investigation of example environments and numerical experiments.

[AI-128] Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

链接: https://arxiv.org/abs/2502.05435
作者: Manh Luong,Khai Nguyen,Dinh Phung,Gholamreza Haffari,Lizhen Qu
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 9 tables, 2 figures

点击查看摘要

Abstract:Teacher-forcing training for audio captioning usually leads to exposure bias due to training and inference mismatch. Prior works propose the contrastive method to deal with caption degeneration. However, the contrastive method ignores the temporal information when measuring similarity across acoustic and linguistic modalities, leading to inferior performance. In this work, we develop the temporal-similarity score by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel equipped with rotary positional embedding to account for temporal information across modalities. In contrast to the conventional sliced Wasserstein RBF kernel, we can form an unbiased estimation of USW-RBF kernel via Monte Carlo estimation. Therefore, it is well-suited to stochastic gradient optimization algorithms, and its approximation error decreases at a parametric rate of \mathcalO(L^-1/2) with L Monte Carlo samples. Additionally, we introduce an audio captioning framework based on the unbiased sliced Wasserstein kernel, incorporating stochastic decoding methods to mitigate caption degeneration during the generation process. We conduct extensive quantitative and qualitative experiments on two datasets, AudioCaps and Clotho, to illustrate the capability of generating high-quality audio captions. Experimental results show that our framework is able to increase caption length, lexical diversity, and text-to-audio self-retrieval accuracy.

[AI-129] Is attention all you need to solve the correlated electron problem?

链接: https://arxiv.org/abs/2502.05383
作者: Max Geier,Khachatur Nazaryan,Timothy Zaklama,Liang Fu
类目: rongly Correlated Electrons (cond-mat.str-el); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI)
*备注: 10+5 pages, comments welcome

点击查看摘要

Abstract:The attention mechanism has transformed artificial intelligence research by its ability to learn relations between objects. In this work, we explore how a many-body wavefunction ansatz constructed from a large-parameter self-attention neural network can be used to solve the interacting electron problem in solids. By a systematic neural-network variational Monte Carlo study on a moiré quantum material, we demonstrate that the self-attention ansatz provides an accurate, efficient, and unbiased solution. Moreover, our numerical study finds that the required number of variational parameters scales roughly as N^2 with the number of electrons, which opens a path towards efficient large-scale simulations.

[AI-130] Quantum automated learning with provable and explainable trainability

链接: https://arxiv.org/abs/2502.05264
作者: Qi Ye,Shuangyue Geng,Zizhao Han,Weikang Li,L.-M. Duan,Dong-Ling Deng
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:Machine learning is widely believed to be one of the most promising practical applications of quantum computing. Existing quantum machine learning schemes typically employ a quantum-classical hybrid approach that relies crucially on gradients of model parameters. Such an approach lacks provable convergence to global minima and will become infeasible as quantum learning models scale up. Here, we introduce quantum automated learning, where no variational parameter is involved and the training process is converted to quantum state preparation. In particular, we encode training data into unitary operations and iteratively evolve a random initial state under these unitaries and their inverses, with a target-oriented perturbation towards higher prediction accuracy sandwiched in between. Under reasonable assumptions, we rigorously prove that the evolution converges exponentially to the desired state corresponding to the global minimum of the loss function. We show that such a training process can be understood from the perspective of preparing quantum states by imaginary time evolution, where the data-encoded unitaries together with target-oriented perturbations would train the quantum learning model in an automated fashion. We further prove that the quantum automated learning paradigm features good generalization ability with the generalization error upper bounded by the ratio between a logarithmic function of the Hilbert space dimension and the number of training samples. In addition, we carry out extensive numerical simulations on real-life images and quantum data to demonstrate the effectiveness of our approach and validate the assumptions. Our results establish an unconventional quantum learning strategy that is gradient-free with provable and explainable trainability, which would be crucial for large-scale practical applications of quantum computing in machine learning scenarios.

[AI-131] hin ring wing as a means of flow improvement upstream of a propeller

链接: https://arxiv.org/abs/2502.05231
作者: Vladimir Sluchak
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注: 7 pages, 7 figures, Propellers/Shafting 97 Symposium

点击查看摘要

Abstract:There are numerous devices currently known with the purpose of reducing the irregularity of the flow upstream of the propeller and to decrease by that means the propeller-induced vibration and noise. Many of these devices are wing-shaped vortex-generators that affect the flow with their induced (i.e. passive) longitudinal vortices. The paper’s subject is the use of a ring-shaped wing as a highly effective passive vortex-generator which allows to control the flow closer to the most charged sections of propeller blades. The problem of a thin ring-shaped wing with irregular (asymmetric) geometry in the irregular steady flow has been solved in a linear approach and the intensity of the induced longitudinal vortices as a function of the irregularity of the flow and the geometry of the ring wing has been estimated using that solution. Experiments in the towing tank showing good concordance with the theoretical model confirmed the effectiveness of such a device. Some additional advantages of a ring-shaped wing incorporated into the construction of stabilizers are considered.

[AI-132] DiffNMR2: NMR Guided Sampling Acquisition Through Diffusion Model Uncertainty

链接: https://arxiv.org/abs/2502.05230
作者: Etienne Goffinet,Sen Yan,Fabrizio Gabellieri,Laurence Jennings,Lydia Gkoura,Filippo Castiglione,Ryan Young,Idir Malki,Ankita Singh,Thomas Launey
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:Nuclear Magnetic Resonance (NMR) spectrometry uses electro-frequency pulses to probe the resonance of a compound’s nucleus, which is then analyzed to determine its structure. The acquisition time of high-resolution NMR spectra remains a significant bottleneck, especially for complex biological samples such as proteins. In this study, we propose a novel and efficient sub-sampling strategy based on a diffusion model trained on protein NMR data. Our method iteratively reconstructs under-sampled spectra while using model uncertainty to guide subsequent sampling, significantly reducing acquisition time. Compared to state-of-the-art strategies, our approach improves reconstruction accuracy by 52.9%, reduces hallucinated peaks by 55.6%, and requires 60% less time in complex NMR experiments. This advancement holds promise for many applications, from drug discovery to materials science, where rapid and high-resolution spectral analysis is critical.

[AI-133] Multi-Objective Mobile Damped Wave Algorithm (MOMDWA): A Novel Approach For Quantum System Control

链接: https://arxiv.org/abs/2502.05228
作者: Juntao Yu,Jiaquan Yu,Dedai Wei,Xinye Sha,Shengwei Fu,Miuyu Qiu,Yurun Jin,Kaichen Ouyang
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel multi-objective optimization algorithm, the Multi-Objective Mobile Damped Wave Algorithm (MOMDWA), specifically designed to address complex quantum control problems. Our approach extends the capabilities of the original Mobile Damped Wave Algorithm (MDWA) by incorporating multiple objectives, enabling a more comprehensive optimization process. We applied MOMDWA to three quantum control scenarios, focusing on optimizing the balance between control fidelity, energy consumption, and control smoothness. The results demonstrate that MOMDWA significantly enhances quantum control efficiency and robustness, achieving high fidelity while minimizing energy use and ensuring smooth control pulses. This advancement offers a valuable tool for quantum computing and other domains requiring precise, multi-objective control.

[AI-134] Blackout DIFUSCO

链接: https://arxiv.org/abs/2502.05221
作者: Jun Pyo Seo
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:This study explores the integration of Blackout Diffusion into the DIFUSCO framework for combinatorial optimization, specifically targeting the Traveling Salesman Problem (TSP). Inspired by the success of discrete-time diffusion models (D3PM) in maintaining structural integrity, we extend the paradigm to a continuous-time framework, leveraging the unique properties of Blackout Diffusion. Continuous-time modeling introduces smoother transitions and refined control, hypothesizing enhanced solution quality over traditional discrete methods. We propose three key improvements to enhance the diffusion process. First, we transition from a discrete-time-based model to a continuous-time framework, providing a more refined and flexible formulation. Second, we refine the observation time scheduling to ensure a smooth and linear transformation throughout the diffusion process, allowing for a more natural progression of states. Finally, building upon the second improvement, we further enhance the reverse process by introducing finer time slices in regions that are particularly challenging for the model, thereby improving accuracy and stability in the reconstruction phase. Although the experimental results did not exceed the baseline performance, they demonstrate the effectiveness of these methods in balancing simplicity and complexity, offering new insights into diffusion-based combinatorial optimization. This work represents the first application of Blackout Diffusion to combinatorial optimization, providing a foundation for further advancements in this domain. * The code is available for review at this https URL.

[AI-135] FactorGCL: A Hypergraph-Based Factor Model with Temporal Residual Contrastive Learning for Stock Returns Prediction

链接: https://arxiv.org/abs/2502.05218
作者: Yitong Duan,Weiran Wang,Jian Li
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As a fundamental method in economics and finance, the factor model has been extensively utilized in quantitative investment. In recent years, there has been a paradigm shift from traditional linear models with expert-designed factors to more flexible nonlinear machine learning-based models with data-driven factors, aiming to enhance the effectiveness of these factor models. However, due to the low signal-to-noise ratio in market data, mining effective factors in data-driven models remains challenging. In this work, we propose a hypergraph-based factor model with temporal residual contrastive learning (FactorGCL) that employs a hypergraph structure to better capture high-order nonlinear relationships among stock returns and factors. To mine hidden factors that supplement human-designed prior factors for predicting stock returns, we design a cascading residual hypergraph architecture, in which the hidden factors are extracted from the residual information after removing the influence of prior factors. Additionally, we propose a temporal residual contrastive learning method to guide the extraction of effective and comprehensive hidden factors by contrasting stock-specific residual information over different time periods. Our extensive experiments on real stock market data demonstrate that FactorGCL not only outperforms existing state-of-the-art methods but also mines effective hidden factors for predicting stock returns.

[AI-136] Multimodal Stock Price Prediction

链接: https://arxiv.org/abs/2502.05186
作者: Furkan Karadaş,Bahaeddin Eravcı,Ahmet Murat Özbayoğlu
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 6 table

点击查看摘要

Abstract:In an era where financial markets are heavily influenced by many static and dynamic factors, it has become increasingly critical to carefully integrate diverse data sources with machine learning for accurate stock price prediction. This paper explores a multimodal machine learning approach for stock price prediction by combining data from diverse sources, including traditional financial metrics, tweets, and news articles. We capture real-time market dynamics and investor mood through sentiment analysis on these textual data using both ChatGPT-4o and FinBERT models. We look at how these integrated data streams augment predictions made with a standard Long Short-Term Memory (LSTM model) to illustrate the extent of performance gains. Our study’s results indicate that incorporating the mentioned data sources considerably increases the forecast effectiveness of the reference model by up to 5%. We also provide insights into the individual and combined predictive capacities of these modalities, highlighting the substantial impact of incorporating sentiment analysis from tweets and news articles. This research offers a systematic and effective framework for applying multimodal data analytics techniques in financial time series forecasting that provides a new view for investors to leverage data for decision-making.

机器学习

[LG-0] DeepCrossAttention: Supercharging Transformer Residual Connections

链接: https://arxiv.org/abs/2502.06785
作者: Mike Heddes,Adel Javanmard,Kyriakos Axiotis,Gang Fu,MohammadHossein Bateni,Vahab Mirrokni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, input-dependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible number of parameters. Theoretical analysis confirms that DCA provides an improved trade-off between accuracy and model size when the ratio of collective layer ranks to the ambient dimension falls below a critical threshold.

[LG-1] Enhancing Performance of Explainable AI Models with Constrained Concept Refinement

链接: https://arxiv.org/abs/2502.06775
作者: Geyu Liang,Senne Michielssen,Salar Fattahi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The trade-off between accuracy and interpretability has long been a challenge in machine learning (ML). This tension is particularly significant for emerging interpretable-by-design methods, which aim to redesign ML algorithms for trustworthy interpretability but often sacrifice accuracy in the process. In this paper, we address this gap by investigating the impact of deviations in concept representations-an essential component of interpretable models-on prediction performance and propose a novel framework to mitigate these effects. The framework builds on the principle of optimizing concept embeddings under constraints that preserve interpretability. Using a generative model as a test-bed, we rigorously prove that our algorithm achieves zero loss while progressively enhancing the interpretability of the resulting model. Additionally, we evaluate the practical performance of our proposed framework in generating explainable predictions for image classification tasks across various benchmarks. Compared to existing explainable methods, our approach not only improves prediction accuracy while preserving model interpretability across various large-scale benchmarks but also achieves this with significantly lower computational cost.

[LG-2] ENFORCE: Exact Nonlinear Constrained Learning with Adaptive-depth Neural Projection

链接: https://arxiv.org/abs/2502.06774
作者: Giacomo Lastrucci,Artur M. Schweidtmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring neural networks adhere to domain-specific constraints is crucial for addressing safety and ethical concerns while also enhancing prediction accuracy. Despite the nonlinear nature of most real-world tasks, existing methods are predominantly limited to affine or convex constraints. We introduce ENFORCE, a neural network architecture that guarantees predictions to satisfy nonlinear constraints exactly. ENFORCE is trained with standard unconstrained gradient-based optimizers (e.g., Adam) and leverages autodifferentiation and local neural projections to enforce any \mathcalC^1 constraint to arbitrary tolerance \epsilon . We build an adaptive-depth neural projection (AdaNP) module that dynamically adjusts its complexity to suit the specific problem and the required tolerance levels. ENFORCE guarantees satisfaction of equality constraints that are nonlinear in both inputs and outputs of the neural network with minimal (and adjustable) computational cost.

[LG-3] rain for the Worst Plan for the Best: Understanding Token Ordering in Masked Diffusions

链接: https://arxiv.org/abs/2502.06768
作者: Jaeyeon Kim,Kulin Shah,Vasilis Kontonis,Sham Kakade,Sitan Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work, we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from 7 % to \approx 90 %, even outperforming ARMs with 7\times as many parameters and that were explicitly trained via teacher forcing to learn the right order of decoding.

[LG-4] When Where and Why to Averag e Weights?

链接: https://arxiv.org/abs/2502.06761
作者: Niccolò Ajroldi,Antonio Orvieto,Jonas Geiping
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf \citepdahl_benchmarking_2023, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost, while mildly improving generalization across all considered workloads. Finally, we explore the relationship between averaging and learning rate annealing and show how to optimally combine the two to achieve the best performances.

[LG-5] Incentivizing Desirable Effort Profiles in Strategic Classification: The Role of Causality and Uncertainty

链接: https://arxiv.org/abs/2502.06749
作者: Valia Efthymiou,Chara Podimata,Diptangshu Sen,Juba Ziani
类目: Computer Science and Game Theory (cs.GT); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study strategic classification in binary decision-making settings where agents can modify their features in order to improve their classification outcomes. Importantly, our work considers the causal structure across different features, acknowledging that effort in a given feature may affect other features. The main goal of our work is to understand \emphwhen and how much agent effort is invested towards desirable features, and how this is influenced by the deployed classifier, the causal structure of the agent’s features, their ability to modify them, and the information available to the agent about the classifier and the feature causal graph. In the complete information case, when agents know the classifier and the causal structure of the problem, we derive conditions ensuring that rational agents focus on features favored by the principal. We show that designing classifiers to induce desirable behavior is generally non-convex, though tractable in special cases. We also extend our analysis to settings where agents have incomplete information about the classifier or the causal graph. While optimal effort selection is again a non-convex problem under general uncertainty, we highlight special cases of partial uncertainty where this selection problem becomes tractable. Our results indicate that uncertainty drives agents to favor features with higher expected importance and lower variance, potentially misaligning with principal preferences. Finally, numerical experiments based on a cardiovascular disease risk study illustrate how to incentivize desirable modifications under uncertainty. Subjects: Computer Science and Game Theory (cs.GT); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2502.06749 [cs.GT] (or arXiv:2502.06749v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2502.06749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] A note on the physical interpretation of neural PDEs

链接: https://arxiv.org/abs/2502.06739
作者: Sauro Succi
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computational Physics (physics.comp-ph)
*备注: 12 pages

点击查看摘要

Abstract:We highlight a formal and substantial analogy between Machine Learning (ML) algorithms and discrete dynamical systems (DDS) in relaxation form. The analogy offers a transparent interpretation of the weights in terms of physical information-propagation processes and identifies the model function of the forward ML step with the local attractor of the corresponding discrete dynamics. Besides improving the explainability of current ML applications, this analogy may also facilitate the development of a new class ML algorithms with a reduced number of weights.

[LG-7] Resurrecting saturated LLM benchmarks with adversarial encoding

链接: https://arxiv.org/abs/2502.06738
作者: Igor Ivanov,Dmitrii Volkov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work showed that small changes in benchmark questions can reduce LLMs’ reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.

[LG-8] VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

链接: https://arxiv.org/abs/2502.06737
作者: Thomas Zeng,Shuibai Zhang,Shutong Wu,Christian Classen,Daewon Chae,Ethan Ewer,Minjae Lee,Heeju Kim,Wonjun Kang,Jackson Kunde,Ying Fan,Jungtaek Kim,Hyung Il Koo,Kannan Ramchandran,Dimitris Papailiopoulos,Kangwook Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline – surpassing Qwen2.5-Math-PRM’s gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.

[LG-9] RSAttAE: An Information-Aware Attention-based Autoencoder Recommender System

链接: https://arxiv.org/abs/2502.06705
作者: Amirhossein Dadashzadeh Taromi,Sina Heydari,Mohsen Hooshmand,Majid Ramezani
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Recommender systems play a crucial role in modern life, including information retrieval, the pharmaceutical industry, retail, and entertainment. The entertainment sector, in particular, attracts significant attention and generates substantial profits. This work proposes a new method for predicting unknown user-movie ratings to enhance customer satisfaction. To achieve this, we utilize the MovieLens 100K dataset. Our approach introduces an attention-based autoencoder to create meaningful representations and the XGBoost method for rating predictions. The results demonstrate that our proposal outperforms most of the existing state-of-the-art methods. Availability: this http URL

[LG-10] FairDropout: Using Example-Tied Dropout to Enhance Generalization of Minority Groups

链接: https://arxiv.org/abs/2502.06695
作者: Geraldin Nanfack,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models frequently exploit spurious features in training data to achieve low training error, often resulting in poor generalization when faced with shifted testing distributions. To address this issue, various methods from imbalanced learning, representation learning, and classifier recalibration have been proposed to enhance the robustness of deep neural networks against spurious correlations. In this paper, we observe that models trained with empirical risk minimization tend to generalize well for examples from the majority groups while memorizing instances from minority groups. Building on recent findings that show memorization can be localized to a limited number of neurons, we apply example-tied dropout as a method we term FairDropout, aimed at redirecting this memorization to specific neurons that we subsequently drop out during inference. We empirically evaluate FairDropout using the subpopulation benchmark suite encompassing vision, language, and healthcare tasks, demonstrating that it significantly reduces reliance on spurious correlations, and outperforms state-of-the-art methods.

[LG-11] No Trick No Treat: Pursuits and Challenges Towards Simulation-free Training of Neural Samplers

链接: https://arxiv.org/abs/2502.06685
作者: Jiajun He,Yuanqi Du,Francisco Vargas,Dinghuai Zhang,Shreyas Padhy,RuiKang OuYang,Carla Gomes,José Miguel Hernández-Lobato
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 5 figures, 6 tables

点击查看摘要

Abstract:We consider the sampling problem, where the aim is to draw samples from a distribution whose density is known only up to a normalization constant. Recent breakthroughs in generative modeling to approximate a high-dimensional data distribution have sparked significant interest in developing neural network-based methods for this challenging problem. However, neural samplers typically incur heavy computational overhead due to simulating trajectories during training. This motivates the pursuit of simulation-free training procedures of neural samplers. In this work, we propose an elegant modification to previous methods, which allows simulation-free training with the help of a time-dependent normalizing flow. However, it ultimately suffers from severe mode collapse. On closer inspection, we find that nearly all successful neural samplers rely on Langevin preconditioning to avoid mode collapsing. We systematically analyze several popular methods with various objective functions and demonstrate that, in the absence of Langevin preconditioning, most of them fail to adequately cover even a simple target. Finally, we draw attention to a strong baseline by combining the state-of-the-art MCMC method, Parallel Tempering (PT), with an additional generative model to shed light on future explorations of neural samplers.

[LG-12] RAILS: Risk-Aware Iterated Local Search for Joint SLA Decomposition and Service Provider Management in Multi-Domain Networks

链接: https://arxiv.org/abs/2502.06674
作者: Cyril Shih-Huan Hsu,Chrysa Papagianni,Paola Grosso
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: The paper has been submitted to IEEE HPSR 2025

点击查看摘要

Abstract:The emergence of the fifth generation (5G) technology has transformed mobile networks into multi-service environments, necessitating efficient network slicing to meet diverse Service Level Agreements (SLAs). SLA decomposition across multiple network domains, each potentially managed by different service providers, poses a significant challenge due to limited visibility into real-time underlying domain conditions. This paper introduces Risk-Aware Iterated Local Search (RAILS), a novel risk model-driven meta-heuristic framework designed to jointly address SLA decomposition and service provider selection in multi-domain networks. By integrating online risk modeling with iterated local search principles, RAILS effectively navigates the complex optimization landscape, utilizing historical feedback from domain controllers. We formulate the joint problem as a Mixed-Integer Nonlinear Programming (MINLP) problem and prove its NP-hardness. Extensive simulations demonstrate that RAILS achieves near-optimal performance, offering an efficient, real-time solution for adaptive SLA management in modern multi-domain networks.

[LG-13] EfficientLLM : Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models

链接: https://arxiv.org/abs/2502.06663
作者: Xingrun Xing,Zheng Liu,Shitao Xiao,Boyan Gao,Yiming Liang,Wanpeng Zhang,Haokun Lin,Guoqi Li,Jiajun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. Recently, the increasing concerns about cloud costs, latency, and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-scalable: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Architecture-agnostic: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary. EfficientLLM significantly outperforms SoTA baselines with 100M \sim 1B parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in common sense benchmarks. As the first attempt, EfficientLLM bridges the performance gap between traditional LLM compression and direct pretraining methods, and we will fully open source at this https URL.

[LG-14] Generating Samples to Question Trained Models

链接: https://arxiv.org/abs/2502.06658
作者: E. Mehmet Kıral,Nurşen Aydın,Ş. İlker Birbil
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is a growing need for investigating how machine learning models operate. With this work, we aim to understand trained machine learning models by questioning their data preferences. We propose a mathematical framework that allows us to probe trained models and identify their preferred samples in various scenarios including prediction-risky, parameter-sensitive, or model-contrastive samples. To showcase our framework, we pose these queries to a range of models trained on a range of classification and regression tasks, and receive answers in the form of generated data.

[LG-15] Koopman-Equivariant Gaussian Processes AISTATS

链接: https://arxiv.org/abs/2502.06645
作者: Petar Bevanda,Max Beier,Armin Lederer,Alexandre Capone,Stefan Sosnowski,Sandra Hirche
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: Accepted to the 28th International Conference on Artificial Intelligence and Statistics (AISTATS)

点击查看摘要

Abstract:Credible forecasting and representation learning of dynamical systems are of ever-increasing importance for reliable decision-making. To that end, we propose a family of Gaussian processes (GP) for dynamical systems with linear time-invariant responses, which are nonlinear only in initial conditions. This linearity allows us to tractably quantify forecasting and representational uncertainty, simultaneously alleviating the challenge of computing the distribution of trajectories from a GP-based dynamical system and enabling a new probabilistic treatment of learning Koopman operator representations. Using a trajectory-based equivariance – which we refer to as \textitKoopman equivariance – we obtain a GP model with enhanced generalization capabilities. To allow for large-scale regression, we equip our framework with variational inference based on suitable inducing points. Experiments demonstrate on-par and often better forecasting performance compared to kernel-based methods for learning dynamical systems.

[LG-16] MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing

链接: https://arxiv.org/abs/2502.06643
作者: Seokjin Go,Divya Mahajan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently, offering sparse activation that reduces computational costs while increasing model capacity. However, as MoE models scale, they need to be distributed across GPU devices, thus face critical performance bottlenecks due to their large memory footprint. Expert parallelism distributes experts across GPUs, however, faces key challenges including an unbalanced token routing and expert activation, resulting in communication tail latency and processing inefficiencies. While existing solutions address some of these issues, they fail to resolve the dual challenges of load imbalance and communication skew. The imbalance in token processing load across experts causes uneven processing times on different GPUs, while communication skew between GPUs leads to unbalanced inter-GPU data transfers. These factors degrade the performance of MoE models by increasing tail latency and reducing overall throughput. To address these limitations, we propose an Integer Linear Programming (ILP) formulation to optimize expert placement by jointly considering token load, communication, and computation costs. We exploit the property that there is a token routing dependency across layers, where tokens routed to a specific expert in one layer are likely to be routed to a limited set of experts in the subsequent layer. Our solution, MoETuner, offers an optimal expert-to-GPU assignment that minimizes inter-GPU token routing costs and balances token processing across devices, thereby reducing tail latency and end-to-end execution time. Experimental results demonstrate 9.3% and 17.5% of end-to-end speedups for single-node and multi-node inference respectively, showcasing the potential of our ILP-based optimization for offering expert parallel solutions for next-generation MoEs.

[LG-17] Continual Release Moment Estimation with Differential Privacy

链接: https://arxiv.org/abs/2502.06597
作者: Nikita P. Kalinin,Jalaj Upadhyay,Christoph H. Lampert
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose Joint Moment Estimation (JME), a method for continually and privately estimating both the first and second moments of data with reduced noise compared to naive approaches. JME uses the matrix mechanism and a joint sensitivity analysis to allow the second moment estimation with no additional privacy cost, thereby improving accuracy while maintaining privacy. We demonstrate JME’s effectiveness in two applications: estimating the running mean and covariance matrix for Gaussian density estimation, and model training with DP-Adam on CIFAR-10.

[LG-18] Diffeomorphic Temporal Alignment Nets for Time-series Joint Alignment and Averag ing ICML2023

链接: https://arxiv.org/abs/2502.06591
作者: Ron Shapira Weber,Oren Freifeld
类目: Machine Learning (cs.LG)
*备注: This manuscript covers and extends the papers: Diffeomorphic Temporal Alignment Nets (DTAN; NeruIPS 2019) and Regularization-free Diffeomorphic Temporal Alignment Nets (ICML 2023). Additional contributions: Multi-tasking DTAN, PCA-DTAN and more

点击查看摘要

Abstract:In time-series analysis, nonlinear temporal misalignment remains a pivotal challenge that forestalls even simple averaging. Since its introduction, the Diffeomorphic Temporal Alignment Net (DTAN), which we first introduced (Weber et al., 2019) and further developed in (Weber Freifeld, 2023), has proven itself as an effective solution for this problem (these conference papers are earlier partial versions of the current manuscript). DTAN predicts and applies diffeomorphic transformations in an input-dependent manner, thus facilitating the joint alignment (JA) and averaging of time-series ensembles in an unsupervised or a weakly-supervised manner. The inherent challenges of the weakly/unsupervised setting, particularly the risk of trivial solutions through excessive signal distortion, are mitigated using either one of two distinct strategies: 1) a regularization term for warps; 2) using the Inverse Consistency Averaging Error (ICAE). The latter is a novel, regularization-free approach which also facilitates the JA of variable-length signals. We also further extend our framework to incorporate multi-task learning (MT-DTAN), enabling simultaneous time-series alignment and classification. Additionally, we conduct a comprehensive evaluation of different backbone architectures, demonstrating their efficacy in time-series alignment tasks. Finally, we showcase the utility of our approach in enabling Principal Component Analysis (PCA) for misaligned time-series data. Extensive experiments across 128 UCR datasets validate the superiority of our approach over contemporary averaging methods, including both traditional and learning-based approaches, marking a significant advancement in the field of time-series analysis.

[LG-19] Deep Reinforcement Learning based Triggering Function for Early Classifiers of Time Series

链接: https://arxiv.org/abs/2502.06584
作者: Aurélien Renault,Alexis Bondu,Antoine Cornuéjols,Vincent Lemaire
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early Classification of Time Series (ECTS) has been recognized as an important problem in many areas where decisions have to be taken as soon as possible, before the full data availability, while time pressure increases. Numerous ECTS approaches have been proposed, based on different triggering functions, each taking into account various pieces of information related to the incoming time series and/or the output of a classifier. Although their performances have been empirically compared in the literature, no studies have been carried out on the optimality of these triggering functions that involve man-tailored'' decision rules. Based on the same information, could there be better triggering functions? This paper presents one way to investigate this question by showing first how to translate ECTS problems into Reinforcement Learning (RL) ones, where the very same information is used in the state space. A thorough comparison of the performance obtained by handmade’’ approaches and their ``RL-based’’ counterparts has been carried out. A second question investigated in this paper is whether a different combination of information, defining the state space in RL systems, can achieve even better performance. Experiments show that the system we describe, called \textscAlert, significantly outperforms its state-of-the-art competitors on a large number of datasets.

[LG-20] Robust Scatter Matrix Estimation for Elliptical Distributions in Polynomial Time

链接: https://arxiv.org/abs/2502.06564
作者: Gleb Novikov
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of computationally efficient robust estimation of scatter matrices of elliptical distributions under the strong contamination model. We design polynomial time algorithms that achieve dimension-independent error in Frobenius norm. Our first result is a sequence of efficient algorithms that approaches nearly optimal error. Specifically, under a mild assumption on the eigenvalues of the scatter matrix \Sigma , for every t \in \mathbbN , we design an estimator that, given n = d^O(t) samples, in time n^O(t) finds \hat\Sigma such that \Vert\Sigma^-1/2, (\hat\Sigma - \Sigma), \Sigma^-1/2\Vert_\textF \le O(t \cdot \varepsilon^1-\frac1t) , where \varepsilon is the fraction of corruption. We do not require any assumptions on the moments of the distribution, while all previously known computationally efficient algorithms for robust covariance/scatter estimation with dimension-independent error rely on strong assumptions on the moments, such as sub-Gaussianity or (certifiable) hypercontractivity. Furthermore, under a stronger assumption on the eigenvalues of \Sigma (that, in particular, is satisfied by all matrices with constant condition number), we provide a fast (sub-quadratic in the input size) algorithm that, given nearly optimal number of samples n = \tildeO(d^2/\varepsilon) , in time \tildeO(nd^2 poly(1/\varepsilon)) finds \hat\Sigma such that \Vert\hat\Sigma - \Sigma\Vert_\textF \le O(\Vert\Sigma\Vert \cdot \sqrt\varepsilon) . Our approach is based on robust covariance estimation of the spatial sign (the projection onto the sphere of radius \sqrtd ) of elliptical distributions. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2502.06564 [cs.DS] (or arXiv:2502.06564v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2502.06564 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gleb Novikov [view email] [v1] Mon, 10 Feb 2025 15:31:57 UTC (26 KB) Full-text links: Access Paper: View a PDF of the paper titled Robust Scatter Matrix Estimation for Elliptical Distributions in Polynomial Time, by Gleb NovikovView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2025-02 Change to browse by: cs cs.LG math math.ST stat stat.ML stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-21] Is API Access to LLM s Useful for Generating Private Synthetic Tabular Data?

链接: https://arxiv.org/abs/2502.06555
作者: Marika Swanberg,Ryan McKenna,Edo Roth,Albert Cheu,Peter Kairouz
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. Recent advancements in large language models (LLMs) have inspired a number of algorithm techniques for improving DP synthetic data generation. One family of approaches uses DP finetuning on the foundation model weights; however, the model weights for state-of-the-art models may not be public. In this work we propose two DP synthetic tabular data algorithms that only require API access to the foundation model. We adapt the Private Evolution algorithm (Lin et al., 2023; Xie et al., 2024) – which was designed for image and text data – to the tabular data domain. In our extension of Private Evolution, we define a query workload-based distance measure, which may be of independent interest. We propose a family of algorithms that use one-shot API access to LLMs, rather than adaptive queries to the LLM. Our findings reveal that API-access to powerful LLMs does not always improve the quality of DP synthetic data compared to established baselines that operate without such access. We provide insights into the underlying reasons and propose improvements to LLMs that could make them more effective for this application.

[LG-22] Dimension-free Regret for Learning Asymmetric Linear Dynamical Systems

链接: https://arxiv.org/abs/2502.06545
作者: Annie Marsden,Elad Hazan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages

点击查看摘要

Abstract:Previously, methods for learning marginally stable linear dynamical systems either required the transition matrix to be symmetric or incurred regret bounds that scale polynomially with the system’s hidden dimension. In this work, we introduce a novel method that overcomes this trade-off, achieving dimension-free regret despite the presence of asymmetric matrices and marginal stability. Our method combines spectral filtering with linear predictors and employs Chebyshev polynomials in the complex plane to construct a novel spectral filtering basis. This construction guarantees sublinear regret in an online learning framework, without relying on any statistical or generative assumptions. Specifically, we prove that as long as the transition matrix has eigenvalues with complex component bounded by 1/\mathrmpoly \log T , then our method achieves regret \tildeO(T^9/10) when compared to the best linear dynamical predictor in hindsight.

[LG-23] Logarithmic Regret of Exploration in Averag e Reward Markov Decision Processes

链接: https://arxiv.org/abs/2502.06480
作者: Victor Boone,Bruno Gaujal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In average reward Markov decision processes, state-of-the-art algorithms for regret minimization follow a well-established framework: They are model-based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well-known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing (DT) by another simple rule, that we call the Vanishing Multiplicative (VM) rule. When managing episodes with (VM), the algorithm’s regret is, both in theory and in practice, as good if not better than with (DT), while the one-shot behavior is greatly improved. More specifically, the management of bad episodes (when sub-optimal policies are being used) is much better under (VM) than (DT) by making the regret of exploration logarithmic rather than linear. These results are made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.

[LG-24] Low-dimensional Functions are Efficiently Learnable under Randomly Biased Distributions

链接: https://arxiv.org/abs/2502.06443
作者: Elisabetta Cornacchia,Dan Mikulincer,Elchanan Mossel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The problem of learning single index and multi index models has gained significant interest as a fundamental task in high-dimensional statistics. Many recent works have analysed gradient-based methods, particularly in the setting of isotropic data distributions, often in the context of neural network training. Such studies have uncovered precise characterisations of algorithmic sample complexity in terms of certain analytic properties of the target function, such as the leap, information, and generative exponents. These properties establish a quantitative separation between low and high complexity learning tasks. In this work, we show that high complexity cases are rare. Specifically, we prove that introducing a small random perturbation to the data distribution–via a random shift in the first moment–renders any Gaussian single index model as easy to learn as a linear function. We further extend this result to a class of multi index models, namely sparse Boolean functions, also known as Juntas.

[LG-25] An Automated Machine Learning Framework for Surgical Suturing Action Detection under Class Imbalance

链接: https://arxiv.org/abs/2502.06407
作者: Baobing Zhang,Paul Sullivan,Benjie Tang,Ghulam Nabi,Mustafa Suphi Erden
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In laparoscopy surgical training and evaluation, real-time detection of surgical actions with interpretable outputs is crucial for automated and real-time instructional feedback and skill development. Such capability would enable development of machine guided training systems. This paper presents a rapid deployment approach utilizing automated machine learning methods, based on surgical action data collected from both experienced and trainee surgeons. The proposed approach effectively tackles the challenge of highly imbalanced class distributions, ensuring robust predictions across varying skill levels of surgeons. Additionally, our method partially incorporates model transparency, addressing the reliability requirements in medical applications. Compared to deep learning approaches, traditional machine learning models not only facilitate efficient rapid deployment but also offer significant advantages in interpretability. Through experiments, this study demonstrates the potential of this approach to provide quick, reliable and effective real-time detection in surgical training environments

[LG-26] he AI off-switch problem as a signalling game: bounded rationality and incomparability

链接: https://arxiv.org/abs/2502.06403
作者: Alessio benavoli,Alessandro facchini,Marco Zaffalon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The off-switch problem is a critical challenge in AI control: if an AI system resists being switched off, it poses a significant risk. In this paper, we model the off-switch problem as a signalling game, where a human decision-maker communicates its preferences about some underlying decision problem to an AI agent, which then selects actions to maximise the human’s utility. We assume that the human is a bounded rational agent and explore various bounded rationality mechanisms. Using real machine learning models, we reprove prior results and demonstrate that a necessary condition for an AI system to refrain from disabling its off-switch is its uncertainty about the human’s utility. We also analyse how message costs influence optimal strategies and extend the analysis to scenarios involving incomparability.

[LG-27] Habitizing Diffusion Planning for Efficient and Effective Decision Making

链接: https://arxiv.org/abs/2502.06401
作者: Haofei Lu,Yifei Shen,Dongsheng Li,Junliang Xing,Dongqi Han
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have shown great promise in decision-making, also known as diffusion planning. However, the slow inference speeds limit their potential for broader real-world applications. Here, we introduce Habi, a general framework that transforms powerful but slow diffusion planning models into fast decision-making models, which mimics the cognitive process in the brain that costly goal-directed behavior gradually transitions to efficient habitual behavior with repetitive practice. Even using a laptop CPU, the habitized model can achieve an average 800+ Hz decision-making frequency (faster than previous diffusion planners by orders of magnitude) on standard offline reinforcement learning benchmarks D4RL, while maintaining comparable or even higher performance compared to its corresponding diffusion planner. Our work proposes a fresh perspective of leveraging powerful diffusion models for real-world decision-making tasks. We also provide robust evaluations and analysis, offering insights from both biological and engineering perspectives for efficient and effective decision-making.

[LG-28] Learning Counterfactual Outcomes Under Rank Preservation

链接: https://arxiv.org/abs/2502.06398
作者: Peng Wu,Haoxuan Li,Chunyuan Zheng,Yan Zeng,Jiawei Chen,Yang Liu,Ruocheng Guo,Kun Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Counterfactual inference aims to estimate the counterfactual outcome at the individual level given knowledge of an observed treatment and the factual outcome, with broad applications in fields such as epidemiology, econometrics, and management science. Previous methods rely on a known structural causal model (SCM) or assume the homogeneity of the exogenous variable and strict monotonicity between the outcome and exogenous variable. In this paper, we propose a principled approach for identifying and estimating the counterfactual outcome. We first introduce a simple and intuitive rank preservation assumption to identify the counterfactual outcome without relying on a known structural causal model. Building on this, we propose a novel ideal loss for theoretically unbiased learning of the counterfactual outcome and further develop a kernel-based estimator for its empirical estimation. Our theoretical analysis shows that the rank preservation assumption is not stronger than the homogeneity and strict monotonicity assumptions, and shows that the proposed ideal loss is convex, and the proposed estimator is unbiased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed method.

[LG-29] How Humans Help LLM s: Assessing and Incentivizing Human Preference Annotators

链接: https://arxiv.org/abs/2502.06387
作者: Shang Liu,Hanzhao Wang,Zhongyao Ma,Xiaocheng Li
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:Human-annotated preference data play an important role in aligning large language models (LLMs). In this paper, we investigate the questions of assessing the performance of human annotators and incentivizing them to provide high-quality annotations. The quality assessment of language/text annotation faces two challenges: (i) the intrinsic heterogeneity among annotators, which prevents the classic methods that assume the underlying existence of a true label; and (ii) the unclear relationship between the annotation quality and the performance of downstream tasks, which excludes the possibility of inferring the annotators’ behavior based on the model performance trained from the annotation data. Then we formulate a principal-agent model to characterize the behaviors of and the interactions between the company and the human annotators. The model rationalizes a practical mechanism of a bonus scheme to incentivize annotators which benefits both parties and it underscores the importance of the joint presence of an assessment system and a proper contract scheme. From a technical perspective, our analysis extends the existing literature on the principal-agent model by considering a continuous action space for the agent. We show the gap between the first-best and the second-best solutions (under the continuous action space) is of \Theta(1/\sqrtn \log n) for the binary contracts and \Theta(1/n) for the linear contracts, where n is the number of samples used for performance assessment; this contrasts with the known result of \exp(-\Theta(n)) for the binary contracts when the action space is discrete. Throughout the paper, we use real preference annotation data to accompany our discussions.

[LG-30] Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset

链接: https://arxiv.org/abs/2502.06364
作者: Huw Cheston,Jan Van Balen,Simon Durand
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Sampling, the practice of reusing recorded music or sounds from another source in a new work, is common in popular music genres like hip-hop and rap. Numerous services have emerged that allow users to identify connections between samples and the songs that incorporate them, with the goal of enhancing music discovery. Designing a system that can perform the same task automatically is challenging, as samples are commonly altered with audio effects like pitch- and time-stretching and may only be seconds long. Progress on this task has been minimal and is further blocked by the limited availability of training data. Here, we show that a convolutional neural network trained on an artificial dataset can identify real-world samples in commercial hip-hop music. We extract vocal, harmonic, and percussive elements from several databases of non-commercial music recordings using audio source separation, and train the model to fingerprint a subset of these elements in transformed versions of the original audio. We optimize the model using a joint classification and metric learning loss and show that it achieves 13% greater precision on real-world instances of sampling than a fingerprinting system using acoustic landmarks, and that it can recognize samples that have been both pitch shifted and time stretched. We also show that, for half of the commercial music recordings we tested, our model is capable of locating the position of a sample to within five seconds.

[LG-31] Improved Regret Analysis in Gaussian Process Bandits: Optimality for Noiseless Reward RKHS norm and Non-Stationary Variance

链接: https://arxiv.org/abs/2502.06363
作者: Shogo Iwazaki,Shion Takeno
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35 pages

点击查看摘要

Abstract:We study the Gaussian process (GP) bandit problem, whose goal is to minimize regret under an unknown reward function lying in some reproducing kernel Hilbert space (RKHS). The maximum posterior variance analysis is vital in analyzing near-optimal GP bandit algorithms such as maximum variance reduction (MVR) and phased elimination (PE). Therefore, we first show the new upper bound of the maximum posterior variance, which improves the dependence of the noise variance parameters of the GP. By leveraging this result, we refine the MVR and PE to obtain (i) a nearly optimal regret upper bound in the noiseless setting and (ii) regret upper bounds that are optimal with respect to the RKHS norm of the reward function. Furthermore, as another application of our proposed bound, we analyze the GP bandit under the time-varying noise variance setting, which is the kernelized extension of the linear bandit with heteroscedastic noise. For this problem, we show that MVR and PE-based algorithms achieve noise variance-dependent regret upper bounds, which matches our regret lower bound.

[LG-32] owards bandit-based prompt-tuning for in-the-wild foundation agents

链接: https://arxiv.org/abs/2502.06358
作者: Finn Rietz,Oleg Smirnov,Sara Karimi,Lele Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prompting has emerged as the dominant paradigm for adapting large, pre-trained transformer-based models to downstream tasks. The Prompting Decision Transformer (PDT) enables large-scale, multi-task offline reinforcement learning pre-training by leveraging stochastic trajectory prompts to identify the target task. However, these prompts are sampled uniformly from expert demonstrations, overlooking a critical limitation: Not all prompts are equally informative for differentiating between tasks. To address this, we propose an inference time bandit-based prompt-tuning framework that explores and optimizes trajectory prompt selection to enhance task performance. Our experiments indicate not only clear performance gains due to bandit-based prompt-tuning, but also better sample complexity, scalability, and prompt space exploration compared to prompt-tuning baselines.

[LG-33] Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach IJCAI2025

链接: https://arxiv.org/abs/2502.06355
作者: Timo Fudala,Vasileios Tsouvalas,Nirvana Meratnia
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, submitted to IJCAI 2025

点击查看摘要

Abstract:Multimodal transformers integrate diverse data types like images, audio, and text, advancing tasks such as audio-visual understanding and image-text retrieval; yet their high parameterization limits deployment on resource-constrained edge devices. Split Learning (SL), which partitions models at a designated cut-layer to offload compute-intensive operations to the server, offers a promising approach for distributed training of multimodal transformers, though its application remains underexplored. We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner, while eliminating label sharing, client synchronization, and per-client sub-model management. MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs. Our evaluation across 7 multimodal datasets demonstrates that MPSL matches or outperforms Federated Learning, reduces client-side computations by 250x, and achieves superior scalability in communication cost with model growth. Through extensive analysis, we highlight task suitability, trade-offs, and scenarios where MPSL excels, inspiring further exploration.

[LG-34] Calibrating LLM s with Information-Theoretic Evidential Deep Learning ICLR2025

链接: https://arxiv.org/abs/2502.06351
作者: Yawei Li,David Rügamer,Bernd Bischl,Mina Rezaei
类目: Machine Learning (cs.LG)
*备注: 18 pages; 3 figures; accepted to ICLR 2025

点击查看摘要

Abstract:Fine-tuned large language models (LLMs) often exhibit overconfidence, particularly when trained on small datasets, resulting in poor calibration and inaccurate uncertainty estimates. Evidential Deep Learning (EDL), an uncertainty-aware approach, enables uncertainty estimation in a single forward pass, making it a promising method for calibrating fine-tuned LLMs. However, despite its computational efficiency, EDL is prone to overfitting, as its training objective can result in overly concentrated probability distributions. To mitigate this, we propose regularizing EDL by incorporating an information bottleneck (IB). Our approach IB-EDL suppresses spurious information in the evidence generated by the model and encourages truly predictive information to influence both the predictions and uncertainty estimates. Extensive experiments across various fine-tuned LLMs and tasks demonstrate that IB-EDL outperforms both existing EDL and non-EDL approaches. By improving the trustworthiness of LLMs, IB-EDL facilitates their broader adoption in domains requiring high levels of confidence calibration. Code is available at this https URL.

[LG-35] Provably Near-Optimal Federated Ensemble Distillation with Negligible Overhead

链接: https://arxiv.org/abs/2502.06349
作者: Won-Jun Jang,Hyeon-Seo Park,Si-Hyeon Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated ensemble distillation addresses client heterogeneity by generating pseudo-labels for an unlabeled server dataset based on client predictions and training the server model using the pseudo-labeled dataset. The unlabeled server dataset can either be pre-existing or generated through a data-free approach. The effectiveness of this approach critically depends on the method of assigning weights to client predictions when creating pseudo-labels, especially in highly heterogeneous settings. Inspired by theoretical results from GANs, we propose a provably near-optimal weighting method that leverages client discriminators trained with a server-distributed generator and local datasets. Our experiments on various image classification tasks demonstrate that the proposed method significantly outperforms baselines. Furthermore, we show that the additional communication cost, client-side privacy leakage, and client-side computational overhead introduced by our method are negligible, both in scenarios with and without a pre-existing server dataset.

[LG-36] Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences

链接: https://arxiv.org/abs/2502.06343
作者: Riccardo Cadei,Ilker Demirel,Piersilvio De Bartolomeis,Lukas Lindorfer,Sylvia Cremer,Cordelia Schmid,Francesco Locatello
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A plethora of real-world scientific investigations is waiting to scale with the support of trustworthy predictive models that can reduce the need for costly data annotations. We focus on causal inferences on a target experiment with unlabeled factual outcomes, retrieved by a predictive model fine-tuned on a labeled similar experiment. First, we show that factual outcome estimation via Empirical Risk Minimization (ERM) may fail to yield valid causal inferences on the target population, even in a randomized controlled experiment and infinite training samples. Then, we propose to leverage the observed experimental settings during training to empower generalization to downstream interventional investigations, ``Causal Lifting’’ the predictive model. We propose Deconfounded Empirical Risk Minimization (DERM), a new simple learning procedure minimizing the risk over a fictitious target population, preventing potential confounding effects. We validate our method on both synthetic and real-world scientific data. Notably, for the first time, we zero-shot generalize causal inferences on ISTAnt dataset (without annotation) by causal lifting a predictive model on our experiment variant.

[LG-37] Microcanonical Langevin Ensembles: Advancing the Sampling of Bayesian Neural Networks

链接: https://arxiv.org/abs/2502.06335
作者: Emanuel Sommer,Jakob Robnik,Giorgi Nozadze,Uros Seljak,David Rügamer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite recent advances, sampling-based inference for Bayesian Neural Networks (BNNs) remains a significant challenge in probabilistic deep learning. While sampling-based approaches do not require a variational distribution assumption, current state-of-the-art samplers still struggle to navigate the complex and highly multimodal posteriors of BNNs. As a consequence, sampling still requires considerably longer inference times than non-Bayesian methods even for small neural networks, despite recent advances in making software implementations more efficient. Besides the difficulty of finding high-probability regions, the time until samplers provide sufficient exploration of these areas remains unpredictable. To tackle these challenges, we introduce an ensembling approach that leverages strategies from optimization and a recently proposed sampler called Microcanonical Langevin Monte Carlo (MCLMC) for efficient, robust and predictable sampling performance. Compared to approaches based on the state-of-the-art No-U-Turn Sampler, our approach delivers substantial speedups up to an order of magnitude, while maintaining or improving predictive performance and uncertainty quantification across diverse tasks and data modalities. The suggested Microcanonical Langevin Ensembles and modifications to MCLMC additionally enhance the method’s predictability in resource requirements, facilitating easier parallelization. All in all, the proposed method offers a promising direction for practical, scalable inference for BNNs.

[LG-38] Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions

链接: https://arxiv.org/abs/2502.06309
作者: Zhaoxian Wu,Quan Xian,Tayfun Gokmen,Omobayode Fagbohungbe,Tianyi Chen
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:As the economic and environmental costs of training and deploying large vision or language models increase dramatically, analog in-memory computing (AIMC) emerges as a promising energy-efficient solution. However, the training perspective, especially its training dynamic, is underexplored. In AIMC hardware, the trainable weights are represented by the conductance of resistive elements and updated using consecutive electrical pulses. Among all the physical properties of resistive elements, the response to the pulses directly affects the training dynamics. This paper first provides a theoretical foundation for gradient-based training on AIMC hardware and studies the impact of response functions. We demonstrate that noisy update and asymmetric response functions negatively impact Analog SGD by imposing an implicit penalty term on the objective. To overcome the issue, Tiki-Taka, a residual learning algorithm, converges exactly to a critical point by optimizing a main array and a residual array bilevelly. The conclusion is supported by simulations validating our theoretical insights.

[LG-39] Utilizing Novelty-based Evolution Strategies to Train Transformers in Reinforcement Learning

链接: https://arxiv.org/abs/2502.06301
作者: Matyáš Lorenc
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In this paper, we experiment with novelty-based variants of OpenAI-ES, the NS-ES and NSR-ES algorithms, and evaluate their effectiveness in training complex, transformer-based architectures designed for the problem of reinforcement learning such as Decision Transformers. We also test if we can accelerate the novelty-based training of these larger models by seeding the training by a pretrained models. By this, we build on our previous work, where we tested the ability of evolution strategies - specifically the aforementioned OpenAI-ES - to train the Decision Transformer architecture. The results were mixed. NS-ES showed progress, but it would clearly need many more iterations for it to yield interesting results. NSR-ES, on the other hand, proved quite capable of being straightforwardly used on larger models, since its performance appears as similar between the feed-forward model and Decision Transformer, as it was for the OpenAI-ES in our previous work.

[LG-40] he impact of allocation strategies in subset learning on the expressive power of neural networks

链接: https://arxiv.org/abs/2502.06300
作者: Ofir Schlisselberg,Ran Darshan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In traditional machine learning, models are defined by a set of parameters, which are optimized to perform specific tasks. In neural networks, these parameters correspond to the synaptic weights. However, in reality, it is often infeasible to control or update all weights. This challenge is not limited to artificial networks but extends to biological networks, such as the brain, where the extent of distributed synaptic weight modification during learning remains unclear. Motivated by these insights, we theoretically investigate how different allocations of a fixed number of learnable weights influence the capacity of neural networks. Using a teacher-student setup, we introduce a benchmark to quantify the expressivity associated with each allocation. We establish conditions under which allocations have maximal or minimal expressive power in linear recurrent neural networks and linear multi-layer feedforward networks. For suboptimal allocations, we propose heuristic principles to estimate their expressivity. These principles extend to shallow ReLU networks as well. Finally, we validate our theoretical findings with empirical experiments. Our results emphasize the critical role of strategically distributing learnable weights across the network, showing that a more widespread allocation generally enhances the network’s expressive power.

[LG-41] DVFS-Aware DNN Inference on GPUs: Latency Modeling and Performance Analysis

链接: https://arxiv.org/abs/2502.06295
作者: Yunchu Han,Zhaojun Nan,Sheng Zhou,Zhisheng Niu
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The rapid development of deep neural networks (DNNs) is inherently accompanied by the problem of high computational costs. To tackle this challenge, dynamic voltage frequency scaling (DVFS) is emerging as a promising technology for balancing the latency and energy consumption of DNN inference by adjusting the computing frequency of processors. However, most existing models of DNN inference time are based on the CPU-DVFS technique, and directly applying the CPU-DVFS model to DNN inference on GPUs will lead to significant errors in optimizing latency and energy consumption. In this paper, we propose a DVFS-aware latency model to precisely characterize DNN inference time on GPUs. We first formulate the DNN inference time based on extensive experiment results for different devices and analyze the impact of fitting parameters. Then by dividing DNNs into multiple blocks and obtaining the actual inference time, the proposed model is further verified. Finally, we compare our proposed model with the CPU-DVFS model in two specific cases. Evaluation results demonstrate that local inference optimization with our proposed model achieves a reduction of no less than 66% and 69% in inference time and energy consumption respectively. In addition, cooperative inference with our proposed model can improve the partition policy and reduce the energy consumption compared to the CPU-DVFS model.

[LG-42] On the Expressiveness of Rational ReLU Neural Networks With Bounded Depth ICLR2025

链接: https://arxiv.org/abs/2502.06283
作者: Gennadiy Averkov,Christopher Hojny,Maximilian Merkert
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注: ICLR 2025 conference paper

点击查看摘要

Abstract:To confirm that the expressive power of ReLU neural networks grows with their depth, the function F_n = \max \0,x_1,\ldots,x_n\ has been considered in the literature. A conjecture by Hertrich, Basu, Di Summa, and Skutella [NeurIPS 2021] states that any ReLU network that exactly represents F_n has at least \lceil\log_2 (n+1)\rceil hidden layers. The conjecture has recently been confirmed for networks with integer weights by Haase, Hertrich, and Loho [ICLR 2023]. We follow up on this line of research and show that, within ReLU networks whose weights are decimal fractions, F_n can only be represented by networks with at least \lceil\log_3 (n+1)\rceil hidden layers. Moreover, if all weights are N -ary fractions, then F_n can only be represented by networks with at least \Omega( \frac\ln n\ln \ln N) layers. These results are a partial confirmation of the above conjecture for rational ReLU networks, and provide the first non-constant lower bound on the depth of practically relevant ReLU networks. Comments: ICLR 2025 conference paper Subjects: Machine Learning (cs.LG); Discrete Mathematics (cs.DM) Cite as: arXiv:2502.06283 [cs.LG] (or arXiv:2502.06283v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.06283 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] IceBerg: Debiased Self-Training for Class-Imbalanced Node Classification WWW

链接: https://arxiv.org/abs/2502.06280
作者: Zhixun Li,Dingshuo Chen,Tong Zhao,Daixin Wang,Hongrui Liu,Zhiqiang Zhang,Jun Zhou,Jeffrey Xu Yu
类目: Machine Learning (cs.LG)
*备注: Accepted by TheWebConf (WWW) 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved great success in dealing with non-Euclidean graph-structured data and have been widely deployed in many real-world applications. However, their effectiveness is often jeopardized under class-imbalanced training sets. Most existing studies have analyzed class-imbalanced node classification from a supervised learning perspective, but they do not fully utilize the large number of unlabeled nodes in semi-supervised scenarios. We claim that the supervised signal is just the tip of the iceberg and a large number of unlabeled nodes have not yet been effectively utilized. In this work, we propose IceBerg, a debiased self-training framework to address the class-imbalanced and few-shot challenges for GNNs at the same time. Specifically, to figure out the Matthew effect and label distribution shift in self-training, we propose Double Balancing, which can largely improve the performance of existing baselines with just a few lines of code as a simple plug-and-play module. Secondly, to enhance the long-range propagation capability of GNNs, we disentangle the propagation and transformation operations of GNNs. Therefore, the weak supervision signals can propagate more effectively to address the few-shot issue. In summary, we find that leveraging unlabeled nodes can significantly enhance the performance of GNNs in class-imbalanced and few-shot scenarios, and even small, surgical modifications can lead to substantial performance improvements. Systematic experiments on benchmark datasets show that our method can deliver considerable performance gain over existing class-imbalanced node classification baselines. Additionally, due to IceBerg’s outstanding ability to leverage unsupervised signals, it also achieves state-of-the-art results in few-shot node classification scenarios. The code of IceBerg is available at: this https URL.

[LG-44] Beyond Batch Learning: Global Awareness Enhanced Domain Adaptation

链接: https://arxiv.org/abs/2502.06272
作者: Lingkun Luo,Shiqiang Hu,Liming Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In domain adaptation (DA), the effectiveness of deep learning-based models is often constrained by batch learning strategies that fail to fully apprehend the global statistical and geometric characteristics of data distributions. Addressing this gap, we introduce ‘Global Awareness Enhanced Domain Adaptation’ (GAN-DA), a novel approach that transcends traditional batch-based limitations. GAN-DA integrates a unique predefined feature representation (PFR) to facilitate the alignment of cross-domain distributions, thereby achieving a comprehensive global statistical awareness. This representation is innovatively expanded to encompass orthogonal and common feature aspects, which enhances the unification of global manifold structures and refines decision boundaries for more effective DA. Our extensive experiments, encompassing 27 diverse cross-domain image classification tasks, demonstrate GAN-DA’s remarkable superiority, outperforming 24 established DA methods by a significant margin. Furthermore, our in-depth analyses shed light on the decision-making processes, revealing insights into the adaptability and efficiency of GAN-DA. This approach not only addresses the limitations of existing DA methodologies but also sets a new benchmark in the realm of domain adaptation, offering broad implications for future research and applications in this field.

[LG-45] Reducing Variance Caused by Communication in Decentralized Multi-agent Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.06261
作者: Changxi Zhu,Mehdi Dastani,Shihan Wang
类目: Machine Learning (cs.LG)
*备注: 30 pages, 6 figures, 6 tables

点击查看摘要

Abstract:In decentralized multi-agent deep reinforcement learning (MADRL), communication can help agents to gain a better understanding of the environment to better coordinate their behaviors. Nevertheless, communication may involve uncertainty, which potentially introduces variance to the learning of decentralized agents. In this paper, we focus on a specific decentralized MADRL setting with communication and conduct a theoretical analysis to study the variance that is caused by communication in policy gradients. We propose modular techniques to reduce the variance in policy gradients during training. We adopt our modular techniques into two existing algorithms for decentralized MADRL with communication and evaluate them on multiple tasks in the StarCraft Multi-Agent Challenge and Traffic Junction domains. The results show that decentralized MADRL communication methods extended with our proposed techniques not only achieve high-performing agents but also reduce variance in policy gradients during training.

[LG-46] DGNO: A Novel Physics-aware Neural Operator for Solving Forward and Inverse PDE Problems based on Deep Generative Probabilistic Modeling

链接: https://arxiv.org/abs/2502.06250
作者: Yaohua Zang,Phaedon-Stelios Koutsourelakis
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:Solving parametric partial differential equations (PDEs) and associated PDE-based, inverse problems is a central task in engineering and physics, yet existing neural operator methods struggle with high-dimensional, discontinuous inputs and require large amounts of \em labeled training data. We propose the Deep Generative Neural Operator (DGNO), a physics-aware framework that addresses these challenges by leveraging a deep, generative, probabilistic model in combination with a set of lower-dimensional, latent variables that simultaneously encode PDE-inputs and PDE-outputs. This formulation can make use of unlabeled data and significantly improves inverse problem-solving, particularly for discontinuous or discrete-valued input functions. DGNO enforces physics constraints without labeled data by incorporating as virtual observables, weak-form residuals based on compactly supported radial basis functions (CSRBFs). These relax regularity constraints and eliminate higher-order derivatives from the objective function. We also introduce MultiONet, a novel neural operator architecture, which is a more expressive generalization of the popular DeepONet that significantly enhances the approximating power of the proposed model. These innovations make DGNO particularly effective for challenging forward and inverse, PDE-based problems, such as those involving multi-phase media. Numerical experiments demonstrate that DGNO achieves higher accuracy across multiple benchmarks while exhibiting robustness to noise and strong generalization to out-of-distribution cases. Its adaptability, and the ability to handle sparse, noisy data while providing probabilistic estimates, make DGNO a powerful tool for scientific and engineering applications.

[LG-47] PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts

链接: https://arxiv.org/abs/2502.06244
作者: Zeman Li,Yuan Deng,Peilin Zhong,Meisam Razaviyayn,Vahab Mirrokni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern machine learning models are trained on diverse datasets and tasks to improve generalization. A key challenge in multitask learning is determining the optimal data mixing and sampling strategy across different data sources. Prior research in this multi-task learning setting has primarily focused on mitigating gradient conflicts between tasks. However, we observe that many real-world multitask learning scenarios-such as multilingual training and multi-domain learning in large foundation models-exhibit predominantly positive task interactions with minimal or no gradient conflict. Building on this insight, we introduce PiKE (Positive gradient interaction-based K-task weights Estimator), an adaptive data mixing algorithm that dynamically adjusts task contributions throughout training. PiKE optimizes task sampling to minimize overall loss, effectively leveraging positive gradient interactions with almost no additional computational overhead. We establish theoretical convergence guarantees for PiKE and demonstrate its superiority over static and non-adaptive mixing strategies. Additionally, we extend PiKE to promote fair learning across tasks, ensuring balanced progress and preventing task underrepresentation. Empirical evaluations on large-scale language model pretraining show that PiKE consistently outperforms existing heuristic and static mixing strategies, leading to faster convergence and improved downstream task performance.

[LG-48] Position: Continual Learning Benefits from An Evolving Population over An Unified Model

链接: https://arxiv.org/abs/2502.06210
作者: Aojun Lu,Junchao Ke,Chunhui Ding,Jiahao Fan,Yanan Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks have demonstrated remarkable success in machine learning; however, they remain fundamentally ill-suited for Continual Learning (CL). Recent research has increasingly focused on achieving CL without the need for rehearsal. Among these, parameter isolation-based methods have proven particularly effective in enhancing CL by optimizing model weights for each incremental task. Despite their success, they fall short in optimizing architectures tailored to distinct incremental tasks. To address this limitation, updating a group of models with different architectures offers a promising alternative to the traditional CL paradigm that relies on a single unified model. Building on this insight, this study introduces a novel Population-based Continual Learning (PCL) framework. PCL extends CL to the architectural level by maintaining and evolving a population of neural network architectures, which are continually refined for the current task through NAS. Importantly, the well-evolved population for the current incremental task is naturally inherited by the subsequent one, thereby facilitating forward transfer, a crucial objective in CL. Throughout the CL process, the population evolves, yielding task-specific architectures that collectively form a robust CL system. Experimental results demonstrate that PCL outperforms state-of-the-art rehearsal-free CL methods that employs a unified model, highlighting its potential as a new paradigm for CL.

[LG-49] On the query complexity of sampling from non-log-concave distributions

链接: https://arxiv.org/abs/2502.06200
作者: Yuchen He,Chihao Zhang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of sampling from a d -dimensional distribution with density p(x)\propto e^-f(x) , which does not necessarily satisfy good isoperimetric conditions. Specifically, we show that for any L,M satisfying LM\ge d\ge 5 , \epsilon\in \left\0,\frac132\right\ , and any algorithm with query accesses to the value of f(x) and \nabla f(x) , there exists an L -log-smooth distribution with second moment at most M such that the algorithm requires \left\fracLMd\epsilon\right^\Omega(d) queries to compute a sample whose distribution is within \epsilon in total variation distance to the target distribution. We complement the lower bound with an algorithm requiring \left\fracLMd\epsilon\right^\mathcal O(d) queries, thereby characterizing the tight (up to the constant in the exponent) query complexity for sampling from the family of non-log-concave distributions. Our results are in sharp contrast with the recent work of Huang et al. (COLT’24), where an algorithm with quasi-polynomial query complexity was proposed for sampling from a non-log-concave distribution when M=\mathttpoly(d) . Their algorithm works under the stronger condition that all distributions along the trajectory of the Ornstein-Uhlenbeck process, starting from the target distribution, are \mathcal O(1) -log-smooth. We investigate this condition and prove that it is strictly stronger than requiring the target distribution to be \mathcal O(1) -log-smooth. Additionally, we study this condition in the context of mixtures of Gaussians. Finally, we place our results within the broader theme of ``sampling versus optimization’', as studied in Ma et al. (PNAS’19). We show that for a wide range of parameters, sampling is strictly easier than optimization by a super-exponential factor in the dimension d . Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.06200 [cs.DS] (or arXiv:2502.06200v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2502.06200 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuchen He [view email] [v1] Mon, 10 Feb 2025 06:54:16 UTC (2,139 KB)

[LG-50] Generalized Temporal Tensor Decomposition with Rank-revealing Latent-ODE

链接: https://arxiv.org/abs/2502.06164
作者: Panqi Chen,Lei Cheng,Jianlong Li,Weichang Li,Weiqing Liu,Jiang Bian,Shikai Fang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Tensor decomposition is a fundamental tool for analyzing multi-dimensional data by learning low-rank factors to represent high-order interactions. While recent works on temporal tensor decomposition have made significant progress by incorporating continuous timestamps in latent factors, they still struggle with general tensor data with continuous indexes not only in the temporal mode but also in other modes, such as spatial coordinates in climate data. Additionally, the problem of determining the tensor rank remains largely unexplored in temporal tensor models. To address these limitations, we propose \underlineGeneralized temporal tensor decomposition with \underlineRank-r\underlineEvealing laten\underlineT-ODE (GRET). Our approach encodes continuous spatial indexes as learnable Fourier features and employs neural ODEs in latent space to learn the temporal trajectories of factors. To automatically reveal the rank of temporal tensors, we introduce a rank-revealing Gaussian-Gamma prior over the factor trajectories. We develop an efficient variational inference scheme with an analytical evidence lower bound, enabling sampling-free optimization. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that GRET not only reveals the underlying ranks of temporal tensors but also significantly outperforms existing methods in prediction performance and robustness against noise. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.06164 [cs.LG] (or arXiv:2502.06164v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.06164 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

链接: https://arxiv.org/abs/2502.06163
作者: Jack Spalding-Jamieson,Eliot Wong Robson,Da Wei Zheng
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Machine Learning (stat.ML)
*备注: 29 pages, 8 figures

点击查看摘要

Abstract:For very large values of k , we consider methods for fast k -means clustering of massive datasets with 10^7\sim10^9 points in high-dimensions ( d\geq100 ). All current practical methods for this problem have runtimes at least \Omega(k^2) . We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd’s local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call “Seeded Approximate Nearest-Neighbor Search”, for which we propose “Seeded Search-Graph” methods as a solution.

[LG-52] Graph Pseudotime Analysis and Neural Stochastic Differential Equations for Analyzing Retinal Degeneration Dynamics and Beyond

链接: https://arxiv.org/abs/2502.06126
作者: Dai Shi,Kuan Yan,Lequan Lin,Yue Zeng,Ting Zhang,Dmytro Matsypura,Mark C. Gillies,Ling Zhu,Junbin Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding disease progression at the molecular pathway level usually requires capturing both structural dependencies between pathways and the temporal dynamics of disease evolution. In this work, we solve the former challenge by developing a biologically informed graph-forming method to efficiently construct pathway graphs for subjects from our newly curated JR5558 mouse transcriptomics dataset. We then develop Graph-level Pseudotime Analysis (GPA) to infer graph-level trajectories that reveal how disease progresses at the population level, rather than in individual subjects. Based on the trajectories estimated by GPA, we identify the most sensitive pathways that drive disease stage transitions. In addition, we measure changes in pathway features using neural stochastic differential equations (SDEs), which enables us to formally define and compute pathway stability and disease bifurcation points (points of no return), two fundamental problems in disease progression research. We further extend our theory to the case when pathways can interact with each other, enabling a more comprehensive and multi-faceted characterization of disease phenotypes. The comprehensive experimental results demonstrate the effectiveness of our framework in reconstructing the dynamics of the pathway, identifying critical transitions, and providing novel insights into the mechanistic understanding of disease evolution.

[LG-53] Fine-Tuning Federated Learning-Based Intrusion Detection Systems for Transportation IoT

链接: https://arxiv.org/abs/2502.06099
作者: Robert Akinie,Nana Kankam Brym Gyimah,Mansi Bhavsar,John Kelly
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures. To be published in IEEE SouthEastCon 2025

点击查看摘要

Abstract:The rapid advancement of machine learning (ML) and on-device computing has revolutionized various industries, including transportation, through the development of Connected and Autonomous Vehicles (CAVs) and Intelligent Transportation Systems (ITS). These technologies improve traffic management and vehicle safety, but also introduce significant security and privacy concerns, such as cyberattacks and data breaches. Traditional Intrusion Detection Systems (IDS) are increasingly inadequate in detecting modern threats, leading to the adoption of ML-based IDS solutions. Federated Learning (FL) has emerged as a promising method for enabling the decentralized training of IDS models on distributed edge devices without sharing sensitive data. However, deploying FL-based IDS in CAV networks poses unique challenges, including limited computational and memory resources on edge devices, competing demands from critical applications such as navigation and safety systems, and the need to scale across diverse hardware and connectivity conditions. To address these issues, we propose a hybrid server-edge FL framework that offloads pre-training to a central server while enabling lightweight fine-tuning on edge devices. This approach reduces memory usage by up to 42%, decreases training times by up to 75%, and achieves competitive IDS accuracy of up to 99.2%. Scalability analyses further demonstrates minimal performance degradation as the number of clients increase, highlighting the framework’s feasibility for CAV networks and other IoT applications.

[LG-54] On the Computability of Multiclass PAC Learning

链接: https://arxiv.org/abs/2502.06089
作者: Pascale Gourdeau,Tosca Lechner,Ruth Urner
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of computable multiclass learnability within the Probably Approximately Correct (PAC) learning framework of Valiant (1984). In the recently introduced computable PAC (CPAC) learning framework of Agarwal et al. (2020), both learners and the functions they output are required to be computable. We focus on the case of finite label space and start by proposing a computable version of the Natarajan dimension and showing that it characterizes CPAC learnability in this setting. We further generalize this result by establishing a meta-characterization of CPAC learnability for a certain family of dimensions: computable distinguishers. Distinguishers were defined by Ben-David et al. (1992) as a certain family of embeddings of the label space, with each embedding giving rise to a dimension. It was shown that the finiteness of each such dimension characterizes multiclass PAC learnability for finite label space in the non-computable setting. We show that the corresponding computable dimensions for distinguishers characterize CPAC learning. We conclude our analysis by proving that the DS dimension, which characterizes PAC learnability for infinite label space, cannot be expressed as a distinguisher (even in the case of finite label space).

[LG-55] Debiasing Guidance for Discrete Diffusion with Sequential Monte Carlo

链接: https://arxiv.org/abs/2502.06079
作者: Cheuk Kit Lee,Paul Jeha,Jes Frellsen,Pietro Lio,Michael Samuel Albergo,Francisco Vargas
类目: Machine Learning (cs.LG)
*备注: 29 pages, 14 figures

点击查看摘要

Abstract:Discrete diffusion models are a class of generative models that produce samples from an approximated data distribution within a discrete state space. Often, there is a need to target specific regions of the data distribution. Current guidance methods aim to sample from a distribution with mass proportional to p_0(x_0) p(\zeta|x_0)^\alpha but fail to achieve this in practice. We introduce a Sequential Monte Carlo algorithm that generates unbiasedly from this target distribution, utilising the learnt unconditional and guided process. We validate our approach on low-dimensional distributions, controlled images and text generations. For text generation, our method provides strong control while maintaining low perplexity compared to guidance-based approaches.

[LG-56] A Planning Framework for Adaptive Labeling

链接: https://arxiv.org/abs/2502.06076
作者: Daksh Mittal,Yuanzhe Ma,Shalmali Joshi,Hongseok Namkoong
类目: Machine Learning (cs.LG)
*备注: A conference version of this work appeared at 2024 Conference on Neural Information Processing Systems, titled "Adaptive Labeling for Efficient Out-of-distribution Model Evaluation’’

点击查看摘要

Abstract:Ground truth labels/outcomes are critical for advancing scientific and engineering applications, e.g., evaluating the treatment effect of an intervention or performance of a predictive model. Since randomly sampling inputs for labeling can be prohibitively expensive, we introduce an adaptive labeling framework where measurement effort can be reallocated in batches. We formulate this problem as a Markov decision process where posterior beliefs evolve over time as batches of labels are collected (state transition), and batches (actions) are chosen to minimize uncertainty at the end of data collection. We design a computational framework that is agnostic to different uncertainty quantification approaches including those based on deep learning, and allows a diverse array of policy gradient approaches by relying on continuous policy parameterizations. On real and synthetic datasets, we demonstrate even a one-step lookahead policy can substantially outperform common adaptive labeling heuristics, highlighting the virtue of planning. On the methodological side, we note that standard REINFORCE-style policy gradient estimators can suffer high variance since they rely only on zeroth order information. We propose a direct backpropagation-based approach, Smoothed-Autodiff, based on a carefully smoothed version of the original non-differentiable MDP. Our method enjoys low variance at the price of introducing bias, and we theoretically and empirically show that this trade-off can be favorable.

[LG-57] ID policy (with reassignment) is asymptotically optimal for heterogeneous weakly-coupled MDPs

链接: https://arxiv.org/abs/2502.06072
作者: Xiangcheng Zhang,Yige Hong,Weina Wang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注: 37 pages

点击查看摘要

Abstract:Heterogeneity poses a fundamental challenge for many real-world large-scale decision-making problems but remains largely understudied. In this paper, we study the fully heterogeneous setting of a prominent class of such problems, known as weakly-coupled Markov decision processes (WCMDPs). Each WCMDP consists of N arms (or subproblems), which have distinct model parameters in the fully heterogeneous setting, leading to the curse of dimensionality when N is large. We show that, under mild assumptions, a natural adaptation of the ID policy, although originally proposed for a homogeneous special case of WCMDPs, in fact achieves an O(1/\sqrtN) optimality gap in long-run average reward per arm for fully heterogeneous WCMDPs as N becomes large. This is the first asymptotic optimality result for fully heterogeneous average-reward WCMDPs. Our techniques highlight the construction of a novel projection-based Lyapunov function, which witnesses the convergence of rewards and costs to an optimal region in the presence of heterogeneity.

[LG-58] Neural Shortest Path for Surface Reconstruction from Point Clouds

链接: https://arxiv.org/abs/2502.06047
作者: Yesom Park,Imseong Park,Jooyoung Hahn,Myungjoo Kang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose the neural shortest path (NSP), a vector-valued implicit neural representation (INR) that approximates a distance function and its gradient. The key feature of NSP is to learn the exact shortest path (ESP), which directs an arbitrary point to its nearest point on the target surface. The NSP is decomposed into its magnitude and direction, and a variable splitting method is used that each decomposed component approximates a distance function and its gradient, respectively. Unlike to existing methods of learning the distance function itself, the NSP ensures the simultaneous recovery of the distance function and its gradient. We mathematically prove that the decomposed representation of NSP guarantees the convergence of the magnitude of NSP in the H^1 norm. Furthermore, we devise a novel loss function that enforces the property of ESP, demonstrating that its global minimum is the ESP. We evaluate the performance of the NSP through comprehensive experiments on diverse datasets, validating its capacity to reconstruct high-quality surfaces with the robustness to noise and data sparsity. The numerical results show substantial improvements over state-of-the-art methods, highlighting the importance of learning the ESP, the product of distance function and its gradient, for representing a wide variety of complex surfaces.

[LG-59] Investigating Compositional Reasoning in Time Series Foundation Models

链接: https://arxiv.org/abs/2502.06037
作者: Willa Potosnak,Cristian Challu,Mononito Goswami,Kin G. Olivares,Michał Wiliński,Nina Żukowska,Artur Dubrawski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large pre-trained time series foundation models (TSFMs) have demonstrated promising zero-shot performance across a wide range of domains. However, a question remains: Do TSFMs succeed solely by memorizing training patterns, or do they possess the ability to reason? While reasoning is a topic of great interest in the study of Large Language Models (LLMs), it is undefined and largely unexplored in the context of TSFMs. In this work, inspired by language modeling literature, we formally define compositional reasoning in forecasting and distinguish it from in-distribution generalization. We evaluate the reasoning and generalization capabilities of 23 popular deep learning forecasting models on multiple synthetic and real-world datasets. Additionally, through controlled studies, we systematically examine which design choices in TSFMs contribute to improved reasoning abilities. Our study yields key insights into the impact of TSFM architecture design on compositional reasoning and generalization. We find that patch-based Transformers have the best reasoning performance, closely followed by residualized MLP-based architectures, which are 97% less computationally complex in terms of FLOPs and 86% smaller in terms of the number of trainable parameters. Interestingly, in some zero-shot out-of-distribution scenarios, these models can outperform moving average and exponential smoothing statistical baselines trained on in-distribution data. Only a few design choices, such as the tokenization method, had a significant (negative) impact on Transformer model performance.

[LG-60] A Conditional Tabular GAN-Enhanced Intrusion Detection System for Rare Attacks in IoT Networks

链接: https://arxiv.org/abs/2502.06031
作者: Safaa Menssouri,El Mehdi Amhoud
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Internet of things (IoT) networks, boosted by 6G technology, are transforming various industries. However, their widespread adoption introduces significant security risks, particularly in detecting rare but potentially damaging cyber-attacks. This makes the development of robust IDS crucial for monitoring network traffic and ensuring their safety. Traditional IDS often struggle with detecting rare attacks due to severe class imbalances in IoT data. In this paper, we propose a novel two-stage system called conditional tabular generative synthetic minority data generation with deep neural network (CTGSM-DNN). In the first stage, a conditional tabular generative adversarial network (CTGAN) is employed to generate synthetic data for rare attack classes. In the second stage, the SMOTEENN method is applied to improve dataset quality. The full study was conducted using the CSE-CIC-IDS2018 dataset, and we assessed the performance of the proposed IDS using different evaluation metrics. The experimental results demonstrated the effectiveness of the proposed multiclass classifier, achieving an overall accuracy of 99.90% and 80% accuracy in detecting rare attacks.

[LG-61] Generating 3D Binding Molecules Using Shape-Conditioned Diffusion Models with Guidance

链接: https://arxiv.org/abs/2502.06027
作者: Ziqi Chen,Bo Peng,Tianhua Zhai,Daniel Adu-Ampratwum,Xia Ning
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by Nature Machine Intelligence

点击查看摘要

Abstract:Drug development is a critical but notoriously resource- and time-consuming process. In this manuscript, we develop a novel generative artificial intelligence (genAI) method DiffSMol to facilitate drug development. DiffSmol generates 3D binding molecules based on the shapes of known ligands. DiffSMol encapsulates geometric details of ligand shapes within pre-trained, expressive shape embeddings and then generates new binding molecules through a diffusion model. DiffSMol further modifies the generated 3D structures iteratively via shape guidance to better resemble the ligand shapes. It also tailors the generated molecules toward optimal binding affinities under the guidance of protein pockets. Here, we show that DiffSMol outperforms the state-of-the-art methods on benchmark datasets. When generating binding molecules resembling ligand shapes, DiffSMol with shape guidance achieves a success rate 61.4%, substantially outperforming the best baseline (11.2%), meanwhile producing molecules with novel molecular graph structures. DiffSMol with pocket guidance also outperforms the best baseline in binding affinities by 13.2%, and even by 17.7% when combined with shape guidance. Case studies for two critical drug targets demonstrate very favorable physicochemical and pharmacokinetic properties of the generated molecules, thus, the potential of DiffSMol in developing promising drug candidates.

[LG-62] A Multimodal PDE Foundation Model for Prediction and Scientific Text Descriptions

链接: https://arxiv.org/abs/2502.06026
作者: Elisa Negrini,Yuxuan Liu,Liu Yang,Stanley J. Osher,Hayden Schaeffer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Neural networks are one tool for approximating non-linear differential equations used in scientific computing tasks such as surrogate modeling, real-time predictions, and optimal control. PDE foundation models utilize neural networks to train approximations to multiple differential equations simultaneously and are thus a general purpose solver that can be adapted to downstream tasks. Current PDE foundation models focus on either learning general solution operators and/or the governing system of equations, and thus only handle numerical or symbolic modalities. However, real-world applications may require more flexible data modalities, e.g. text analysis or descriptive outputs. To address this gap, we propose a novel multimodal deep learning approach that leverages a transformer-based architecture to approximate solution operators for a wide variety of ODEs and PDEs. Our method integrates numerical inputs, such as equation parameters and initial conditions, with text descriptions of physical processes or system dynamics. This enables our model to handle settings where symbolic representations may be incomplete or unavailable. In addition to providing accurate numerical predictions, our approach generates interpretable scientific text descriptions, offering deeper insights into the underlying dynamics and solution properties. The numerical experiments show that our model provides accurate solutions for in-distribution data (with average relative error less than 3.3%) and out-of-distribution data (average relative error less than 7.8%) together with precise text descriptions (with correct descriptions generated 100% of times). In certain tests, the model is also shown to be capable of extrapolating solutions in time.

[LG-63] Decision Making in Hybrid Environments: A Model Aggregation Approach

链接: https://arxiv.org/abs/2502.05974
作者: Haolin Liu,Chen-Yu Wei,Julian Zimmert
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent work by Foster et al. (2021, 2022, 2023) and Xu and Zeevi (2023) developed the framework of decision estimation coefficient (DEC) that characterizes the complexity of general online decision making problems and provides a general algorithm design principle. These works, however, either focus on the pure stochastic regime where the world remains fixed over time, or the pure adversarial regime where the world arbitrarily changes over time. For the hybrid regime where the dynamics of the world is fixed while the reward arbitrarily changes, they only give pessimistic bounds on the decision complexity. In this work, we propose a general extension of DEC that more precisely characterizes this case. Besides applications in special cases, our framework leads to a flexible algorithm design where the learner learns over subsets of the hypothesis set, trading estimation complexity with decision complexity, which could be of independent interest. Our work covers model-based learning and model-free learning in the hybrid regime, with a newly proposed extension of the bilinear classes (Du et al., 2021) to the adversarial-reward case. We also recover some existing model-free learning results in the pure stochastic regime.

[LG-64] Known Unknowns: Out-of-Distribution Property Prediction in Materials and Molecules

链接: https://arxiv.org/abs/2502.05970
作者: Nofit Segal,Aviv Netanyahu,Kevin P. Greenman,Pulkit Agrawal,Rafael Gomez-Bombarelli
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE); Chemical Physics (physics.chem-ph)
*备注: 10 Pages, 5 figures, supporting information

点击查看摘要

Abstract:Discovery of high-performance materials and molecules requires identifying extremes with property values that fall outside the known distribution. Therefore, the ability to extrapolate to out-of-distribution (OOD) property values is critical for both solid-state materials and molecular design. Our objective is to train predictor models that extrapolate zero-shot to higher ranges than in the training data, given the chemical compositions of solids or molecular graphs and their property values. We propose using a transductive approach to OOD property prediction, achieving improvements in prediction accuracy. In particular, the True Positive Rate (TPR) of OOD classification of materials and molecules improved by 3x and 2.5x, respectively, and precision improved by 2x and 1.5x compared to non-transductive baselines. Our method leverages analogical input-target relations in the training and test sets, enabling generalization beyond the training target support, and can be applied to any other material and molecular tasks.

[LG-65] munit Scaling: Simple and Scalable FP8 LLM Training

链接: https://arxiv.org/abs/2502.05967
作者: Saaketh Narayan,Abhay Gupta,Mansheej Paul,Davis Blalock
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently possible to train in FP8 only if one is willing to tune various hyperparameters, reduce model scale, or accept the overhead of computing dynamic scale factors. We demonstrate simple, scalable FP8 training that requires no dynamic scaling factors or special hyperparameters, even at large model sizes. Our method, \mu nit Scaling ( \mu S), also enables simple hyperparameter transfer across model widths, matched numerics across training and inference, and other desirable properties. \mu nit Scaling is straightforward to implement, consisting of a set of minimal interventions based on a first-principles analysis of common transformer operations. We validate our method by training models from 1B to 13B parameters, performing all hidden linear layer computations in FP8. We achieve quality equal to higher precision baselines while also training up to 33% faster.

[LG-66] Norm Augmented Graph AutoEncoders for Link Prediction ICASSP2025

链接: https://arxiv.org/abs/2502.05868
作者: Yunhui Liu,Huaisong Zhang,Xinyi Gao,Liuye Guo,Zhen Tao,Tieke He
类目: Machine Learning (cs.LG)
*备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Link Prediction (LP) is a crucial problem in graph-structured data. Graph Neural Networks (GNNs) have gained prominence in LP, with Graph AutoEncoders (GAEs) being a notable representation. However, our empirical findings reveal that GAEs’ LP performance suffers heavily from the long-tailed node degree distribution, i.e., low-degree nodes tend to exhibit inferior LP performance compared to high-degree nodes. \emphWhat causes this degree-related bias, and how can it be mitigated? In this study, we demonstrate that the norm of node embeddings learned by GAEs exhibits variation among nodes with different degrees, underscoring its central significance in influencing the final performance of LP. Specifically, embeddings with larger norms tend to guide the decoder towards predicting higher scores for positive links and lower scores for negative links, thereby contributing to superior performance. This observation motivates us to improve GAEs’ LP performance on low-degree nodes by increasing their embedding norms, which can be implemented simply yet effectively by introducing additional self-loops into the training objective for low-degree nodes. This norm augmentation strategy can be seamlessly integrated into existing GAE methods with light computational cost. Extensive experiments on various datasets and GAE methods show the superior performance of norm-augmented GAEs.

[LG-67] Learning Accurate Efficient and Interpretable MLPs on Multiplex Graphs via Node-wise Multi-View Ensemble Distillation DASFAA2025

链接: https://arxiv.org/abs/2502.05864
作者: Yunhui Liu,Zhen Tao,Xiang Zhao,Jianhua Zhao,Tao Zheng,Tieke He
类目: Machine Learning (cs.LG)
*备注: Accepted by DASFAA 2025

点击查看摘要

Abstract:Multiplex graphs, with multiple edge types (graph views) among common nodes, provide richer structural semantics and better modeling capabilities. Multiplex Graph Neural Networks (MGNNs), typically comprising view-specific GNNs and a multi-view integration layer, have achieved advanced performance in various downstream tasks. However, their reliance on neighborhood aggregation poses challenges for deployment in latency-sensitive applications. Motivated by recent GNN-to-MLP knowledge distillation frameworks, we propose Multiplex Graph-Free Neural Networks (MGFNN and MGFNN+) to combine MGNNs’ superior performance and MLPs’ efficient inference via knowledge distillation. MGFNN directly trains student MLPs with node features as input and soft labels from teacher MGNNs as targets. MGFNN+ further employs a low-rank approximation-based reparameterization to learn node-wise coefficients, enabling adaptive knowledge ensemble from each view-specific GNN. This node-wise multi-view ensemble distillation strategy allows student MLPs to learn more informative multiplex semantic knowledge for different nodes. Experiments show that MGFNNs achieve average accuracy improvements of about 10% over vanilla MLPs and perform comparably or even better to teacher MGNNs (accurate); MGFNNs achieve a 35.40 \times -89.14 \times speedup in inference over MGNNs (efficient); MGFNN+ adaptively assigns different coefficients for multi-view ensemble distillation regarding different nodes (interpretable).

[LG-68] MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration

链接: https://arxiv.org/abs/2502.05850
作者: Zhiqiang Que,Jose G. F. Coutinho,Ce Guo,Hongxiang Fan,Wayne Luk
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 25 pages, 19 figures

点击查看摘要

Abstract:This paper presents a unified framework for codifying and automating optimization strategies to efficiently deploy deep neural networks (DNNs) on resource-constrained hardware, such as FPGAs, while maintaining high performance, accuracy, and resource efficiency. Deploying DNNs on such platforms involves addressing the significant challenge of balancing performance, resource usage (e.g., DSPs and LUTs), and inference accuracy, which often requires extensive manual effort and domain expertise. Our novel approach addresses two key issues: cross-stage co-optimization and optimization search. By seamlessly integrating programmatic DNN optimization techniques with high-level synthesis (HLS)-based metaprogramming and leveraging advanced design space exploration (DSE) strategies like Bayesian optimization, the framework automates both top-down and bottom-up design flows, reducing the need for manual intervention and domain expertise. The proposed framework introduces customizable optimization, transformation, and control blocks to enhance DNN accelerator performance and resource efficiency. Experimental results demonstrate up to a 92% DSP and 89% LUT usage reduction for select networks, while preserving accuracy, along with a 15.6-fold reduction in optimization time compared to grid search. These results underscore the novelty and potential of the proposed framework for automated, resource-efficient DNN accelerator designs.

[LG-69] Devil is in the Details: Density Guidance for Detail-Aware Generation with Flow Models

链接: https://arxiv.org/abs/2502.05807
作者: Rafał Karczewski,Markus Heinonen,Vikas Garg
类目: Machine Learning (cs.LG)
*备注: 27 pages, 15 figures

点击查看摘要

Abstract:Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality images by mapping noise to a data distribution. However, recent findings suggest that image likelihood does not align with perceptual quality: high-likelihood samples tend to be smooth, while lower-likelihood ones are more detailed. Controlling sample density is thus crucial for balancing realism and detail. In this paper, we analyze an existing technique, Prior Guidance, which scales the latent code to influence image detail. We introduce score alignment, a condition that explains why this method works and show that it can be tractably checked for any continuous normalizing flow model. We then propose Density Guidance, a principled modification of the generative ODE that enables exact log-density control during sampling. Finally, we extend Density Guidance to stochastic sampling, ensuring precise log-density control while allowing controlled variation in structure or fine details. Our experiments demonstrate that these techniques provide fine-grained control over image detail without compromising sample quality.

[LG-70] I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining

链接: https://arxiv.org/abs/2502.05790
作者: Haochen Zhang,Junze Yin,Guanchu Wang,Zirui Liu,Tianyi Zhang,Anshumali Shrivastava,Lin Yang,Vladimir Braverman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs). Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the memory cost of storing optimizer states. A key challenge in these methods is identifying suitable subspaces to ensure an effective optimization trajectory. Most existing approaches select the dominant subspace to preserve gradient information, as this intuitively provides the best approximation. However, we find that in practice, the dominant subspace stops changing during pretraining, thereby constraining weight updates to similar subspaces. In this paper, we propose importance sampling subspace selection (I3S) for low-rank optimization, which theoretically offers a comparable convergence rate to the dominant subspace approach. Empirically, we demonstrate that I3S significantly outperforms previous methods in LLM pretraining tasks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.05790 [cs.LG] (or arXiv:2502.05790v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.05790 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-71] GOLD: Graph Out-of-Distribution Detection via Implicit Adversarial Latent Generation ICLR25

链接: https://arxiv.org/abs/2502.05780
作者: Danny Wang,Ruihong Qiu,Guangdong Bai,Zi Huang
类目: Machine Learning (cs.LG)
*备注: ICLR25

点击查看摘要

Abstract:Despite graph neural networks’ (GNNs) great success in modelling graph-structured data, out-of-distribution (OOD) test instances still pose a great challenge for current GNNs. One of the most effective techniques to detect OOD nodes is to expose the detector model with an additional OOD node-set, yet the extra OOD instances are often difficult to obtain in practice. Recent methods for image data address this problem using OOD data synthesis, typically relying on pre-trained generative models like Stable Diffusion. However, these approaches require vast amounts of additional data, as well as one-for-all pre-trained generative models, which are not available for graph data. Therefore, we propose the GOLD framework for graph OOD detection, an implicit adversarial learning pipeline with synthetic OOD exposure without pre-trained models. The implicit adversarial training process employs a novel alternating optimisation framework by training: (1) a latent generative model to regularly imitate the in-distribution (ID) embeddings from an evolving GNN, and (2) a GNN encoder and an OOD detector to accurately classify ID data while increasing the energy divergence between the ID embeddings and the generative model’s synthetic embeddings. This novel approach implicitly transforms the synthetic embeddings into pseudo-OOD instances relative to the ID data, effectively simulating exposure to OOD scenarios without auxiliary data. Extensive OOD detection experiments are conducted on five benchmark graph datasets, verifying the superior performance of GOLD without using real OOD data compared with the state-of-the-art OOD exposure and non-exposure baselines.

[LG-72] Privacy-Preserving Dataset Combination

链接: https://arxiv.org/abs/2502.05765
作者: Keren Fuentes,Mimee Xu,Irene Chen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Access to diverse, high-quality datasets is crucial for machine learning model performance, yet data sharing remains limited by privacy concerns and competitive interests, particularly in regulated domains like healthcare. This dynamic especially disadvantages smaller organizations that lack resources to purchase data or negotiate favorable sharing agreements. We present SecureKL, a privacy-preserving framework that enables organizations to identify beneficial data partnerships without exposing sensitive information. Building on recent advances in dataset combination methods, we develop a secure multiparty computation protocol that maintains strong privacy guarantees while achieving 90% correlation with plaintext evaluations. In experiments with real-world hospital data, SecureKL successfully identifies beneficial data partnerships that improve model performance for intensive care unit mortality prediction while preserving data privacy. Our framework provides a practical solution for organizations seeking to leverage collective data resources while maintaining privacy and competitive advantages. These results demonstrate the potential for privacy-preserving data collaboration to advance machine learning applications in high-stakes domains while promoting more equitable access to data resources.

[LG-73] Filter Obstruct and Dilute: Defending Against Backdoor Attacks on Semi-Supervised Learning

链接: https://arxiv.org/abs/2502.05755
作者: Xinrui Wang,Chuanxing Geng,Wenhai Wan,Shao-yuan Li,Songcan Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies have verified that semi-supervised learning (SSL) is vulnerable to data poisoning backdoor attacks. Even a tiny fraction of contaminated training data is sufficient for adversaries to manipulate up to 90% of the test outputs in existing SSL methods. Given the emerging threat of backdoor attacks designed for SSL, this work aims to protect SSL against such risks, marking it as one of the few known efforts in this area. Specifically, we begin by identifying that the spurious correlations between the backdoor triggers and the target class implanted by adversaries are the primary cause of manipulated model predictions during the test phase. To disrupt these correlations, we utilize three key techniques: Gaussian Filter, complementary learning and trigger mix-up, which collectively filter, obstruct and dilute the influence of backdoor attacks in both data pre-processing and feature learning. Experimental results demonstrate that our proposed method, Backdoor Invalidator (BI), significantly reduces the average attack success rate from 84.7% to 1.8% across different state-of-the-art backdoor attacks. It is also worth mentioning that BI does not sacrifice accuracy on clean data and is supported by a theoretical guarantee of its generalization capability.

[LG-74] owards Autonomous Experimentation: Bayesian Optimization over Problem Formulation Space for Accelerated Alloy Development

链接: https://arxiv.org/abs/2502.05735
作者: Danial Khatamsaz,Joseph Wagner,Brent Vela,Raymundo Arroyave,Douglas L. Allaire
类目: ystems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accelerated discovery in materials science demands autonomous systems capable of dynamically formulating and solving design problems. In this work, we introduce a novel framework that leverages Bayesian optimization over a problem formulation space to identify optimal design formulations in line with decision-maker preferences. By mapping various design scenarios to a multi attribute utility function, our approach enables the system to balance conflicting objectives such as ductility, yield strength, density, and solidification range without requiring an exact problem definition at the outset. We demonstrate the efficacy of our method through an in silico case study on a Mo-Nb-Ti-V-W alloy system targeted for gas turbine engine blade applications. The framework converges on a sweet spot that satisfies critical performance thresholds, illustrating that integrating problem formulation discovery into the autonomous design loop can significantly streamline the experimental process. Future work will incorporate human feedback to further enhance the adaptability of the system in real-world experimental settings.

[LG-75] Impact of Data Poisoning Attacks on Feasibility and Optimality of Neural Power System Optimizers

链接: https://arxiv.org/abs/2502.05727
作者: Nora Agah,Meiyi Li,Javad Mohammadi
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:The increased integration of clean yet stochastic energy resources and the growing number of extreme weather events are narrowing the decision-making window of power grid operators. This time constraint is fueling a plethora of research on Machine Learning-, or ML-, based optimization proxies. While finding a fast solution is appealing, the inherent vulnerabilities of the learning-based methods are hindering their adoption. One of these vulnerabilities is data poisoning attacks, which adds perturbations to ML training data, leading to incorrect decisions. The impact of poisoning attacks on learning-based power system optimizers have not been thoroughly studied, which creates a critical vulnerability. In this paper, we examine the impact of data poisoning attacks on ML-based optimization proxies that are used to solve the DC Optimal Power Flow problem. Specifically, we compare the resilience of three different methods-a penalty-based method, a post-repair approach, and a direct mapping approach-against the adverse effects of poisoning attacks. We will use the optimality and feasibility of these proxies as performance metrics. The insights of this work will establish a foundation for enhancing the resilience of neural power system optimizers.

[LG-76] Improving Environment Novelty Quantification for Effective Unsupervised Environment Design

链接: https://arxiv.org/abs/2502.05726
作者: Jayden Teoh,Wenjun Li,Pradeep Varakantham
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Unsupervised Environment Design (UED) formalizes the problem of autocurricula through interactive training between a teacher agent and a student agent. The teacher generates new training environments with high learning potential, curating an adaptive curriculum that strengthens the student’s ability to handle unseen scenarios. Existing UED methods mainly rely on regret, a metric that measures the difference between the agent’s optimal and actual performance, to guide curriculum design. Regret-driven methods generate curricula that progressively increase environment complexity for the student but overlook environment novelty – a critical element for enhancing an agent’s generalizability. Measuring environment novelty is especially challenging due to the underspecified nature of environment parameters in UED, and existing approaches face significant limitations. To address this, this paper introduces the Coverage-based Evaluation of Novelty In Environment (CENIE) framework. CENIE proposes a scalable, domain-agnostic, and curriculum-aware approach to quantifying environment novelty by leveraging the student’s state-action space coverage from previous curriculum experiences. We then propose an implementation of CENIE that models this coverage and measures environment novelty using Gaussian Mixture Models. By integrating both regret and novelty as complementary objectives for curriculum design, CENIE facilitates effective exploration across the state-action space while progressively increasing curriculum complexity. Empirical evaluations demonstrate that augmenting existing regret-based UED algorithms with CENIE achieves state-of-the-art performance across multiple benchmarks, underscoring the effectiveness of novelty-driven autocurricula for robust generalization.

[LG-77] Explainable and Class-Revealing Signal Feature Extraction via Scattering Transform and Constrained Zeroth-Order Optimization

链接: https://arxiv.org/abs/2502.05722
作者: Naoki Saito,David Weber
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 5 pages; 6 figures; submitted to 2025 IEEE Statistical Signal Processing Workshop

点击查看摘要

Abstract:We propose a new method to extract discriminant and explainable features from a particular machine learning model, i.e., a combination of the scattering transform and the multiclass logistic regression. Although this model is well-known for its ability to learn various signal classes with high classification rate, it remains elusive to understand why it can generate such successful classification, mainly due to the nonlinearity of the scattering transform. In order to uncover the meaning of the scattering transform coefficients selected by the multiclass logistic regression (with the Lasso penalty), we adopt zeroth-order optimization algorithms to search an input pattern that maximizes the class probability of a class of interest given the learned model. In order to do so, it turns out that imposing sparsity and smoothness of input patterns is important. We demonstrate the effectiveness of our proposed method using a couple of synthetic time-series classification problems.

[LG-78] Using agent -based models and EXplainable Artificial Intelligence (XAI) to simulate social behaviors and policy intervention scenarios: A case study of private well users in Ireland

链接: https://arxiv.org/abs/2502.05718
作者: Rabia Asghar,Simon Mooney,Eoin O Neill,Paul Hynds
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Around 50 percent of Irelands rural population relies on unregulated private wells vulnerable to agricultural runoff and untreated wastewater. High national rates of Shiga toxin-producing Escherichia coli (STEC) and other waterborne illnesses have been linked to well water exposure. Periodic well testing is essential for public health, yet the lack of government incentives places the financial burden on households. Understanding environmental, cognitive, and material factors influencing well-testing behavior is critical. This study employs Agent-Based Modeling (ABM) to simulate policy interventions based on national survey data. The ABM framework, designed for private well-testing behavior, integrates a Deep Q-network reinforcement learning model and Explainable AI (XAI) for decision-making insights. Key features were selected using Recursive Feature Elimination (RFE) with 10-fold cross-validation, while SHAP (Shapley Additive Explanations) provided further interpretability for policy recommendations. Fourteen policy scenarios were tested. The most effective, Free Well Testing plus Communication Campaign, increased participation to 435 out of 561 agents, from a baseline of approximately 5 percent, with rapid behavioral adaptation. Free Well Testing plus Regulation also performed well, with 433 out of 561 agents initiating well testing. Free testing alone raised participation to over 75 percent, with some agents testing multiple times annually. Scenarios with free well testing achieved faster learning efficiency, converging in 1000 episodes, while others took 2000 episodes, indicating slower adaptation. This research demonstrates the value of ABM and XAI in public health policy, providing a framework for evaluating behavioral interventions in environmental health. Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2502.05718 [cs.CY] (or arXiv:2502.05718v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2502.05718 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rabia Asghar [view email] [v1] Sat, 8 Feb 2025 23:21:50 UTC (867 KB)

[LG-79] Flow-based Conformal Prediction for Multi-dimensional Time Series

链接: https://arxiv.org/abs/2502.05709
作者: Junghwan Lee,Chen Xu,Yao Xie
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal prediction for time series presents two key challenges: (1) leveraging sequential correlations in features and non-conformity scores and (2) handling multi-dimensional outcomes. We propose a novel conformal prediction method to address these two key challenges by integrating Transformer and Normalizing Flow. Specifically, the Transformer encodes the historical context of time series, and normalizing flow learns the transformation from the base distribution to the distribution of non-conformity scores conditioned on the encoded historical context. This enables the construction of prediction regions by transforming samples from the base distribution using the learned conditional flow. We ensure the marginal coverage by defining the prediction regions as sets in the transformed space that correspond to a predefined probability mass in the base distribution. The model is trained end-to-end by Flow Matching, avoiding the need for computationally intensive numerical solutions of ordinary differential equations. We demonstrate that our proposed method achieves smaller prediction regions compared to the baselines while satisfying the desired coverage through comprehensive experiments using simulated and real-world time series datasets.

[LG-80] GWRF: A Generalizable Wireless Radiance Field for Wireless Signal Propagation Modeling

链接: https://arxiv.org/abs/2502.05708
作者: Kang Yang,Yuning Chen,Wan Du
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Generalizable Wireless Radiance Fields (GWRF), a framework for modeling wireless signal propagation at arbitrary 3D transmitter and receiver positions. Unlike previous methods that adapt vanilla Neural Radiance Fields (NeRF) from the optical to the wireless signal domain, requiring extensive per-scene training, GWRF generalizes effectively across scenes. First, a geometry-aware Transformer encoder-based wireless scene representation module incorporates information from geographically proximate transmitters to learn a generalizable wireless radiance field. Second, a neural-driven ray tracing algorithm operates on this field to automatically compute signal reception at the receiver. Experimental results demonstrate that GWRF outperforms existing methods on single scenes and achieves state-of-the-art performance on unseen scenes.

[LG-81] OKON: TOKenization-Optimized Normalization for time series analysis with a large language model

链接: https://arxiv.org/abs/2502.05701
作者: Janghoon Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While large language models have rapidly evolved towards general artificial intelligence, their versatility in analyzing time series data remains limited. To address this limitation, we propose a novel normalization technique that considers the inherent nature of tokenization. The proposed Tokenization-Optimized Normalization (TOKON) simplifies time series data by representing each element with a single token, effectively reducing the number of tokens by 2 to 3 times. Additionally, we introduce a novel prompt for time series forecasting, termed Time Series Forecasting with Care (TFSC), to further enhance forecasting performance. Experimental results demonstrate that TOKON improves root mean square error (RMSE) for multi-step forecasting by approximately 7% to 18%, depending on the dataset and prompting method. Furthermore, TFSC, when used in conjunction with TOKON, shows additional improvements in forecasting accuracy for certain datasets

[LG-82] Federated Learning with Reservoir State Analysis for Time Series Anomaly Detection IJCNN2025

链接: https://arxiv.org/abs/2502.05679
作者: Keigo Nogami,Tamura Hiroto,Gouhei Tanaka
类目: Machine Learning (cs.LG)
*备注: 8 pages, 16 figures, submitted to IJCNN 2025

点击查看摘要

Abstract:With a growing data privacy concern, federated learning has emerged as a promising framework to train machine learning models without sharing locally distributed data. In federated learning, local model training by multiple clients and model integration by a server are repeated only through model parameter sharing. Most existing federated learning methods assume training deep learning models, which are often computationally demanding. To deal with this issue, we propose federated learning methods with reservoir state analysis to seek computational efficiency and data privacy protection simultaneously. Specifically, our method relies on Mahalanobis Distance of Reservoir States (MD-RS) method targeting time series anomaly detection, which learns a distribution of reservoir states for normal inputs and detects anomalies based on a deviation from the learned distribution. Iterative updating of statistical parameters in the MD-RS enables incremental federated learning (IncFed MD-RS). We evaluate the performance of IncFed MD-RS using benchmark datasets for time series anomaly detection. The results show that IncFed MD-RS outperforms other federated learning methods with deep learning and reservoir computing models particularly when clients’ data are relatively short and heterogeneous. We demonstrate that IncFed MD-RS is robust against reduced sample data compared to other methods. We also show that the computational cost of IncFed MD-RS can be reduced by subsampling from the reservoir states without performance degradation. The proposed method is beneficial especially in anomaly detection applications where computational efficiency, algorithm simplicity, and low communication cost are required.

[LG-83] Surprise Potential as a Measure of Interactivity in Driving Scenarios

链接: https://arxiv.org/abs/2502.05677
作者: Wenhao Ding,Sushant Veer,Karen Leung,Yulong Cao,Marco Pavone
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Validating the safety and performance of an autonomous vehicle (AV) requires benchmarking on real-world driving logs. However, typical driving logs contain mostly uneventful scenarios with minimal interactions between road users. Identifying interactive scenarios in real-world driving logs enables the curation of datasets that amplify critical signals and provide a more accurate assessment of an AV’s performance. In this paper, we present a novel metric that identifies interactive scenarios by measuring an AV’s surprise potential on others. First, we identify three dimensions of the design space to describe a family of surprise potential measures. Second, we exhaustively evaluate and compare different instantiations of the surprise potential measure within this design space on the nuScenes dataset. To determine how well a surprise potential measure correctly identifies an interactive scenario, we use a reward model learned from human preferences to assess alignment with human intuition. Our proposed surprise potential, arising from this exhaustive comparative study, achieves a correlation of more than 0.82 with the human-aligned reward function, outperforming existing approaches. Lastly, we validate motion planners on curated interactive scenarios to demonstrate downstream applications.

[LG-84] he late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks

链接: https://arxiv.org/abs/2502.05668
作者: Sholom Schechtman,Nicolas Schreuder
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We analyze the implicit bias of constant step stochastic subgradient descent (SGD). We consider the setting of binary classification with homogeneous neural networks - a large class of deep neural networks with ReLU-type activation functions such as MLPs and CNNs without biases. We interpret the dynamics of normalized SGD iterates as an Euler-like discretization of a conservative field flow that is naturally associated to the normalized classification margin. Owing to this interpretation, we show that normalized SGD iterates converge to the set of critical points of the normalized margin at late-stage training (i.e., assuming that the data is correctly classified with positive normalized margin). Up to our knowledge, this is the first extension of the analysis of Lyu and Li (2020) on the discrete dynamics of gradient descent to the nonsmooth and stochastic setting. Our main result applies to binary classification with exponential or logistic losses. We additionally discuss extensions to more general settings.

[LG-85] Flowing Through Layers: A Continuous Dynamical Systems Perspective on Transformers

链接: https://arxiv.org/abs/2502.05656
作者: Jacob Fein-Ashley
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:We show that the standard discrete update rule of transformer layers can be naturally interpreted as a forward Euler discretization of a continuous dynamical system. Our Transformer Flow Approximation Theorem demonstrates that, under standard Lipschitz continuity assumptions, token representations converge uniformly to the unique solution of an ODE as the number of layers grows. Moreover, if the underlying mapping satisfies a one-sided Lipschitz condition with a negative constant, the resulting dynamics are contractive, causing perturbations to decay exponentially across layers. Beyond clarifying the empirical stability and expressivity of transformer models, these insights link transformer updates to a broader iterative reasoning framework, suggesting new avenues for accelerated convergence and architectural innovations inspired by dynamical systems theory.

[LG-86] ETHEREAL: Energy-efficient and High-throughput Inference using Compressed Tsetlin Machine

链接: https://arxiv.org/abs/2502.05640
作者: Shengyu Duan,Rishad Shafik,Alex Yakovlev
类目: Machine Learning (cs.LG)
*备注: Accepted as a full paper by the 2025 EDGE AI FOUNDATION Austin

点击查看摘要

Abstract:The Tsetlin Machine ™ is a novel alternative to deep neural networks (DNNs). Unlike DNNs, which rely on multi-path arithmetic operations, a TM learns propositional logic patterns from data literals using Tsetlin automata. This fundamental shift from arithmetic to logic underpinning makes TM suitable for empowering new applications with low-cost implementations. In TM, literals are often included by both positive and negative clauses within the same class, canceling out their impact on individual class definitions. This property can be exploited to develop compressed TM models, enabling energy-efficient and high-throughput inferences for machine learning (ML) applications. We introduce a training approach that incorporates excluded automata states to sparsify TM logic patterns in both positive and negative clauses. This exclusion is iterative, ensuring that highly class-correlated (and therefore significant) literals are retained in the compressed inference model, ETHEREAL, to maintain strong classification accuracy. Compared to standard TMs, ETHEREAL TM models can reduce model size by up to 87.54%, with only a minor accuracy compromise. We validate the impact of this compression on eight real-world Tiny machine learning (TinyML) datasets against standard TM, equivalent Random Forest (RF) and Binarized Neural Network (BNN) on the STM32F746G-DISCO platform. Our results show that ETHEREAL TM models achieve over an order of magnitude reduction in inference time (resulting in higher throughput) and energy consumption compared to BNNs, while maintaining a significantly smaller memory footprint compared to RFs. Comments: Accepted as a full paper by the 2025 EDGE AI FOUNDATION Austin Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.05640 [cs.LG] (or arXiv:2502.05640v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.05640 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-87] Mol-MoE: Training Preference-Guided Routers for Molecule Generation

链接: https://arxiv.org/abs/2502.05633
作者: Diego Calanzone,Pierluca D’Oro,Pierre-Luc Bacon
类目: Machine Learning (cs.LG)
*备注: We release our code and data at: this https URL

点击查看摘要

Abstract:Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.

[LG-88] rackDiffuser: Nearly Model-Free Bayesian Filtering with Diffusion Model

链接: https://arxiv.org/abs/2502.05629
作者: Yangguang He,Wenhao Li,Minzhe Li,Juan Zhang,Xiangfeng Wang,Bo Jin
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:State estimation remains a fundamental challenge across numerous domains, from autonomous driving, aircraft tracking to quantum system control. Although Bayesian filtering has been the cornerstone solution, its classical model-based paradigm faces two major limitations: it struggles with inaccurate state space model (SSM) and requires extensive prior knowledge of noise characteristics. We present TrackDiffuser, a generative framework addressing both challenges by reformulating Bayesian filtering as a conditional diffusion model. Our approach implicitly learns system dynamics from data to mitigate the effects of inaccurate SSM, while simultaneously circumventing the need for explicit measurement models and noise priors by establishing a direct relationship between measurements and states. Through an implicit predict-and-update mechanism, TrackDiffuser preserves the interpretability advantage of traditional model-based filtering methods. Extensive experiments demonstrate that our framework substantially outperforms both classical and contemporary hybrid methods, especially in challenging non-linear scenarios involving non-Gaussian noises. Notably, TrackDiffuser exhibits remarkable robustness to SSM inaccuracies, offering a practical solution for real-world state estimation problems where perfect models and prior knowledge are unavailable.

[LG-89] raining-Free Constrained Generation With Stable Diffusion Models

链接: https://arxiv.org/abs/2502.05625
作者: Stefano Zampini,Jacob Christopher,Luca Oneto,Davide Anguita,Ferdinando Fioretto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stable diffusion models represent the state-of-the-art in data synthesis across diverse domains and hold transformative potential for applications in science and engineering, e.g., by facilitating the discovery of novel solutions and simulating systems that are computationally intractable to model explicitly. However, their current utility in these fields is severely limited by an inability to enforce strict adherence to physical laws and domain-specific constraints. Without this grounding, the deployment of such models in critical applications, ranging from material science to safety-critical systems, remains impractical. This paper addresses this fundamental limitation by proposing a novel approach to integrate stable diffusion models with constrained optimization frameworks, enabling them to generate outputs that satisfy stringent physical and functional requirements. We demonstrate the effectiveness of this approach through material science experiments requiring adherence to precise morphometric properties, inverse design problems involving the generation of stress-strain responses using video generation with a simulator in the loop, and safety settings where outputs must avoid copyright infringement.

[LG-90] Mixing Time of the Proximal Sampler in Relative Fisher Information via Strong Data Processing Inequality

链接: https://arxiv.org/abs/2502.05623
作者: Andre Wibisono
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the mixing time guarantee for sampling in relative Fisher information via the Proximal Sampler algorithm, which is an approximate proximal discretization of the Langevin dynamics. We show that when the target probability distribution is strongly log-concave, the relative Fisher information converges exponentially fast along the Proximal Sampler; this matches the exponential convergence rate of the relative Fisher information along the continuous-time Langevin dynamics for strongly log-concave target. When combined with a standard implementation of the Proximal Sampler via rejection sampling, this exponential convergence rate provides a high-accuracy iteration complexity guarantee for the Proximal Sampler in relative Fisher information when the target distribution is strongly log-concave and log-smooth. Our proof proceeds by establishing a strong data processing inequality for relative Fisher information along the Gaussian channel under strong log-concavity, and a data processing inequality along the reverse Gaussian channel for a special distribution. The forward and reverse Gaussian channels compose to form the Proximal Sampler, and these data processing inequalities imply the exponential convergence rate of the relative Fisher information along the Proximal Sampler.

[LG-91] Online Bidding Algorithms with Strict Return on Spend (ROS) Constraint

链接: https://arxiv.org/abs/2502.05599
作者: Rahul Vaze,Abhishek Sinha
类目: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Auto-bidding problem under a strict return-on-spend constraint (ROSC) is considered, where an algorithm has to make decisions about how much to bid for an ad slot depending on the revealed value, and the hidden allocation and payment function that describes the probability of winning the ad-slot depending on its bid. The objective of an algorithm is to maximize the expected utility (product of ad value and probability of winning the ad slot) summed across all time slots subject to the total expected payment being less than the total expected utility, called the ROSC. A (surprising) impossibility result is derived that shows that no online algorithm can achieve a sub-linear regret even when the value, allocation and payment function are drawn i.i.d. from an unknown distribution. The problem is non-trivial even when the revealed value remains constant across time slots, and an algorithm with regret guarantee that is optimal up to logarithmic factor is derived.

[LG-92] Democratic Training Against Universal Adversarial Perturbations

链接: https://arxiv.org/abs/2502.05542
作者: Bing Sun,Jun Sun,Wei Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite their advances and success, real-world deep neural networks are known to be vulnerable to adversarial attacks. Universal adversarial perturbation, an input-agnostic attack, poses a serious threat for them to be deployed in security-sensitive systems. In this case, a single universal adversarial perturbation deceives the model on a range of clean inputs without requiring input-specific optimization, which makes it particularly threatening. In this work, we observe that universal adversarial perturbations usually lead to abnormal entropy spectrum in hidden layers, which suggests that the prediction is dominated by a small number of ``feature’’ in such cases (rather than democratically by many features). Inspired by this, we propose an efficient yet effective defense method for mitigating UAPs called \emphDemocratic Training by performing entropy-based model enhancement to suppress the effect of the universal adversarial perturbations in a given model. \emphDemocratic Training is evaluated with 7 neural networks trained on 5 benchmark datasets and 5 types of state-of-the-art universal adversarial attack methods. The results show that it effectively reduces the attack success rate, improves model robustness and preserves the model accuracy on clean samples.

[LG-93] Do Spikes Protect Privacy? Investigating Black-Box Model Inversion Attacks in Spiking Neural Networks

链接: https://arxiv.org/abs/2502.05509
作者: Hamed Poursiami,Ayana Moshruba,Maryam Parsa
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:As machine learning models become integral to security-sensitive applications, concerns over data leakage from adversarial attacks continue to rise. Model Inversion (MI) attacks pose a significant privacy threat by enabling adversaries to reconstruct training data from model outputs. While MI attacks on Artificial Neural Networks (ANNs) have been widely studied, Spiking Neural Networks (SNNs) remain largely unexplored in this context. Due to their event-driven and discrete computations, SNNs introduce fundamental differences in information processing that may offer inherent resistance to such attacks. A critical yet underexplored aspect of this threat lies in black-box settings, where attackers operate through queries without direct access to model parameters or gradients-representing a more realistic adversarial scenario in deployed systems. This work presents the first study of black-box MI attacks on SNNs. We adapt a generative adversarial MI framework to the spiking domain by incorporating rate-based encoding for input transformation and decoding mechanisms for output interpretation. Our results show that SNNs exhibit significantly greater resistance to MI attacks than ANNs, as demonstrated by degraded reconstructions, increased instability in attack convergence, and overall reduced attack effectiveness across multiple evaluation metrics. Further analysis suggests that the discrete and temporally distributed nature of SNN decision boundaries disrupts surrogate modeling, limiting the attacker’s ability to approximate the target model.

[LG-94] Feature Explosion: a generic optimization strategy for outlier detection algorithms

链接: https://arxiv.org/abs/2502.05496
作者: Qi Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Outlier detection tasks aim at discovering potential issues or opportunities and are widely used in cybersecurity, financial security, industrial inspection, etc. To date, thousands of outlier detection algorithms have been proposed. Clearly, in real-world scenarios, such a large number of algorithms is unnecessary. In other words, a large number of outlier detection algorithms are redundant. We believe the root cause of this redundancy lies in the current highly customized (i.e., non-generic) optimization strategies. Specifically, when researchers seek to improve the performance of existing outlier detection algorithms, they have to design separate optimized versions tailored to the principles of each algorithm, leading to an ever-growing number of outlier detection algorithms. To address this issue, in this paper, we introduce the explosion from physics into the outlier detection task and propose a generic optimization strategy based on feature explosion, called OSD (Optimization Strategy for outlier Detection algorithms). In the future, when improving the performance of existing outlier detection algorithms, it will be sufficient to invoke the OSD plugin without the need to design customized optimized versions for them. We compared the performances of 14 outlier detection algorithms on 24 datasets before and after invoking the OSD plugin. The experimental results show that the performances of all outlier detection algorithms are improved on almost all datasets. In terms of average accuracy, OSD make these outlier detection algorithms improve by 15% (AUC), 63.7% (AP).

[LG-95] Modeling of Core Loss Based on Machine Learning and Deep Learning

链接: https://arxiv.org/abs/2502.05487
作者: Junqi He,Yifeng Wei,Daiguang Jin
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This article proposes a Mix Neural Network (MNN) based on CNN-FCNN for predicting magnetic loss of different materials. In traditional magnetic core loss models, empirical equations usually need to be regressed under the same external conditions. When the magnetic core material is different, it needs to be classified and discussed. If external factors increase, multiple models need to be proposed for classification and discussion, making the modeling process extremely cumbersome. And traditional empirical equations still has the problem of low accuracy, although various correction equations have been introduced later, the accuracy has always been unsatisfactory. By introducing machine learning and deep learning, it is possible to simultaneously solve prediction problems with low accuracy of empirical equations and complex conditions. Based on the MagNet database, through the training of the newly proposed MNN, it is found that a single model is sufficient to make predictions for at least four different materials under varying temperatures, frequencies, and waveforms, with accuracy far exceeding that of traditional models. At the same time, we also used three other machine learning and deep learning models (Random Forest, XGBoost, MLP-LSTM) for training, all of which had much higher accuracy than traditional models. On the basis of the predicted results, a hybrid model combining MNN and XGBoost was proposed, which predicted through weighting and found that the accuracy could continue to improve. This provides a solution for modeling magnetic core loss under different materials and operating modes.

[LG-96] You Are What You Eat – AI Alignment Requires Understanding How Data Shapes Structure and Generalisation

链接: https://arxiv.org/abs/2502.05475
作者: Simon Pepin Lehalleur,Jesse Hoogland,Matthew Farrugia-Roberts,Susan Wei,Alexander Gietelink Oldenziel,George Wang,Liam Carroll,Daniel Murfet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation.

[LG-97] Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making

链接: https://arxiv.org/abs/2502.05468
作者: Prince Zizhuang Wang,Jinhao Liang,Shuyi Chen,Ferdinando Fioretto,Shixiang Zhu
类目: Machine Learning (cs.LG)
*备注: 22 pages, 6 figures

点击查看摘要

Abstract:Decision-focused learning (DFL) integrates predictive models with downstream optimization, directly training machine learning models to minimize decision errors. While DFL has been shown to provide substantial advantages when compared to a counterpart that treats the predictive and prescriptive models separately, it has also been shown to struggle in high-dimensional and risk-sensitive settings, limiting its applicability in real-world settings. To address this limitation, this paper introduces decision-focused generative learning (Gen-DFL), a novel framework that leverages generative models to adaptively model uncertainty and improve decision quality. Instead of relying on fixed uncertainty sets, Gen-DFL learns a structured representation of the optimization parameters and samples from the tail regions of the learned distribution to enhance robustness against worst-case scenarios. This approach mitigates over-conservatism while capturing complex dependencies in the parameter space. The paper shows, theoretically, that Gen-DFL achieves improved worst-case performance bounds compared to traditional DFL. Empirically, it evaluates Gen-DFL on various scheduling and logistics problems, demonstrating its strong performance against existing DFL methods.

[LG-98] Learning Memory and Material Dependent Constitutive Laws

链接: https://arxiv.org/abs/2502.05463
作者: Kaushik Bhattacharya,Lianghao Cao,George Stepaniants,Andrew Stuart,Margaret Trautner
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 48 pages, 11 figures

点击查看摘要

Abstract:The theory of homogenization provides a systematic approach to the derivation of macroscale constitutive laws, obviating the need to repeatedly resolve complex microstructure. However, the unit cell problem that defines the constitutive model is typically not amenable to explicit evaluation. It is therefore of interest to learn constitutive models from data generated by the unit cell problem. Many viscoelastic and elastoviscoplastic materials are characterized by memory-dependent constitutive laws. In order to amortize the computational investment in finding such memory-dependent constitutive laws, it is desirable to learn their dependence on the material microstructure. While prior work has addressed learning memory dependence and material dependence separately, their joint learning has not been considered. This paper focuses on the joint learning problem and proposes a novel neural operator framework to address it. In order to provide firm foundations, the homogenization problem for linear Kelvin-Voigt viscoelastic materials is studied. The theoretical properties of the cell problem in this Kelvin-Voigt setting are used to motivate the proposed general neural operator framework; these theoretical properties are also used to prove a universal approximation theorem for the learned macroscale constitutive model. This formulation of learnable constitutive models is then deployed beyond the Kelvin-Voigt setting. Numerical experiments are presented showing that the resulting data-driven methodology accurately learns history- and microstructure-dependent linear viscoelastic and nonlinear elastoviscoplastic constitutive models, and numerical results also demonstrate that the resulting constitutive models can be deployed in macroscale simulation of material deformation. Comments: 48 pages, 11 figures Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG) MSC classes: 35B27, 65M60, 68T07, 74D05, 74D10, 74Q10, 74Q15 ACMclasses: G.1.8; I.6; J.2 Cite as: arXiv:2502.05463 [math.NA] (or arXiv:2502.05463v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2502.05463 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-99] mporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following Temporal Representation Alignment

链接: https://arxiv.org/abs/2502.05454
作者: Vivek Myers,Bill Chunyuan Zheng,Anca Dragan,Kuan Fang,Sergey Levine
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositionality. We show that learning to associate the representations of current and future states with a temporal alignment loss can improve compositional generalization, even in the absence of any explicit subtask planning or reinforcement learning. We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.

[LG-100] Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets

链接: https://arxiv.org/abs/2502.05446
作者: Haoye Lu,Qifan Wu,Yaoliang Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward–Backward Deconvolution (SFBD) method, we attain an FID of 6.31 on CIFAR-10 with just 4% clean images (and 3.58 with 10% ). Theoretically, we prove that SFBD guides the model to learn the true data distribution. The result also highlights the importance of pretraining on limited but clean data or the alternative from similar datasets. Empirical studies further support these findings and offer additional insights.

[LG-101] Approximating the total variation distance between spin systems

链接: https://arxiv.org/abs/2502.05437
作者: Weiming Feng,Hongyang Liu,Minji Yang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Spin systems form an important class of undirected graphical models. For two Gibbs distributions \mu and \nu induced by two spin systems on the same graph G = (V, E) , we study the problem of approximating the total variation distance d_TV(\mu,\nu) with an \epsilon -relative error. We propose a new reduction that connects the problem of approximating the TV-distance to sampling and approximate counting. Our applications include the hardcore model and the antiferromagnetic Ising model in the uniqueness regime, the ferromagnetic Ising model, and the general Ising model satisfying the spectral condition. Additionally, we explore the computational complexity of approximating the total variation distance d_TV(\mu_S,\nu_S) between two marginal distributions on an arbitrary subset S \subseteq V . We prove that this problem remains hard even when both \mu and \nu admit polynomial-time sampling and approximate counting algorithms. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2502.05437 [cs.DS] (or arXiv:2502.05437v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2502.05437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-102] Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

链接: https://arxiv.org/abs/2502.05434
作者: Han Qi,Haochen Yang,Qiaosheng Zhang,Zhuoran Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified \emphsurrogate environment and introduce a novel distance measure (named the \emph \ell_g -distance), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order O(H^\frac32\sqrt\log(K(\epsilon)) T) , where H is the episode length, T is the number of episode and K(\epsilon) is related to the covering number of the environment. Specializing to the tabular settings, this regret bound is of order \tildeO(H^2\sqrtSAT) , where S and A are the numbers of states and actions. Finally, we propose an Approximate-IDS algorithm that is computationally more efficient while maintaining nearly the same sample efficiency. The design principle of this approximate algorithm is not only effective in RLHF settings but also applicable to the standard RL framework. Moreover, our work showcases the value of information theory in reinforcement learning and in the training of large language models.

[LG-103] Deep Generative Models with Hard Linear Equality Constraints

链接: https://arxiv.org/abs/2502.05416
作者: Ruoyan Li,Dipti Ranjan Sahu,Guy Van den Broeck,Zhe Zeng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While deep generative models~(DGMs) have demonstrated remarkable success in capturing complex data distributions, they consistently fail to learn constraints that encode domain knowledge and thus require constraint integration. Existing solutions to this challenge have primarily relied on heuristic methods and often ignore the underlying data distribution, harming the generative performance. In this work, we propose a probabilistically sound approach for enforcing the hard constraints into DGMs to generate constraint-compliant and realistic data. This is achieved by our proposed gradient estimators that allow the constrained distribution, the data distribution conditioned on constraints, to be differentiably learned. We carry out extensive experiments with various DGM model architectures over five image datasets and three scientific applications in which domain knowledge is governed by linear equality constraints. We validate that the standard DGMs almost surely generate data violating the constraints. Among all the constraint integration strategies, ours not only guarantees the satisfaction of constraints in generation but also archives superior generative performance than the other methods across every benchmark.

[LG-104] Analyzing public sentiment to gauge key stock events and determine volatility in conjunction with time and options premiums

链接: https://arxiv.org/abs/2502.05403
作者: SriVarsha Mulakala,Umesh Vangapally,Benjamin Larkey,Aidan Henrichs,Corey Wojslaw
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analyzing stocks and making higher accurate predictions on where the price is heading continues to become more and more challenging therefore, we designed a new financial algorithm that leverages social media sentiment analysis to enhance the prediction of key stock earnings and associated volatility. Our model integrates sentiment analysis and data retrieval techniques to extract critical information from social media, analyze company financials, and compare sentiments between Wall Street and the general public. This approach aims to provide investors with timely data to execute trades based on key events, rather than relying on long-term stock holding strategies. The stock market is characterized by rapid data flow and fluctuating community sentiments, which can significantly impact trading outcomes. Stock forecasting is complex given its stochastic dynamic. Standard traditional prediction methods often overlook key events and media engagement, focusing its practice into long-term investment options. Our research seeks to change the stochastic dynamic to a more predictable environment by examining the impact of media on stock volatility, understanding and identifying sentiment differences between Wall Street and retail investors, and evaluating the impact of various media networks in predicting earning reports.

[LG-105] Imitation Learning from a Single Temporally Misaligned Video

链接: https://arxiv.org/abs/2502.05397
作者: William Huey,Huaxiaoyue Wang,Anne Wu,Yoav Artzi,Sanjiban Choudhury
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We examine the problem of learning sequential tasks from a single visual demonstration. A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress. Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve 4.5 x improvement ( 0.11 \rightarrow 0.50 average normalized returns) for Meta-world tasks and 6.6 x improvement ( 6.55 \rightarrow 43.3 average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. Our code is available at this https URL

[LG-106] Open Challenges in Time Series Anomaly Detection: An Industry Perspective

链接: https://arxiv.org/abs/2502.05392
作者: Andreas Mueller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current research in time-series anomaly detection is using definitions that miss critical aspects of how anomaly detection is commonly used in practice. We list several areas that are of practical relevance and that we believe are either under-investigated or missing entirely from the current discourse. Based on an investigation of systems deployed in a cloud environment, we motivate the areas of streaming algorithms, human-in-the-loop scenarios, point processes, conditional anomalies and populations analysis of time series. This paper serves as a motivation and call for action, including opportunities for theoretical and applied research, as well as for building new dataset and benchmarks.

[LG-107] BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference

链接: https://arxiv.org/abs/2502.05376
作者: Reena Elangovan,Charbel Sakr,Anand Raghunathan,Brucek Khailany
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-8-bits while maintaining activations at 8-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with 0.5-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating 1% loss in inference accuracy across several LLMs and downstream tasks.

[LG-108] Active Learning of Model Discrepancy with Bayesian Experimental Design

链接: https://arxiv.org/abs/2502.05372
作者: Huchen Yang,Chuanqi Chen,Jin-Long Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Digital twins have been actively explored in many engineering applications, such as manufacturing and autonomous systems. However, model discrepancy is ubiquitous in most digital twin models and has significant impacts on the performance of using those models. In recent years, data-driven modeling techniques have been demonstrated promising in characterizing the model discrepancy in existing models, while the training data for the learning of model discrepancy is often obtained in an empirical way and an active approach of gathering informative data can potentially benefit the learning of model discrepancy. On the other hand, Bayesian experimental design (BED) provides a systematic approach to gathering the most informative data, but its performance is often negatively impacted by the model discrepancy. In this work, we build on sequential BED and propose an efficient approach to iteratively learn the model discrepancy based on the data from the BED. The performance of the proposed method is validated by a classical numerical example governed by a convection-diffusion equation, for which full BED is still feasible. The proposed method is then further studied in the same numerical example with a high-dimensional model discrepancy, which serves as a demonstration for the scenarios where full BED is not practical anymore. An ensemble-based approximation of information gain is further utilized to assess the data informativeness and to enhance learning model discrepancy. The results show that the proposed method is efficient and robust to the active learning of high-dimensional model discrepancy, using data suggested by the sequential BED. We also demonstrate that the proposed method is compatible with both classical numerical solvers and modern auto-differentiable solvers.

[LG-109] DobLIX: A Dual-Objective Learned Index for Log-Structured Merge Trees

链接: https://arxiv.org/abs/2502.05369
作者: Alireza Heidari,Amirhossein Ahmadi,Wei Zhang
类目: Databases (cs.DB); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 14 pages, 15 figures

点击查看摘要

Abstract:In this paper, we introduce DobLIX, a dual-objective learned index specifically designed for Log-Structured Merge(LSM) tree-based key-value stores. Although traditional learned indexes focus exclusively on optimizing index lookups, they often overlook the impact of data access from storage, resulting in performance bottlenecks. DobLIX addresses this by incorporating a second objective, data access optimization, into the learned index training process. This dual-objective approach ensures that both index lookup efficiency and data access costs are minimized, leading to significant improvements in read performance while maintaining write efficiency in real-world LSM-tree systems. Additionally, DobLIX features a reinforcement learning agent that dynamically tunes the system parameters, allowing it to adapt to varying workloads in real-time. Experimental results using real-world datasets demonstrate that DobLIX reduces indexing overhead and improves throughput by 1.19 to 2.21 times compared to state-of-the-art methods within RocksDB, a widely used LSM-tree-based storage engine.

[LG-110] Otter: Generating Tests from Issues to Validate SWE Patches

链接: https://arxiv.org/abs/2502.05368
作者: Toufique Ahmed,Jatin Ganhotra,Rangeet Pan,Avraham Shinnar,Saurabh Sinha,Martin Hirzel
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While there has been plenty of work on generating tests from existing code, there has been limited work on generating tests from issues. A correct test must validate the code patch that resolves the issue. In this work, we focus on the scenario where the code patch does not exist yet. This approach supports two major use-cases. First, it supports TDD (test-driven development), the discipline of “test first, write code later” that has well-documented benefits for human software engineers. Second, it also validates SWE (software engineering) agents, which generate code patches for resolving issues. This paper introduces Otter, an LLM-based solution for generating tests from issues. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planning stage. Experiments show Otter outperforming state-of-the-art systems for generating tests from issues, in addition to enhancing systems that generate patches from issues. We hope that Otter helps make developers more productive at resolving issues and leads to more robust, well-tested code.

[LG-111] Detecting APT Malware Command and Control over HTTP(S) Using Contextual Summaries

链接: https://arxiv.org/abs/2502.05367
作者: Almuthanna Alageel,Sergio Maffeis,Imperial College London
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 22 pages, 9 figures. In: Susilo, W., Chen, X., Guo, F., Zhang, Y., Intan, R. (eds) Information Security. ISC 2022

点击查看摘要

Abstract:Advanced Persistent Threats (APTs) are among the most sophisticated threats facing critical organizations worldwide. APTs employ specific tactics, techniques, and procedures (TTPs) which make them difficult to detect in comparison to frequent and aggressive attacks. In fact, current network intrusion detection systems struggle to detect APTs communications, allowing such threats to persist unnoticed on victims’ machines for months or even years. In this paper, we present EarlyCrow, an approach to detect APT malware command and control over HTTP(S) using contextual summaries. The design of EarlyCrow is informed by a novel threat model focused on TTPs present in traffic generated by tools recently used as part of APT campaigns. The threat model highlights the importance of the context around the malicious connections, and suggests traffic attributes which help APT detection. EarlyCrow defines a novel multipurpose network flow format called PairFlow, which is leveraged to build the contextual summary of a PCAP capture, representing key behavioral, statistical and protocol information relevant to APT TTPs. We evaluate the effectiveness of EarlyCrow on unseen APTs obtaining a headline macro average F1-score of 93.02% with FPR of 0.74%. Comments: 22 pages, 9 figures. In: Susilo, W., Chen, X., Guo, F., Zhang, Y., Intan, R. (eds) Information Security. ISC 2022 Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) MSC classes: 68M25 ACMclasses: F.2.2; I.2.7 Cite as: arXiv:2502.05367 [cs.CR] (or arXiv:2502.05367v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2502.05367 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1007/978-3-031-22390-7_18 Focus to learn more DOI(s) linking to related resources

[LG-112] Hypencoder: Hypernetworks for Information Retrieval

链接: https://arxiv.org/abs/2502.05364
作者: Julian Killingback,Hansi Zeng,Hamed Zamani
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The vast majority of retrieval models depend on vector inner products to produce a relevance score between a query and a document. This naturally limits the expressiveness of the relevance score that can be employed. We propose a new paradigm, instead of producing a vector to represent the query we produce a small neural network which acts as a learned relevance function. This small neural network takes in a representation of the document, in this paper we use a single vector, and produces a scalar relevance score. To produce the little neural network we use a hypernetwork, a network that produce the weights of other networks, as our query encoder or as we call it a Hypencoder. Experiments on in-domain search tasks show that Hypencoder is able to significantly outperform strong dense retrieval models and has higher metrics then reranking models and models an order of magnitude larger. Hypencoder is also shown to generalize well to out-of-domain search tasks. To assess the extent of Hypencoder’s capabilities, we evaluate on a set of hard retrieval tasks including tip-of-the-tongue retrieval and instruction-following retrieval tasks and find that the performance gap widens substantially compared to standard retrieval tasks. Furthermore, to demonstrate the practicality of our method we implement an approximate search algorithm and show that our model is able to search 8.8M documents in under 60ms.

[LG-113] Curse of Dimensionality in Neural Network Optimization

链接: https://arxiv.org/abs/2502.05360
作者: Sanghoon Na,Haizhao Yang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The curse of dimensionality in neural network optimization under the mean-field regime is studied. It is demonstrated that when a shallow neural network with a Lipschitz continuous activation function is trained using either empirical or population risk to approximate a target function that is r times continuously differentiable on [0,1]^d , the population risk may not decay at a rate faster than t^-\frac4rd-2r , where t is an analog of the total number of optimization iterations. This result highlights the presence of the curse of dimensionality in the optimization computation required to achieve a desired accuracy. Instead of analyzing parameter evolution directly, the training dynamics are examined through the evolution of the parameter distribution under the 2-Wasserstein gradient flow. Furthermore, it is established that the curse of dimensionality persists when a locally Lipschitz continuous activation function is employed, where the Lipschitz constant in [-x,x] is bounded by O(x^\delta) for any x \in \mathbbR . In this scenario, the population risk is shown to decay at a rate no faster than t^-\frac(4+2\delta)rd-2r . To the best of our knowledge, this work is the first to analyze the impact of function smoothness on the curse of dimensionality in neural network optimization theory.

[LG-114] owards Foundational Models for Dynamical System Reconstruction: Hierarchical Meta-Learning via Mixture of Experts

链接: https://arxiv.org/abs/2502.05335
作者: Roussel Desmond Nzoyem,David A.W. Barton,Tom Deakin
类目: Machine Learning (cs.LG)
*备注: 22 pages, 11 figures, 7 tables

点击查看摘要

Abstract:As foundational models reshape scientific discovery, a bottleneck persists in dynamical system reconstruction (DSR): the ability to learn across system hierarchies. Many meta-learning approaches have been applied successfully to single systems, but falter when confronted with sparse, loosely related datasets requiring multiple hierarchies to be learned. Mixture of Experts (MoE) offers a natural paradigm to address these challenges. Despite their potential, we demonstrate that naive MoEs are inadequate for the nuanced demands of hierarchical DSR, largely due to their gradient descent-based gating update mechanism which leads to slow updates and conflicted routing during training. To overcome this limitation, we introduce MixER: Mixture of Expert Reconstructors, a novel sparse top-1 MoE layer employing a custom gating update algorithm based on K -means and least squares. Extensive experiments validate MixER’s capabilities, demonstrating efficient training and scalability to systems of up to ten parametric ordinary differential equations. However, our layer underperforms state-of-the-art meta-learners in high-data regimes, particularly when each expert is constrained to process only a fraction of a dataset composed of highly related data points. Further analysis with synthetic and neuroscientific time series suggests that the quality of the contextual representations generated by MixER is closely linked to the presence of hierarchical structure in the data.

[LG-115] Geometric Machine Learning on EEG Signals

链接: https://arxiv.org/abs/2502.05334
作者: Benjamin J. Choi
类目: Machine Learning (cs.LG)
*备注: Accepted in Proceedings of Machine Learning Research (PMLR), 2025

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) offer transformative potential, but decoding neural signals presents significant challenges. The core premise of this paper is built around demonstrating methods to elucidate the underlying low-dimensional geometric structure present in high-dimensional brainwave data in order to assist in downstream BCI-related neural classification tasks. We demonstrate two pipelines related to electroencephalography (EEG) signal processing: (1) a preliminary pipeline removing noise from individual EEG channels, and (2) a downstream manifold learning pipeline uncovering geometric structure across networks of EEG channels. We conduct preliminary validation using two EEG datasets and situate our demonstration in the context of the BCI-relevant imagined digit decoding problem. Our preliminary pipeline uses an attention-based EEG filtration network to extract clean signal from individual EEG channels. Our primary pipeline uses a fast Fourier transform, a Laplacian eigenmap, a discrete analog of Ricci flow via Ollivier’s notion of Ricci curvature, and a graph convolutional network to perform dimensionality reduction on high-dimensional multi-channel EEG data in order to enable regularizable downstream classification. Our system achieves competitive performance with existing signal processing and classification benchmarks; we demonstrate a mean test correlation coefficient of 0.95 at 2 dB on semi-synthetic neural denoising and a downstream EEG-based classification accuracy of 0.97 on distinguishing digit- versus non-digit thoughts. Results are preliminary and our geometric machine learning pipeline should be validated by more extensive follow-up studies; generalizing these results to larger inter-subject sample sizes, different hardware systems, and broader use cases will be crucial.

[LG-116] A Tutorial On Intersectionality in Fair Rankings

链接: https://arxiv.org/abs/2502.05333
作者: Chiara Criscuolo,Davide Martinenghi,Giuseppe Piccirillo
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the critical issue of biased algorithms and unfair rankings, which have permeated various sectors, including search engines, recommendation systems, and workforce management. These biases can lead to discriminatory outcomes in a data-driven world, especially against marginalized and underrepresented groups. Efforts towards responsible data science and responsible artificial intelligence aim to mitigate these biases and promote fairness, diversity, and transparency. However, most fairness-aware ranking methods singularly focus on protected attributes such as race, gender, or socio-economic status, neglecting the intersectionality of these attributes, i.e., the interplay between multiple social identities. Understanding intersectionality is crucial to ensure that existing inequalities are not preserved by fair rankings. We offer a description of the main ways to incorporate intersectionality in fair ranking systems through practical examples and provide a comparative overview of existing literature and a synoptic table summarizing the various methodologies. Our analysis highlights the need for intersectionality to attain fairness, while also emphasizing that fairness, alone, does not necessarily imply intersectionality.

[LG-117] Removing Neural Signal Artifacts with Autoencoder-Targeted Adversarial Transformers (AT-AT)

链接: https://arxiv.org/abs/2502.05332
作者: Benjamin J. Choi
类目: Machine Learning (cs.LG)
*备注: Accepted at CNS 2025, Boston, MA, USA

点击查看摘要

Abstract:Electromyogenic (EMG) noise is a major contamination source in EEG data that can impede accurate analysis of brain-specific neural activity. Recent literature on EMG artifact removal has moved beyond traditional linear algorithms in favor of machine learning-based systems. However, existing deep learning-based filtration methods often have large compute footprints and prohibitively long training times. In this study, we present a new machine learning-based system for filtering EMG interference from EEG data using an autoencoder-targeted adversarial transformer (AT-AT). By leveraging the lightweight expressivity of an autoencoder to determine optimal time-series transformer application sites, our AT-AT architecture achieves a 90% model size reduction compared to published artifact removal models. The addition of adversarial training ensures that filtered signals adhere to the fundamental characteristics of EEG data. We trained AT-AT using published neural data from 67 subjects and found that the system was able to achieve comparable test performance to larger models; AT-AT posted a mean reconstructive correlation coefficient above 0.95 at an initial signal-to-noise ratio (SNR) of 2 dB and 0.70 at -7 dB SNR. Further research generalizing these results to broader sample sizes beyond these isolated test cases will be crucial; while outside the scope of this study, we also include results from a real-world deployment of AT-AT in the Appendix.

[LG-118] From Counterfactuals to Trees: Competitive Analysis of Model Extraction Attacks

链接: https://arxiv.org/abs/2502.05325
作者: Awa Khouna,Julien Ferry,Thibaut Vidal
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The advent of Machine Learning as a Service (MLaaS) has heightened the trade-off between model explainability and security. In particular, explainability techniques, such as counterfactual explanations, inadvertently increase the risk of model extraction attacks, enabling unauthorized replication of proprietary models. In this paper, we formalize and characterize the risks and inherent complexity of model reconstruction, focusing on the "oracle’’ queries required for faithfully inferring the underlying prediction function. We present the first formal analysis of model extraction attacks through the lens of competitive analysis, establishing a foundational framework to evaluate their efficiency. Focusing on models based on additive decision trees (e.g., decision trees, gradient boosting, and random forests), we introduce novel reconstruction algorithms that achieve provably perfect fidelity while demonstrating strong anytime performance. Our framework provides theoretical bounds on the query complexity for extracting tree-based model, offering new insights into the security vulnerabilities of their deployment.

[LG-119] Using Federated Machine Learning in Predictive Maintenance of Jet Engines

链接: https://arxiv.org/abs/2502.05321
作者: Asaph Matheus Barbosa,Thao Vy Nhat Ngo,Elaheh Jafarigol,Theodore B. Trafalis,Emuobosa P. Ojoboh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal of this paper is to predict the Remaining Useful Life (RUL) of turbine jet engines using a federated machine learning framework. Federated Learning enables multiple edge devices/nodes or servers to collaboratively train a shared model without sharing sensitive data, thus preserving data privacy and security. By implementing a nonlinear model, the system aims to capture complex relationships and patterns in the engine data to enhance the accuracy of RUL predictions. This approach leverages decentralized computation, allowing models to be trained locally at each device before aggregating the learned weights at a central server. By predicting the RUL of jet engines accurately, maintenance schedules can be optimized, downtime reduced, and operational efficiency improved, ultimately leading to cost savings and enhanced performance in the aviation industry. Computational results are provided by using the C-MAPSS dataset which is publicly available on the NASA website and is a valuable resource for studying and analyzing engine degradation behaviors in various operational scenarios.

[LG-120] Diagonal Symmetrization of Neural Network Solvers for the Many-Electron Schr"odinger Equation

链接: https://arxiv.org/abs/2502.05318
作者: Kevin Han Huang,Ni Zhan,Elif Ertekin,Peter Orbanz,Ryan P. Adams
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Incorporating group symmetries into neural networks has been a cornerstone of success in many AI-for-science applications. Diagonal groups of isometries, which describe the invariance under a simultaneous movement of multiple objects, arise naturally in many-body quantum problems. Despite their importance, diagonal groups have received relatively little attention, as they lack a natural choice of invariant maps except in special cases. We study different ways of incorporating diagonal invariance in neural network ansätze trained via variational Monte Carlo methods, and consider specifically data augmentation, group averaging and canonicalization. We show that, contrary to standard ML setups, in-training symmetrization destabilizes training and can lead to worse performance. Our theoretical and numerical results indicate that this unexpected behavior may arise from a unique computational-statistical tradeoff not found in standard ML analyses of symmetrization. Meanwhile, we demonstrate that post hoc averaging is less sensitive to such tradeoffs and emerges as a simple, flexible and effective method for improving neural network solvers.

[LG-121] AI/ML-Based Automatic Modulation Recognition: Recent Trends and Future Possibilities

链接: https://arxiv.org/abs/2502.05315
作者: Elaheh Jafarigol,Behnoud Alaghband,Azadeh Gilanpour,Saeid Hosseinipoor,Mirhamed Mirmozafari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a review of high-performance automatic modulation recognition (AMR) models proposed in the literature to classify various Radio Frequency (RF) modulation schemes. We replicated these models and compared their performance in terms of accuracy across a range of signal-to-noise ratios. To ensure a fair comparison, we used the same dataset (RadioML-2016A), the same hardware, and a consistent definition of test accuracy as the evaluation metric, thereby providing a benchmark for future AMR studies. The hyperparameters were selected based on the authors’ suggestions in the associated references to achieve results as close as possible to the originals. The replicated models are publicly accessible for further analysis of AMR models. We also present the test accuracies of the selected models versus their number of parameters, indicating their complexities. Building on this comparative analysis, we identify strategies to enhance these models’ performance. Finally, we present potential opportunities for improvement, whether through novel architectures, data processing techniques, or training strategies, to further advance the capabilities of AMR models.

[LG-122] raining Set Reconstruction from Differentially Private Forests: How Effective is DP?

链接: https://arxiv.org/abs/2502.05307
作者: Alice Gorgé,Julien Ferry,Sébastien Gambs,Thibaut Vidal
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recent research has shown that machine learning models are vulnerable to privacy attacks targeting their training data. Differential privacy (DP) has become a widely adopted countermeasure, as it offers rigorous privacy protections. In this paper, we introduce a reconstruction attack targeting state-of-the-art \varepsilon -DP random forests. By leveraging a constraint programming model that incorporates knowledge of the forest’s structure and DP mechanism characteristics, our approach formally reconstructs the most likely dataset that could have produced a given forest. Through extensive computational experiments, we examine the interplay between model utility, privacy guarantees, and reconstruction accuracy across various configurations. Our results reveal that random forests trained with meaningful DP guarantees can still leak substantial portions of their training data. Specifically, while DP reduces the success of reconstruction attacks, the only forests fully robust to our attack exhibit predictive performance no better than a constant classifier. Building on these insights, we provide practical recommendations for the construction of DP random forests that are more resilient to reconstruction attacks and maintain non-trivial predictive performance. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2502.05307 [cs.LG] (or arXiv:2502.05307v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.05307 Focus to learn more arXiv-issued DOI via DataCite

[LG-123] Decentralized Online Ensembles of Gaussian Processes for Multi-Agent Systems ICASSP2025

链接: https://arxiv.org/abs/2502.05301
作者: Fernando Llorente,Daniel Waxman,Petar M. Djurić
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: 5 pages, 2 figures. Accepted to ICASSP 2025

点击查看摘要

Abstract:Flexible and scalable decentralized learning solutions are fundamentally important in the application of multi-agent systems. While several recent approaches introduce (ensembles of) kernel machines in the distributed setting, Bayesian solutions are much more limited. We introduce a fully decentralized, asymptotically exact solution to computing the random feature approximation of Gaussian processes. We further address the choice of hyperparameters by introducing an ensembling scheme for Bayesian multiple kernel learning based on online Bayesian model averaging. The resulting algorithm is tested against Bayesian and frequentist methods on simulated and real-world datasets.

[LG-124] GST-UNet: Spatiotemporal Causal Inference with Time-Varying Confounders

链接: https://arxiv.org/abs/2502.05295
作者: Miruna Oprescu,David K. Park,Xihaier Luo,Shinjae Yoo,Nathan Kallus
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 17 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Estimating causal effects from spatiotemporal data is a key challenge in fields such as public health, social policy, and environmental science, where controlled experiments are often infeasible. However, existing causal inference methods relying on observational data face significant limitations: they depend on strong structural assumptions to address spatiotemporal challenges \unicodex2013 such as interference, spatial confounding, and temporal carryover effects \unicodex2013 or fail to account for \textittime-varying confounders . These confounders, influenced by past treatments and outcomes, can themselves shape future treatments and outcomes, creating feedback loops that complicate traditional adjustment strategies. To address these challenges, we introduce the \textbfGST-UNet ( \textbfG -computation \textbfS patio- \textbfT emporal \textbfUNet ), a novel end-to-end neural network framework designed to estimate treatment effects in complex spatial and temporal settings. The GST-UNet leverages regression-based iterative G-computation to explicitly adjust for time-varying confounders, providing valid estimates of potential outcomes and treatment effects. To the best of our knowledge, the GST-UNet is the first neural model to account for complex, non-linear dynamics and time-varying confounders in spatiotemporal interventions. We demonstrate the effectiveness of the GST-UNet through extensive simulation studies and showcase its practical utility with a real-world analysis of the impact of wildfire smoke on respiratory hospitalizations during the 2018 California Camp Fire. Our results highlight the potential of GST-UNet to advance spatiotemporal causal inference across a wide range of policy-driven and scientific applications.

[LG-125] Fairness and Sparsity within Rashomon sets: Enumeration-Free Exploration and Characterization

链接: https://arxiv.org/abs/2502.05286
作者: Lucas Langlade,Julien Ferry,Gabriel Laberge,Thibaut Vidal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce an enumeration-free method based on mathematical programming to precisely characterize various properties such as fairness or sparsity within the set of “good models”, known as Rashomon set. This approach is generically applicable to any hypothesis class, provided that a mathematical formulation of the model learning task exists. It offers a structured framework to define the notion of business necessity and evaluate how fairness can be improved or degraded towards a specific protected group, while remaining within the Rashomon set and maintaining any desired sparsity level. We apply our approach to two hypothesis classes: scoring systems and decision diagrams, leveraging recent mathematical programming formulations for training such models. As seen in our experiments, the method comprehensively and certifiably quantifies trade-offs between predictive performance, sparsity, and fairness. We observe that a wide range of fairness values are attainable, ranging from highly favorable to significantly unfavorable for a protected group, while staying within less than 1% of the best possible training accuracy for the hypothesis class. Additionally, we observe that sparsity constraints limit these trade-offs and may disproportionately harm specific subgroups. As we evidenced, thoroughly characterizing the tensions between these key aspects is critical for an informed and accountable selection of models. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.05286 [cs.LG] (or arXiv:2502.05286v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.05286 Focus to learn more arXiv-issued DOI via DataCite

[LG-126] Principles and Components of Federated Learning Architectures

链接: https://arxiv.org/abs/2502.05273
作者: Sarwar Saif,MD Abdullah Al Nasim,Parag Biswas,Abdur Rashid,MD Mahim Anjum Haque,Md. Zihad Bin Jahangir
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning, also known as FL, is a machine learning framework in which a significant amount of clients (such as mobile devices or whole enterprises) collaborate to collaboratively train a model while keeping decentralized training data, all overseen by a central server (such as a service provider). There are advantages in terms of privacy, security, regulations, and economy with this decentralized approach to model training. FL is not impervious to the flaws that plague conventional machine learning models, despite its seeming promise. This study offers a thorough analysis of the fundamental ideas and elements of federated learning architectures, emphasizing five important areas: communication architectures, machine learning models, data partitioning, privacy methods, and system heterogeneity. We additionally address the difficulties and potential paths for future study in the area. Furthermore, based on a comprehensive review of the literature, we present a collection of architectural patterns for federated learning systems. This analysis will help to understand the basic of Federated learning, the primary components of FL, and also about several architectural details.

[LG-127] An Adaptable Budget Planner for Enhancing Budget-Constrained Auto-Bidding in Online Advertising KDD2025

链接: https://arxiv.org/abs/2502.05187
作者: Zhijian Duan,Yusen Huo,Tianyu Wang,Zhilin Zhang,Yeshu Li,Chuan Yu,Jian Xu,Bo Zheng,Xiaotie Deng
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: In KDD 2025 ADS Track August

点击查看摘要

Abstract:In online advertising, advertisers commonly utilize auto-bidding services to bid for impression opportunities. A typical objective of the auto-bidder is to optimize the advertiser’s cumulative value of winning impressions within specified budget constraints. However, such a problem is challenging due to the complex bidding environment faced by diverse advertisers. To address this challenge, we introduce ABPlanner, a few-shot adaptable budget planner designed to improve budget-constrained auto-bidding. ABPlanner is based on a hierarchical bidding framework that decomposes the bidding process into shorter, manageable stages. Within this framework, ABPlanner allocates the budget across all stages, allowing a low-level auto-bidder to bids based on the budget allocation plan. The adaptability of ABPlanner is achieved through a sequential decision-making approach, inspired by in-context reinforcement learning. For each advertiser, ABPlanner adjusts the budget allocation plan episode by episode, using data from previous episodes as prompt for current decisions. This enables ABPlanner to quickly adapt to different advertisers with few-shot data, providing a sample-efficient solution. Extensive simulation experiments and real-world A/B testing validate the effectiveness of ABPlanner, demonstrating its capability to enhance the cumulative value achieved by auto-bidders.

[LG-128] O(sqrtT) Static Regret and Instance Dependent Constraint Violation for Constrained Online Convex Optimization

链接: https://arxiv.org/abs/2502.05019
作者: Rahul Vaze,Abhishek Sinha
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:The constrained version of the standard online convex optimization (OCO) framework, called COCO is considered, where on every round, a convex cost function and a convex constraint function are revealed to the learner after it chooses the action for that round. The objective is to simultaneously minimize the static regret and cumulative constraint violation (CCV). An algorithm is proposed that guarantees a static regret of O(\sqrtT) and a CCV of \min\cV, O(\sqrtT\log T) \ , where \cV depends on the distance between the consecutively revealed constraint sets, the shape of constraint sets, dimension of action space and the diameter of the action space. For special cases of constraint sets, \cV=O(1) . Compared to the state of the art results, static regret of O(\sqrtT) and CCV of O(\sqrtT\log T) , that were universal, the new result on CCV is instance dependent, which is derived by exploiting the geometric properties of the constraint sets.

[LG-129] Learning an Optimal Assortment Policy under Observational Data

链接: https://arxiv.org/abs/2502.06777
作者: Yuxuan Han,Han Zhong,Miao Lu,Jose Blanchet,Zhengyuan Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the fundamental problem of offline assortment optimization under the Multinomial Logit (MNL) model, where sellers must determine the optimal subset of the products to offer based solely on historical customer choice data. While most existing approaches to learning-based assortment optimization focus on the online learning of the optimal assortment through repeated interactions with customers, such exploration can be costly or even impractical in many real-world settings. In this paper, we consider the offline learning paradigm and investigate the minimal data requirements for efficient offline assortment optimization. To this end, we introduce Pessimistic Rank-Breaking (PRB), an algorithm that combines rank-breaking with pessimistic estimation. We prove that PRB is nearly minimax optimal by establishing the tight suboptimality upper bound and a nearly matching lower bound. This further shows that “optimal item coverage” - where each item in the optimal assortment appears sufficiently often in the historical data - is both sufficient and necessary for efficient offline learning. This significantly relaxes the previous requirement of observing the complete optimal assortment in the data. Our results provide fundamental insights into the data requirements for offline assortment optimization under the MNL model.

[LG-130] Unsupervised Particle Tracking with Neuromorphic Computing

链接: https://arxiv.org/abs/2502.06771
作者: Emanuele Coradin(1, 2),Fabio Cufino(3),Muhammad Awais(1, 2, 7),Tommaso Dorigo(1, 2, 4, 5, 7),Enrico Lupi(1, 2),Eleonora Porcu(3),Jinu Raj(6),Fredrik Sandin(4, 7),Mia Tosi(1, 2, 7) ((1) INFN, Sezione di Padova, Italy, (2) Università di Padova, Dipartimento di Fisica e Astronomia, Italy, (3) Università di Bologna, Dipartimento di Fisica, Italy, (4) Luleå University of Technology, Sweden, (5) Universal Scientific Education and Research Network, Italy, (6) Central University of Tamil Nadu, India, (7) MODE Collaboration)
类目: High Energy Physics - Experiment (hep-ex); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 24 pages, 21 figures, submitted to MDPI Particles

点击查看摘要

Abstract:We study the application of a neural network architecture for identifying charged particle trajectories via unsupervised learning of delays and synaptic weights using a spike-time-dependent plasticity rule. In the considered model, the neurons receive time-encoded information on the position of particle hits in a tracking detector for a particle collider, modeled according to the geometry of the Compact Muon Solenoid Phase II detector. We show how a spiking neural network is capable of successfully identifying in a completely unsupervised way the signal left by charged particles in the presence of conspicuous noise from accidental or combinatorial hits. These results open the way to applications of neuromorphic computing to particle tracking, motivating further studies into its potential for real-time, low-power particle tracking in future high-energy physics experiments.

[LG-131] Are all models wrong? Fundamental limits in distribution-free empirical model falsification

链接: https://arxiv.org/abs/2502.06765
作者: Manuel M. Müller,Yuetian Luo,Rina Foygel Barber
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In statistics and machine learning, when we train a fitted model on available data, we typically want to ensure that we are searching within a model class that contains at least one accurate model – that is, we would like to ensure an upper bound on the model class risk (the lowest possible risk that can be attained by any model in the class). However, it is also of interest to establish lower bounds on the model class risk, for instance so that we can determine whether our fitted model is at least approximately optimal within the class, or, so that we can decide whether the model class is unsuitable for the particular task at hand. Particularly in the setting of interpolation learning where machine learning models are trained to reach zero error on the training data, we might ask if, at the very least, a positive lower bound on the model class risk is possible – or are we unable to detect that “all models are wrong”? In this work, we answer these questions in a distribution-free setting by establishing a model-agnostic, fundamental hardness result for the problem of constructing a lower bound on the best test error achievable over a model class, and examine its implications on specific model classes such as tree-based methods and linear regression.

[LG-132] Case for a unified surrogate modelling framework in the age of AI

链接: https://arxiv.org/abs/2502.06753
作者: Elizaveta Semenova
类目: Computation (stat.CO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Surrogate models are widely used in natural sciences, engineering, and machine learning to approximate complex systems and reduce computational costs. However, the current landscape lacks standardisation across key stages of the pipeline, including data collection, sampling design, model class selection, evaluation metrics, and downstream task performance analysis. This fragmentation limits reproducibility, reliability, and cross-domain applicability. The issue has only been exacerbated by the AI revolution and a new suite of surrogate model classes that it offers. In this position paper, we argue for the urgent need for a unified framework to guide the development and evaluation of surrogate models. We outline essential steps for constructing a comprehensive pipeline and discuss alternative perspectives, such as the benefits of domain-specific frameworks. By advocating for a standardised approach, this paper seeks to improve the reliability of surrogate modelling, foster cross-disciplinary knowledge transfer, and, as a result, accelerate scientific progress.

[LG-133] Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

链接: https://arxiv.org/abs/2502.06719
作者: Marina Sheshukova,Sergey Samsonov,Denis Belomestny,Eric Moulines,Qi-Man Shao,Zhuo-Song Zhang,Alexey Naumov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this paper, we establish non-asymptotic convergence rates in the central limit theorem for Polyak-Ruppert-averaged iterates of stochastic gradient descent (SGD). Our analysis builds on the result of the Gaussian approximation for nonlinear statistics of independent random variables of Shao and Zhang (2022). Using this result, we prove the non-asymptotic validity of the multiplier bootstrap for constructing the confidence sets for the optimal solution of an optimization problem. In particular, our approach avoids the need to approximate the limiting covariance of Polyak-Ruppert SGD iterates, which allows us to derive approximation rates in convex distance of order up to 1/\sqrtn .

[LG-134] Neumann eigenmaps for landmark embedding

链接: https://arxiv.org/abs/2502.06689
作者: Shashank Sule,Wojciech Czaja
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present Neumann eigenmaps (NeuMaps), a novel approach for enhancing the standard diffusion map embedding using landmarks, i.e distinguished samples within the dataset. By interpreting these landmarks as a subgraph of the larger data graph, NeuMaps are obtained via the eigendecomposition of a renormalized Neumann Laplacian. We show that NeuMaps offer two key advantages: (1) they provide a computationally efficient embedding that accurately recovers the diffusion distance associated with the reflecting random walk on the subgraph, and (2) they naturally incorporate the Nyström extension within the diffusion map framework through the discrete Neumann boundary condition. Through examples in digit classification and molecular dynamics, we demonstrate that NeuMaps not only improve upon existing landmark-based embedding methods but also enhance the stability of diffusion map embeddings to the removal of highly significant points.

[LG-135] Quantile Multi-Armed Bandits with 1-bit Feedback ALT2025

链接: https://arxiv.org/abs/2502.06678
作者: Ivan Lau,Jonathan Scarlett
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: ALT 2025

点击查看摘要

Abstract:In this paper, we study a variant of best-arm identification involving elements of risk sensitivity and communication constraints. Specifically, the goal of the learner is to identify the arm with the highest quantile reward, while the communication from an agent (who observes rewards) and the learner (who chooses actions) is restricted to only one bit of feedback per arm pull. We propose an algorithm that utilizes noisy binary search as a subroutine, allowing the learner to estimate quantile rewards through 1-bit feedback. We derive an instance-dependent upper bound on the sample complexity of our algorithm and provide an algorithm-independent lower bound for specific instances, with the two matching to within logarithmic factors under mild conditions, or even to within constant factors in certain low error probability scaling regimes. The lower bound is applicable even in the absence of communication constraints, and thus we conclude that restricting to 1-bit feedback has a minimal impact on the scaling of the sample complexity.

[LG-136] LOCO: Distribution-Free Inference for Feature Interactions

链接: https://arxiv.org/abs/2502.06661
作者: Camille Little,Lili Zheng,Genevera Allen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature importance measures are widely studied and are essential for understanding model behavior, guiding feature selection, and enhancing interpretability. However, many machine learning fitted models involve complex, higher-order interactions between features. Existing feature importance metrics fail to capture these higher-order effects while existing interaction metrics often suffer from limited applicability or excessive computation; no methods exist to conduct statistical inference for feature interactions. To bridge this gap, we first propose a new model-agnostic metric, interaction Leave-One-Covariate-Out iLOCO, for measuring the importance of higher-order feature interactions. Next, we leverage recent advances in LOCO inference to develop distribution-free and assumption-light confidence intervals for our iLOCO metric. To address computational challenges, we also introduce an ensemble learning method for calculating the iLOCO metric and confidence intervals that we show is both computationally and statistically efficient. We validate our iLOCO metric and our confidence intervals on both synthetic and real data sets, showing that our approach outperforms existing methods and provides the first inferential approach to detecting feature interactions.

[LG-137] Estimation of Food Intake Quantity Using Inertial Signals from Smartwatches

链接: https://arxiv.org/abs/2502.06649
作者: Ioannis Levi,Konstantinos Kyritsis,Vasileios Papapanagiotou,Georgios Tsakiridis,Anastasios Delopoulos
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Manuscript submitted for review to 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2025

点击查看摘要

Abstract:Accurate monitoring of eating behavior is crucial for managing obesity and eating disorders such as bulimia nervosa. At the same time, existing methods rely on multiple and/or specialized sensors, greatly harming adherence and ultimately, the quality and continuity of data. This paper introduces a novel approach for estimating the weight of a bite, from a commercial smartwatch. Our publicly-available dataset contains smartwatch inertial data from ten participants, with manually annotated start and end times of each bite along with their corresponding weights from a smart scale, under semi-controlled conditions. The proposed method combines extracted behavioral features such as the time required to load the utensil with food, with statistical features of inertial signals, that serve as input to a Support Vector Regression model to estimate bite weights. Under a leave-one-subject-out cross-validation scheme, our approach achieves a mean absolute error (MAE) of 3.99 grams per bite. To contextualize this performance, we introduce the improvement metric, that measures the relative MAE difference compared to a baseline model. Our method demonstrates a 17.41% improvement, while the adapted state-of-the art method shows a -28.89% performance against that same baseline. The results presented in this work establish the feasibility of extracting meaningful bite weight estimates from commercial smartwatch inertial sensors alone, laying the groundwork for future accessible, non-invasive dietary monitoring systems.

[LG-138] Membership Inference Risks in Quantized Models: A Theoretical and Empirical Study

链接: https://arxiv.org/abs/2502.06567
作者: Eric Aubinais,Philippe Formont,Pablo Piantanida,Elisabeth Gassiat
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantizing machine learning models has demonstrated its effectiveness in lowering memory and inference costs while maintaining performance levels comparable to the original models. In this work, we investigate the impact of quantization procedures on the privacy of data-driven models, specifically focusing on their vulnerability to membership inference attacks. We derive an asymptotic theoretical analysis of Membership Inference Security (MIS), characterizing the privacy implications of quantized algorithm weights against the most powerful (and possibly unknown) attacks. Building on these theoretical insights, we propose a novel methodology to empirically assess and rank the privacy levels of various quantization procedures. Using synthetic datasets, we demonstrate the effectiveness of our approach in assessing the MIS of different quantizers. Furthermore, we explore the trade-off between privacy and performance using real-world data and models in the context of molecular modeling.

[LG-139] Data Augmentation and Regularization for Learning Group Equivariance

链接: https://arxiv.org/abs/2502.06547
作者: Oskar Nordenfors,Axel Flinth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In many machine learning tasks, known symmetries can be used as an inductive bias to improve model performance. In this paper, we consider learning group equivariance through training with data augmentation. We summarize results from a previous paper of our own, and extend the results to show that equivariance of the trained model can be achieved through training on augmented data in tandem with regularization.

[LG-140] Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions

链接: https://arxiv.org/abs/2502.06536
作者: Hidde Fokkema,Tim van Erven,Sara Magliacane
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 47 pages, 16 figures, 9 Tables, Preprint

点击查看摘要

Abstract:Machine learning is a vital part of many real-world systems, but several concerns remain about the lack of interpretability, explainability and robustness of black-box AI systems. Concept-based models (CBM) address some of these challenges by learning interpretable concepts from high-dimensional data, e.g. images, which are used to predict labels. An important issue in CBMs is concept leakage, i.e., spurious information in the learned concepts, which effectively leads to learning “wrong” concepts. Current mitigating strategies are heuristic, have strong assumptions, e.g., they assume that the concepts are statistically independent of each other, or require substantial human interaction in terms of both interventions and labels provided by annotators. In this paper, we describe a framework that provides theoretical guarantees on the correctness of the learned concepts and on the number of required labels, without requiring any interventions. Our framework leverages causal representation learning (CRL) to learn high-level causal variables from low-level data, and learns to align these variables with interpretable concepts. We propose a linear and a non-parametric estimator for this mapping, providing a finite-sample high probability result in the linear case and an asymptotic consistency result for the non-parametric estimator. We implement our framework with state-of-the-art CRL methods, and show its efficacy in learning the correct concepts in synthetic and image benchmarks.

[LG-141] Properties of Wasserstein Gradient Flows for the Sliced-Wasserstein Distance

链接: https://arxiv.org/abs/2502.06525
作者: Christophe Vauthier,Quentin Mérigot,Anna Korba
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32p

点击查看摘要

Abstract:In this paper, we investigate the properties of the Sliced Wasserstein Distance (SW) when employed as an objective functional. The SW metric has gained significant interest in the optimal transport and machine learning literature, due to its ability to capture intricate geometric properties of probability distributions while remaining computationally tractable, making it a valuable tool for various applications, including generative modeling and domain adaptation. Our study aims to provide a rigorous analysis of the critical points arising from the optimization of the SW objective. By computing explicit perturbations, we establish that stable critical points of SW cannot concentrate on segments. This stability analysis is crucial for understanding the behaviour of optimization algorithms for models trained using the SW objective. Furthermore, we investigate the properties of the SW objective, shedding light on the existence and convergence behavior of critical points. We illustrate our theoretical results through numerical experiments.

[LG-142] Conformal Prediction Regions are Imprecise Highest Density Regions

链接: https://arxiv.org/abs/2502.06331
作者: Michele Caprio,Yusuf Sale,Eyke Hüllermeier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Recently, Cella and Martin proved how, under an assumption called consonance, a credal set (i.e. a closed and convex set of probabilities) can be derived from the conformal transducer associated with transductive conformal prediction. We show that the Imprecise Highest Density Region (IHDR) associated with such a credal set corresponds to the classical Conformal Prediction Region. In proving this result, we relate the set of probability density/mass functions (pdf/pmf’s) associated with the elements of the credal set to the imprecise probabilistic concept of a cloud. As a result, we establish new relationships between Conformal Prediction and Imprecise Probability (IP) theories. A byproduct of our presentation is the discovery that consonant plausibility functions are monoid homomorphisms, a new algebraic property of an IP tool.

[LG-143] A physics-based data-driven model for CO_2 gas diffusion electrodes to drive automated laboratories ICLR2025

链接: https://arxiv.org/abs/2502.06323
作者: Ivan Grega,Félix Therrien,Abhishek Soni,Karry Ocean,Kevan Dettelbach,Ribwar Ahmadi,Mehrdad Mokhtari,Curtis P. Berlinguette,Yoshua Bengio
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures. Submitted to AI4Mat-ICLR2025 workshop

点击查看摘要

Abstract:The electrochemical reduction of atmospheric CO _2 into high-energy molecules with renewable energy is a promising avenue for energy storage that can take advantage of existing infrastructure especially in areas where sustainable alternatives to fossil fuels do not exist. Automated laboratories are currently being developed and used to optimize the composition and operating conditions of gas diffusion electrodes (GDEs), the device in which this reaction takes place. Improving the efficiency of GDEs is crucial for this technology to become viable. Here we present a modeling framework to efficiently explore the high-dimensional parameter space of GDE designs in an active learning context. At the core of the framework is an uncertainty-aware physics model calibrated with experimental data. The model has the flexibility to capture various input parameter spaces and any carbon products which can be modeled with Tafel kinetics. It is interpretable, and a Gaussian process layer can capture deviations of real data from the function space of the physical model itself. We deploy the model in a simulated active learning setup with real electrochemical data gathered by the AdaCarbon automated laboratory and show that it can be used to efficiently traverse the multi-dimensional parameter space.

[LG-144] Application of quantum machine learning using quantum kernel algorithms on multiclass neuron M type classification

链接: https://arxiv.org/abs/2502.06281
作者: Xavier Vasques,Hanhee Paik,Laura Cif
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The functional characterization of different neuronal types has been a longstanding and crucial challenge. With the advent of physical quantum computers, it has become possible to apply quantum machine learning algorithms to translate theoretical research into practical solutions. Previous studies have shown the advantages of quantum algorithms on artificially generated datasets, and initial experiments with small binary classification problems have yielded comparable outcomes to classical algorithms. However, it is essential to investigate the potential quantum advantage using real-world data. To the best of our knowledge, this study is the first to propose the utilization of quantum systems to classify neuron morphologies, thereby enhancing our understanding of the performance of automatic multiclass neuron classification using quantum kernel methods. We examined the influence of feature engineering on classification accuracy and found that quantum kernel methods achieved similar performance to classical methods, with certain advantages observed in various configurations.

[LG-145] Spectral-factorized Positive-definite Curvature Learning for NN Training

链接: https://arxiv.org/abs/2502.06268
作者: Wu Lin,Felix Dangel,Runa Eschenhagen,Juhan Bae,Richard E. Turner,Roger B. Grosse
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: technical report

点击查看摘要

Abstract:Many training methods, such as Adam(W) and Shampoo, learn a positive-definite curvature matrix and apply an inverse root before preconditioning. Recently, non-diagonal training methods, such as Shampoo, have gained significant attention; however, they remain computationally inefficient and are limited to specific types of curvature information due to the costly matrix root computation via matrix decomposition. To address this, we propose a Riemannian optimization approach that dynamically adapts spectral-factorized positive-definite curvature estimates, enabling the efficient application of arbitrary matrix roots and generic curvature learning. We demonstrate the efficacy and versatility of our approach in positive-definite matrix optimization and covariance adaptation for gradient-free optimization, as well as its efficiency in curvature learning for neural net training.

[LG-146] Falsification of Unconfoundedness by Testing Independence of Causal Mechanisms

链接: https://arxiv.org/abs/2502.06231
作者: Rickard K.A. Karlsson,Jesse H. Krijthe
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, including 5 figures, 2 tables, and appendices

点击查看摘要

Abstract:A major challenge in estimating treatment effects in observational studies is the reliance on untestable conditions such as the assumption of no unmeasured confounding. In this work, we propose an algorithm that can falsify the assumption of no unmeasured confounding in a setting with observational data from multiple heterogeneous sources, which we refer to as environments. Our proposed falsification strategy leverages a key observation that unmeasured confounding can cause observed causal mechanisms to appear dependent. Building on this observation, we develop a novel two-stage procedure that detects these dependencies with high statistical power while controlling false positives. The algorithm does not require access to randomized data and, in contrast to other falsification approaches, functions even under transportability violations when the environment has a direct effect on the outcome of interest. To showcase the practical relevance of our approach, we show that our method is able to efficiently detect confounding on both simulated and real-world data.

[LG-147] Bayesian Optimization by Kernel Regression and Density-based Exploration

链接: https://arxiv.org/abs/2502.06178
作者: Tansheng Zhu,Hongyu Zhou,Ke Jin,Xusheng Xu,Qiufan Yuan,Lijie Ji
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimization is highly effective for optimizing expensive-to-evaluate black-box functions, but it faces significant computational challenges due to the high computational complexity of Gaussian processes, which results in a total time complexity that is quartic with respect to the number of iterations. To address this limitation, we propose the Bayesian Optimization by Kernel regression and density-based Exploration (BOKE) algorithm. BOKE uses kernel regression for efficient function approximation, kernel density for exploration, and the improved kernel regression upper confidence bound criteria to guide the optimization process, thus reducing computational costs to quadratic. Our theoretical analysis rigorously establishes the global convergence of BOKE and ensures its robustness. Through extensive numerical experiments on both synthetic and real-world optimization tasks, we demonstrate that BOKE not only performs competitively compared to Gaussian process-based methods but also exhibits superior computational efficiency. These results highlight BOKE’s effectiveness in resource-constrained environments, providing a practical approach for optimization problems in engineering applications.

[LG-148] Dynamic Pricing with Adversarially-Censored Demands

链接: https://arxiv.org/abs/2502.06168
作者: Jianyu Xu,Yining Wang,Xi Chen,Yu-Xiang Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Optimization and Control (math.OC)
*备注: 33 pages, 1 figure

点击查看摘要

Abstract:We study an online dynamic pricing problem where the potential demand at each time period t=1,2,\ldots, T is stochastic and dependent on the price. However, a perishable inventory is imposed at the beginning of each time t , censoring the potential demand if it exceeds the inventory level. To address this problem, we introduce a pricing algorithm based on the optimistic estimates of derivatives. We show that our algorithm achieves \tildeO(\sqrtT) optimal regret even with adversarial inventory series. Our findings advance the state-of-the-art in online decision-making problems with censored feedback, offering a theoretically optimal solution against adversarial observations.

[LG-149] Linear Bandits with Partially Observable Features

链接: https://arxiv.org/abs/2502.06142
作者: Wonyoung Kim,Sungwoo Park,Garud Iyengar,Assaf Zeevi,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel linear bandit problem with partially observable features, resulting in partial reward information and spurious estimates. Without proper address for latent part, regret possibly grows linearly in decision horizon T , as their influence on rewards are unknown. To tackle this, we propose a novel analysis to handle the latent features and an algorithm that achieves sublinear regret. The core of our algorithm involves (i) augmenting basis vectors orthogonal to the observed feature space, and (ii) introducing an efficient doubly robust estimator. Our approach achieves a regret bound of \tildeO(\sqrt(d + d_h)T) , where d is the dimension of observed features, and d_h is the unknown dimension of the subspace of the unobserved features. Notably, our algorithm requires no prior knowledge of the unobserved feature space, which may expand as more features become hidden. Numerical experiments confirm that our algorithm outperforms both non-contextual multi-armed bandits and linear bandit algorithms depending solely on observed features.

[LG-150] Lipschitz-Driven Inference: Bias-corrected Confidence Intervals for Spatial Linear Models

链接: https://arxiv.org/abs/2502.06067
作者: David R. Burt,Renato Berlinghieri,Stephen Bates,Tamara Broderick
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 34 pages; 15 figures

点击查看摘要

Abstract:Linear models remain ubiquitous in modern spatial applications - including climate science, public health, and economics - due to their interpretability, speed, and reproducibility. While practitioners generally report a form of uncertainty, popular spatial uncertainty quantification methods do not jointly handle model misspecification and distribution shift - despite both being essentially always present in spatial problems. In the present paper, we show that existing methods for constructing confidence (or credible) intervals in spatial linear models fail to provide correct coverage due to unaccounted-for bias. In contrast to classical methods that rely on an i.i.d. assumption that is inappropriate in spatial problems, in the present work we instead make a spatial smoothness (Lipschitz) assumption. We are then able to propose a new confidence-interval construction that accounts for bias in the estimation procedure. We demonstrate that our new method achieves nominal coverage via both theory and experiments. Code to reproduce experiments is available at this https URL.

[LG-151] Scalable Differentially Private Bayesian Optimization

链接: https://arxiv.org/abs/2502.06044
作者: Getoar Sopa,Juraj Marusic,Marco Avella-Medina,John P. Cunningham
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:In recent years, there has been much work on scaling Bayesian Optimization to high-dimensional problems, for example hyperparameter tuning in large neural network models. These scalable methods have been successful, finding high objective values much more quickly than traditional global Bayesian Optimization or random search-based methods. At the same time, these large neural network models often use sensitive data, but preservation of Differential Privacy has not scaled alongside these modern Bayesian Optimization procedures. Here we develop a method to privately estimate potentially high-dimensional parameter spaces using Gradient Informative Bayesian Optimization. Our theoretical results prove that under suitable conditions, our method converges exponentially fast to a ball around the optimal parameter configuration. Moreover, regardless of whether the assumptions are satisfied, we show that our algorithm maintains privacy and empirically demonstrates superior performance to existing methods in the high-dimensional hyperparameter setting.

[LG-152] Nested subspace learning with flags

链接: https://arxiv.org/abs/2502.06022
作者: Tom Szwagier,Xavier Pennec
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many machine learning methods look for low-dimensional representations of the data. The underlying subspace can be estimated by first choosing a dimension q and then optimizing a certain objective function over the space of q -dimensional subspaces (the Grassmannian). Trying different q yields in general non-nested subspaces, which raises an important issue of consistency between the data representations. In this paper, we propose a simple trick to enforce nestedness in subspace learning methods. It consists in lifting Grassmannian optimization problems to flag manifolds (the space of nested subspaces of increasing dimension) via nested projectors. We apply the flag trick to several classical machine learning methods and show that it successfully addresses the nestedness issue.

[LG-153] Uncertainty Quantification and Causal Considerations for Off-Policy Decision Making

链接: https://arxiv.org/abs/2502.06011
作者: Muhammad Faaiz Taufiq
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: PhD thesis

点击查看摘要

Abstract:Off-policy evaluation (OPE) is a critical challenge in robust decision-making that seeks to assess the performance of a new policy using data collected under a different policy. However, the existing OPE methodologies suffer from several limitations arising from statistical uncertainty as well as causal considerations. In this thesis, we address these limitations by presenting three different works. Firstly, we consider the problem of high variance in the importance-sampling-based OPE estimators. We introduce the Marginal Ratio (MR) estimator, a novel OPE method that reduces variance by focusing on the marginal distribution of outcomes rather than direct policy shifts, improving robustness in contextual bandits. Next, we propose Conformal Off-Policy Prediction (COPP), a principled approach for uncertainty quantification in OPE that provides finite-sample predictive intervals, ensuring robust decision-making in risk-sensitive applications. Finally, we address causal unidentifiability in off-policy decision-making by developing novel bounds for sequential decision settings, which remain valid under arbitrary unmeasured confounding. We apply these bounds to assess the reliability of digital twin models, introducing a falsification framework to identify scenarios where model predictions diverge from real-world behaviour. Our contributions provide new insights into robust decision-making under uncertainty and establish principled methods for evaluating policies in both static and dynamic settings.

[LG-154] ransformers versus the EM Algorithm in Multi-class Clustering

链接: https://arxiv.org/abs/2502.06007
作者: Yihan He,Hong-Yu Chen,Yuan Cao,Jianqing Fan,Han Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLMs demonstrate significant inference capacities in complicated machine learning tasks, using the Transformer model as its backbone. Motivated by the limited understanding of such models on the unsupervised learning problems, we study the learning guarantees of Transformers in performing multi-class clustering of the Gaussian Mixture Models. We develop a theory drawing strong connections between the Softmax Attention layers and the workflow of the EM algorithm on clustering the mixture of Gaussians. Our theory provides approximation bounds for the Expectation and Maximization steps by proving the universal approximation abilities of multivariate mappings by Softmax functions. In addition to the approximation guarantees, we also show that with a sufficient number of pre-training samples and an initialization, Transformers can achieve the minimax optimal rate for the problem considered. Our extensive simulations empirically verified our theory by revealing the strong learning capacities of Transformers even beyond the assumptions in the theory, shedding light on the powerful inference capacities of LLMs.

[LG-155] Diffusion Models for Inverse Problems in the Exponential Family

链接: https://arxiv.org/abs/2502.05994
作者: Alessandro Micheli,Mélodie Monod,Samir Bhatt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful tools for solving inverse problems, yet prior work has primarily focused on observations with Gaussian measurement noise, restricting their use in real-world scenarios. This limitation persists due to the intractability of the likelihood score, which until now has only been approximated in the simpler case of Gaussian likelihoods. In this work, we extend diffusion models to handle inverse problems where the observations follow a distribution from the exponential family, such as a Poisson or a Binomial distribution. By leveraging the conjugacy properties of exponential family distributions, we introduce the evidence trick, a method that provides a tractable approximation to the likelihood score. In our experiments, we demonstrate that our methodology effectively performs Bayesian inference on spatially inhomogeneous Poisson processes with intensities as intricate as ImageNet images. Furthermore, we demonstrate the real-world impact of our methodology by showing that it performs competitively with the current state-of-the-art in predicting malaria prevalence estimates in Sub-Saharan Africa.

[LG-156] Asymptotic FDR Control with Model-X Knockoffs: Is Moments Matching Sufficient?

链接: https://arxiv.org/abs/2502.05969
作者: Yingying Fan,Lan Gao,Jinchi Lv,Xiaocong Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 90 pages

点击查看摘要

Abstract:We propose a unified theoretical framework for studying the robustness of the model-X knockoffs framework by investigating the asymptotic false discovery rate (FDR) control of the practically implemented approximate knockoffs procedure. This procedure deviates from the model-X knockoffs framework by substituting the true covariate distribution with a user-specified distribution that can be learned using in-sample observations. By replacing the distributional exchangeability condition of the model-X knockoff variables with three conditions on the approximate knockoff statistics, we establish that the approximate knockoffs procedure achieves the asymptotic FDR control. Using our unified framework, we further prove that an arguably most popularly used knockoff variable generation method–the Gaussian knockoffs generator based on the first two moments matching–achieves the asymptotic FDR control when the two-moment-based knockoff statistics are employed in the knockoffs inference procedure. For the first time in the literature, our theoretical results justify formally the effectiveness and robustness of the Gaussian knockoffs generator. Simulation and real data examples are conducted to validate the theoretical findings.

[LG-157] Detection of Physiological Data Tampering Attacks with Quantum Machine Learning

链接: https://arxiv.org/abs/2502.05966
作者: Md. Saif Hassan Onim,Himanshu Thapliyal
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread use of cloud-based medical devices and wearable sensors has made physiological data susceptible to tampering. These attacks can compromise the reliability of healthcare systems which can be critical and life-threatening. Detection of such data tampering is of immediate need. Machine learning has been used to detect anomalies in datasets but the performance of Quantum Machine Learning (QML) is still yet to be evaluated for physiological sensor data. Thus, our study compares the effectiveness of QML for detecting physiological data tampering, focusing on two types of white-box attacks: data poisoning and adversarial perturbation. The results show that QML models are better at identifying label-flipping attacks, achieving accuracy rates of 75%-95% depending on the data and attack severity. This superior performance is due to the ability of quantum algorithms to handle complex and high-dimensional data. However, both QML and classical models struggle to detect more sophisticated adversarial perturbation attacks, which subtly alter data without changing its statistical properties. Although QML performed poorly against this attack with around 45%-65% accuracy, it still outperformed classical algorithms in some cases.

[LG-158] Propagation of Chaos for Mean-Field Langevin Dynamics and its Application to Model Ensemble

链接: https://arxiv.org/abs/2502.05784
作者: Atsushi Nitanda,Anzelle Lee,Damian Tan Xing Kai,Mizuki Sakaguchi,Taiji Suzuki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 23 pages,

点击查看摘要

Abstract:Mean-field Langevin dynamics (MFLD) is an optimization method derived by taking the mean-field limit of noisy gradient descent for two-layer neural networks in the mean-field regime. Recently, the propagation of chaos (PoC) for MFLD has gained attention as it provides a quantitative characterization of the optimization complexity in terms of the number of particles and iterations. A remarkable progress by Chen et al. (2022) showed that the approximation error due to finite particles remains uniform in time and diminishes as the number of particles increases. In this paper, by refining the defective log-Sobolev inequality – a key result from that earlier work – under the neural network training setting, we establish an improved PoC result for MFLD, which removes the exponential dependence on the regularization coefficient from the particle approximation term of the optimization complexity. As an application, we propose a PoC-based model ensemble strategy with theoretical guarantees.

[LG-159] Dynamic Pricing in the Linear Valuation Model using Shape Constraints

链接: https://arxiv.org/abs/2502.05776
作者: Daniele Bracale,Moulinath Banerjee,Yuekai Sun,Kevin Stoll,Salam Turki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a shape-constrained approach to dynamic pricing for censored data in the linear valuation model that eliminates the need for tuning parameters commonly required in existing methods. Previous works have addressed the challenge of unknown market noise distribution F using strategies ranging from kernel methods to reinforcement learning algorithms, such as bandit techniques and upper confidence bounds (UCB), under the Lipschitz (and stronger) assumption(s) on F_0 . In contrast, our method relies on isotonic regression under the weaker assumption that F_0 is \alpha -Holder continuous for some \alpha \in (0,1] . We obtain an upper bound on the asymptotic expected regret that matches existing bounds in the literature for \alpha = 1 (the Lipschitz case). Simulations and experiments with real-world data obtained by Welltower Inc (a major healthcare Real Estate Investment Trust) consistently demonstrate that our method attains better empirical regret in comparison to several existing methods in the literature while offering the advantage of being completely tuning-parameter free.

[LG-160] D(0) Learning converges for Polynomial mixing and non-linear functions

链接: https://arxiv.org/abs/2502.05706
作者: Anupama Sridhar,Alexander Johansen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 12 pages main text

点击查看摘要

Abstract:Theoretical work on Temporal Difference (TD) learning has provided finite-sample and high-probability guarantees for data generated from Markov chains. However, these bounds typically require linear function approximation, instance-dependent step sizes, algorithmic modifications, and restrictive mixing rates. We present theoretical findings for TD learning under more applicable assumptions, including instance-independent step sizes, full data utilization, and polynomial ergodicity, applicable to both linear and non-linear functions. \textbfTo our knowledge, this is the first proof of TD(0) convergence on Markov data under universal and instance-independent step sizes. While each contribution is significant on its own, their combination allows these bounds to be effectively utilized in practical application settings. Our results include bounds for linear models and non-linear under generalized gradients and Hölder continuity.

[LG-161] Generalized Venn and Venn-Abers Calibration with Applications in Conformal Prediction

链接: https://arxiv.org/abs/2502.05676
作者: Lars van der Laan,Ahmed Alaa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Ensuring model calibration is critical for reliable predictions, yet popular distribution-free methods, such as histogram binning and isotonic regression, provide only asymptotic guarantees. We introduce a unified framework for Venn and Venn-Abers calibration, generalizing Vovk’s binary classification approach to arbitrary prediction tasks and loss functions. Venn calibration leverages binning calibrators to construct prediction sets that contain at least one marginally perfectly calibrated point prediction in finite samples, capturing epistemic uncertainty in the calibration process. The width of these sets shrinks asymptotically to zero, converging to a conditionally calibrated point prediction. Furthermore, we propose Venn multicalibration, a novel methodology for finite-sample calibration across subpopulations. For quantile loss, group-conditional and multicalibrated conformal prediction arise as special cases of Venn multicalibration, and Venn calibration produces novel conformal prediction intervals that achieve quantile-conditional coverage. As a separate contribution, we extend distribution-free conditional calibration guarantees of histogram binning and isotonic calibration to general losses.

[LG-162] dynoGP: Deep Gaussian Processes for dynamic system identification

链接: https://arxiv.org/abs/2502.05620
作者: Alessio Benavoli,Dario Piga,Marco Forgione,Marco Zaffalon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we present a novel approach to system identification for dynamical systems, based on a specific class of Deep Gaussian Processes (Deep GPs). These models are constructed by interconnecting linear dynamic GPs (equivalent to stochastic linear time-invariant dynamical systems) and static GPs (to model static nonlinearities). Our approach combines the strengths of data-driven methods, such as those based on neural network architectures, with the ability to output a probability distribution. This offers a more comprehensive framework for system identification that includes uncertainty quantification. Using both simulated and real-world data, we demonstrate the effectiveness of the proposed approach.

[LG-163] Physics-Conditioned Diffusion Models for Lattice Gauge Theory

链接: https://arxiv.org/abs/2502.05504
作者: Qianteng Zhu,Gert Aarts,Wei Wang,Kai Zhou,Lingxiao Wang
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 25 pages, 10 figures, comments are welcome! Codes are available at: this https URL

点击查看摘要

Abstract:We develop diffusion models for simulating lattice gauge theories, where stochastic quantization is explicitly incorporated as a physical condition for sampling. We demonstrate the applicability of this novel sampler to U(1) gauge theory in two spacetime dimensions and find that a model trained at a small inverse coupling constant can be extrapolated to larger inverse coupling regions without encountering the topological freezing problem. Additionally, the trained model can be employed to sample configurations on different lattice sizes without requiring further training. The exactness of the generated samples is ensured by incorporating Metropolis-adjusted Langevin dynamics into the generation process. Furthermore, we demonstrate that this approach enables more efficient sampling of topological quantities compared to traditional algorithms such as Hybrid Monte Carlo and Langevin simulations.

[LG-164] Deep Generative model that uses physical quantities to generate and retrieve solar magnetic active regions

链接: https://arxiv.org/abs/2502.05351
作者: Subhamoy Chatterjee,Andres Munoz-Jaramillo,Anna Malanushenko
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Deep generative models have shown immense potential in generating unseen data that has properties of real data. These models learn complex data-generating distributions starting from a smaller set of latent dimensions. However, generative models have encountered great skepticism in scientific domains due to the disconnection between generative latent vectors and scientifically relevant quantities. In this study, we integrate three types of machine learning models to generate solar magnetic patches in a physically interpretable manner and use those as a query to find matching patches in real observations. We use the magnetic field measurements from Space-weather HMI Active Region Patches (SHARPs) to train a Generative Adversarial Network (GAN). We connect the physical properties of GAN-generated images with their latent vectors to train Support Vector Machines (SVMs) that do mapping between physical and latent spaces. These produce directions in the GAN latent space along which known physical parameters of the SHARPs change. We train a self-supervised learner (SSL) to make queries with generated images and find matches from real data. We find that the GAN-SVM combination enables users to produce high-quality patches that change smoothly only with a prescribed physical quantity, making generative models physically interpretable. We also show that GAN outputs can be used to retrieve real data that shares the same physical properties as the generated query. This elevates Generative Artificial Intelligence (AI) from a means-to-produce artificial data to a novel tool for scientific data interrogation, supporting its applicability beyond the domain of heliophysics.

[LG-165] Contextual Scenario Generation for Two-Stage Stochastic Programming

链接: https://arxiv.org/abs/2502.05349
作者: David Islip,Roy H. Kwon,Sanghyeon Bae,Woo Chang Kim
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 47 pages, 10 figures

点击查看摘要

Abstract:Two-stage stochastic programs (2SPs) are important tools for making decisions under uncertainty. Decision-makers use contextual information to generate a set of scenarios to represent the true conditional distribution. However, the number of scenarios required is a barrier to implementing 2SPs, motivating the problem of generating a small set of surrogate scenarios that yield high-quality decisions when they represent uncertainty. Current scenario generation approaches do not leverage contextual information or do not address computational concerns. In response, we propose contextual scenario generation (CSG) to learn a mapping between the context and a set of surrogate scenarios of user-specified size. First, we propose a distributional approach that learns the mapping by minimizing a distributional distance between the predicted surrogate scenarios and the true contextual distribution. Second, we propose a task-based approach that aims to produce surrogate scenarios that yield high-quality decisions. The task-based approach uses neural architectures to approximate the downstream objective and leverages the approximation to search for the mapping. The proposed approaches apply to various problem structures and loosely only require efficient solving of the associated subproblems and 2SPs defined on the reduced scenario sets. Numerical experiments demonstrating the effectiveness of the proposed methods are presented.

[LG-166] Online Covariance Estimation in Nonsmooth Stochastic Approximation

链接: https://arxiv.org/abs/2502.05305
作者: Liwei Jiang,Abhishek Roy,Krishna Balasubramanian,Damek Davis,Dmitriy Drusvyatskiy,Sen Na
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 46 pages, 1 figure

点击查看摘要

Abstract:We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potentially non-monotone (nonconvex) setting. In this paper, we study an online batch-means covariance matrix estimator introduced in Zhu et al.(2023). The estimator groups the SA iterates appropriately and computes the sample covariance among batches as an estimate of the limiting covariance. Its construction does not require prior knowledge of the total sample size, and updates can be performed recursively as new data arrives. We establish that, as long as the batch size sequence is properly specified (depending on the stepsize sequence), the estimator achieves a convergence rate of order O(\sqrtdn^-1/8+\varepsilon) for any \varepsilon0 , where d and n denote the problem dimensionality and the number of iterations (or samples) used. Although the problem is nonsmooth and potentially non-monotone (nonconvex), our convergence rate matches the best-known rate for covariance estimation methods using only first-order information in smooth and strongly-convex settings. The consistency of this covariance estimator enables asymptotically valid statistical inference, including constructing confidence intervals and performing hypothesis testing.

[LG-167] Regression and Forecasting of U.S. Stock Returns Based on LSTM

链接: https://arxiv.org/abs/2502.05210
作者: Shicheng Zhou,Zizhou Zhang,Rong Zhang,Yuchen Yin,Chia Hong Chang,Qinyan Shen
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 5pages

点击查看摘要

Abstract:This paper analyses the investment returns of three stock sectors, Manuf, Hitec, and Other, in the U.S. stock market, based on the Fama-French three-factor model, the Carhart four-factor model, and the Fama-French five-factor model, in order to test the validity of the Fama-French three-factor model, the Carhart four-factor model, and the Fama-French five-factor model for the three sectors of the market. French five-factor model for the three sectors of the market. Also, the LSTM model is used to explore the additional factors affecting stock returns. The empirical results show that the Fama-French five-factor model has better validity for the three segments of the market under study, and the LSTM model has the ability to capture the factors affecting the returns of certain industries, and can better regress and predict the stock returns of the relevant industries. Keywords- Fama-French model; Carhart model; Factor model; LSTM model.

[LG-168] Invariant Measures for Data-Driven Dynamical System Identification: Analysis and Application

链接: https://arxiv.org/abs/2502.05204
作者: Jonah Botvinick-Greenhouse
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: This article draws heavily from arXiv:2301.05193 and arXiv:2412.00589

点击查看摘要

Abstract:We propose a novel approach for performing dynamical system identification, based upon the comparison of simulated and observed physical invariant measures. While standard methods adopt a Lagrangian perspective by directly treating time-trajectories as inference data, we take on an Eulerian perspective and instead seek models fitting the observed global time-invariant statistics. With this change in perspective, we gain robustness against pervasive challenges in system identification including noise, chaos, and slow sampling. In the first half of this paper, we pose the system identification task as a partial differential equation (PDE) constrained optimization problem, in which synthetic stationary solutions of the Fokker-Planck equation, obtained as fixed points of a finite-volume discretization, are compared to physical invariant measures extracted from observed trajectory data. In the latter half of the paper, we improve upon this approach in two crucial directions. First, we develop a Galerkin-inspired modification to the finite-volume surrogate model, based on data-adaptive unstructured meshes and Monte-Carlo integration, enabling the approach to efficiently scale to high-dimensional problems. Second, we leverage Takens’ seminal time-delay embedding theory to introduce a critical data-dependent coordinate transformation which can guarantee unique system identifiability from the invariant measure alone. This contribution resolves a major challenge of system identification through invariant measures, as systems exhibiting distinct transient behaviors may still share the same time-invariant statistics in their state-coordinates. Throughout, we present comprehensive numerical tests which highlight the effectiveness of our approach on a variety of challenging system identification tasks.

[LG-169] A finite element-based machine learning model for hydro-mechanical analysis of swelling behavior in clay-sulfate rocks

链接: https://arxiv.org/abs/2502.05198
作者: Reza Taherdangkoo,Mostafa Mollaali,Matthias Ehrhardt,Thomas Nagel,Lyesse Laloui,Alessio Ferrari,Christoph Butscher
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:The hydro-mechanical behavior of clay-sulfate rocks, especially their swelling properties, poses significant challenges in geotechnical engineering. This study presents a hybrid constrained machine learning (ML) model developed using the categorical boosting algorithm (CatBoost) tuned with a Bayesian optimization algorithm to predict and analyze the swelling behavior of these complex geological materials. Initially, a coupled hydro-mechanical model based on the Richards’ equation coupled to a deformation process with linear kinematics implemented within the finite element framework OpenGeoSys was used to simulate the observed ground heave in Staufen, Germany, caused by water inflow into the clay-sulfate bearing Triassic Grabfeld Formation. A systematic parametric analysis using Gaussian distributions of key parameters, including Young’s modulus, Poisson’s ratio, maximum swelling pressure, permeability, and air entry pressure, was performed to construct a synthetic database. The ML model takes time, spatial coordinates, and these parameter values as inputs, while water saturation, porosity, and vertical displacement are outputs. In addition, penalty terms were incorporated into the CatBoost objective function to enforce physically meaningful predictions. Results show that the hybrid approach effectively captures the nonlinear and dynamic interactions that govern hydro-mechanical processes. The study demonstrates the ability of the model to predict the swelling behavior of clay-sulfate rocks, providing a robust tool for risk assessment and management in affected regions. The results highlight the potential of ML-driven models to address complex geotechnical challenges.

[LG-170] Physics-Trained Neural Network as Inverse Problem Solver for Potential Fields: An Example of Downward Continuation between Arbitrary Surfaces

链接: https://arxiv.org/abs/2502.05190
作者: Jing Sun,Lu Li,Liang Zhang
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Downward continuation is a critical task in potential field processing, including gravity and magnetic fields, which aims to transfer data from one observation surface to another that is closer to the source of the field. Its effectiveness directly impacts the success of detecting and highlighting subsurface anomalous sources. We treat downward continuation as an inverse problem that relies on solving a forward problem defined by the formula for upward continuation, and we propose a new physics-trained deep neural network (DNN)-based solution for this task. We hard-code the upward continuation process into the DNN’s learning framework, where the DNN itself learns to act as the inverse problem solver and can perform downward continuation without ever being shown any ground truth data. We test the proposed method on both synthetic magnetic data and real-world magnetic data from West Antarctica. The preliminary results demonstrate its effectiveness through comparison with selected benchmarks, opening future avenues for the combined use of DNNs and established geophysical theories to address broader potential field inverse problems, such as density and geometry modelling.

[LG-171] Physics-Driven Self-Supervised Deep Learning for Free-Surface Multiple Elimination

链接: https://arxiv.org/abs/2502.05189
作者: Jing Sun,Tiexing Wang,Eric Verschuur,Ivan Vasconcelos
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, deep learning (DL) has emerged as a promising alternative approach for various seismic processing tasks, including primary estimation (or multiple elimination), a crucial step for accurate subsurface imaging. In geophysics, DL methods are commonly based on supervised learning from large amounts of high-quality labelled data. Instead of relying on traditional supervised learning, in the context of free-surface multiple elimination, we propose a method in which the DL model learns to effectively parameterize the free-surface multiple-free wavefield from the full wavefield by incorporating the underlying physics into the loss computation. This, in turn, yields high-quality estimates without ever being shown any ground truth data. Currently, the network reparameterization is performed independently for each dataset. We demonstrate its effectiveness through tests on both synthetic and field data. We employ industry-standard Surface-Related Multiple Elimination (SRME) using, respectively, global least-squares adaptive subtraction and local least-squares adaptive subtraction as benchmarks. The comparison shows that the proposed method outperforms the benchmarks in estimation accuracy, achieving the most complete primary estimation and the least multiple energy leakage, but at the cost of a higher computational burden.

信息检索

[IR-0] LiveForesighter: Generating Future Information for Live-Streaming Recommendations at Kuaishou

链接: https://arxiv.org/abs/2502.06557
作者: Yucheng Lu,Jiangxia Cao,Xu Kuan,Wei Cheng,Wei Jiang,Jiaming Zhang,Yang Shuang,Liu Zhaojie,Liyin Hong
类目: Information Retrieval (cs.IR)
*备注: Work in progress

点击查看摘要

Abstract:Live-streaming, as a new-generation media to connect users and authors, has attracted a lot of attention and experienced rapid growth in recent years. Compared with the content-static short-video recommendation, the live-streaming recommendation faces more challenges in giving our users a satisfactory experience: (1) Live-streaming content is dynamically ever-changing along time. (2) valuable behaviors (e.g., send digital-gift, buy products) always require users to watch for a long-time (10 min). Combining the two attributes, here raising a challenging question for live-streaming recommendation: How to discover the live-streamings that the content user is interested in at the current moment, and further a period in the future?

[IR-1] Progressive Collaborative and Semantic Knowledge Fusion for Generative Recommendation

链接: https://arxiv.org/abs/2502.06269
作者: Longtao Xiao,Haozhao Wang,Cheng Wang,Linfei Ji,Yifan Wang,Jieming Zhu,Zhenhua Dong,Rui Zhang,Ruixuan Li
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the recent surge in interest surrounding generative paradigms, generative recommendation has increasingly attracted the attention of researchers in the recommendation community. This paradigm generally consists of two stages. In the first stage, pretrained semantic embeddings or collaborative ID embeddings are quantized to create item codes, aiming to capture and preserve rich semantic or collaborative knowledge within these codes. The second stage involves utilizing these discrete codes to perform an autoregressive sequence generation task. Existing methods often either overlook collaborative or semantic knowledge, or combine the two roughly. In this paper, we observe that naively concatenating representations from semantic and collaborative modality leads to a semantic domination issue, where the resulting representation is overly influenced by semantic information, effectively overshadowing the collaborative representation. Consequently, downstream recommendation tasks fail to fully exploit the knowledge from both modalities, resulting in suboptimal performance. To address this, we propose a progressive collaborative and semantic knowledge fusion model for generative recommendation, named PRORec, which integrates semantic and collaborative knowledge with a unified code through a two-stage framework. Specifically, in the first stage, we propose a cross-modality knowledge alignment task, which integrates semantic knowledge into collaborative embeddings, enhancing their representational capability. In the second stage, we propose an in-modality knowledge distillation task, designed to effectively capture and integrate knowledge from both semantic and collaborative modalities. Extensive experiments on three widely used benchmarks validate the effectiveness of our approach, demonstrating its superiority compared to existing methods.

[IR-2] FactIR: A Real-World Zero-shot Open-Domain Retrieval Benchmark for Fact-Checking WWW2025

链接: https://arxiv.org/abs/2502.06006
作者: Venktesh V,Vinay Setty
类目: Information Retrieval (cs.IR)
*备注: Accepted to WWW 2025 resource track

点击查看摘要

Abstract:The field of automated fact-checking increasingly depends on retrieving web-based evidence to determine the veracity of claims in real-world scenarios. A significant challenge in this process is not only retrieving relevant information, but also identifying evidence that can both support and refute complex claims. Traditional retrieval methods may return documents that directly address claims or lean toward supporting them, but often struggle with more complex claims requiring indirect reasoning. While some existing benchmarks and methods target retrieval for fact-checking, a comprehensive real-world open-domain benchmark has been lacking. In this paper, we present a real-world retrieval benchmark FactIR, derived from Factiverse production logs, enhanced with human annotations. We rigorously evaluate state-of-the-art retrieval models in a zero-shot setup on FactIR and offer insights for developing practical retrieval systems for fact-checking. Code and data are available at this https URL.

[IR-3] HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads WWW2025

链接: https://arxiv.org/abs/2502.05822
作者: Guobing Gan,Kaiming Gao,Li Wang,Shen Jiang,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW 2025 (Industry Track)

点击查看摘要

Abstract:Search advertising is essential for merchants to reach the target users on short video platforms. Short video ads aligned with user search intents are displayed through relevance matching and bid ranking mechanisms. This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems. Recent vision-language pre-training models have demonstrated promise in various multimodal tasks. However, their contribution to downstream query-video relevance tasks is limited, as the alignment between the pair of visual signals and text differs from the modeling of the triplet of the query, visual signals, and video text. In addition, our previous relevance model provides limited ranking capabilities, largely due to the discrepancy between the binary cross-entropy fine-tuning objective and the ranking objective. To address these limitations, we design a high-consistency multimodal relevance model (HCMRM). It utilizes a simple yet effective method to enhance the consistency between pre-training and relevance tasks. Specifically, during the pre-training phase, along with aligning visual signals and video text, several keywords are extracted from the video text as pseudo-queries to perform the triplet relevance modeling. For the fine-tuning phase, we introduce a hierarchical softmax loss, which enables the model to learn the order within labels while maximizing the distinction between positive and negative samples. This promotes the fusion ranking of relevance and bidding in the subsequent ranking stage. The proposed method has been deployed in the Kuaishou search advertising system for over a year, contributing to a 6.1% reduction in the proportion of irrelevant ads and a 1.4% increase in ad revenue.

[IR-4] FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking ECIR2024

链接: https://arxiv.org/abs/2502.05803
作者: Kevin Nanekhan,Venktesh V,Erik Martin,Henrik Vatndal,Vinay Setty,Avishek Anand
类目: Information Retrieval (cs.IR)
*备注: Accepted to ECIR 2024, 15 pages

点击查看摘要

Abstract:The advances in digital tools have led to the rampant spread of misinformation. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. It is essential for automated fact-checking to be efficient for aiding in combating misinformation in real-time and at the source. Fact-checking pipelines primarily comprise a knowledge retrieval component which extracts relevant knowledge to fact-check a claim from large knowledge sources like Wikipedia and a verification component. The existing works primarily focus on the fact-verification part rather than evidence retrieval from large data collections, which often face scalability issues for practical applications such as live fact-checking. In this study, we address this gap by exploring various methods for indexing a succinct set of factual statements from large collections like Wikipedia to enhance the retrieval phase of the fact-checking pipeline. We also explore the impact of vector quantization to further improve the efficiency of pipelines that employ dense retrieval approaches for first-stage retrieval. We study the efficiency and effectiveness of the approaches on fact-checking datasets such as HoVer and WiCE, leveraging Wikipedia as the knowledge source. We also evaluate the real-world utility of the efficient retrieval approaches by fact-checking 2024 presidential debate and also open source the collection of claims with corresponding labels identified in the debate. Through a combination of indexed facts together with Dense retrieval and Index compression, we achieve up to a 10.0x speedup on CPUs and more than a 20.0x speedup on GPUs compared to the classical fact-checking pipelines over large collections.

[IR-5] Graph-Based Vector Search: An Experimental Evaluation of the State-of-the-Art

链接: https://arxiv.org/abs/2502.05575
作者: Ilias Azizi,Karima Echihabi,Themis Palpanas
类目: Information Retrieval (cs.IR); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Vector data is prevalent across business and scientific applications, and its popularity is growing with the proliferation of learned embeddings. Vector data collections often reach billions of vectors with thousands of dimensions, thus, increasing the complexity of their analysis. Vector search is the backbone of many critical analytical tasks, and graph-based methods have become the best choice for analytical tasks that do not require guarantees on the quality of the answers. We briefly survey in-memory graph-based vector search, outline the chronology of the different methods and classify them according to five main design paradigms: seed selection, incremental insertion, neighborhood propagation, neighborhood diversification, and divide-and-conquer. We conduct an exhaustive experimental evaluation of twelve state-of-the-art methods on seven real data collections, with sizes up to 1 billion vectors. We share key insights about the strengths and limitations of these methods; e.g., the best approaches are typically based on incremental insertion and neighborhood diversification, and the choice of the base graph can hurt scalability. Finally, we discuss open research directions, such as the importance of devising more sophisticated data-adaptive seed selection and diversification strategies.

[IR-6] Diffusion Model for Interest Refinement in Multi-Interest Recommendation

链接: https://arxiv.org/abs/2502.05561
作者: Yankun Le,Haoran Li,Baoyuan Ou,Yinjie Qing,Zhixuan Yang,Ruilong Su,Fu Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multi-interest candidate matching plays a pivotal role in personalized recommender systems, as it captures diverse user interests from their historical behaviors. Most existing methods utilize attention mechanisms to generate interest representations by aggregating historical item embeddings. However, these methods only capture overall item-level relevance, leading to coarse-grained interest representations that include irrelevant information. To address this issue, we propose the Diffusion Multi-Interest model (DMI), a novel framework for refining user interest representations at the dimension level. Specifically, DMI first introduces controllable noise into coarse-grained interest representations at the dimensional level. Then, in the iterative reconstruction process, DMI combines a cross-attention mechanism and an item pruning strategy to reconstruct the personalized interest vectors with the guidance of tailored collaborative information. Extensive experiments demonstrate the effectiveness of DMI, surpassing state-of-the-art methods on offline evaluations and an online A/B test. Successfully deployed in the real-world recommender system, DMI effectively enhances user satisfaction and system performance at scale, serving the major traffic of hundreds of millions of daily active users. \footnoteThe code will be released for reproducibility once the paper is accepted.

[IR-7] Large Memory Network for Recommendation

链接: https://arxiv.org/abs/2502.05558
作者: Hui Lu,Zheng Chai,Yuchao Zheng,Zhe Chen,Deping Xie,Peng Xu,Xun Zhou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modeling user behavior sequences in recommender systems is essential for understanding user preferences over time, enabling personalized and accurate recommendations for improving user retention and enhancing business values. Despite its significance, there are two challenges for current sequential modeling approaches. From the spatial dimension, it is difficult to mutually perceive similar users’ interests for a generalized intention understanding; from the temporal dimension, current methods are generally prone to forgetting long-term interests due to the fixed-length input sequence. In this paper, we present Large Memory Network (LMN), providing a novel idea by compressing and storing user history behavior information in a large-scale memory block. With the elaborated online deployment strategy, the memory block can be easily scaled up to million-scale in the industry. Extensive offline comparison experiments, memory scaling up experiments, and online A/B test on Douyin E-Commerce Search (ECS) are performed, validating the superior performance of LMN. Currently, LMN has been fully deployed in Douyin ECS, serving millions of users each day.

[IR-8] Adaptive Domain Scaling for Personalized Sequential Modeling in Recommenders

链接: https://arxiv.org/abs/2502.05523
作者: Zheng Chai,Hui Lu,Di Chen,Qin Ren,Xun Zhou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Users generally exhibit complex behavioral patterns and diverse intentions in multiple business scenarios of super applications like Douyin, presenting great challenges to current industrial multi-domain recommenders. To mitigate the discrepancies across diverse domains, researches and industrial practices generally emphasize sophisticated network structures to accomodate diverse data distributions, while neglecting the inherent understanding of user behavioral sequence from the multi-domain perspective. In this paper, we present Adaptive Domain Scaling (ADS) model, which comprehensively enhances the personalization capability in target-aware sequence modeling across multiple domains. Specifically, ADS comprises of two major modules, including personalized sequence representation generation (PSRG) and personalized candidate representation generation (PCRG). The modules contribute to the tailored multi-domain learning by dynamically learning both the user behavioral sequence item representation and the candidate target item representation under different domains, facilitating adaptive user intention understanding. Experiments are performed on both a public dataset and two billion-scaled industrial datasets, and the extensive results verify the high effectiveness and compatibility of ADS. Besides, we conduct online experiments on two influential business scenarios including Douyin Advertisement Platform and Douyin E-commerce Service Platform, both of which show substantial business improvements. Currently, ADS has been fully deployed in many recommendation services at ByteDance, serving billions of users.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-11

目录

概览 (2025-02-11)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载