Arxiv今日论文 | 2024-12-30

本篇博文主要展示 2024-12-30 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决现代生成式语言模型（Generative Language Models）在训练过程中语言模型对齐（Language Model Alignment）的优化问题。传统对齐方法的目标是通过微调参考模型，使得对齐模型的样本在KL散度（KL Divergence）约束下相对于参考模型的样本具有较高的胜率（Win Rate）。然而，现有的对齐框架未能充分考虑推理时解码算法（Inference-Time Decoding Algorithms）的影响，导致其在这些方法下表现次优。为此，论文提出了一种推理感知对齐框架（Inference-Aware Alignment, IAPO），通过修改对齐目标，使其能够更好地适应推理时解码策略。关键解决方案包括：1）证明对于任何推理时解码算法，优化对齐策略相对于参考策略的推理时胜率的最优解，等同于对奖励进行变换后的典型RLHF（Reinforcement Learning from Human Feedback）问题；2）提出KL正则化的校准与变换强化学习算法（KL-Regularized Calibrate-and-Transform RL, CTRL），该算法包含奖励校准步骤和基于校准奖励的KL正则化奖励最大化步骤。论文特别针对两种重要的推理时策略（Best-of-N采样和Best-of-N越狱）提出了具体的变换方法，并实证表明该框架在Anthropic帮助性和无害性对话基准数据集上显著优于现有方法，推理时胜率分别提升了8-12%和4-9%。

链接: https://arxiv.org/abs/2412.19792
作者: Ananth Balashankar,Ziteng Sun,Jonathan Berant,Jacob Eisenstein,Michael Collins,Adrian Hutter,Jong Lee,Chirag Nagpal,Flavien Prost,Aradhana Sinha,and Ananda Theertha Suresh,Ahmad Beirami
机构: Google DeepMind; Google Research
关键词: training modern generative, modern generative language, Language model alignment, generative language models, Language model
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Language model alignment has become a critical step in training modern generative language models. The goal of alignment is to finetune a reference model such that the win rate of a sample from the aligned model over a sample from the reference model is high, subject to a KL divergence constraint. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. However, the alignment objective does not capture such inference-time decoding procedures. We show that the existing alignment framework is sub-optimal in view of such inference-time methods. We then modify the alignment objective and propose a framework for inference-aware alignment (IAPO). We prove that for any inference-time decoding algorithm, the optimal solution that optimizes the inference-time win rate of the aligned policy against the reference policy is the solution to the typical RLHF problem with a transformation of the reward. This motivates us to provide the KL-regularized calibrate-and-transform RL (CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. We particularize our study to two important inference-time strategies: best-of-N sampling and best-of-N jailbreaking, where N responses are sampled from the model and the one with the highest or lowest reward is selected. We propose specific transformations for these strategies and demonstrate that our framework offers significant improvements over existing state-of-the-art methods for language model alignment. Empirically, we outperform baselines that are designed without taking inference-time decoding into consideration by 8-12% and 4-9% on inference-time win rates over the Anthropic helpfulness and harmlessness dialog benchmark datasets.
zh

[NLP-1] Enhancing Whispers Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization ICASSP2025

【速读】：该论文旨在解决Whisper等大型基础模型在低资源语言（如印度语言）中自动语音识别（Automatic Speech Recognition, ASR）性能不佳的问题。为此，论文提出了两种关键解决方案：首先，通过引入语言家族信息进行提示调优（prompt-tuning），以提升Whisper在语言相似性较高的语言中的识别准确率；其次，设计了一种新型分词器（tokenizer），通过减少生成的分词数量来加速Whisper的推理速度。实验表明，该分词器显著降低了推理时间，而提示调优则在不同规模的Whisper模型（包括Small、Medium和Large）中均提升了识别准确率。这两种技术共同实现了在最优词错误率（WER）和推理速度之间的平衡。

链接: https://arxiv.org/abs/2412.19785
作者: Kumud Tripathi,Raj Gothi,Pankaj Wasnik
机构: Media Analysis Group, Sony Research India
关键词: Automatic speech recognition, Automatic speech, significant advancement, Automatic, Whisper
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at ICASSP 2025, 5 pages, 1 figures, 5 tables

点击查看摘要

Abstract:Automatic speech recognition has recently seen a significant advancement with large foundational models such as Whisper. However, these models often struggle to perform well in low-resource languages, such as Indian languages. This paper explores two novel approaches to enhance Whisper’s multilingual speech recognition performance in Indian languages. First, we propose prompt-tuning with language family information, which enhances Whisper’s accuracy in linguistically similar languages. Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper’s inference speed. Our extensive experiments demonstrate that the tokenizer significantly reduces inference time, while prompt-tuning enhances accuracy across various Whisper model sizes, including Small, Medium, and Large. Together, these techniques achieve a balance between optimal WER and inference speed.
zh

[NLP-2] Machine Learning for Sentiment Analysis of Imported Food in Trinidad and Tobago

【速读】：该论文旨在探讨不同机器学习算法（CNN、LSTM、VADER和RoBERTa）在特立尼达和多巴哥进口食品相关推特数据情感分析中的性能表现。研究主要解决三个核心问题：算法的准确性和效率比较、各模型的最优配置，以及优化模型在实时监测公众情感及其对进口账单影响中的应用潜力。研究通过2018年至2024年的推特数据集，分为不平衡、平衡和时间子集，评估数据平衡和COVID-19大流行对情感趋势的影响。通过十次实验评估不同配置下的模型性能，结果表明VADER在多类和二类情感分类中均表现最优。研究还揭示了COVID-19前后情感趋势的显著变化，对进口政策具有重要启示。解决方案的关键在于通过实验验证不同算法在不同数据集配置下的性能，并确定最优模型及其配置，以支持实时情感监测系统的应用。

链接: https://arxiv.org/abs/2412.19781
作者: Cassandra Daniels,Koffka Khan
机构: 未知
关键词: Twitter data related, imported food items, Trinidad and Tobago, machine learning algorithms, analysis of Twitter
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages

点击查看摘要

Abstract:This research investigates the performance of various machine learning algorithms (CNN, LSTM, VADER, and RoBERTa) for sentiment analysis of Twitter data related to imported food items in Trinidad and Tobago. The study addresses three primary research questions: the comparative accuracy and efficiency of the algorithms, the optimal configurations for each model, and the potential applications of the optimized models in a live system for monitoring public sentiment and its impact on the import bill. The dataset comprises tweets from 2018 to 2024, divided into imbalanced, balanced, and temporal subsets to assess the impact of data balancing and the COVID-19 pandemic on sentiment trends. Ten experiments were conducted to evaluate the models under various configurations. Results indicated that VADER outperformed the other models in both multi-class and binary sentiment classifications. The study highlights significant changes in sentiment trends pre- and post-COVID-19, with implications for import policies.
zh

[NLP-3] OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

【速读】：该论文旨在解决基于视觉语言模型（Vision-Language Models, VLMs）的图形用户界面（Graphical User Interface, GUI）代理在训练过程中高质量轨迹数据收集的瓶颈问题。传统方法依赖于人工监督或通过执行预定义任务生成合成数据，这些方法不仅资源密集，且难以保证数据质量，同时存在数据多样性不足以及合成数据与真实环境之间的显著差距。为解决这些问题，论文提出了OS-Genesis，一种新颖的GUI数据合成管道，其关键创新在于反转传统的轨迹收集过程。OS-Genesis首先让代理感知环境并进行逐步交互，随后通过回顾性推导高质量任务来实现轨迹级探索，并利用轨迹奖励模型确保生成轨迹的质量。实验表明，使用OS-Genesis训练的GUI代理在高度挑战性的在线基准测试中表现显著提升，且其数据质量和多样性优于现有合成方法。

链接: https://arxiv.org/abs/2412.19723
作者: Qiushi Sun,Kanzhi Cheng,Zichen Ding,Chuanyang Jin,Yian Wang,Fangzhi Xu,Zhenyu Wu,Chengyou Jia,Liheng Chen,Zhoumianze Liu,Ben Kao,Guohao Li,Junxian He,Yu Qiao,Zhiyong Wu
机构: Shanghai AI Laboratory(上海人工智能实验室); The University of Hong Kong(香港大学); Johns Hopkins University(约翰霍普金斯大学); Shanghai Jiao Tong University(上海交通大学); University of Oxford(牛津大学); Hong Kong University of Science and Technology(香港科技大学)
关键词: Graphical User Interface, Graphical User, User Interface, computer control capability, demonstrated human-like computer
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Work in progress

点击查看摘要

Abstract:Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, a critical bottleneck persists: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diversity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis’s efficiency and its superior data quality and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at \hrefthis https URLOS-Genesis Homepage.
zh

[NLP-4] oward Adaptive Reasoning in Large Language Models with Thought Rollback ICML2024

【速读】：该论文旨在解决大型语言模型（LLMs）在逐步推理过程中由于中间推理步骤（thoughts）结构僵化（如链式、树状或无环有向图）而导致的推理不灵活和单向性问题，尤其是在模型频繁产生错误响应（即“幻觉”）时难以应对复杂任务的问题。论文提出了一种新的推理框架，称为“思维回滚”（Thought Rollback, TR），其核心机制是通过回滚思维步骤，允许LLMs对错误思维进行分析，并回退到任何先前错误的思维步骤进行修正。通过将这种试错过程纳入提示（prompt）以引导LLM，每次回滚都能生成一条更可靠的推理路径。因此，TR框架使得LLM能够从简单的提示出发，无需人工标注，逐步自适应地探索思维路径，最终找到正确的解决方案。实验结果表明，TR在数学问题和多任务推理上的问题解决率和交互成本方面均达到了当前最优水平。

链接: https://arxiv.org/abs/2412.19707
作者: Sijia Chen,Baochun Li
机构: 未知
关键词: Large language models, Large language, language models, Large, reasoning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2024 camera-ready version with 24 pages and 12 figures. Code repo with all prompts: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have been routinely used to solve various tasks using step-by-step reasoning. However, the structure of intermediate reasoning steps, or thoughts, is rigid and unidirectional, such as chains, trees, or acyclic-directed graphs. Consequently, the resulting inflexible and forward-only reasoning may not address challenging tasks and fail when the LLM frequently gives false responses, i.e., hallucinations''. This paper proposes a new reasoning framework, called Thought Rollback (TR), allowing LLMs to adaptively build thought structure while maintaining effective reasoning toward problem-solving under hallucinations’'. The core mechanism of TR is rolling back thoughts, which allows LLMs to perform error analysis on thoughts, and thus roll back to any previously mistaken thought for revision. Subsequently, by including such trial-and-error in the prompt to guide the LLM, each rollback leads to one more reliable reasoning path. Therefore, starting with a simple prompt without human annotations, LLM with TR adaptively and gradually explores thoughts for a correct solution. Comprehensive experiments on mathematical problems and multi-task reasoning demonstrate the state-of-the-art performance of TR in terms of problem-solving rate and interaction cost. For instance, the solving rate of GPT-4 with TR outperforms the current best by 9% on the MATH dataset.
zh

[NLP-5] Machine Generated Product Advertisements: Benchmarking LLM s Against Human Performance

【速读】：该论文旨在比较AI生成与人工撰写的产品描述在多个维度上的表现差异，以评估当前AI在电子商务内容创作中的能力与局限性。研究通过分析四种AI模型（Gemma 2B、LLAMA、GPT2和ChatGPT 4）生成的产品描述，并与人工撰写的描述进行对比，评估了情感（sentiment）、可读性（readability）、说服力（persuasiveness）、搜索引擎优化（SEO）、清晰度（clarity）、情感吸引力（emotional appeal）以及行动号召有效性（call-to-action effectiveness）等指标。研究结果表明，ChatGPT 4表现最佳，而其他模型在逻辑结构、上下文相关性及信息传达方面存在显著不足。解决方案的关键在于采用多维度的评估模型，全面分析AI生成内容的优缺点，为电子商务领域的内容创作提供科学依据。

链接: https://arxiv.org/abs/2412.19610
作者: Sanjukta Ghosh
机构: University at Buffalo (纽约州立大学布法罗分校)
关键词: Search Engine Optimization, study compares, compares the performance, performance of AI-generated, human-written product descriptions
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study compares the performance of AI-generated and human-written product descriptions using a multifaceted evaluation model. We analyze descriptions for 100 products generated by four AI models (Gemma 2B, LLAMA, GPT2, and ChatGPT 4) with and without sample descriptions, against human-written descriptions. Our evaluation metrics include sentiment, readability, persuasiveness, Search Engine Optimization(SEO), clarity, emotional appeal, and call-to-action effectiveness. The results indicate that ChatGPT 4 performs the best. In contrast, other models demonstrate significant shortcomings, producing incoherent and illogical output that lacks logical structure and contextual relevance. These models struggle to maintain focus on the product being described, resulting in disjointed sentences that do not convey meaningful information. This research provides insights into the current capabilities and limitations of AI in the creation of content for e-Commerce.
zh

[NLP-6] A Comparative Study of Machine Unlearning Techniques for Image and Text Classification Models

【速读】：该论文旨在解决机器遗忘（Machine Unlearning）这一关键问题，即在数据隐私法规的要求下，如何从机器学习模型中选择性移除已学习的数据。解决方案的核心在于对六种先进的遗忘技术进行全面比较分析，这些技术应用于图像和文本分类任务。论文通过评估这些技术的性能、效率以及对法规要求的合规性，揭示了它们在实际应用中的优势和局限性。通过系统分析这些方法，论文旨在为机器遗忘领域的适用性、挑战和权衡提供深入见解，从而推动伦理和适应性机器学习的发展。

链接: https://arxiv.org/abs/2412.19583
作者: Omar M. Safa,Mahmoud M. Abdelaziz,Mustafa Eltawy,Mohamed Mamdouh,Moamen Gharib,Salaheldin Eltenihy,Nagia M. Ghanem,Mohamed M. Ismail
机构: Computer and Systems Engineering Department, Faculty of Engineering, Alexandria University (亚历山大大学)
关键词: data privacy regulations, selectively remove learned, remove learned data, artificial intelligence, privacy regulations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine Unlearning has emerged as a critical area in artificial intelligence, addressing the need to selectively remove learned data from machine learning models in response to data privacy regulations. This paper provides a comprehensive comparative analysis of six state-of-theart unlearning techniques applied to image and text classification tasks. We evaluate their performance, efficiency, and compliance with regulatory requirements, highlighting their strengths and limitations in practical scenarios. By systematically analyzing these methods, we aim to provide insights into their applicability, challenges,and tradeoffs, fostering advancements in the field of ethical and adaptable machine learning.
zh

[NLP-7] ARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data

【速读】：该论文旨在解决语义解析（Semantic Parsing）领域中存在的两个主要挑战：一是对大量手动标注数据集的依赖，二是对未见示例的泛化能力有限。为解决这些问题，作者提出了目标合成数据生成（Targeted Synthetic Data Generation, TARGA）框架，其关键创新在于动态生成高相关性的合成数据，而无需手动标注。具体而言，TARGA通过从给定问题的相关实体和关系出发，逐层扩展和跨层组合来探索潜在的相关查询，并生成相应的自然语言问题，作为上下文学习的合成示例。实验结果表明，TARGA在多个知识库问答（KBQA）数据集上显著优于现有未微调的方法，并在非独立同分布（non-I.I.D.）设置下表现出优异的样本效率、鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2412.19544
作者: Xiang Huang,Jiayu Shen,Shanshan Huang,Sitao Cheng,Xiaxia Wang,Yuzhong Qu
机构: State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学软件新技术国家重点实验室); University of California, Santa Barbara(加州大学圣塔芭芭拉分校); University of Oxford(牛津大学)
关键词: Semantic parsing, logic forms, plays a crucial, structured environments, crucial role
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic parsing, which converts natural language questions into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (TARGA), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entities and relations of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstrations for in-context learning. Experiments on multiple knowledge base question answering (KBQA) datasets demonstrate that TARGA, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, TARGA also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.
zh

[NLP-8] Exploiting Domain-Specific Parallel Data on Multilingual Language Models for Low-resource Language Translation

【速读】：该论文旨在解决低资源语言（Low-Resource Languages, LRLs）在领域特定的神经机器翻译（Neural Machine Translation, NMT）系统中表现不佳的问题，尤其是当这些语言的平行数据量有限且在多语言序列到序列语言模型（multilingual sequence-to-sequence Language Models, msLMs）中的表示不足时。为解决这一问题，论文提出了两种关键策略：一是利用辅助领域的平行数据对msLM进行微调（fine-tune），二是进一步预训练（pre-train）msLM。此外，论文还探讨了领域差异对NMT模型性能的影响，并推荐了在构建领域特定的LRL-NMT模型时有效利用辅助平行数据的若干策略。

链接: https://arxiv.org/abs/2412.19522
作者: Surangika Ranathungaa,Shravan Nayak,Shih-Ting Cindy Huang,Yanke Mao,Tong Su,Yun-Hsiang Ray Chan,Songchen Yuan,Anthony Rinaldi,Annie En-Shiun Lee
机构: Mila – Quebec AI Institutes; University of Toronto (多伦多大学); OntarioTech University (安大略理工大学)
关键词: Neural Machine Translation, Neural Machine, Machine Translation, deliver expected results, built on multilingual
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural Machine Translation (NMT) systems built on multilingual sequence-to-sequence Language Models (msLMs) fail to deliver expected results when the amount of parallel data for a language, as well as the language’s representation in the model are limited. This restricts the capabilities of domain-specific NMT systems for low-resource languages (LRLs). As a solution, parallel data from auxiliary domains can be used either to fine-tune or to further pre-train the msLM. We present an evaluation of the effectiveness of these two techniques in the context of domain-specific LRL-NMT. We also explore the impact of domain divergence on NMT model performance. We recommend several strategies for utilizing auxiliary parallel data in building domain-specific NMT models for LRLs.
zh

[NLP-9] Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLM s

【速读】：该论文旨在深入理解大语言模型（LLMs）的自我校正（self-correction）行为，并解决其在自我校正后准确性下降的问题。为此，作者将自我校正能力分解为两个关键维度：置信度（confidence，即对校正答案的自信程度）和批判能力（critique，即将错误答案转化为正确答案的能力），并提出了基于概率视角的度量指标来评估这两种能力以及整体自我校正能力。通过广泛的实验，作者发现不同模型在自我校正行为上表现出显著差异，并揭示了在通过提示（prompts）或上下文学习（in-context learning）操纵模型自我校正行为时，置信度和批判能力之间存在权衡关系。此外，作者提出了一种通过调整监督微调（Supervision Fine-Tuning, SFT）数据格式来提升自我校正能力的简单而有效的策略，该策略在两种能力上均优于传统的监督微调方法，并在自我校正后显著提高了准确性。

链接: https://arxiv.org/abs/2412.19513
作者: Zhe Yang,Yichang Zhang,Yudong Wang,Ziyao Xu,Junyang Lin,Zhifang Sui
机构: State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学多媒体信息处理国家重点实验室, 计算机学院); Alibaba Group(阿里巴巴集团)
关键词: Large Language Models, Large Language, Language Models, self-correction, self-generated responses
类目: Computation and Language (cs.CL)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Large Language Models (LLMs) can correct their self-generated responses, but a decline in accuracy after self-correction is also witnessed. To have a deeper understanding of self-correction, we endeavor to decompose, evaluate, and analyze the self-correction behaviors of LLMs. By enumerating and analyzing answer correctness before and after self-correction, we decompose the self-correction capability into confidence (being confident to correct answers) and critique (turning wrong answers to correct) capabilities, and propose two metrics from a probabilistic perspective to measure these 2 capabilities, along with another metric for overall self-correction capability evaluation. Based on our decomposition and evaluation metrics, we conduct extensive experiments and draw some empirical conclusions. For example, we find different models can exhibit distinct behaviors: some models are confident while others are more critical. We also find the trade-off between the two capabilities (i.e. improving one can lead to a decline in the other) when manipulating model self-correction behavior by prompts or in-context learning. Further, we find a simple yet efficient strategy to improve self-correction capability by transforming Supervision Fine-Tuning (SFT) data format, and our strategy outperforms vanilla SFT in both capabilities and achieves much higher accuracy after self-correction. Our code will be publicly available on GitHub.
zh

[NLP-10] Safeguard Fine-Tuned LLM s Through Pre- and Post-Tuning Model Merging

【速读】：该论文旨在解决在微调大语言模型（LLMs）以适配下游任务时，可能导致安全对齐的LLMs出现安全性退化的问题。当前许多解决方案通过引入额外的安全数据来应对这一问题，但在许多情况下这种方法并不实用。论文提出的关键解决方案是通过合并微调前和微调后的安全对齐模型的权重，从而在提升下游任务性能的同时，保持模型的固有安全性。实验结果表明，该方法在不同下游任务、模型和合并方法中均能有效缓解安全性退化，并为适配安全对齐的LLMs提供了一种实用的解决方案。

链接: https://arxiv.org/abs/2412.19512
作者: Hua Farn,Hsuan Su,Shachi H Kumar,Saurav Sahay,Shang-Tse Chen,Hung-yi Lee
机构: National Taiwan University(国立台湾大学); Intel Lab(英特尔实验室)
关键词: Fine-tuning large language, Fine-tuning large, large language models, widely adopted approach, downstream task performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.
zh

[NLP-11] User Willingness-aware Sales Talk Dataset COLING2025

【速读】：该论文试图解决在自动销售对话系统（automated sales talk dialogue systems）开发中，如何有效考虑用户意愿（user willingness）的问题。尽管用户意愿在销售对话过程中至关重要，但现有研究尚未开发出能够明确考虑用户意愿的自动销售对话系统，主要障碍在于缺乏包含可靠用户意愿数据的销售对话数据集。为此，论文提出了一种基于生态效度（ecological validity）概念的用户意愿感知销售对话数据收集方法。解决方案的关键在于创建了一个高度模拟真实销售场景的对话环境，以自然引发用户的意愿表达，并通过多角度评估参与者在话语层面的意愿。通过对收集到的数据进行分析，论文深入探讨了实际应用中用户意愿感知的销售对话策略，并开发了一个旨在提升用户购买意图的销售对话系统作为数据集的实际应用。

链接: https://arxiv.org/abs/2412.19490
作者: Asahi Hentona,Jun Baba,Shiki Sato,Reina Akama
机构: CyberAgent; Tohoku University (东北大学)
关键词: User willingness, sales talk, sales system objectives, sales talk process, User
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 12 pages, Accepted to COLING2025

点击查看摘要

Abstract:User willingness is a crucial element in the sales talk process that affects the achievement of the salesperson’s or sales system’s objectives. Despite the importance of user willingness, to the best of our knowledge, no previous study has addressed the development of automated sales talk dialogue systems that explicitly consider user willingness. A major barrier is the lack of sales talk datasets with reliable user willingness data. Thus, in this study, we developed a user willingness-aware sales talk collection by leveraging the ecological validity concept, which is discussed in the field of human-computer interaction. Our approach focused on three types of user willingness essential in real sales interactions. We created a dialogue environment that closely resembles real-world scenarios to elicit natural user willingness, with participants evaluating their willingness at the utterance level from multiple perspectives. We analyzed the collected data to gain insights into practical user willingness-aware sales talk strategies. In addition, as a practical application of the constructed dataset, we developed and evaluated a sales dialogue system aimed at enhancing the user’s intent to purchase.
zh

[NLP-12] Pre-training Fine-tuning and Re-ranking: A Three-Stage Framework for Legal Question Answering

【速读】：该论文旨在解决法律问答（Legal QA）领域中因缺乏领域知识和足够标注训练数据而导致的双编码器架构（dual-encoder architecture）性能受限的问题。为此，论文提出了一种三阶段框架（PFR-LQA），包括预训练（pre-training）、微调（fine-tuning）和重排序（re-ranking）。其关键解决方案在于：首先，通过自监督学习目标对法律问题和答案进行领域特定的预训练，使模型适应法律领域；其次，利用监督学习目标对双编码器进行任务特定的微调，以提升其在特定下游问答任务中的表现；最后，通过上下文重排序目标进一步优化文档编码器生成的问题表示，利用上下文相似性增加锚点样本与困难负样本之间的差异，从而实现更好的问题重排序。实验结果表明，PFR-LQA在法律问答任务中优于现有强竞争方法。

链接: https://arxiv.org/abs/2412.19482
作者: Shiwen Ni,Hao Cheng,Min Yang
机构: 未知
关键词: attracted increasing attention, seeking legal advice, people seeking legal, attracted increasing, increasing attention
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal question answering (QA) has attracted increasing attention from people seeking legal advice, which aims to retrieve the most applicable answers from a large-scale database of question-answer pairs. Previous methods mainly use a dual-encoder architecture to learn dense representations of both questions and answers. However, these methods could suffer from lacking domain knowledge and sufficient labeled training data. In this paper, we propose a three-stage (\underlinepre-training, \underlinefine-tuning and \underlinere-ranking) framework for \underlinelegal \underlineQA (called PFR-LQA), which promotes the fine-grained text representation learning and boosts the performance of dense retrieval with the dual-encoder architecture. Concretely, we first conduct domain-specific pre-training on legal questions and answers through a self-supervised training objective, allowing the pre-trained model to be adapted to the legal domain. Then, we perform task-specific fine-tuning of the dual-encoder on legal question-answer pairs by using the supervised learning objective, leading to a high-quality dual-encoder for the specific downstream QA task. Finally, we employ a contextual re-ranking objective to further refine the output representations of questions produced by the document encoder, which uses contextual similarity to increase the discrepancy between the anchor and hard negative samples for better question re-ranking. We conduct extensive experiments on a manually annotated legal QA dataset. Experimental results show that our PFR-LQA method achieves better performance than the strong competitors for legal question answering.
zh

[NLP-13] Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models

【速读】：该论文旨在解决如何将大规模预训练模型（如GPT-4）的知识高效地迁移到轻量级学生模型中，以在保持高性能的同时降低计算成本。传统软标签蒸馏方法通常仅通过输出分布进行知识迁移，而本文提出的解决方案关键在于引入多层特征对齐策略，深度对齐教师模型和学生模型的中间特征及注意力机制，从而最大限度地保留教师模型的语义表达能力和上下文建模能力。此外，该方法通过构建多任务损失函数（包括特征匹配损失、注意力对齐损失和输出分布匹配损失），确保多层次信息的联合优化传递。实验结果表明，该方法在GLUE数据集及多种自然语言处理任务上表现优异，接近GPT-4的性能，同时显著优于DeBERTa、XLNet和GPT-3等基线模型，展示了其在计算效率和存储需求上的显著优势。

链接: https://arxiv.org/abs/2412.19449
作者: Shuo Wang,Chihang Wang,Jia Gao,Zhen Qi,Hongye Zheng,Xiaoxuan Liao
机构: 未知
关键词: knowledge distillation algorithm, distillation algorithm based, large pre-trained models, feature alignment, reducing computational costs
类目: Computation and Language (cs.CL)
备注: 4 pages

点击查看摘要

Abstract:This study proposes a knowledge distillation algorithm based on large language models and feature alignment, aiming to effectively transfer the knowledge of large pre-trained models into lightweight student models, thereby reducing computational costs while maintaining high model performance. Different from the traditional soft label distillation method, this method introduces a multi-layer feature alignment strategy to deeply align the intermediate features and attention mechanisms of the teacher model and the student model, maximally retaining the semantic expression ability and context modeling ability of the teacher model. In terms of method design, a multi-task loss function is constructed, including feature matching loss, attention alignment loss, and output distribution matching loss, to ensure multi-level information transfer through joint optimization. The experiments were comprehensively evaluated on the GLUE data set and various natural language processing tasks. The results show that the proposed model performs very close to the state-of-the-art GPT-4 model in terms of evaluation indicators such as perplexity, BLEU, ROUGE, and CER. At the same time, it far exceeds baseline models such as DeBERTa, XLNet, and GPT-3, showing significant performance improvements and computing efficiency advantages. Research results show that the feature alignment distillation strategy is an effective model compression method that can significantly reduce computational overhead and storage requirements while maintaining model capabilities. Future research can be further expanded in the directions of self-supervised learning, cross-modal feature alignment, and multi-task transfer learning to provide more flexible and efficient solutions for the deployment and optimization of deep learning models.
zh

[NLP-14] DeepSeek-V3 Technical Report

【速读】：该论文旨在解决大规模语言模型在推理效率和训练成本方面的挑战，同时提升模型性能。解决方案的关键在于采用了混合专家模型（Mixture-of-Experts, MoE）架构，并结合了多头潜在注意力机制（Multi-head Latent Attention, MLA）和DeepSeekMoE架构，这些技术在DeepSeek-V2中已得到充分验证。此外，DeepSeek-V3引入了无辅助损失的负载均衡策略和多令牌预测训练目标，以进一步增强模型性能。通过在大规模高质量数据集（14.8万亿令牌）上进行预训练，并结合监督微调和强化学习阶段，DeepSeek-V3在性能上超越了其他开源模型，并与领先的闭源模型相媲美，同时显著降低了训练成本（仅需2.788M H800 GPU小时）并保持了训练过程的稳定性。

链接: https://arxiv.org/abs/2412.19437
作者: DeepSeek-AI,Aixin Liu,Bei Feng,Bing Xue,Bingxuan Wang,Bochao Wu,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chenyu Zhang,Chong Ruan,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Dongjie Ji,Erhang Li,Fangyun Lin,Fucong Dai,Fuli Luo,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Han Bao,Hanwei Xu,Haocheng Wang,Haowei Zhang,Honghui Ding,Huajian Xin,Huazuo Gao,Hui Li,Hui Qu,J.L. Cai,Jian Liang,Jianzhong Guo,Jiaqi Ni,Jiashi Li,Jiawei Wang,Jin Chen,Jingchang Chen,Jingyang Yuan,Junjie Qiu,Junlong Li,Junxiao Song,Kai Dong,Kai Hu,Kaige Gao,Kang Guan,Kexin Huang,Kuai Yu,Lean Wang,Lecong Zhang,Lei Xu,Leyi Xia,Liang Zhao,Litong Wang,Liyue Zhang,Meng Li,Miaojun Wang,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Mingming Li,Ning Tian,Panpan Huang,Peiyi Wang,Peng Zhang,Qiancheng Wang,Qihao Zhu,Qinyu Chen,Qiushi Du,R.J. Chen,R.L. Jin,Ruiqi Ge,Ruisong Zhang,Ruizhe Pan,Runji Wang,Runxin Xu,Ruoyu Zhang,Ruyi Chen,S.S. Li,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shaoqing Wu,Shengfeng Ye,Shengfeng Ye,Shirong Ma,Shiyu Wang,Shuang Zhou,Shuiping Yu,Shunfeng Zhou,Shuting Pan,T. Wang,Tao Yun,Tian Pei,Tianyu Sun,W.L. Xiao,Wangding Zeng
机构: 未知
关键词: Multi-head Latent Attention, adopts Multi-head Latent, total parameters, Latent Attention, Multi-head Latent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at this https URL.
zh

[NLP-15] Dynamic Skill Adaptation for Large Language Models

【速读】：该论文旨在解决大语言模型（LLMs）在适应新技能和复杂技能时面临的挑战。传统方法依赖于从人类整理的静态数据中随机学习，而本文提出的动态技能适应（Dynamic Skill Adaptation, DSA）框架则通过模仿人类学习路径，自动生成和组织训练数据，并根据训练动态动态调整数据。具体而言，DSA框架首先通过将复杂技能分解为子技能并基于其在人类学习中的依赖关系构建技能图（skill graph）。对于每个技能，利用LLMs生成包含详细技能描述的教科书式数据用于预训练，以及针对明确运用技能解决问题的练习式数据用于指令微调（instruction-tuning）。在指令微调过程中，动态更新训练数据，降低易学样本的权重，生成更复杂的样本，并过滤掉错误数据。实验结果表明，该方法在数学推理技能和社会研究技能的适应上具有显著效果。

链接: https://arxiv.org/abs/2412.19361
作者: Jiaao Chen,Diyi Yang
机构: 未知
关键词: Dynamic Skill Adaptation, present Dynamic Skill, Large Language Models, Skill Adaptation, framework to adapt
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Dynamic Skill Adaptation (DSA), an adaptive and dynamic framework to adapt novel and complex skills to Large Language Models (LLMs). Compared with previous work which learns from human-curated and static data in random orders, we propose to first automatically generate and organize the training data by mimicking the learning pathways of human and then dynamically tailor the training data based on the training dynamics. Specifically, inspired by the learning structures and teaching strategies in the human education system, we first construct a skill graph by decomposing complex skills into sub-skills and arranging them based on their dependencies in human syllables. For every skill, we utilize LLMs to generate both textbook-like data which contains detailed descriptions of skills for pre-training and exercise-like data which targets at explicitly utilizing the skills to solve problems for instruction-tuning. Furthermore, during the instruction-tuning, we dynamically update the training data which down-weight easy-to-learn examples, generate more complex examples, and filter out data with errors. Experiments on large language models such as LLAMA and Mistral demonstrate the effectiveness of our proposed methods in adapting math reasoning skills and social study skills.
zh

[NLP-16] ETTA: Elucidating the Design Space of Text-to-Audio Models

【速读】：该论文旨在解决文本到音频（Text-To-Audio, TTA）合成领域中数据、模型架构、训练目标函数和采样策略对目标基准的影响尚未充分理解的问题。为此，论文通过大规模实证实验，重点研究了扩散模型（diffusion models）和流匹配模型（flow matching models），以提供对TTA模型设计空间的全面理解。解决方案的关键包括：1）构建了AF-Synthetic数据集，该数据集通过音频理解模型生成了高质量合成字幕；2）系统比较了不同架构、训练和推理设计选择对TTA模型的影响；3）分析了采样方法及其在生成质量和推理速度方面的帕累托曲线（Pareto curves）。基于这些分析，论文提出了最佳模型Elucidated Text-To-Audio (ETTA)，该模型在AudioCaps和MusicCaps基准测试中优于使用公开数据训练的基线模型，并与使用专有数据训练的模型竞争，尤其在处理复杂和富有想象力的字幕生成创意音频任务中表现出色。

链接: https://arxiv.org/abs/2412.19351
作者: Sang-gil Lee,Zhifeng Kong,Arushi Goel,Sungwon Kim,Rafael Valle,Bryan Catanzaro
机构: 未知
关键词: natural language prompts, Recent years, enabling users, language prompts, users to enrich
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA’s improved ability to generate creative audio following complex and imaginative captions – a task that is more challenging than current benchmarks.
zh

[NLP-17] On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages AAAI2025

【速读】：该论文旨在探讨选择性状态空间模型（Selective State-Space Models, SSMs）在形式表达能力和长度泛化性能方面的不足，特别是在正则语言任务（如有限状态自动机模拟）中的表现。为了解决这些问题，论文提出了一种新的模型——选择性密集状态空间模型（Selective Dense State-Space Model, SD-SSM），该模型通过引入密集转移矩阵字典、基于softmax的选择机制以及由层归一化和线性映射组成的读出机制，首次实现了在单层结构下对多种正则语言任务的完美长度泛化。此外，论文还通过实验评估了对角选择性SSM在交换和非交换自动机上的表现，并结合理论分析解释了实验结果。

链接: https://arxiv.org/abs/2412.19350
作者: Aleksandar Terzić,Michael Hersche,Giacomo Camposampiero,Thomas Hofmann,Abu Sebastian,Abbas Rahimi
机构: 未知
关键词: offering the unique, sequential inference, Selective state-space models, emerging alternative, unique advantage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 7 figures, to be published in AAAI 2025

点击查看摘要

Abstract:Selective state-space models (SSMs) are an emerging alternative to the Transformer, offering the unique advantage of parallel training and sequential inference. Although these models have shown promising performance on a variety of tasks, their formal expressiveness and length generalization properties remain underexplored. In this work, we provide insight into the workings of selective SSMs by analyzing their expressiveness and length generalization performance on regular language tasks, i.e., finite-state automaton (FSA) emulation. We address certain limitations of modern SSM-based architectures by introducing the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization on a set of various regular language tasks using a single layer. It utilizes a dictionary of dense transition matrices, a softmax selection mechanism that creates a convex combination of dictionary matrices at each time step, and a readout consisting of layer normalization followed by a linear map. We then proceed to evaluate variants of diagonal selective SSMs by considering their empirical performance on commutative and non-commutative automata. We explain the experimental results with theoretical considerations. Our code is available at this https URL.
zh

[NLP-18] Semi-Supervised Learning from Small Annotated Data and Large Unlabeled Data for Fine-grained PICO Entity Recognition

【速读】：该论文旨在解决从临床试验文献中提取细粒度PICO（Participants, Intervention, Comparison, Outcomes）元素的问题，这对于临床证据的检索、评估和合成至关重要。现有方法未能区分PICO实体的属性，因此本研究开发了一种名为FinePICO的命名实体识别（NER）模型，以提取具有细粒度的PICO实体。解决方案的关键在于采用了一种半监督学习方法，结合有限的标注数据和大量未标注数据来训练NER模型。通过这种方法，FinePICO在仅使用少量标注样本的情况下，显著优于基线模型（F1值提高了16%以上），并展示了在不同PICO框架和语料库中的泛化能力。

链接: https://arxiv.org/abs/2412.19346
作者: Fangyi Chen,Gongbo Zhang,Yilu Fang,Yifan Peng,Chunhua Weng
机构: 未知
关键词: Extracting PICO elements, clinical evidence retrieval, clinical trial literature, Extracting PICO, PICO entities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: Extracting PICO elements – Participants, Intervention, Comparison, and Outcomes – from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities. Materials and Methods: Using a corpus of 2,511 abstracts with PICO mentions from 4 public datasets, we developed a semi-supervised method to facilitate the training of a NER model, FinePICO, by combining limited annotated data of PICO entities and abundant unlabeled data. For evaluation, we divided the entire dataset into two subsets: a smaller group with annotations and a larger group without annotations. We then established the theoretical lower and upper performance bounds based on the performance of supervised learning models trained solely on the small, annotated subset and on the entire set with complete annotations, respectively. Finally, we evaluated FinePICO on both the smaller annotated subset and the larger, initially unannotated subset. We measured the performance of FinePICO using precision, recall, and F1. Results: Our method achieved precision/recall/F1 of 0.567/0.636/0.60, respectively, using a small set of annotated samples, outperforming the baseline model (F1: 0.437) by more than 16%. The model demonstrates generalizability to a different PICO framework and to another corpus, which consistently outperforms the benchmark in diverse experimental settings (p-value \textless0.001). Conclusion: This study contributes a generalizable and effective semi-supervised approach to named entity recognition leveraging large unlabeled data together with small, annotated data. It also initially supports fine-grained PICO extraction. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.19346 [cs.CL] (or arXiv:2412.19346v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.19346 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gongbo Zhang [view email] [v1] Thu, 26 Dec 2024 20:24:35 UTC (3,113 KB)
zh

[NLP-19] ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning

【速读】：该论文旨在解决当前轻量级图像描述（lightweight image captioning）模型中，仅依赖检索文本作为文本提示（text prompts）而忽视图像信息在视觉嵌入空间中的充分反映的问题。具体而言，现有方法主要依赖CLIP视觉嵌入（CLIP visual embedding）来获取视觉信息，导致图像描述在提示中的固有信息未能充分体现在视觉嵌入空间中。为解决这一问题，论文提出了ViPCap，一种基于检索文本的视觉提示（visual prompt）方法。ViPCap的关键在于将检索文本与图像信息结合，作为视觉提示，以增强模型捕捉相关视觉信息的能力。通过将文本提示映射到CLIP空间并生成多个随机高斯分布（Gaussian distributions），该方法利用采样技术探索随机增强的分布，有效检索包含图像信息的语义特征。这些检索到的特征被整合到图像中并指定为视觉提示，从而在COCO、Flickr30k和NoCaps等数据集上实现了性能提升。实验结果表明，ViPCap在效率和效果上显著优于先前的轻量级图像描述模型，展示了其作为即插即用解决方案的潜力。

链接: https://arxiv.org/abs/2412.19289
作者: Taewhan Kim,Soeun Lee,Si-Woo Kim,Dong-Jin Kim
机构: 未知
关键词: Recent lightweight image, Recent lightweight, CLIP visual embedding, lightweight image captioning, text prompts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent lightweight image captioning models using retrieved data mainly focus on text prompts. However, previous works only utilize the retrieved text as text prompts, and the visual information relies only on the CLIP visual embedding. Because of this issue, there is a limitation that the image descriptions inherent in the prompt are not sufficiently reflected in the visual embedding space. To tackle this issue, we propose ViPCap, a novel retrieval text-based visual prompt for lightweight image captioning. ViPCap leverages the retrieved text with image information as visual prompts to enhance the ability of the model to capture relevant visual information. By mapping text prompts into the CLIP space and generating multiple randomized Gaussian distributions, our method leverages sampling to explore randomly augmented distributions and effectively retrieves the semantic features that contain image information. These retrieved features are integrated into the image and designated as the visual prompt, leading to performance improvements on the datasets such as COCO, Flickr30k, and NoCaps. Experimental results demonstrate that ViPCap significantly outperforms prior lightweight captioning models in efficiency and effectiveness, demonstrating the potential for a plug-and-play solution.
zh

[NLP-20] Optimizing Multi-Stage Language Models for Effective Text Retrieval

【速读】：该论文旨在解决在特定领域（如日本法律系统）中文本检索效率低下的问题。现有检索方法在领域特定场景中表现不佳，因此需要定制化的解决方案。论文提出了一种新颖的两阶段文本检索流程，专门针对日本法律数据集进行优化。其关键解决方案在于利用先进的语言模型（language models）实现最先进的性能，显著提升检索效率和准确性。此外，为了进一步增强鲁棒性和适应性，论文引入了一个集成模型（ensemble model），该模型整合了多种检索策略，从而在多样化任务中取得了优异的结果。通过大量实验验证，该方法不仅在日本法律数据集上表现突出，还在广泛认可的基准测试（如MS-MARCO）中展现了强大的性能，为领域特定和通用场景下的文本检索设立了新标准。

链接: https://arxiv.org/abs/2412.19265
作者: Quang Hoang Trung,Le Trung Hoang,Nguyen Van Hoang Phuc
机构: 未知
关键词: Efficient text retrieval, legal document analysis, Japanese legal systems, Efficient text, document analysis
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficient text retrieval is critical for applications such as legal document analysis, particularly in specialized contexts like Japanese legal systems. Existing retrieval methods often underperform in such domain-specific scenarios, necessitating tailored approaches. In this paper, we introduce a novel two-phase text retrieval pipeline optimized for Japanese legal datasets. Our method leverages advanced language models to achieve state-of-the-art performance, significantly improving retrieval efficiency and accuracy. To further enhance robustness and adaptability, we incorporate an ensemble model that integrates multiple retrieval strategies, resulting in superior outcomes across diverse tasks. Extensive experiments validate the effectiveness of our approach, demonstrating strong performance on both Japanese legal datasets and widely recognized benchmarks like MS-MARCO. Our work establishes new standards for text retrieval in domain-specific and general contexts, providing a comprehensive solution for addressing complex queries in legal and multilingual environments.
zh

[NLP-21] MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

【速读】：该论文旨在解决大语言模型（LLMs）在验证和纠正医学文本中的错误方面的能力评估问题。尽管已有研究表明LLMs在某些医学考试中能够正确回答医学问题，甚至超过人类平均分数，但目前尚无研究系统评估LLMs在检测和纠正医学文本错误方面的表现。为此，论文引入了MEDEC（Medical Error Detection and Correction）基准，这是首个公开可用的用于临床笔记中医学错误检测和纠正的基准，涵盖了诊断（Diagnosis）、管理（Management）、治疗（Treatment）、药物治疗（Pharmacotherapy）和病原体（Causal Organism）五类错误。MEDEC包含3,848份临床文本，其中包括来自三个美国医院系统的488份临床笔记，这些笔记此前未被任何LLM接触过。该数据集已被用于MEDIQA-CORR共享任务，评估了17个参与系统的表现。论文还评估了多个最新LLMs（如o1-preview、GPT-4、Claude 3.5 Sonnet和Gemini 2.0 Flash）在检测和纠正医学错误任务中的表现，并与两名医学医生进行了对比研究。结果表明，MEDEC是一个具有足够挑战性的基准，能够有效评估模型在验证和纠正医学笔记中的错误方面的能力。尽管最新LLMs在错误检测和纠正方面表现良好，但仍不及医学医生。论文进一步讨论了这一差距背后的潜在因素、实验中的洞察、当前评估指标的局限性，并提出了未来研究的潜在方向。

链接: https://arxiv.org/abs/2412.19260
作者: Asma Ben Abacha,Wen-wai Yim,Yujuan Fu,Zhaoyi Sun,Meliha Yetisgen,Fei Xia,Thomas Lin
机构: 未知
关键词: Large Language Models, average human score, Large Language, Language Models, medical questions correctly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages

点击查看摘要

Abstract:Several studies showed that Large Language Models (LLMs) can answer medical questions correctly, even outperforming the average human score in some medical exams. However, to our knowledge, no study has been conducted to assess the ability of language models to validate existing or generated medical text for correctness and consistency. In this paper, we introduce MEDEC (this https URL), the first publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM. The dataset has been used for the MEDIQA-CORR shared task to evaluate seventeen participating systems [Ben Abacha et al., 2024]. In this paper, we describe the data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4, Claude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities. We also conducted a comparative study where two medical doctors performed the same task on the MEDEC test set. The results showed that MEDEC is a sufficiently challenging benchmark to assess the ability of models to validate existing or generated notes and to correct medical errors. We also found that although recent LLMs have a good performance in error detection and correction, they are still outperformed by medical doctors in these tasks. We discuss the potential factors behind this gap, the insights from our experiments, the limitations of current evaluation metrics, and share potential pointers for future research.
zh

[NLP-22] Multi-matrix Factorization Attention

【速读】：该论文旨在解决在严格的键值缓存（KV cache）约束下，现有多头注意力机制（MHA）及其变体（如MLA）性能下降的问题。论文提出了两种新颖的注意力架构：多矩阵分解注意力（Multi-matrix Factorization Attention, MFA）和MFA-键重用（MFA-Key-Reuse, MFA-KR）。MFA通过在查询-键（QK）电路中引入低秩矩阵分解，有效扩展了注意力头的数量和维度，从而增强了模型容量。MFA-KR则进一步通过值投影重参数化，将键缓存重新用作值缓存，显著降低了内存需求。这两种架构在严格的KV cache预算下均表现出色，其中MFA-KR在更苛刻的KV cache限制下仅需牺牲少量性能。实验表明，MFA和MFA-KR在减少KV cache使用量（分别高达56%和93.7%）的同时，性能优于MLA，并与MHA相当。

链接: https://arxiv.org/abs/2412.19255
作者: Jingcheng Hu,Houyi Li,Yinmin Zhang,Zili Wang,Shuigeng Zhou,Xiangyu Zhang,Heung-Yeung Shum
机构: 未知
关键词: Multi-matrix Factorization Attention, Multi-matrix Factorization, including SOTA methods, Multi-matrix, attention
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA’s design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.
zh

[NLP-23] Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching

【速读】：该论文旨在解决图像-文本匹配任务中的复杂关系理解与表示问题。随着多模态学习（multimodal learning）的快速发展，图像-文本匹配任务作为连接视觉与语言的桥梁，其重要性日益凸显。论文提出的解决方案关键在于引入了一种创新的视觉语义嵌入模型——多头共识感知视觉语义嵌入模型（Multi-Headed Consensus-Aware Visual-Semantic Embedding, MH-CVSE）。该模型在共识感知视觉语义嵌入模型（CVSE）的基础上，引入了多头自注意力机制（multi-head self-attention mechanism），以并行捕捉多个子空间中的信息，从而显著增强了模型对图像与文本之间复杂关系的理解与表示能力。此外，模型采用了参数化特征融合策略，灵活整合不同层次的特征信息，进一步提升了模型的表达能力。在损失函数设计上，MH-CVSE模型通过动态权重调整策略，根据损失值本身动态调整权重，使模型在训练过程中更好地平衡不同损失项的贡献。同时，模型引入了余弦退火学习率策略，帮助模型在训练后期更稳定地收敛。通过在Flickr30k数据集上的广泛实验验证，MH-CVSE模型在双向图像和文本检索任务中均表现出优于现有方法的性能，充分证明了其有效性和优越性。

链接: https://arxiv.org/abs/2412.19184
作者: Wenjing Chen
机构: 未知
关键词: bridge connecting vision, semantic embedding model, image-text matching task, visual semantic embedding, vision and language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of multimodal learning, the image-text matching task, as a bridge connecting vision and language, has become increasingly important. Based on existing research, this study proposes an innovative visual semantic embedding model, Multi-Headed Consensus-Aware Visual-Semantic Embedding (MH-CVSE). This model introduces a multi-head self-attention mechanism based on the consensus-aware visual semantic embedding model (CVSE) to capture information in multiple subspaces in parallel, significantly enhancing the model’s ability to understand and represent the complex relationship between images and texts. In addition, we adopt a parameterized feature fusion strategy to flexibly integrate feature information at different levels, further improving the model’s expressive power. In terms of loss function design, the MH-CVSE model adopts a dynamic weight adjustment strategy to dynamically adjust the weight according to the loss value itself, so that the model can better balance the contribution of different loss terms during training. At the same time, we introduce a cosine annealing learning rate strategy to help the model converge more stably in the later stages of training. Extensive experimental verification on the Flickr30k dataset shows that the MH-CVSE model achieves better performance than previous methods in both bidirectional image and text retrieval tasks, fully demonstrating its effectiveness and superiority.
zh

[NLP-24] Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval

【速读】：该论文旨在解决视频-文本检索（video-text retrieval）领域中现有基准数据集在全面评估模型能力，尤其是时间理解（temporal understanding）方面的不足。现有的大规模图像-文本预训练模型在零样本（zero-shot）性能上已经能够与视频-文本预训练模型相媲美，这表明现有视频-文本基准数据集未能充分体现视频检索的独特挑战。为此，论文提出了RTime数据集，该数据集通过选择具有显著时间性的动作或事件视频，并反转这些视频以生成更具挑战性的负样本（harder negative samples），从而增强对模型时间理解能力的评估。此外，RTime数据集包含21k个视频，每个视频配有10个描述，总计约122小时，并通过GPT-4扩展了基于人工撰写的描述。基于RTime，论文提出了三个检索基准任务：RTime-Origin、RTime-Hard和RTime-Binary，并进一步在模型训练中加强了对难负样本的使用。实验证明，RTime确实为视频-文本检索带来了新的更高挑战。

链接: https://arxiv.org/abs/2412.19178
作者: Yang Du,Yuqi Liu,Qin Jin
机构: 未知
关键词: vision-language understanding field, Cross-modal, video-text, retrieval, RTime
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: ACMMM 2024 poster

点击查看摘要

Abstract:Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field. Temporal understanding makes video-text retrieval more challenging than image-text retrieval. However, we find that the widely used video-text benchmarks have shortcomings in comprehensively assessing abilities of models, especially in temporal understanding, causing large-scale image-text pre-trained models can already achieve comparable zero-shot performance with video-text pre-trained models. In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset. We first obtain videos of actions or events with significant temporality, and then reverse these videos to create harder negative samples. We then recruit annotators to judge the significance and reversibility of candidate videos, and write captions for qualified videos. We further adopt GPT-4 to extend more captions based on human-written captions. Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours. Based on RTime, we propose three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We further enhance the use of harder-negatives in model training, and benchmark a variety of video-text models on RTime. Extensive experiment analysis proves that RTime indeed poses new and higher challenges to video-text retrieval. We release our RTime dataset\footnote\urlthis https URL to further advance video-text retrieval and multimodal understanding research.
zh

[NLP-25] GFG – Gender-Fair Generation: A CALAMITA Challenge

【速读】：该论文旨在解决在性别标记明显的语言（如意大利语）中实现性别公平语言（Gender-fair Language）的挑战，以促进性别平等并避免强化性别刻板印象。为此，论文提出了“Gender-Fair Generation”挑战，旨在推动书面交流中的性别公平语言使用。该挑战包括三个关键任务：(1) 检测意大利语句子中的性别化表达，(2) 将性别化表达改写为性别公平的替代形式，(3) 在从英语到意大利语的自动翻译中生成性别公平语言。解决方案的核心在于利用三个标注数据集（GFL-it、GeNTE和Neo-GATE）来评估和监控性别公平语言的识别与生成，并通过特定指标（如BERTScore、性别中性分类器的准确率和覆盖率加权准确率）对任务进行量化评估。

链接: https://arxiv.org/abs/2412.19168
作者: Simona Frenda,Andrea Piergentili,Beatrice Savoldi,Marco Madeddu,Martina Rosola,Silvia Casola,Chiara Ferrando,Viviana Patti,Matteo Negri,Luisa Bentivogli
机构: 未知
关键词: reinforcing gender stereotypes, promoting gender equality, avoid reinforcing gender, Gender-fair language aims, Gender-fair language
类目: Computation and Language (cs.CL)
备注: To refer to this paper please cite the CEUR-ws publication available at this https URL

点击查看摘要

Abstract:Gender-fair language aims at promoting gender equality by using terms and expressions that include all identities and avoid reinforcing gender stereotypes. Implementing gender-fair strategies is particularly challenging in heavily gender-marked languages, such as Italian. To address this, the Gender-Fair Generation challenge intends to help shift toward gender-fair language in written communication. The challenge, designed to assess and monitor the recognition and generation of gender-fair language in both mono- and cross-lingual scenarios, includes three tasks: (1) the detection of gendered expressions in Italian sentences, (2) the reformulation of gendered expressions into gender-fair alternatives, and (3) the generation of gender-fair language in automatic translation from English to Italian. The challenge relies on three different annotated datasets: the GFL-it corpus, which contains Italian texts extracted from administrative documents provided by the University of Brescia; GeNTE, a bilingual test set for gender-neutral rewriting and translation built upon a subset of the Europarl dataset; and Neo-GATE, a bilingual test set designed to assess the use of non-binary neomorphemes in Italian for both fair formulation and translation tasks. Finally, each task is evaluated with specific metrics: average of F1-score obtained by means of BERTScore computed on each entry of the datasets for task 1, an accuracy measured with a gender-neutral classifier, and a coverage-weighted accuracy for tasks 2 and 3.
zh

[NLP-26] Referencing Where to Focus: Improving VisualGrounding with Referential Query NIPS2024

【速读】：该论文旨在解决视觉定位（Visual Grounding）任务中现有方法在查询生成和解码过程中存在的两个主要问题：一是传统的查询生成方法通常通过随机初始化或语言嵌入生成可学习的查询，缺乏目标相关的先验信息，增加了模型的学习难度；二是现有方法在查询学习过程中仅使用最深层的图像特征，忽略了其他层次特征的重要性。为解决这些问题，论文提出了一种名为RefFormer的新方法，其核心在于引入了查询适配模块（query adaption module），该模块能够无缝集成到CLIP中，生成具有先验上下文的参考查询，从而为解码器提供目标相关的信息。此外，RefFormer还包含一个任务特定的解码器，通过将参考查询引入解码过程，有效降低了解码器的学习难度，并更准确地聚焦于目标对象。查询适配模块还充当适配器，保留了CLIP中的丰富知识，而无需调整主干网络的参数。实验结果表明，该方法在五个视觉定位基准上均优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.19155
作者: Yabing Wang,Zhuotao Tian,Qingpei Guo,Zheng Qin,Sanping Zhou,Ming Yang,Le Wang
机构: 未知
关键词: natural language expression, Visual Grounding aims, Visual Grounding, DETR-based visual grounding, visual grounding methods
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by NIPS2024

点击查看摘要

Abstract:Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.
zh

[NLP-27] SILC-EFSA: Self-aware In-context Learning Correction for Entity-level Financial Sentiment Analysis COLING2025

【速读】：该论文旨在解决金融领域细粒度情感分析（fine-grained sentiment analysis）中实体级别数据集稀缺的关键问题。为此，作者构建了迄今为止最大的英文和中文金融实体级别情感分析数据集。在此基础上，提出了一种新颖的两阶段情感分析方法，称为自感知上下文学习校正（Self-aware In-context Learning Correction, SILC）。该方法的第一阶段通过微调基础大语言模型（large language model）生成任务特定的伪标签数据；第二阶段则利用基于图神经网络（Graph Neural Network, GNN）的示例检索器，结合伪标签数据训练校正模型。这一两阶段策略在新构建的数据集上实现了最先进的性能，推动了金融情感分析领域的发展。

链接: https://arxiv.org/abs/2412.19140
作者: Senbin Zhu,Chenyuan He,Hongde Liu,Pengcheng Dong,Hanjie Zhao,Yuchen Yan,Yuxiang Jia,Hongying Zan,Min Peng
机构: 未知
关键词: gained significant attention, fine-grained sentiment analysis, sentiment analysis, entity-level sentiment analysis, Self-aware In-context Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: This paper is to be published in the Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:In recent years, fine-grained sentiment analysis in finance has gained significant attention, but the scarcity of entity-level datasets remains a key challenge. To address this, we have constructed the largest English and Chinese financial entity-level sentiment analysis datasets to date. Building on this foundation, we propose a novel two-stage sentiment analysis approach called Self-aware In-context Learning Correction (SILC). The first stage involves fine-tuning a base large language model to generate pseudo-labeled data specific to our task. In the second stage, we train a correction model using a GNN-based example retriever, which is informed by the pseudo-labeled data. This two-stage strategy has allowed us to achieve state-of-the-art performance on the newly constructed datasets, advancing the field of financial sentiment analysis. In a case study, we demonstrate the enhanced practical utility of our data and methods in monitoring the cryptocurrency market. Our datasets and code are available at this https URL.
zh

[NLP-28] SketchFill: Sketch-Guided Code Generation for Imputing Derived Missing Values

【速读】：该论文旨在解决缺失值填补（Missing Value Imputation, MVI）这一数据科学中的关键问题，特别是在处理表格数据时，现有的大语言模型（Large Language Models, LLMs）技术在复杂推理方面存在不足，尤其是在需要跨行跨列的数据关系和数学公式来填补衍生缺失值时。现有方法如上下文学习（in-context learning）和思维链（Chain-of-Thought, CoT）往往无法有效指导LLMs进行此类复杂推理。为解决这一问题，论文提出了一种名为SketchFill的基于草图（sketch-based）的新方法，通过引导LLMs生成准确的数学公式来填补缺失的数值。实验结果表明，SketchFill在准确性上显著优于现有方法，比基于CoT的方法提高了56.2%，比MetaGPT提高了78.8%，为自动化数据清洗和数值缺失值填补领域设定了新的标准。

链接: https://arxiv.org/abs/2412.19113
作者: Yunfan Zhang,Changlun Li,Yuyu Luo,Nan Tang
机构: 未知
关键词: analyses and predictions, critical issue, impacting the reliability, reliability of analyses, MVI
类目: Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Missing value is a critical issue in data science, significantly impacting the reliability of analyses and predictions. Missing value imputation (MVI) is a longstanding problem because it highly relies on domain knowledge. Large language models (LLMs) have emerged as a promising tool for data cleaning, including MVI for tabular data, offering advanced capabilities for understanding and generating content. However, despite their promise, existing LLM techniques such as in-context learning and Chain-of-Thought (CoT) often fall short in guiding LLMs to perform complex reasoning for MVI, particularly when imputing derived missing values, which require mathematical formulas and data relationships across rows and columns. This gap underscores the need for further advancements in LLM methodologies to enhance their reasoning capabilities for more reliable imputation outcomes. To fill this gap, we propose SketchFill, a novel sketch-based method to guide LLMs in generating accurate formulas to impute missing numerical values. Our experimental results demonstrate that SketchFill significantly outperforms state-of-the-art methods, achieving 56.2% higher accuracy than CoT-based methods and 78.8% higher accuracy than MetaGPT. This sets a new standard for automated data cleaning and advances the field of MVI for numerical values.
zh

[NLP-29] “Ive Heard of You!”: Generate Spoken Named Entity Recognition Data for Unseen Entities ICASSP2025

【速读】：该论文旨在解决口语命名实体识别（Spoken Named Entity Recognition, Spoken NER）系统中处理未见过的命名实体时性能较差的问题。由于新命名实体不断出现，而标注口语NER数据的成本较高，现有系统在面对新实体时表现不佳。为解决这一挑战，论文提出了一种基于命名实体词典（Named Entity Dictionary, NED）生成口语NER数据的方法，以降低标注成本。具体而言，该方法首先利用大语言模型（Large Language Model, LLM）从采样的命名实体生成句子，然后通过文本转语音（Text-to-Speech, TTS）系统生成语音数据。此外，论文引入了一种噪声度量标准来过滤噪声数据。为评估该方法的有效性，作者发布了一个包含8,853个实体的新口语NER基准数据集及相应的NED。实验结果表明，该方法在域内、零样本域适应和完全零样本设置下均达到了最先进的性能。

链接: https://arxiv.org/abs/2412.19102
作者: Jiawei Yu,Xiang Geng,Yuang Li,Mengxin Ren,Wei Tang,Jiahuan Li,Zhibin Lan,Min Zhang,Hao Yang,Shujian Huang,Jinsong Su
机构: 未知
关键词: Spoken NER, Spoken NER data, Spoken NER systems, existing Spoken NER, NER
类目: Computation and Language (cs.CL)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Spoken named entity recognition (NER) aims to identify named entities from speech, playing an important role in speech processing. New named entities appear every day, however, annotating their Spoken NER data is costly. In this paper, we demonstrate that existing Spoken NER systems perform poorly when dealing with previously unseen named entities. To tackle this challenge, we propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs. Specifically, we first use a large language model (LLM) to generate sentences from the sampled named entities and then use a text-to-speech (TTS) system to generate the speech. Furthermore, we introduce a noise metric to filter out noisy data. To evaluate our approach, we release a novel Spoken NER benchmark along with a corresponding NED containing 8,853 entities. Experiment results show that our method achieves state-of-the-art (SOTA) performance in the in-domain, zero-shot domain adaptation, and fully zero-shot settings. Our data will be available at this https URL.
zh

[NLP-30] MoPD: Mixture-of-Prompts Distillation for Vision-Language Models

【速读】：该论文旨在解决现有软提示学习（soft prompt learning）方法在适应视觉-语言模型（VLMs）时，对已见类别（seen classes）过拟合且在未见类别（unseen classes）上表现不佳的问题。这一局限性源于训练数据对已见类别的固有偏差。为解决这一问题，论文提出了一种名为混合提示蒸馏（Mixture-of-Prompts Distillation, MoPD）的新方法，其关键是通过将手工设计的硬提示（hard prompts，即教师提示）中的有用知识有效传递到可学习的软提示（soft prompts，即学生提示）中，从而提升软提示在未见类别上的泛化能力。此外，MoPD方法还引入了一个门控网络（gating network），用于选择用于提示蒸馏的硬提示。实验结果表明，MoPD方法在未见类别上的表现显著优于现有最先进的基线方法。

链接: https://arxiv.org/abs/2412.19087
作者: Yang Chen,Shuai Fu,Yu Zhang
机构: 未知
关键词: adapting vision-language models, Soft prompt learning, vision-language models, downstream tasks, effective for adapting
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Soft prompt learning methods are effective for adapting vision-language models (VLMs) to downstream tasks. Nevertheless, empirical evidence reveals a tendency of existing methods that they overfit seen classes and exhibit degraded performance on unseen classes. This limitation is due to the inherent bias in the training data towards the seen classes. To address this issue, we propose a novel soft prompt learning method, named Mixture-of-Prompts Distillation (MoPD), which can effectively transfer useful knowledge from hard prompts manually hand-crafted (a.k.a. teacher prompts) to the learnable soft prompt (a.k.a. student prompt), thereby enhancing the generalization ability of soft prompts on unseen classes. Moreover, the proposed MoPD method utilizes a gating network that learns to select hard prompts used for prompt distillation. Extensive experiments demonstrate that the proposed MoPD method outperforms state-of-the-art baselines especially on on unseen classes.
zh

[NLP-31] Advancing LLM detection in the ALTA 2024 Shared Task: Techniques and Analysis

【速读】：该论文旨在解决AI生成内容（AI-generated content）的可靠检测问题，特别是在混合文章（hybrid articles）中识别AI生成文本的挑战。研究通过句子级评估（sentence-level evaluation）探索了相关技术，发现ChatGPT-3.5 Turbo在生成文本时表现出独特的、重复的概率模式（repetitive probability patterns），这些模式使得在特定领域内的检测具有一致性。实验表明，轻微的文本修改（如重写）对检测准确性的影响较小。这些发现为推进AI检测方法提供了重要见解，并为解决合成文本识别的复杂性提供了潜在的解决方案。

链接: https://arxiv.org/abs/2412.19076
作者: Dima Galat
机构: 未知
关键词: prompted significant interest, reliable detection methods, developing reliable detection, recent proliferation, content has prompted
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent proliferation of AI-generated content has prompted significant interest in developing reliable detection methods. This study explores techniques for identifying AI-generated text through sentence-level evaluation within hybrid articles. Our findings indicate that ChatGPT-3.5 Turbo exhibits distinct, repetitive probability patterns that enable consistent in-domain detection. Empirical tests show that minor textual modifications, such as rewording, have minimal impact on detection accuracy. These results provide valuable insights for advancing AI detection methodologies, offering a pathway toward robust solutions to address the complexities of synthetic text identification.
zh

[NLP-32] Cross-Demographic Portability of Deep NLP-Based Depression Models

【速读】：该论文旨在探讨深度学习模型在行为健康领域的实际应用中，如何在不同人群之间实现良好的泛化能力。具体而言，研究关注基于自然语言处理（NLP）的模型在年龄差异显著的两个语料库之间的可移植性。第一个语料库包含年轻说话者，用于训练预测抑郁的NLP模型，该模型在同一年龄分布的未见说话者上表现良好，AUC值为0.82。随后，该模型在第二个语料库（包含退休社区的老年人）上进行测试，尽管两个语料库在人口统计学上存在显著差异，模型在老年人群中的性能仅略有下降，AUC值为0.76。值得注意的是，在健康状况稳定的老年患者子集中，AUC值达到0.81。研究的关键在于通过实验验证NLP模型在不同年龄群体中的泛化能力，并探讨基于语音的应用在人口统计学上的可移植性。

链接: https://arxiv.org/abs/2412.19070
作者: Tomek Rutowski,Elizabeth Shriberg,Amir Harati,Yang Lu,Ricardo Oliveira,Piotr Chlebek
机构: 未知
关键词: Deep learning models, rapidly gaining interest, Deep learning, Natural Language Processing, rapidly gaining
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep learning models are rapidly gaining interest for real-world applications in behavioral health. An important gap in current literature is how well such models generalize over different populations. We study Natural Language Processing (NLP) based models to explore portability over two different corpora highly mismatched in age. The first and larger corpus contains younger speakers. It is used to train an NLP model to predict depression. When testing on unseen speakers from the same age distribution, this model performs at AUC=0.82. We then test this model on the second corpus, which comprises seniors from a retirement community. Despite the large demographic differences in the two corpora, we saw only modest degradation in performance for the senior-corpus data, achieving AUC=0.76. Interestingly, in the senior population, we find AUC=0.81 for the subset of patients whose health state is consistent over time. Implications for demographic portability of speech-based applications are discussed.
zh

[NLP-33] Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID

【速读】：该论文旨在解决多语言文本到语音（Text-to-Speech, TTS）系统中处理印度尼西亚语和英语之间的语码转换（code-switching）问题。语码转换在印度尼西亚尤为常见，但现有的多语言TTS系统尚未能有效处理这种语言混合现象。论文提出的解决方案关键在于在文本到音素转换过程中引入了一个语言识别组件，该组件使用微调的BERT模型进行逐词语言识别，并移除了基础模型中的语言嵌入（language embedding）。实验结果表明，该语码转换模型在自然度和语音清晰度方面均优于印度尼西亚语和英语的基线STEN-TTS模型。

链接: https://arxiv.org/abs/2412.19043
作者: Ahmad Alfani Handoyo,Chung Tran,Dessi Puji Lestari,Sakriani Sakti
机构: 未知
关键词: systems convert text, Indonesian and English, convert text, multilingual TTS system, systems convert
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at O-COCOSDA 2024

点击查看摘要

Abstract:Multilingual text-to-speech systems convert text into speech across multiple languages. In many cases, text sentences may contain segments in different languages, a phenomenon known as code-switching. This is particularly common in Indonesia, especially between Indonesian and English. Despite its significance, no research has yet developed a multilingual TTS system capable of handling code-switching between these two languages. This study addresses Indonesian-English code-switching in STEN-TTS. Key modifications include adding a language identification component to the text-to-phoneme conversion using finetuned BERT for per-word language identification, as well as removing language embedding from the base model. Experimental results demonstrate that the code-switching model achieves superior naturalness and improved speech intelligibility compared to the Indonesian and English baseline STEN-TTS models.
zh

[NLP-34] Let the Rule Speak: Enhancing In-context Learning Debiasing with Interpretability

【速读】：该论文旨在解决大语言模型（LLMs）在多类文本分类任务中，通过上下文学习（In-context Learning, ICL）进行预测时出现的类别间准确率不平衡问题。具体而言，某些类别在ICL输出中始终获得较高的概率，导致其被更频繁地选择，从而表现出较高的准确率，而其他类别则因获得较低或混合范围的概率而准确率较低。这种不平衡不仅影响整体预测性能，还带来了核心的可解释性挑战，即为何某些类别需要修正，以及如何针对每个样本、每个类别的概率进行定制化修正。为解决这一问题，论文提出了FuRud（Fuzzy Rule Optimization based Debiasing method），其关键是通过模糊规则优化方法，检测哪些类别需要修正，并针对每个需要修正的类别，识别其概率范围，应用非对称的放大或缩小操作进行可解释的修正。实验结果表明，FuRud在七个基准数据集上显著减少了类别间准确率偏差（COBias），同时提升了整体准确率，且仅需少量优化样本即可优化下游任务。

链接: https://arxiv.org/abs/2412.19018
作者: Ruixi Lin,Yang You
机构: 未知
关键词: multi-class text classification, In-context learning, large language models, perform diverse tasks, imbalanced per-class prediction
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning, which allows large language models to perform diverse tasks with a few demonstrations, is found to have imbalanced per-class prediction accuracy on multi-class text classification. Although notable output correction methods have been developed to tackle the issue and simultaneously improve downstream prediction accuracy, they may fail to answer the core interpretability challenges: why and which certain classes need corrections, and more importantly, a tailored correction for per-sample, per-class’s probability. To address such interpretability gaps, we first find that the imbalance arises from certain classes consistently receiving high ICL output probabilities, whereas others receiving lower or mixed ranges, so the former is more frequently chosen, resulting in higher accuracy; more crucially, we find that these ranges have significantly varying degrees of influence on the accuracy bias, highlighting the need for precise, interpretable probability corrections by range. Motivated by this, we propose FuRud, a Fuzzy Rule Optimization based Debiasing method, that (1) detects which classes need corrections, and (2) for each correction-needed class, detects its probability ranges and applies asymmetric amplifications or reductions to correct them interpretably. Notably, across seven benchmark datasets, FuRud reduces the pairwise class accuracy bias (COBias) by more than half (56%), while achieving a relative increase of 21% in accuracy, outperforming state-of-the-art debiasing methods. Moreover, FuRud can optimize downstream tasks with as few as 10 optimization examples. Furthermore, FuRud can work for prompt formats that lead to highly skewed predictions. For example, FuRud greatly improves ICL outputs which use letter options, with 44% relative accuracy increase and 54% relative COBias reduction.
zh

[NLP-35] MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models AAAI-25

【速读】：该论文旨在解决医学大语言模型（Medical Large Language Models, MLLMs）在医疗应用中存在的“幻觉”问题，即模型生成医学上不可信或不准确信息的倾向，这对患者护理构成了重大风险。论文提出的解决方案是引入MedHallBench，一个全面的基准框架，用于评估和减轻MLLMs中的幻觉现象。该框架的关键在于整合了专家验证的医学案例场景与已建立的医学数据库，以创建一个稳健的评估数据集。此外，框架采用了自动化的ACHMI（Automatic Caption Hallucination Measurement in Medical Imaging）评分系统，结合严格的临床专家评估，并利用强化学习方法实现自动标注。通过专门为医疗应用优化的基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）训练管道，MedHallBench能够在多样化的临床背景下对MLLMs进行全面评估，同时保持严格的准确性标准。

链接: https://arxiv.org/abs/2412.18947
作者: Kaiwen Zuo,Yirui Jiang
机构: 未知
关键词: generating medically implausible, Large Language Models, presents substantial risks, generating medically, inaccurate information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published to AAAI-25 Bridge Program

点击查看摘要

Abstract:Medical Large Language Models (MLLMs) have demonstrated potential in healthcare applications, yet their propensity for hallucinations – generating medically implausible or inaccurate information – presents substantial risks to patient care. This paper introduces MedHallBench, a comprehensive benchmark framework for evaluating and mitigating hallucinations in MLLMs. Our methodology integrates expert-validated medical case scenarios with established medical databases to create a robust evaluation dataset. The framework employs a sophisticated measurement system that combines automated ACHMI (Automatic Caption Hallucination Measurement in Medical Imaging) scoring with rigorous clinical expert evaluations and utilizes reinforcement learning methods to achieve automatic annotation. Through an optimized reinforcement learning from human feedback (RLHF) training pipeline specifically designed for medical applications, MedHallBench enables thorough evaluation of MLLMs across diverse clinical contexts while maintaining stringent accuracy standards. We conducted comparative experiments involving various models, utilizing the benchmark to establish a baseline for widely adopted large language models (LLMs). Our findings indicate that ACHMI provides a more nuanced understanding of the effects of hallucinations compared to traditional metrics, thereby highlighting its advantages in hallucination assessment. This research establishes a foundational framework for enhancing MLLMs’ reliability in healthcare settings and presents actionable strategies for addressing the critical challenge of AI hallucinations in medical applications.
zh

[NLP-36] Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference

【速读】：该论文旨在解决大型语言模型（LLMs）在消费级设备上部署时面临的高资源需求问题。由于消费级设备通常配备较弱的GPU和较强的CPU，而现有方法主要依赖GPU进行计算，导致硬件资源利用不均衡。为此，论文提出了Dovetail方法，其关键解决方案是将草稿模型（draft model）部署在GPU上生成草稿标记（draft tokens），同时允许目标模型（target model）在CPU上并行执行验证，从而充分利用所有可用硬件资源并减少设备间通信带宽占用。此外，论文还通过减少草稿标记数量、增加草稿模型深度以及引入动态门控融合（DGF, Dynamic Gating Fusion）等技术优化了草稿模型，以更好地适应异构硬件特性。实验结果表明，Dovetail在HumanEval基准测试中显著提升了推理速度。

链接: https://arxiv.org/abs/2412.18934
作者: Libo Zhang,Zhaoning Zhang,Baizhou Xu,Songzhu Mei,Dongsheng Li(1) ((1) National University of Defense Technology, Changsha, China)
机构: 未知
关键词: Large Language Models, Large Language, achieving widespread deployment, presents significant challenges, devices presents significant
类目: Computation and Language (cs.CL)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Due to the high resource demands of Large Language Models (LLMs), achieving widespread deployment on consumer-grade devices presents significant challenges. Typically, personal or consumer-grade devices, including servers configured prior to the era of large-scale models, generally have relatively weak GPUs and relatively strong CPUs. However, most current methods primarily depend on GPUs for computation. Therefore, we propose Dovetail, an approach that deploys the draft model on the GPU to generate draft tokens while allowing the target model to perform parallel verification on the CPU, thereby improving the utilization of all available hardware resources and occupying less inter-device communication bandwidth. Accordingly, we have redesigned the draft model to better align with heterogeneous hardware characteristics. To this end, we implemented several optimizations: reducing the number of draft tokens to mitigate latency in parallel verification, increasing the depth of the draft model to enhance its predictive capacity, and introducing DGF (Dynamic Gating Fusion) to improve the integration of features and token embeddings. In the HumanEval benchmark, Dovetail achieved an inference speed of 5.86 tokens per second for LLaMA2-Chat-7B using 3GB of VRAM, representing an approximately 2.77x improvement over CPU-only inference. Furthermore, the inference speed was increased to 8 tokens per second when utilizing 7GB of VRAM.
zh

[NLP-37] HuatuoGPT-o1 Towards Medical Complex Reasoning with LLM s

【速读】：该论文旨在解决大语言模型（LLM）在医学领域中的推理能力不足的问题。尽管现有研究在数学任务上的推理能力提升取得了显著进展，但医学领域的推理能力仍未被充分探索。医学领域与数学不同，其推理过程需要更高的可靠性，且验证医学推理的正确性更具挑战性。为此，论文提出了一种可验证的医学问题框架，并引入医学验证器（medical verifier）来检查模型输出的正确性。解决方案的关键在于采用两阶段方法：首先，利用验证器指导复杂推理轨迹的搜索，以微调LLM；其次，通过基于验证器的奖励机制应用强化学习（RL），进一步提升复杂推理能力。最终，论文介绍了HuatuoGPT-o1，一个在仅使用40K可验证问题的情况下，能够进行复杂推理的医学LLM，其在实验中表现优于通用和医学专用基线模型。实验结果表明，复杂推理显著提升了医学问题解决能力，并且RL的应用进一步增强了这一效果。该研究为医学及其他专业领域的推理能力提升提供了新的思路。

链接: https://arxiv.org/abs/2412.18925
作者: Junying Chen,Zhenyang Cai,Ke Ji,Xidong Wang,Wanlong Liu,Rongsheng Wang,Jianye Hou,Benyou Wang
机构: 未知
关键词: reasoning, breakthrough of OpenAI, highlights the potential, potential of enhancing, medical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.
zh

[NLP-38] AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures

【速读】：该论文旨在解决静态草稿结构在推测解码（Speculative Decoding, SD）框架中解码速度受限的问题。当前研究在自适应草稿结构（adaptive draft structures）的性能、建模方法和适用性方面存在局限。论文提出的解决方案是引入AdaEAGLE框架，该框架首次明确建模了自适应草稿结构。其关键在于采用了轻量级草稿长度预测器（Lightweight Draft Length Predictor, LDLP）模块，该模块在推理过程中显式预测最优的草稿令牌数量，从而指导草稿模型的生成。AdaEAGLE在不依赖手动阈值的情况下实现了可比的加速效果，并允许进行更深层次的专业优化。结合基于阈值的策略，AdaEAGLE在保持输出质量的同时，相比传统的自回归解码（AR decoding）实现了1.62倍的加速，并超越了固定长度的现有技术（SotA）基线。

链接: https://arxiv.org/abs/2412.18910
作者: Situo Zhang,Hankun Wang,Da Ma,Zichen Zhu,Lu Chen,Kunyao Lan,Kai Yu
机构: 未知
关键词: Large Language Models, Large Language, popular lossless technique, Speculative Decoding, adaptive draft structures
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative Decoding (SD) is a popular lossless technique for accelerating the inference of Large Language Models (LLMs). We show that the decoding speed of SD frameworks with static draft structures can be significantly improved by incorporating context-aware adaptive draft structures. However, current studies on adaptive draft structures are limited by their performance, modeling approaches, and applicability. In this paper, we introduce AdaEAGLE, the first SD framework that explicitly models adaptive draft structures. AdaEAGLE leverages the Lightweight Draft Length Predictor (LDLP) module to explicitly predict the optimal number of draft tokens during inference to guide the draft model. It achieves comparable speedup results without manual thresholds and allows for deeper, more specialized optimizations. Moreover, together with threshold-based strategies, AdaEAGLE achieves a 1.62\times speedup over the vanilla AR decoding and outperforms fixed-length SotA baseline while maintaining output quality.
zh

[NLP-39] Research Experiment on Multi-Model Comparison for Chinese Text Classification Tasks

【速读】：该论文旨在解决中文文本分类任务中的模型性能评估与适用性问题。随着中文文本数据的爆炸式增长和自然语言处理（Natural Language Processing, NLP）技术的进步，中文文本分类在信息检索和情感分析等领域中成为关键技术。论文通过对比三种深度学习模型——TextCNN、TextRNN和BERT——在THUCNews数据集上的表现，评估了这些模型在不同场景中的适用性。解决方案的关键在于通过实验验证这些模型在中文文本分类任务中的性能，并探讨其在不同应用场景中的优势和局限性。

链接: https://arxiv.org/abs/2412.18908
作者: JiaCheng Li
机构: 未知
关键词: attracting increasing attention, Chinese text classification, language processing technologies, Chinese text data, natural language processing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the explosive growth of Chinese text data and advancements in natural language processing technologies, Chinese text classification has become one of the key techniques in fields such as information retrieval and sentiment analysis, attracting increasing attention. This paper conducts a comparative study on three deep learning models:TextCNN, TextRNN, and this http URL for Chinese text classification tasks. By conducting experiments on the THUCNews dataset, the performance of these models is evaluated, and their applicability in different scenarios is discussed.
zh

[NLP-40] Overview of MWE history challenges and horizons: standing at the 20th anniversary of the MWE workshop series via MWE-UD2024

【速读】：该论文旨在回顾多词表达式（Multiword Expression, MWE）研究领域在过去近二十年中的发展历程，并总结研究人员在该领域所探讨的研究主题和方法论。通过梳理MWE工作坊系列的历史，论文进一步讨论了当前面临的挑战以及MWE研究在计算语言学（Computational Linguistics, CL）和自然语言处理（Natural Language Processing, NLP）领域中的广泛影响和协同效应。最后，论文提出了未来的研究方向，旨在为对MWE感兴趣的研究人员、学生和工业从业者提供一个简明且易于理解的历史、现状及未来展望。解决方案的关键在于通过对MWE研究历史的系统性回顾，识别当前研究中的瓶颈，并基于此提出未来可能的研究路径，以推动该领域的进一步发展。

链接: https://arxiv.org/abs/2412.18868
作者: Lifeng Han,Kilian Evang,Archna Bhatia,Gosse Bouma,A. Seza Doğruöz,Marcos Garcia,Voula Giouli,Joakim Nivre,Alexandre Rademacher
机构: 未知
关键词: ACL in Sapporo, MWE workshop events, held with ACL, MWE workshop, conference marked
类目: Computation and Language (cs.CL)
备注: ongoing work, position paper, 6 pages

点击查看摘要

Abstract:Starting in 2003 when the first MWE workshop was held with ACL in Sapporo, Japan, this year, the joint workshop of MWE-UD co-located with the LREC-COLING 2024 conference marked the 20th anniversary of MWE workshop events over the past nearly two decades. Standing at this milestone, we look back to this workshop series and summarise the research topics and methodologies researchers have carried out over the years. We also discuss the current challenges that we are facing and the broader impacts/synergies of MWE research within the CL and NLP fields. Finally, we give future research perspectives. We hope this position paper can help researchers, students, and industrial practitioners interested in MWE get a brief but easy understanding of its history, current, and possible future.
zh

[NLP-41] Whose Morality Do They Speak? Unraveling Cultural Bias in Multilingual Language Models

【速读】：该论文旨在探讨多语言大语言模型（LLMs）在不同文化和语言背景下的道德推理能力，特别是这些模型是否反映了特定文化的道德价值观，还是强加了以英语为主导的道德规范。研究通过使用更新版的道德基础问卷（MFQ-2）在八种语言（阿拉伯语、波斯语、英语、西班牙语、日语、中文、法语和俄语）中进行分析，评估了模型对六个核心道德基础（关怀、平等、比例、忠诚、权威和纯洁）的遵循情况。研究结果表明，LLMs在文化和语言上存在显著差异，挑战了其道德一致性的普遍假设。尽管部分模型展示了适应多样化语境的能力，但其他模型则表现出受训练数据构成影响的偏见。这些发现强调了在开发多语言AI系统时，需要注重文化包容性，以提高公平性和信任度。

链接: https://arxiv.org/abs/2412.18863
作者: Meltem Aksoy
机构: 未知
关键词: moral reasoning capabilities, Large language models, contexts remain underexplored, Large language, remain underexplored
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become integral tools in diverse domains, yet their moral reasoning capabilities across cultural and linguistic contexts remain underexplored. This study investigates whether multilingual LLMs, such as GPT-3.5-Turbo, GPT-4o-mini, Llama 3.1, and MistralNeMo, reflect culturally specific moral values or impose dominant moral norms, particularly those rooted in English. Using the updated Moral Foundations Questionnaire (MFQ-2) in eight languages, Arabic, Farsi, English, Spanish, Japanese, Chinese, French, and Russian, the study analyzes the models’ adherence to six core moral foundations: care, equality, proportionality, loyalty, authority, and purity. The results reveal significant cultural and linguistic variability, challenging the assumption of universal moral consistency in LLMs. Although some models demonstrate adaptability to diverse contexts, others exhibit biases influenced by the composition of the training data. These findings underscore the need for culturally inclusive model development to improve fairness and trust in multilingual AI systems.
zh

[NLP-42] Bootstrap Your Own Context Length

【速读】：该论文旨在解决长上下文语言模型（long-context language models）训练过程中对大量手动标注数据的依赖问题。传统方法通常需要耗费大量资源进行数据收集和标注，而本文提出了一种基于短上下文语言模型（short-context language models）的自举（bootstrapping）方法，通过合成多样化的长上下文指令调优数据（instruction tuning data），从而避免手动数据处理的繁琐过程。解决方案的关键在于利用一个简单的代理工作流（agent workflow），结合短上下文语言模型、文本检索器（text retriever）和文档集合（document collection），自动生成长上下文训练数据。随后，通过微调（fine-tuning）语言模型，将其短上下文能力有效迁移到长上下文场景中。实验结果表明，该方法能够成功将上下文长度扩展至100万 tokens（1M tokens），并在多个基准测试中表现出色。

链接: https://arxiv.org/abs/2412.18860
作者: Liang Wang,Nan Yang,Xingxing Zhang,Xiaolong Huang,Furu Wei
机构: 未知
关键词: train long-context language, approach to train, language models, train long-context, language
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 18 pages

点击查看摘要

Abstract:We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only. Our method utilizes a simple agent workflow to synthesize diverse long-context instruction tuning data, thereby eliminating the necessity for manual data collection and annotation. The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection, all of which are readily accessible within the open-source ecosystem. Subsequently, language models are fine-tuned using the synthesized data to extend their context lengths. In this manner, we effectively transfer the short-context capabilities of language models to long-context scenarios through a bootstrapping process. We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens, achieving superior performance across various benchmarks.
zh

[NLP-43] RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在处理视觉-语言推理任务时更容易生成有害内容的问题。现有防御性提示技术依赖于静态、统一的安全指南，无法有效应对不同多模态场景中的特定风险。为解决这一局限性，论文提出了RapGuard框架，该框架通过多模态链式思维推理（multimodal chain-of-thought reasoning）动态生成针对特定场景的安全提示。RapGuard的关键在于其能够根据每个输入的唯一风险自适应地调整提示，从而在保持良性任务高性能的同时，有效减少有害内容的生成。实验结果表明，RapGuard在多个MLLM基准测试中实现了最先进的安全性能，显著降低了有害内容，且未降低响应质量。

链接: https://arxiv.org/abs/2412.18826
作者: Yilei Jiang,Yingshui Tan,Xiangyu Yue
机构: 未知
关键词: Large Language Models, Multimodal Large Language, Large Language, made remarkable progress, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have made remarkable progress in vision-language reasoning, they are also more susceptible to producing harmful content compared to models that focus solely on text. Existing defensive prompting techniques rely on a static, unified safety guideline that fails to account for the specific risks inherent in different multimodal contexts. To address these limitations, we propose RapGuard, a novel framework that uses multimodal chain-of-thought reasoning to dynamically generate scenario-specific safety prompts. RapGuard enhances safety by adapting its prompts to the unique risks of each input, effectively mitigating harmful outputs while maintaining high performance on benign tasks. Our experimental results across multiple MLLM benchmarks demonstrate that RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
zh

[NLP-44] DCIS: Efficient Length Extrapolation of LLM s via Divide-and-Conquer Scaling Factor Search

【速读】：该论文旨在解决基于Transformer架构的大语言模型（LLMs）在扩展上下文长度时面临的挑战，特别是由于训练成本高导致的上下文长度受限问题。传统的扩展方法通过调整旋转位置编码（RoPE）的缩放因子并进行微调，但次优的缩放因子初始化会导致微调成本增加且在目标长度下性能下降。为解决这一问题，论文提出了一种创新的基于RoPE的微调框架，摒弃了传统的缩放因子搜索方法，引入了分治增量搜索（DCIS）算法。该算法通过策略性地确定更优的缩放因子，有效扩展了LLMs的上下文窗口。实验结果表明，该方法不仅缓解了在扩展目标长度下的性能衰减，还允许模型在短上下文上进行微调并泛化到长上下文，从而降低了微调成本。此外，通过DCIS获得的缩放因子甚至可以在无需微调的情况下有效工作。进一步分析表明，DCIS的搜索效率是其他方法的两倍，并探讨了非严格递增缩放因子的影响及其在不同上下文长度下的泛化能力。

链接: https://arxiv.org/abs/2412.18811
作者: Lei Yang,Shaoyang Xu,Deyi Xiong
机构: 未知
关键词: Large language models, Large language, Transformer architecture, high training cost, scaling factors
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) based on the Transformer architecture usually have their context length limited due to the high training cost. Recent advancements extend the context window by adjusting the scaling factors of RoPE and fine-tuning. However, suboptimal initialization of these factors results in increased fine-tuning costs and reduced performance at target length. To address these challenges, we propose an innovative RoPE-based fine-tuning framework that diverges from conventional scaling factors search. Specifically, we present a Divide-and-Conquer Incremental Search (DCIS) algorithm that strategically determines the better scaling factors. Further fine-tuning with the identified scaling factors effectively extends the context window of LLMs. Empirical results demonstrate that our methodology not only mitigates performance decay at extended target lengths but also allows the model to fine-tune on short contexts and generalize to long contexts, thereby reducing the cost of fine-tuning. The scaling factors obtained through DCIS can even perform effectively without fine-tuning. Further analysis of the search space reveals that DCIS achieves twice the search efficiency compared to other methods. We also examine the impact of the non-strictly increasing scaling factors utilized in DCIS and evaluate the general capabilities of LLMs across various context lengths.
zh

[NLP-45] Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation ICASSP2025

【速读】：该论文旨在解决开放域问答（Open-domain QA）中缺乏明确标签来有效结合检索到的段落和通过大语言模型（LLMs）生成的段落的问题。为了解决这一问题，作者提出了一种无监督的简单框架，称为“双向重排序以合并生成与检索知识”（BRMGR）。该框架的关键在于分别对检索到的段落和LLM生成的段落进行重排序，并通过贪心匹配（greedy matching）将两者结合。BRMGR在分配每个检索到的段落与对应的LLM生成段落时，等效于使用二分图匹配损失（bipartite matching loss）。实验结果表明，该框架在NQ和WebQ数据集上分别提升了1.7和1.6的性能，并在TriviaQA数据集上取得了与竞争基线相当的结果。

链接: https://arxiv.org/abs/2412.18800
作者: Xinkai Du,Quanjie Han,Chao Lv,Yan Liu,Yalin Sun,Hao Shu,Hongbo Shan,Maosong Sun
机构: 未知
关键词: Open-domain Question Answering, Large Language Models, Open-domain Question, Question Answering, Large Language
类目: Computation and Language (cs.CL)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Open-domain Question Answering (QA) has garnered substantial interest by combining the advantages of faithfully retrieved passages and relevant passages generated through Large Language Models (LLMs). However, there is a lack of definitive labels available to pair these sources of knowledge. In order to address this issue, we propose an unsupervised and simple framework called Bi-Reranking for Merging Generated and Retrieved Knowledge (BRMGR), which utilizes re-ranking methods for both retrieved passages and LLM-generated passages. We pair the two types of passages using two separate re-ranking methods and then combine them through greedy matching. We demonstrate that BRMGR is equivalent to employing a bipartite matching loss when assigning each retrieved passage with a corresponding LLM-generated passage. The application of our model yielded experimental results from three datasets, improving their performance by +1.7 and +1.6 on NQ and WebQ datasets, respectively, and obtaining comparable result on TriviaQA dataset when compared to competitive baselines.
zh

[NLP-46] owards Expressive Video Dubbing with Multiscale Multimodal Context Interaction

【速读】：该论文旨在解决自动视频配音（Automatic Video Dubbing, AVD）中多尺度韵律表达（multiscale prosody expression）和上下文交互（context interaction）对当前句子韵律表达影响的问题。具体而言，现有研究在增强韵律表达时忽略了两个关键问题：1）上下文中的多尺度韵律表达属性对当前句子韵律的影响；2）上下文中的韵律线索与当前句子的交互作用对最终韵律表达的影响。为解决这些问题，论文提出了M2CI-Dubber方案，该方案通过两个共享的M2CI编码器（Multiscale Multimodal Context Interaction encoders）来建模多尺度多模态上下文，并促进其与当前句子的深度交互。关键步骤包括提取上下文中每种模态的全局和局部特征，利用基于注意力机制（attention-based mechanisms）进行特征聚合和交互，并采用基于交互的图注意力网络（interaction-based graph attention network）进行特征融合，从而增强当前句子的韵律表达。实验结果表明，该模型在Chem数据集上的配音表达效果优于基线模型。

链接: https://arxiv.org/abs/2412.18748
作者: Yuan Zhao,Rui Liu,Gaoxiang Cong
机构: 未知
关键词: Automatic Video Dubbing, Automatic Video, Multiscale Multimodal Context, generates speech aligned, Video Dubbing
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by ICSSP 2025

点击查看摘要

Abstract:Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence’s prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]0.93,0.0,0.47this https URL.
zh

[NLP-47] Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis ICASSP2025

【速读】：该论文旨在解决对话语音合成（Conversational Speech Synthesis, CSS）中的关键挑战，即如何有效建模多模态对话历史（Multimodal Dialogue History, MDH）与目标话语之间的交互关系。具体而言，MDH中的文本和语音模态各自具有独特的影响，并且它们相互补充以对目标话语产生综合影响。然而，先前的研究并未明确建模这种模态内（intra-modal）和模态间（inter-modal）的交互。为解决这一问题，论文提出了一种基于模态内和模态间上下文交互的新CSS系统，称为III-CSS。其关键解决方案在于：在训练阶段，通过将MDH与目标话语的文本和语音模态结合，形成四种模态组合，并设计基于对比学习的模态内和模态间交互模块，以深入挖掘模态内和模态间的上下文交互；在推理阶段，利用训练好的交互模块，充分推断目标话语文本内容的语音韵律。实验结果表明，III-CSS在韵律表现力方面优于现有先进基线模型。

链接: https://arxiv.org/abs/2412.18733
作者: Zhenqi Jia,Rui Liu
机构: 未知
关键词: Conversational Speech Synthesis, multimodal dialogue history, target utterance, Speech Synthesis, Conversational Speech
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance’s text content. Subjective and objective experiments on the DailyTalk dataset show that III-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at this https URL.
zh

[NLP-48] Optimizing Large Language Models with an Enhanced LoRA Fine-Tuning Algorithm for Efficiency and Robustness in NLP Tasks

【速读】：该论文试图解决在自然语言处理任务中，大语言模型（Large Language Model）在保持预训练模型强大能力的同时，如何提高其准确性和计算效率的问题。解决方案的关键在于提出了一种基于改进的低秩适应（LoRA, Low-Rank Adaptation）微调算法。该算法通过低秩适应策略对大语言模型进行微调，显著减少了计算资源的消耗，同时在QQP任务中的实验结果表明，改进的LoRA算法在准确性、F1分数和MCC（Matthews Correlation Coefficient）等指标上均优于传统模型如BERT、Roberta、T5和GPT-4。特别是在F1分数和MCC方面，该模型表现出更强的鲁棒性和判别能力，证明了改进的LoRA算法在微调大规模预训练模型中的潜力。此外，论文还探讨了改进的LoRA算法在其他自然语言处理任务中的应用前景，强调了其在多任务学习和计算资源有限场景中的优势。

链接: https://arxiv.org/abs/2412.18729
作者: Jiacheng Hu,Xiaoxuan Liao,Jia Gao,Zhen Qi,Hongye Zheng,Chihang Wang
机构: 未知
关键词: optimization method based, large language model, improved LoRA algorithm, model optimization method, language model optimization
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study proposes a large language model optimization method based on the improved LoRA fine-tuning algorithm, aiming to improve the accuracy and computational efficiency of the model in natural language processing tasks. We fine-tune the large language model through a low-rank adaptation strategy, which significantly reduces the consumption of computing resources while maintaining the powerful capabilities of the pre-trained model. The experiment uses the QQP task as the evaluation scenario. The results show that the improved LoRA algorithm shows significant improvements in accuracy, F1 score, and MCC compared with traditional models such as BERT, Roberta, T5, and GPT-4. In particular, in terms of F1 score and MCC, our model shows stronger robustness and discrimination ability, which proves the potential of the improved LoRA algorithm in fine-tuning large-scale pre-trained models. In addition, this paper also discusses the application prospects of the improved LoRA algorithm in other natural language processing tasks, emphasizing its advantages in multi-task learning and scenarios with limited computing resources. Future research can further optimize the LoRA fine-tuning strategy and expand its application in larger-scale pre-trained models to improve the generalization ability and task adaptability of the model.
zh

[NLP-49] Using Large Language Models for Automated Grading of Student Writing about Science

【速读】：该论文旨在解决大规模课程中学生写作评估的挑战，特别是在科学类课程中，传统上依赖客观评估工具（如选择题）而难以有效评估写作能力的问题。解决方案的关键在于利用大语言模型（LLMs），特别是GPT-4，来评估学生的短篇写作作业。实验通过将GPT-4与教师评分进行对比，验证了LLMs在评估天文学相关写作作业中的可靠性。实验结果表明，GPT-4在评分一致性上优于同伴评分，并与教师评分相当，表明LLMs可以用于自动化、可靠且可扩展的学生科学写作评估。

链接: https://arxiv.org/abs/2412.18719
作者: Chris Impey,Matthew Wenger,Nikhil Garuda,Shahriar Golchin,Sarah Stamer
机构: 未知
关键词: informal learners presents, Assessing writing, significant challenge, large classes, formal or informal
类目: Computation and Language (cs.CL)
备注: Accepted at IJAIE

点击查看摘要

Abstract:Assessing writing in large classes for formal or informal learners presents a significant challenge. Consequently, most large classes, particularly in science, rely on objective assessment tools such as multiple-choice quizzes, which have a single correct answer. The rapid development of AI has introduced the possibility of using large language models (LLMs) to evaluate student writing. An experiment was conducted using GPT-4 to determine if machine learning methods based on LLMs can match or exceed the reliability of instructor grading in evaluating short writing assignments on topics in astronomy. The audience consisted of adult learners in three massive open online courses (MOOCs) offered through Coursera. One course was on astronomy, the second was on astrobiology, and the third was on the history and philosophy of astronomy. The results should also be applicable to non-science majors in university settings, where the content and modes of evaluation are similar. The data comprised answers from 120 students to 12 questions across the three courses. GPT-4 was provided with total grades, model answers, and rubrics from an instructor for all three courses. In addition to evaluating how reliably the LLM reproduced instructor grades, the LLM was also tasked with generating its own rubrics. Overall, the LLM was more reliable than peer grading, both in aggregate and by individual student, and approximately matched instructor grades for all three online courses. The implication is that LLMs may soon be used for automated, reliable, and scalable grading of student science writing.
zh

[NLP-50] Multiple References with Meaningful Variations Improve Literary Machine Translation

【速读】：该论文旨在解决机器翻译（MT）模型训练中单一参考翻译（single reference）的局限性问题。尽管源句子可以有多种翻译方式，但大多数MT模型仅使用单一参考进行训练。论文通过分析Par3数据集中世界文学的不同英文翻译之间的语义相似性，探讨了使用多参考翻译（multiple references）的最佳实践。解决方案的关键在于将语义相似性分为低、中、高三类，并基于此对两种不同的LLMs（mT5-large和LLaMA-2-7B）进行微调，用于下游MT任务。实验结果表明，在保持总训练实例不变的情况下，使用中等和高语义相似性的多参考翻译优于未过滤的数据集（+BLEU 0.3-0.5, +COMET 0.2-0.9, +chrF++ 0.25-0.32）。

链接: https://arxiv.org/abs/2412.18707
作者: Si Wu,John Wieting,David A. Smith
机构: 未知
关键词: semantic similarity, machine translation, single reference, source sentence, Abstract
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While a source sentence can be translated in many ways, most machine translation (MT) models are trained with only a single reference. Previous work has shown that using synthetic paraphrases can improve MT. This paper investigates best practices for employing multiple references by analyzing the semantic similarity among different English translations of world literature in the Par3 dataset. We classify the semantic similarity between paraphrases into three groups: low, medium, and high, and fine-tune two different LLMs (mT5-large and LLaMA-2-7B) for downstream MT tasks. Across different models, holding the total training instances constant, single-reference but more source texts only marginally outperforms multiple-reference with half of the source texts. Moreover, using paraphrases of medium and high semantic similarity outperforms an unfiltered dataset (+BLEU 0.3-0.5, +COMET 0.2-0.9, +chrF++ 0.25-0.32). Our code is publicly available on GitHub.
zh

[NLP-51] CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era

【速读】：该论文旨在解决从现代百科知识图谱（如Wikidata）中检索信息以增强大语言模型（LLM）能力时面临的效率问题。具体而言，现有的RDF知识图谱由于模式过大、资源标识符的使用、关系类型重叠以及缺乏规范化，导致其难以在LLM的典型上下文窗口内高效检索。为此，论文提出了一种解决方案，即在底层RDF图谱之上构建属性图谱视图（property graph views），并通过Cypher查询语言实现高效检索。该方案的关键在于开发了RDF到属性图谱的转换引擎，建立了系统化的文本到Cypher任务生成管道，并设计了新的评估指标。通过在Wikidata上实例化这一方法，论文还引入了CypherBench，这是首个包含11个大规模、多领域属性图谱及超过10,000个问题的基准测试集。

链接: https://arxiv.org/abs/2412.18702
作者: Yanlin Feng,Simone Papicchio,Sajjadur Rahman
机构: 未知
关键词: private enterprise data, recent GraphRAG system, large language models, augmenting large language, enterprise data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (edge et al., 2024). Despite decades of research on knowledge graphs and knowledge base question answering, leading LLM frameworks (e.g. Langchain and LlamaIndex) have only minimal support for retrieval from modern encyclopedic knowledge graphs like Wikidata. In this paper, we analyze the root cause and suggest that modern RDF knowledge graphs (e.g. Wikidata, Freebase) are less efficient for LLMs due to overly large schemas that far exceed the typical LLM context window, use of resource identifiers, overlapping relation types and lack of normalization. As a solution, we propose property graph views on top of the underlying RDF graph that can be efficiently queried by LLMs using Cypher. We instantiated this idea on Wikidata and introduced CypherBench, the first benchmark with 11 large-scale, multi-domain property graphs with 7.8 million entities and over 10,000 questions. To achieve this, we tackled several key challenges, including developing an RDF-to-property graph conversion engine, creating a systematic pipeline for text-to-Cypher task generation, and designing new evaluation metrics.
zh

[NLP-52] Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

【速读】：该论文旨在解决自动化红队测试（Automated Red Teaming）中生成多样且有效的攻击样本的核心挑战。以往的方法通常只能在多样性和有效性之间优化其一，而难以同时兼顾两者。论文提出的解决方案将任务分解为两个步骤：首先，通过自动化方法生成多样化的攻击目标；其次，针对这些目标生成有效的攻击。关键贡献在于训练了一个强化学习（Reinforcement Learning, RL）攻击模型，该模型不仅遵循这些目标，还能为目标生成多样化的攻击。具体而言，论文展示了如何利用大语言模型（Large Language Model, LLM）通过目标特定的提示和奖励（包括基于规则的奖励，Rule-Based Rewards, RBRs）来生成多样化的攻击目标，并通过多步强化学习训练攻击模型，使其在生成与以往尝试不同的攻击时获得奖励，从而在保持有效性的同时进一步提升多样性。该方法在生成提示注入攻击（Prompt Injection Attacks）和引发不安全响应的提示（Prompts that Elicit Unsafe Responses）方面表现出色，生成的攻击样本在有效性和多样性上均优于以往的红队测试方法。

链接: https://arxiv.org/abs/2412.18693
作者: Alex Beutel,Kai Xiao,Johannes Heidecke,Lilian Weng
机构: 未知
关键词: Automated red teaming, red teaming, Automated red, discover rare model, rare model failures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated red teaming can discover rare model failures and generate challenging examples that can be used for training or evaluation. However, a core challenge in automated red teaming is ensuring that the attacks are both diverse and effective. Prior methods typically succeed in optimizing either for diversity or for effectiveness, but rarely both. In this paper, we provide methods that enable automated red teaming to generate a large number of diverse and successful attacks. Our approach decomposes the task into two steps: (1) automated methods for generating diverse attack goals and (2) generating effective attacks for those goals. While we provide multiple straightforward methods for generating diverse goals, our key contributions are to train an RL attacker that both follows those goals and generates diverse attacks for those goals. First, we demonstrate that it is easy to use a large language model (LLM) to generate diverse attacker goals with per-goal prompts and rewards, including rule-based rewards (RBRs) to grade whether the attacks are successful for the particular goal. Second, we demonstrate how training the attacker model with multi-step RL, where the model is rewarded for generating attacks that are different from past attempts further increases diversity while remaining effective. We use our approach to generate both prompt injection attacks and prompts that elicit unsafe responses. In both cases, we find that our approach is able to generate highly-effective and considerably more diverse attacks than past general red-teaming approaches. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2412.18693 [cs.LG] (or arXiv:2412.18693v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.18693 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-53] AgreeMate: Teaching LLM s to Haggle

【速读】：该论文旨在解决大型语言模型（LLMs）在战略价格谈判中的有效性问题，特别是在自然语言环境下进行商品议价的应用场景。论文提出的解决方案关键包括以下几个方面：首先，采用解耦（模块化）的议价架构，使得两个代理（即买方或卖方）能够通过自然语言进行议价；其次，通过提示工程（prompt engineering）、微调（fine-tuning）和链式思维提示（chain-of-thought prompting）等技术手段，显著提升了模型在谈判任务中的表现；最后，利用注意力探测（attention probing）技术，展示了模型在谈判过程中对语义关系的关注，进一步验证了其有效性。这些方法共同构成了AgreeMate框架的核心，旨在优化LLMs在战略谈判中的应用。

链接: https://arxiv.org/abs/2412.18690
作者: Ainesh Chatterjee,Samuel Miller,Nithin Parepally
机构: 未知
关键词: perform strategic price, training Large Language, Large Language Models, strategic price negotiations, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 22 figures, 6 tables

点击查看摘要

Abstract:We introduce AgreeMate, a framework for training Large Language Models (LLMs) to perform strategic price negotiations through natural language. We apply recent advances to a negotiation setting where two agents (i.e. buyer or seller) use natural language to bargain on goods using coarse actions. Specifically, we present the performance of Large Language Models when used as agents within a decoupled (modular) bargaining architecture. We demonstrate that using prompt engineering, fine-tuning, and chain-of-thought prompting enhances model performance, as defined by novel metrics. We use attention probing to show model attention to semantic relationships between tokens during negotiations.
zh

[NLP-54] From Hallucinations to Facts: Enhancing Language Models with Curated Knowledge Graphs

【速读】：该论文旨在解决语言模型（Language Models）中的幻觉（Hallucination）问题，即模型生成偏离事实准确性或连贯性的回答，从而影响其在自然语言处理任务中的有效性和可信度。解决方案的关键在于通过整合精心筛选的知识图谱（Knowledge Graph, KG）三元组，将模型的回答锚定在实证数据上。具体而言，论文构建了一个基于维基百科的全面知识图谱库，并精炼数据以突出模型训练所需的关键信息。通过为语言模型提供对这些精选知识的访问，论文旨在生成既语言流畅又基于事实准确性和上下文相关性的回答。这种集成通过为模型提供坚实的信息基础，减少了幻觉现象，使模型在生成回答时能够利用丰富的知识储备。实验评估表明，该方法在减少幻觉回答方面具有显著效果，凸显了知识图谱在提升语言模型输出可靠性和可信度中的重要作用。

链接: https://arxiv.org/abs/2412.18672
作者: Ratnesh Kumar Joshi,Sagnik Sengupta,Asif Ekbal
机构: 未知
关键词: persistent challenge plaguing, challenge plaguing language, natural language processing, language processing endeavors, undermines their efficacy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 Pages, 5 Tables, 2 figures

点击查看摘要

Abstract:Hallucination, a persistent challenge plaguing language models, undermines their efficacy and trustworthiness in various natural language processing endeavors by generating responses that deviate from factual accuracy or coherence. This paper addresses language model hallucination by integrating curated knowledge graph (KG) triples to anchor responses in empirical data. We meticulously select and integrate relevant KG triples tailored to specific contexts, enhancing factual grounding and alignment with input. Our contribution involves constructing a comprehensive KG repository from Wikipedia and refining data to spotlight essential information for model training. By imbuing language models with access to this curated knowledge, we aim to generate both linguistically fluent responses and deeply rooted in factual accuracy and context relevance. This integration mitigates hallucinations by providing a robust foundation of information, enabling models to draw upon a rich reservoir of factual data during response generation. Experimental evaluations demonstrate the effectiveness of multiple approaches in reducing hallucinatory responses, underscoring the role of curated knowledge graphs in improving the reliability and trustworthiness of language model outputs.
zh

[NLP-55] Advancing Explainability in Neural Machine Translation: Analytical Metrics for Attention and Alignment Consistency DATE

【速读】：该论文旨在解决神经机器翻译（Neural Machine Translation, NMT）模型在决策过程中的不透明性问题，特别是其内部注意力机制（attention mechanisms）的可解释性。为了增强对这些模型的信任并验证其行为是否符合预期，作者提出了一种系统性框架，通过将NMT模型的注意力模式与统计对齐（statistical alignments）进行比较，并将其与标准机器翻译质量指标相关联，来定量评估其可解释性。该框架的关键在于引入了一组新的度量指标，包括注意力熵（attention entropy）和对齐一致性（alignment agreement），并在WMT14的英德测试子集上使用预训练的mT5模型进行了验证。研究结果表明，更集中的注意力分布与更高的可解释性相关，但并不总是保证更好的翻译质量。这些发现为理解NMT的可解释性提供了新的视角，并为未来构建更透明和可靠的机器翻译系统提供了指导。

链接: https://arxiv.org/abs/2412.18669
作者: Anurag Mishra
机构: 未知
关键词: decision making processes, Neural Machine Translation, shown remarkable performance, remain largely opaque, Neural Machine
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages, 3 figures, research paper from the Rochester Institute of Technology, focused on explainability in Neural Machine Translation. Validated metrics using English-German data subset from WMT14 and mT5 model. Results connect attention entropy and alignment agreement with translation quality

点击查看摘要

Abstract:Neural Machine Translation (NMT) models have shown remarkable performance but remain largely opaque in their decision making processes. The interpretability of these models, especially their internal attention mechanisms, is critical for building trust and verifying that these systems behave as intended. In this work, we introduce a systematic framework to quantitatively evaluate the explainability of an NMT model attention patterns by comparing them against statistical alignments and correlating them with standard machine translation quality metrics. We present a set of metrics attention entropy and alignment agreement and validate them on an English-German test subset from WMT14 using a pre trained mT5 model. Our results indicate that sharper attention distributions correlate with improved interpretability but do not always guarantee better translation quality. These findings advance our understanding of NMT explainability and guide future efforts toward building more transparent and reliable machine translation systems.
zh

[NLP-56] Simple is not Enough: Document-level Text Simplification using Readability and Coherence

【速读】：该论文旨在解决文本简化（Text Simplification, TS）领域中长期存在的局限性，即现有研究主要集中在句子级别的简化，而忽略了段落或文档级别的简化，后者对大多数TS受众更为有益。为此，论文提出了SimDoc系统，该系统在简化过程中综合考虑了简洁性（simplicity）、可读性（readability）和语篇连贯性（coherence）等多个方面。解决方案的关键在于：首先，系统通过专业创建的语料库进行微调；其次，在训练过程中引入了多目标优化，同时考虑简洁性、可读性和连贯性；此外，论文还扩展了专业标注的简化语料库，将现有注释关联为（复杂文本、简化文本、可读性标签）三元组，以便在训练中利用可读性信息。最后，论文通过零样本、少样本和微调设置下的对比分析，评估了所提出模型在文档级别TS语料库上的表现，展示了文档级别简化的新方法。

链接: https://arxiv.org/abs/2412.18655
作者: Laura Vásquez-Rodríguez,Nhung T.H. Nguyen,Piotr Przybyła,Matthew Shardlow,Sophia Ananiadou
机构: 未知
关键词: discourse aspects, simplification, SimDoc system, readability, Text
类目: Computation and Language (cs.CL)
备注: 16 pages, 3 figures, 8 tables

点击查看摘要

Abstract:In this paper, we present the SimDoc system, a simplification model considering simplicity, readability, and discourse aspects, such as coherence. In the past decade, the progress of the Text Simplification (TS) field has been mostly shown at a sentence level, rather than considering paragraphs or documents, a setting from which most TS audiences would benefit. We propose a simplification system that is initially fine-tuned with professionally created corpora. Further, we include multiple objectives during training, considering simplicity, readability, and coherence altogether. Our contributions include the extension of professionally annotated simplification corpora by the association of existing annotations into (complex text, simple text, readability label) triples to benefit from readability during training. Also, we present a comparative analysis in which we evaluate our proposed models in a zero-shot, few-shot, and fine-tuning setting using document-level TS corpora, demonstrating novel methods for simplification. Finally, we show a detailed analysis of outputs, highlighting the difficulties of simplification at a document level.
zh

[NLP-57] DynaGRAG: Improving Language Understanding and Generation through Dynamic Subgraph Representation in Graph Retrieval-Augmented Generation

【速读】：该论文旨在解决在利用外部知识增强语言理解和生成过程中，如何有效捕捉和整合文本与结构化数据中丰富语义信息的挑战。为此，论文提出了一种新颖的图检索增强生成（Graph Retrieval-Augmented Generation, GRAG）框架，其核心在于提升知识图谱中子图的表示和多样性。具体解决方案包括：通过去重过程、两步均值池化嵌入、考虑独特节点的查询感知检索，以及动态相似性感知广度优先搜索（Dynamic Similarity-Aware BFS, DSA-BFS）遍历算法，来增强图密度、更有效地捕捉实体和关系信息，并动态优先选择相关且多样的子图。此外，通过硬提示（hard prompting）将图卷积网络（Graph Convolutional Networks, GCNs）与大语言模型（Large Language Models, LLMs）结合，进一步丰富了节点和边的表示，同时保持了层次化的子图结构。实验结果表明，该框架在多个基准数据集上表现出色，验证了增强子图表示和多样性对于提升语言理解和生成的重要性。

链接: https://arxiv.org/abs/2412.18644
作者: Karishma Thakrar
机构: 未知
关键词: leveraging external knowledge, Graph RAG, Graph Retrieval-Augmented Generation, architectures aim, leveraging external
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Graph Retrieval-Augmented Generation (GRAG or Graph RAG) architectures aim to enhance language understanding and generation by leveraging external knowledge. However, effectively capturing and integrating the rich semantic information present in textual and structured data remains a challenge. To address this, a novel GRAG framework is proposed to focus on enhancing subgraph representation and diversity within the knowledge graph. By improving graph density, capturing entity and relation information more effectively, and dynamically prioritizing relevant and diverse subgraphs, the proposed approach enables a more comprehensive understanding of the underlying semantic structure. This is achieved through a combination of de-duplication processes, two-step mean pooling of embeddings, query-aware retrieval considering unique nodes, and a Dynamic Similarity-Aware BFS (DSA-BFS) traversal algorithm. Integrating Graph Convolutional Networks (GCNs) and Large Language Models (LLMs) through hard prompting further enhances the learning of rich node and edge representations while preserving the hierarchical subgraph structure. Experimental results on multiple benchmark datasets demonstrate the effectiveness of the proposed GRAG framework, showcasing the significance of enhanced subgraph representation and diversity for improved language understanding and generation.
zh

[NLP-58] KRAIL: A Knowledge-Driven Framework for Base Human Reliability Analysis Integrating IDHEAS and Large Language Models

【速读】：该论文旨在解决现有的人类可靠性分析（Human Reliability Analysis, HRA）方法在估计人类错误概率（Human Error Probability, HEP）时过度依赖专家知识、主观性强且耗时的问题。为此，论文提出了一种创新的两阶段框架，即知识驱动的可靠性分析（Knowledge-driven Reliability Analysis, KRAIL），该框架结合了IDHEAS和大语言模型（Large Language Models, LLMs），实现了半自动化的基础HEP值计算。解决方案的关键在于利用知识图谱作为检索增强生成（Retrieval-Augmented Generation, RAG）的一种形式，从而提升框架在检索和处理相关数据时的效率。通过系统化的实验验证，该框架在部分信息条件下进行可靠性评估时，表现出优越的基础HEP估计性能。

链接: https://arxiv.org/abs/2412.18627
作者: Xingyu Xiao,Peng Chen,Ben Qi,Hongru Zhao,Jingang Liang,Jiejuan Tong,Haitao Wang
机构: 未知
关键词: complex systems, Human reliability analysis, crucial for evaluating, evaluating and improving, improving the safety
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human reliability analysis (HRA) is crucial for evaluating and improving the safety of complex systems. Recent efforts have focused on estimating human error probability (HEP), but existing methods often rely heavily on expert knowledge,which can be subjective and time-consuming. Inspired by the success of large language models (LLMs) in natural language processing, this paper introduces a novel two-stage framework for knowledge-driven reliability analysis, integrating IDHEAS and LLMs (KRAIL). This innovative framework enables the semi-automated computation of base HEP values. Additionally, knowledge graphs are utilized as a form of retrieval-augmented generation (RAG) for enhancing the framework’ s capability to retrieve and process relevant data efficiently. Experiments are systematically conducted and evaluated on authoritative datasets of human reliability. The experimental results of the proposed methodology demonstrate its superior performance on base HEP estimation under partial information for reliability assessment.
zh

[NLP-59] Why Do Large Language Models (LLM s) Struggle to Count Letters?

【速读】：该论文旨在探讨大语言模型（LLMs）在统计单词中字母出现次数时表现不佳的问题，并分析其背后的原因。论文通过实验研究，评估了LLMs在字母计数任务中的错误与以下两个因素的关系：1）单词及其组成部分在训练数据集中的频率；2）计数操作的复杂性。研究结果表明，LLMs能够识别字母但无法准确计数，且单词和词元的频率对错误率影响不大。关键发现包括：字母频率与错误率呈正相关，高频字母的计数错误更多；错误率与单词中的字母或词元数量密切相关，尤其是当字母出现次数超过两次时，大多数模型无法正确计数。解决方案的关键在于深入分析这些错误模式，为进一步优化LLMs的字符级处理能力提供依据。

链接: https://arxiv.org/abs/2412.18626
作者: Tairan Fu,Raquel Ferrando,Javier Conde,Carlos Arriaga,Pedro Reviriego
机构: 未知
关键词: Large Language Models, achieved unprecedented performance, Large Language, Language Models, letters
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved unprecedented performance on many complex tasks, being able, for example, to answer questions on almost any topic. However, they struggle with other simple tasks, such as counting the occurrences of letters in a word, as illustrated by the inability of many LLMs to count the number of “r” letters in “strawberry”. Several works have studied this problem and linked it to the tokenization used by LLMs, to the intrinsic limitations of the attention mechanism, or to the lack of character-level training data. In this paper, we conduct an experimental study to evaluate the relations between the LLM errors when counting letters with 1) the frequency of the word and its components in the training dataset and 2) the complexity of the counting operation. We present a comprehensive analysis of the errors of LLMs when counting letter occurrences by evaluating a representative group of models over a large number of words. The results show a number of consistent trends in the models evaluated: 1) models are capable of recognizing the letters but not counting them; 2) the frequency of the word and tokens in the word does not have a significant impact on the LLM errors; 3) there is a positive correlation of letter frequency with errors, more frequent letters tend to have more counting errors, 4) the errors show a strong correlation with the number of letters or tokens in a word and 5) the strongest correlation occurs with the number of letters with counts larger than one, with most models being unable to correctly count words in which letters appear more than twice.
zh

[NLP-60] Investigating the Feasibility of Mitigating Potential Copyright Infringement via Large Language Model Unlearning

【速读】：该论文旨在解决预训练大语言模型（LLMs）在生成受版权保护内容时引发的法律和伦理问题。具体而言，模型所有者需要在不同时间点应对内容删除请求，以避免版权侵权。为此，论文提出了一种称为“稳定序列遗忘”（Stable Sequential Unlearning, SSU）的新框架，用于在多时间步骤中从LLMs中移除受版权保护的内容。SSU的关键在于通过任务向量（task vectors）识别并移除模型中与受版权保护内容相关的特定权重更新，同时通过引入随机标签损失（random labeling loss）和基于梯度的权重显著性（gradient-based weight saliency）调整目标参数，确保模型在移除特定内容的同时保留其通用语言能力。实验结果表明，SSU在遗忘效果与通用语言能力之间实现了有效平衡，但并非解决所有版权遗忘问题的万能方案。

链接: https://arxiv.org/abs/2412.18621
作者: Guangyao Dou
机构: 未知
关键词: Pre-trained Large Language, Pre-trained Large, demonstrated remarkable capabilities, Large Language Models, Large Language
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2406.10952

点击查看摘要

Abstract:Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In a potential real-world scenario, model owners may need to continuously address copyright infringement in order to address requests for content removal that emerge at different time points. One potential way of addressing this is via sequential unlearning, where copyrighted content is removed sequentially as new requests arise. Despite its practical relevance, sequential unlearning in the context of copyright infringement has not been rigorously explored in existing literature. To address this gap, we propose Stable Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted content from LLMs over multiple time steps. Our approach works by identifying and removing specific weight updates in the model’s parameters that correspond to copyrighted content using task vectors. We improve unlearning efficacy by introducing random labeling loss and ensuring the model retains its general-purpose knowledge by adjusting targeted parameters with gradient-based weight saliency. Extensive experimental results show that SSU sometimes achieves an effective trade-off between unlearning efficacy and general-purpose language abilities, outperforming existing baselines, but it’s not a cure-all for unlearning copyrighted material.
zh

[NLP-61] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

【速读】：该论文旨在解决多模态学习（multimodal learning）中理解和生成任务的统一问题，特别是在不同模态（如文本、图像、音频等）中如何有效地应用下一个标记预测（Next Token Prediction, NTP）框架。解决方案的关键在于提出了一种全面的分类法（taxonomy），该分类法通过NTP的视角将多模态学习中的理解和生成任务统一起来。具体而言，该分类法涵盖了五个关键方面：多模态标记化（Multimodal tokenization）、多模态NTP模型架构（MMNTP model architectures）、统一任务表示（unified task representation）、数据集与评估（datasets & evaluation）以及开放挑战（open challenges）。这一分类法旨在帮助研究人员更好地探索多模态智能（multimodal intelligence），并提供了一个GitHub仓库以收集最新的相关论文和代码库。

链接: https://arxiv.org/abs/2412.18619
作者: Liang Chen,Zekun Wang,Shuhuai Ren,Lei Li,Haozhe Zhao,Yunshui Li,Zefan Cai,Hongcheng Guo,Lei Zhang,Yizhe Xiong,Yichi Zhang,Ruoyu Wu,Qingxiu Dong,Ge Zhang,Jian Yang,Lingwei Meng,Shujie Hu,Yulong Chen,Junyang Lin,Shuai Bai,Andreas Vlachos,Xu Tan,Minjia Zhang,Wen Xiao,Aaron Yee,Tianyu Liu,Baobao Chang
机构: 未知
关键词: achieving considerable success, natural language processing, versatile training objective, Token Prediction, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 69 papes, 18 figures, repo at this https URL

点击查看摘要

Abstract:Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \ evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at this https URL
zh

[NLP-62] Exploring Text Representations for Online Misinformation

【速读】：该论文旨在解决虚假新闻（fake news）在社会中的传播问题，特别是在文本形式（如社交媒体帖子和博客文章）中的传播。虚假新闻在政治和医疗等领域的影响尤为显著，且随着技术进步，其形式不断演变。论文的核心贡献在于开发了一种用于检测虚假新闻的文本特征提取方法，该方法利用真实新闻和虚假新闻在主题连贯性（thematic coherence）上的差异。具体而言，真实新闻和虚假新闻在故事发展过程中所讨论的主题构成存在显著不同。此外，论文通过分类和聚类（clustering）方法验证了主题特征在虚假新闻检测中的有效性，其中聚类方法尤其具有优势，因为它减少了对标注数据集的依赖，而标注数据集的获取通常耗时且费力。总体而言，该研究通过机器学习和自然语言处理技术，为更好地理解虚假新闻及其检测方法提供了新的视角。

链接: https://arxiv.org/abs/2412.18618
作者: Martins Samuel Dogo
机构: 未知
关键词: commonly collectively called, collectively called fake, commonly collectively, menace society, collectively called
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Masters Thesis, 106 pages, 11 figures

点击查看摘要

Abstract:Mis- and disinformation, commonly collectively called fake news, continue to menace society. Perhaps, the impact of this age-old problem is presently most plain in politics and healthcare. However, fake news is affecting an increasing number of domains. It takes many different forms and continues to shapeshift as technology advances. Though it arguably most widely spreads in textual form, e.g., through social media posts and blog articles. Thus, it is imperative to thwart the spread of textual misinformation, which necessitates its initial detection. This thesis contributes to the creation of representations that are useful for detecting misinformation. Firstly, it develops a novel method for extracting textual features from news articles for misinformation detection. These features harness the disparity between the thematic coherence of authentic and false news stories. In other words, the composition of themes discussed in both groups significantly differs as the story progresses. Secondly, it demonstrates the effectiveness of topic features for fake news detection, using classification and clustering. Clustering is particularly useful because it alleviates the need for a labelled dataset, which can be labour-intensive and time-consuming to amass. More generally, it contributes towards a better understanding of misinformation and ways of detecting it using Machine Learning and Natural Language Processing.
zh

[NLP-63] HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios

【速读】：该论文旨在解决在人类与大型语言模型（LLMs）交互过程中，评估LLMs功能调用能力的挑战。由于对话过程的复杂性和开放性，现有的评估方法难以全面反映LLMs的实际表现。为此，论文提出了HammerBench这一新型基准测试框架，通过模拟移动设备上的多种真实用户场景（包括不完善的指令、多样化的问答轨迹、意图/参数转移以及通过代词使用外部个人信息），来更有效地评估LLMs的功能调用能力。解决方案的关键在于构建高质量的数据集，该数据集通过LLM生成数据并结合多轮人工验证来确保数据的准确性。此外，论文将对话分解为功能调用快照，从而实现对每个对话回合的细粒度评估。通过HammerBench对多个流行LLMs进行评估，研究发现参数命名错误是导致不同数据类型对话失败的主要因素。

链接: https://arxiv.org/abs/2412.16516
作者: Jun Wang,Jiamu Zhou,Muning Wen,Xiaoyun Mo,Haoyu Zhang,Qiqiang Lin,Cheng Jin,Xihuai Wang,Weinan Zhang,Qiuying Peng,Jun Wang
机构: 未知
关键词: remains challenging due, human-LLM interactions remains, interactions remains challenging, large language models, Evaluating the capabilities
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Evaluating the capabilities of large language models (LLMs) in human-LLM interactions remains challenging due to the inherent complexity and openness of dialogue processes. This paper introduces HammerBench, a novel benchmarking framework designed to assess the function-calling ability of LLMs more effectively in such interactions. We model a wide range of real-world user scenarios on mobile devices, encompassing imperfect instructions, diverse question-answer trajectories, intent/argument shifts, and the use of external individual information through pronouns. To construct the corresponding datasets, we propose a comprehensive pipeline that involves LLM-generated data and multiple rounds of human validation, ensuring high data quality. Additionally, we decompose the conversations into function-calling snapshots, enabling a fine-grained evaluation of each turn. We evaluate several popular LLMs using HammerBench and highlight different performance aspects. Our empirical findings reveal that errors in parameter naming constitute the primary factor behind conversation failures across different data types.
zh

[NLP-64] Robust Speech and Natural Language Processing Models for Depression Screening

【速读】：该论文旨在解决抑郁症（Depression）筛查的全球性健康问题，特别是通过远程技术提高患者筛查的效率和覆盖面。论文提出了两种基于深度学习（Deep Learning）的模型，分别利用声学特征（Acoustics）和自然语言处理（Natural Language Processing, NLP）技术，并通过迁移学习（Transfer Learning）进行优化。这两种模型在抑郁症的二元分类任务中表现出色，未见数据的曲线下面积（AUC）均达到或超过0.80，且在不同说话者和会话变量下表现出较强的鲁棒性。论文的核心解决方案在于利用深度学习模型结合语音技术，实现对抑郁症的自动化筛查，具有广泛应用的潜力。

链接: https://arxiv.org/abs/2412.19072
作者: Y. Lu,A. Harati,T. Rutowski,R. Oliveira,P. Chlebek,E. Shriberg
机构: 未知
关键词: global health concern, increased patient screening, global health, health concern, increased patient
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Depression is a global health concern with a critical need for increased patient screening. Speech technology offers advantages for remote screening but must perform robustly across patients. We have described two deep learning models developed for this purpose. One model is based on acoustics; the other is based on natural language processing. Both models employ transfer learning. Data from a depression-labeled corpus in which 11,000 unique users interacted with a human-machine application using conversational speech is used. Results on binary depression classification have shown that both models perform at or above AUC=0.80 on unseen data with no speaker overlap. Performance is further analyzed as a function of test subset characteristics, finding that the models are generally robust over speaker and session variables. We conclude that models based on these approaches offer promise for generalized automated depression screening.
zh

[NLP-65] Investigating Acoustic-Textual Emotional Inconsistency Information for Automatic Depression Detection

【速读】：该论文旨在解决如何利用情感表达不一致性（Emotional Expression Inconsistency）来提升抑郁症检测的准确性问题。根据情感背景不敏感理论（Emotion Context-Insensitivity theory）和初步研究，抑郁症患者在自然对话中可能以异常平静的方式表达负面情感，表现出情感表达的高度不一致性。然而，现有研究很少识别并利用这种不一致性进行抑郁症检测。论文提出了一种多模态交叉注意力方法（multimodal cross-attention method），通过分析声学和文本领域中情感表达的复杂局部和长期依赖关系，以及两者之间的情感内容不匹配，来捕捉声学-文本情感不一致性（Acoustic-Textual Emotional Inconsistency, ATEI）。随后，论文提出了一种基于Transformer的模型，结合多种融合策略，将ATEI信息整合到抑郁症检测中。此外，论文还采用了一种缩放技术（scaling technique）来调整融合过程中的ATEI特征程度，从而增强模型在不同严重程度抑郁症患者中的辨别能力。该研究首次将情感表达不一致性信息纳入抑郁症检测，实验结果表明该方法在心理咨询对话数据集上具有显著效果。

链接: https://arxiv.org/abs/2412.18614
作者: Rongfeng Su,Changqing Xu,Xinyi Wu,Feng Xu,Xie Chen,Lan Wangt,Nan Yan
机构: 未知
关键词: depression diagnosis accuracy, enhance depression diagnosis, Previous studies, single acoustic sentiment, acoustic sentiment label
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous studies have demonstrated that emotional features from a single acoustic sentiment label can enhance depression diagnosis accuracy. Additionally, according to the Emotion Context-Insensitivity theory and our pilot study, individuals with depression might convey negative emotional content in an unexpectedly calm manner, showing a high degree of inconsistency in emotional expressions during natural conversations. So far, few studies have recognized and leveraged the emotional expression inconsistency for depression detection. In this paper, a multimodal cross-attention method is presented to capture the Acoustic-Textual Emotional Inconsistency (ATEI) information. This is achieved by analyzing the intricate local and long-term dependencies of emotional expressions across acoustic and textual domains, as well as the mismatch between the emotional content within both domains. A Transformer-based model is then proposed to integrate this ATEI information with various fusion strategies for detecting depression. Furthermore, a scaling technique is employed to adjust the ATEI feature degree during the fusion process, thereby enhancing the model’s ability to discern patients with depression across varying levels of severity. To best of our knowledge, this work is the first to incorporate emotional expression inconsistency information into depression detection. Experimental results on a counseling conversational dataset illustrate the effectiveness of our method.
zh

[NLP-66] he Illusion-Illusion: Vision Language Models See Illusions Where There are None

【速读】：该论文探讨了当前视觉语言模型在处理感知错觉（perceptual illusions）时出现的基本处理错误。传统上，感知错觉用于揭示人类感知系统与真实世界之间的差异，而本文则通过引入“错觉的错觉”（illusory-illusions）——即那些本不应引发处理错误的合理图像（如真实的歪线、大小不同的圆等）——来测试视觉语言模型是否会将它们误认为错觉。研究发现，许多现有模型在处理这些“错觉的错觉”时仍然错误地将其识别为错觉，这表明这些模型存在更广泛的处理缺陷。本文的关键在于通过这种反向测试方法，揭示了当前视觉语言模型在处理感知信息时的局限性，并为进一步改进模型提供了诊断依据。

链接: https://arxiv.org/abs/2412.18613
作者: Tomer Ullman
机构: 未知
关键词: cognitive science, diagnostic tool, tool in cognitive, Illusions, Abstract
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Illusions are entertaining, but they are also a useful diagnostic tool in cognitive science, philosophy, and neuroscience. A typical illusion shows a gap between how something “really is” and how something “appears to be”, and this gap helps us understand the mental processing that lead to how something appears to be. Illusions are also useful for investigating artificial systems, and much research has examined whether computational models of perceptions fall prey to the same illusions as people. Here, I invert the standard use of perceptual illusions to examine basic processing errors in current vision language models. I present these models with illusory-illusions, neighbors of common illusions that should not elicit processing errors. These include such things as perfectly reasonable ducks, crooked lines that truly are crooked, circles that seem to have different sizes because they are, in fact, of different sizes, and so on. I show that many current vision language systems mistakenly see these illusion-illusions as illusions. I suggest that such failures are part of broader failures already discussed in the literature.
zh

计算机视觉

[CV-0] MVTamperBench: Evaluating Robustness of Vision-Language Models

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在复杂视频理解任务中对现实世界篡改操作的鲁棒性问题。尽管VLMs在视频理解方面取得了显著进展，但其在面对视频篡改（如旋转、丢帧、掩码、替换和重复）时的可靠性尚未得到充分探索，这限制了其在关键应用中的可信度。为解决这一问题，作者提出了MVTamperBench，一个全面的基准测试工具，用于系统评估VLMs在面对各种视频篡改操作时的鲁棒性。通过集成到VLMEvalKit这一模块化评估工具包中，MVTamperBench不仅能够简化测试流程，还促进了模型鲁棒性的进一步研究。该基准测试的引入是开发抗篡改VLMs的关键一步，确保其在现实场景中的可靠性。

链接: https://arxiv.org/abs/2412.19794
作者: Amit Agarwal,Srikant Panda,Angeline Charles,Bhargava Kumar,Hitesh Patel,Priyanranjan Pattnayak,Taki Hasan Rafi,Tejaswini Kumar,Dong-Kyu Chae
机构: 未知
关键词: enabled significant progress, video understanding tasks, complex video understanding, Recent advancements, understanding tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have enabled significant progress in complex video understanding tasks. However, their robustness to real-world manipulations remains underexplored, limiting their reliability in critical applications. To address this gap, we introduce MVTamperBench, a comprehensive benchmark designed to evaluate VLM’s resilience to video tampering effects, including rotation, dropping, masking, substitution, and repetition. By systematically assessing state-of-the-art models, MVTamperBench reveals substantial variability in robustness, with models like InternVL2-8B achieving high performance, while others, such as Llama-VILA1.5-8B, exhibit severe vulnerabilities. To foster broader adoption and reproducibility, MVTamperBench is integrated into VLMEvalKit, a modular evaluation toolkit, enabling streamlined testing and facilitating advancements in model robustness. Our benchmark represents a critical step towards developing tamper-resilient VLMs, ensuring their dependability in real-world scenarios. Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T37, 68T05, 68Q32, 68T45, 94A08, 68T40, 68Q85 ACMclasses: I.2.10; I.2.7; I.5.4; I.4.9; I.4.8; H.5.1 Cite as: arXiv:2412.19794 [cs.CV] (or arXiv:2412.19794v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.19794 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-1] Improved image display by identifying the RGB family color space

【速读】：该论文旨在解决图像显示过程中颜色空间（color space）未知的问题。由于在实际应用中，图像的颜色空间通常未被明确标识，导致显示效果不准确。论文提出了一种基于像素嵌入（pixel embedding）和高斯过程（Gaussian process）的方法，用于自动识别给定彩色图像的颜色空间。该方法支持五种常见的颜色空间，包括Adobe RGB、Apple RGB、ColorMatch RGB、ProPhoto RGB和sRGB。研究结果表明，这一问题仍需进一步深入探索和优化。解决方案的关键在于利用像素嵌入和高斯过程来有效区分不同颜色空间的特征，从而实现准确的识别。

链接: https://arxiv.org/abs/2412.19775
作者: Elvis Togban,Djemel Ziou
机构: 未知
关键词: encoded is assumed, RGB, color, color space, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To display an image, the color space in which the image is encoded is assumed to be known. Unfortunately, this assumption is rarely realistic. In this paper, we propose to identify the color space of a given color image using pixel embedding and the Gaussian process. Five color spaces are supported, namely Adobe RGB, Apple RGB, ColorMatch RGB, ProPhoto RGB and sRGB. The results obtained show that this problem deserves more efforts.
zh

[CV-2] Generative Video Propagation

【速读】：该论文旨在解决大规模视频生成模型在多种视频任务中的统一应用问题。通过设计一个生成式视频传播框架（GenProp），论文提出了一种能够处理视频编辑、插入、移除和跟踪等任务的统一方法。解决方案的关键在于利用选择性内容编码器对原始视频进行编码，并通过图像到视频生成模型传播对第一帧的修改。此外，论文提出了一种基于实例级视频分割数据集的数据生成方案，并结合掩码预测解码器头和区域感知损失进行模型训练，以确保在生成模型传播修改区域的同时，编码器能够保留原始内容。这一设计使得GenProp在视频编辑中能够显著改变物体形状，在插入任务中使插入物体具有独立运动，在移除任务中有效消除阴影和反射等效果，并在跟踪任务中能够同时跟踪物体及其相关效果。实验结果表明，该模型在多种视频任务中表现出领先性能。

链接: https://arxiv.org/abs/2412.19761
作者: Shaoteng Liu,Tianyu Wang,Jui-Hsien Wang,Qing Liu,Zhifei Zhang,Joon-Young Lee,Yijun Li,Bei Yu,Zhe Lin,Soo Ye Kim,Jiaya Jia
机构: 未知
关键词: Large-scale video generation, model natural scenes, realistically model natural, Large-scale video, natural scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 18 figures

点击查看摘要

Abstract:Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models. Specifically, our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. We propose a data generation scheme to cover multiple video tasks based on instance-level video segmentation datasets. Our model is trained by incorporating a mask prediction decoder head and optimizing a region-aware loss to aid the encoder to preserve the original content while the generation model propagates the modified region. This novel design opens up new possibilities: In editing scenarios, GenProp allows substantial changes to an object’s shape; for insertion, the inserted objects can exhibit independent motion; for removal, GenProp effectively removes effects like shadows and reflections from the whole video; for tracking, GenProp is capable of tracking objects and their associated effects together. Experiment results demonstrate the leading performance of our model in various video tasks, and we further provide in-depth analyses of the proposed framework.
zh

[CV-3] Sharpening Neural Implicit Functions with Frequency Consolidation Priors AAAI2025

【速读】：该论文旨在解决基于神经网络的隐式表示（如符号距离函数，SDFs）在表示高频几何结构（如尖锐结构）时的不足。现有方法在从符号距离、3D点云或多视角图像中学习SDF时，由于神经网络对低频内容的偏好、3D采样不敏感、点云稀疏性或图像分辨率低等问题，难以准确捕捉高频几何细节。为解决这一问题，论文提出了一种通过恢复低频SDF观测的高频成分来锐化表面的方法。其核心思想是数据驱动地学习从低频观测到全频覆盖的映射，从而在频域中形成形状整合的先验知识，称为频率整合先验（frequency consolidation priors）。为了更好地将学习到的先验推广到未见过的形状，论文提出将频率成分表示为嵌入，并将低频成分的嵌入与全频成分的嵌入解耦。这种解耦使得先验能够通过测试时的自重建恢复全频嵌入，从而在未见过的低频观测上实现泛化。实验结果表明，该方法能够有效恢复高频成分，并生成比现有方法更精确的表面。

链接: https://arxiv.org/abs/2412.19720
作者: Chao Chen,Yu-Shen Liu,Zhizhong Han
机构: 未知
关键词: Signed Distance Functions, Distance Functions, including signed distances, frequency, Signed Distance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Signed Distance Functions (SDFs) are vital implicit representations to represent high fidelity 3D surfaces. Current methods mainly leverage a neural network to learn an SDF from various supervisions including signed distances, 3D point clouds, or multi-view images. However, due to various reasons including the bias of neural network on low frequency content, 3D unaware sampling, sparsity in point clouds, or low resolutions of images, neural implicit representations still struggle to represent geometries with high frequency components like sharp structures, especially for the ones learned from images or point clouds. To overcome this challenge, we introduce a method to sharpen a low frequency SDF observation by recovering its high frequency components, pursuing a sharper and more complete surface. Our key idea is to learn a mapping from a low frequency observation to a full frequency coverage in a data-driven manner, leading to a prior knowledge of shape consolidation in the frequency domain, dubbed frequency consolidation priors. To better generalize a learned prior to unseen shapes, we introduce to represent frequency components as embeddings and disentangle the embedding of the low frequency component from the embedding of the full frequency component. This disentanglement allows the prior to generalize on an unseen low frequency observation by simply recovering its full frequency embedding through a test-time self-reconstruction. Our evaluations under widely used benchmarks or real scenes show that our method can recover high frequency component and produce more accurate surfaces than the latest methods. The code, data, and pre-trained models are available at \urlthis https URL.
zh

[CV-4] From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

【速读】：该论文旨在解决自动设计组合（automatic design composition）中存在的两个主要问题：现有生成模型通常仅关注特定子任务，未能实现完整的设计组合任务；且在生成过程中未充分考虑图形设计的层次信息（hierarchical information）。为解决这些问题，论文提出了一种名为LaDeCo的新方法，其关键创新在于将分层设计原则（layered design principle）引入大型多模态模型（Large Multimodal Models, LMMs）。具体而言，LaDeCo首先对给定元素集进行分层规划（layer planning），根据内容将输入元素划分为不同的语义层；随后基于规划结果，以分层方式预测控制设计组合的元素属性，并将先前生成层的渲染图像纳入上下文。通过这种设计，LaDeCo将复杂任务分解为更易管理的步骤，使生成过程更加流畅和清晰。实验结果表明，LaDeCo在设计组合任务中表现出色，并在某些设计子任务中超越了专用模型，且无需特定任务训练。

链接: https://arxiv.org/abs/2412.19712
作者: Jiawei Lin,Shizhao Sun,Danqing Huang,Ting Liu,Ji Li,Jiang Bian
机构: 未知
关键词: investigate automatic design, Large Multimodal Models, automatic design composition, design composition, multimodal graphic elements
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: $\href{ [this https URL](https://elements2design.github.io/) }{\text{elements2design}}$

点击查看摘要

Abstract:In this work, we investigate automatic design composition from multimodal graphic elements. Although recent studies have developed various generative models for graphic design, they usually face the following limitations: they only focus on certain subtasks and are far from achieving the design composition task; they do not consider the hierarchical information of graphic designs during the generation process. To tackle these issues, we introduce the layered design principle into Large Multimodal Models (LMMs) and propose a novel approach, called LaDeCo, to accomplish this challenging task. Specifically, LaDeCo first performs layer planning for a given element set, dividing the input elements into different semantic layers according to their contents. Based on the planning results, it subsequently predicts element attributes that control the design composition in a layer-wise manner, and includes the rendered image of previously generated layers into the context. With this insightful design, LaDeCo decomposes the difficult task into smaller manageable steps, making the generation process smoother and clearer. The experimental results demonstrate the effectiveness of LaDeCo in design composition. Furthermore, we show that LaDeCo enables some interesting applications in graphic design, such as resolution adjustment, element filling, design variation, etc. In addition, it even outperforms the specialized models in some design subtasks without any task-specific training.
zh

[CV-5] A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization

【速读】：该论文旨在解决图像伪造定位（Image Forgery Localization）中传统方法仅生成二值伪造掩码（binary forgery mask）的局限性问题。传统方法将伪造区域进行二值分割，但无法解释模型为何定位特定区域，且将所有伪造像素同等对待，难以识别最明显的伪造部分。为解决这一问题，论文提出了一种基于显著区域的解释方法，并构建了一个多模态伪造追踪（Multi-Modal Tramper Tracing, MMTT）数据集，包含通过深度伪造技术（deepfake）处理的人脸图像及其手工标注的可解释文本。通过该数据集，论文开发了ForgeryTalker架构，该架构结合伪造提示网络（forgery prompter network）和多模态大语言模型（multimodal large language model），实现了伪造定位与解释的双重目标。实验结果表明，该模型在MMTT数据集上表现优异，且数据集、代码和预训练模型将公开以促进进一步研究。

链接: https://arxiv.org/abs/2412.19685
作者: Jingchun Lian,Lingyu Liu,Yaxiong Wang,Yujiao Wu,Li Zhu,Zhedong Zheng
机构: 未知
关键词: identifying tampered pixels, significant advancements, centers on identifying, identifying tampered, forgery
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 4 tabels

点击查看摘要

Abstract:Image forgery localization, which centers on identifying tampered pixels within an image, has seen significant advancements. Traditional approaches often model this challenge as a variant of image segmentation, treating the binary segmentation of forged areas as the end product. We argue that the basic binary forgery mask is inadequate for explaining model predictions. It doesn’t clarify why the model pinpoints certain areas and treats all forged pixels the same, making it hard to spot the most fake-looking parts. In this study, we mitigate the aforementioned limitations by generating salient region-focused interpretation for the forgery images. To support this, we craft a Multi-Modal Tramper Tracing (MMTT) dataset, comprising facial images manipulated using deepfake techniques and paired with manual, interpretable textual annotations. To harvest high-quality annotation, annotators are instructed to meticulously observe the manipulated images and articulate the typical characteristics of the forgery regions. Subsequently, we collect a dataset of 128,303 image-text pairs. Leveraging the MMTT dataset, we develop ForgeryTalker, an architecture designed for concurrent forgery localization and interpretation. ForgeryTalker first trains a forgery prompter network to identify the pivotal clues within the explanatory text. Subsequently, the region prompter is incorporated into multimodal large language model for finetuning to achieve the dual goals of localization and interpretation. Extensive experiments conducted on the MMTT dataset verify the superior performance of our proposed model. The dataset, code as well as pretrained checkpoints will be made publicly available to facilitate further research and ensure the reproducibility of our results.
zh

[CV-6] A Hybrid Technique for Plant Disease Identification and Localisation in Real-time

【速读】：该论文旨在解决植物病害实时检测中的性能问题，特别是传统图像处理和基于深度神经网络（DNN）方法在计算限制和广泛植物病害特征下的表现不佳。论文提出了一种基于图像四叉树分解（Quad-Tree decomposition）和特征学习同步进行的新技术，用于植物病害的识别和定位。该算法的关键创新在于结合了传统图像处理方法和DNN模型的优势，在不同尺度上实现更快的推理速度，同时在高分辨率图像上显著提高了准确性并降低了计算负载。这使得该算法非常适合部署在远程操作的图像采集和病害检测系统中，如无人机和机器人，用于大规模农田的病害监测。实验结果表明，该算法在土豆和番茄作物的四种病害类别上取得了约0.80的F1分数。

链接: https://arxiv.org/abs/2412.19682
作者: Mahendra Kumar Gohil,Anirudha Bhattacharjee,Rwik Rana,Kishan Lal,Samir Kumar Biswas,Nachiketa Tiwari,Bishakh Bhattacharya
机构: 未知
关键词: Deep Neural Networks, past decade, visual data, Deep Neural, Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Over the past decade, several image-processing methods and algorithms have been proposed for identifying plant diseases based on visual data. DNN (Deep Neural Networks) have recently become popular for this task. Both traditional image processing and DNN-based methods encounter significant performance issues in real-time detection owing to computational limitations and a broad spectrum of plant disease features. This article proposes a novel technique for identifying and localising plant disease based on the Quad-Tree decomposition of an image and feature learning simultaneously. The proposed algorithm significantly improves accuracy and faster convergence in high-resolution images with relatively low computational load. Hence it is ideal for deploying the algorithm in a standalone processor in a remotely operated image acquisition and disease detection system, ideally mounted on drones and robots working on large agricultural fields. The technique proposed in this article is hybrid as it exploits the advantages of traditional image processing methods and DNN-based models at different scales, resulting in faster inference. The F1 score is approximately 0.80 for four disease classes corresponding to potato and tomato crops.
zh

[CV-7] Optimizing Local-Global Dependencies for Accurate 3D Human Pose Estimation

【速读】：该论文旨在解决基于Transformer的3D人体姿态估计方法在捕捉细粒度局部细节方面的不足。尽管Transformer在建模长程依赖关系方面表现出色，但其全局注意力机制难以有效捕捉对精确姿态估计至关重要的局部细节。为此，论文提出了SSR-STF，一种双流模型，通过将局部特征与全局依赖关系有效整合来提升3D人体姿态估计的精度。其关键解决方案是引入了SSRFormer模块，该模块采用骨架选择性精炼注意力（SSRA）机制，专门用于捕捉人体姿态序列中的细粒度局部依赖关系，从而补充了Transformer建模的全局依赖关系。通过自适应地融合这两种特征流，SSR-STF能够更好地学习人体姿态的底层结构，克服了传统方法在局部特征提取方面的局限性。实验结果表明，SSR-STF在Human3.6M和MPI-INF-3DHP数据集上均达到了最先进的性能，并在下游任务如人体网格恢复中也表现出色。

链接: https://arxiv.org/abs/2412.19676
作者: Guangsheng Xu,Guoyi Zhang,Lejia Ye,Shuwei Gan,Xiaohu Zhang,Xia Yang
机构: 未知
关键词: recently achieved significant, achieved significant success, human pose estimation, pose estimation, accurate pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based methods have recently achieved significant success in 3D human pose estimation, owing to their strong ability to model long-range dependencies. However, relying solely on the global attention mechanism is insufficient for capturing the fine-grained local details, which are crucial for accurate pose estimation. To address this, we propose SSR-STF, a dual-stream model that effectively integrates local features with global dependencies to enhance 3D human pose estimation. Specifically, we introduce SSRFormer, a simple yet effective module that employs the skeleton selective refine attention (SSRA) mechanism to capture fine-grained local dependencies in human pose sequences, complementing the global dependencies modeled by the Transformer. By adaptively fusing these two feature streams, SSR-STF can better learn the underlying structure of human poses, overcoming the limitations of traditional methods in local feature extraction. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that SSR-STF achieves state-of-the-art performance, with P1 errors of 37.4 mm and 13.2 mm respectively, outperforming existing methods in both accuracy and generalization. Furthermore, the motion representations learned by our model prove effective in downstream tasks such as human mesh recovery. Codes are available at this https URL.
zh

[CV-8] CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLM s

【速读】：该论文旨在解决现有计算机辅助设计（CAD）模型构建方法在推断准确的三维空间位置和方向时存在的困难，这些问题导致在确定几何构造的空间起始点和挤出方向时出现不准确性。现有的方法依赖于难以获取且存储成本高的潜在向量或点云数据，而最近的多模态大语言模型（MLLMs）虽然能够通过自然语言指令和图像进行CAD模型构建，但在空间推理方面仍存在不足。论文提出的解决方案是CAD-GPT，这是一种结合了空间推理增强的MLLM的CAD合成方法，能够以单张图像或文本描述作为输入。其关键创新在于引入了三维建模空间机制（3D Modeling Spatial Mechanism），该机制通过专门的空间展开机制将三维空间位置和三维草图平面旋转角度映射到一维语言特征空间，同时将二维草图坐标离散化到适当的平面空间，从而实现空间起始位置、草图方向和二维草图坐标平移的精确确定。实验结果表明，CAD-GPT在CAD模型合成方面在定量和定性上均优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.19663
作者: Siyu Wang,Cailian Chen,Xinyi Le,Qimin Xu,Lei Xu,Yanzhou Zhang,Jie Yang
机构: 未知
关键词: Computer-aided design, design processes, Multimodal Large Language, CAD model, significantly enhances
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Computer-aided design (CAD) significantly enhances the efficiency, accuracy, and innovation of design processes by enabling precise 2D and 3D modeling, extensive analysis, and optimization. Existing methods for creating CAD models rely on latent vectors or point clouds, which are difficult to obtain and costly to store. Recent advances in Multimodal Large Language Models (MLLMs) have inspired researchers to use natural language instructions and images for CAD model construction. However, these models still struggle with inferring accurate 3D spatial location and orientation, leading to inaccuracies in determining the spatial 3D starting points and extrusion directions for constructing geometries. This work introduces CAD-GPT, a CAD synthesis method with spatial reasoning-enhanced MLLM that takes either a single image or a textual description as input. To achieve precise spatial inference, our approach introduces a 3D Modeling Spatial Mechanism. This method maps 3D spatial positions and 3D sketch plane rotation angles into a 1D linguistic feature space using a specialized spatial unfolding mechanism, while discretizing 2D sketch coordinates into an appropriate planar space to enable precise determination of spatial starting position, sketch orientation, and 2D sketch coordinate translations. Extensive experiments demonstrate that CAD-GPT consistently outperforms existing state-of-the-art methods in CAD model synthesis, both quantitatively and qualitatively.
zh

[CV-9] oward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

【速读】：该论文旨在解决弱监督语义分割（Weakly Supervised Semantic Segmentation, WSSS）中由于文本和视觉模态之间的模态差异（modality gap）导致的文本与区域特征不对齐的问题。现有方法通过优化输入文本提示（text prompts）来改善图像与文本的对齐，但这些方法未能有效建立文本原型（text prototypes）与像素级视觉特征之间的紧密对应关系。论文的理论分析表明，模态差异导致文本和区域特征的错位，且仅通过最小化对比损失（contrast loss）无法充分减少这种差异。为解决这一问题，论文提出了视觉原型学习（Vision Prototype Learning, VPL）框架，其核心在于借助文本原型在视觉空间中学习类别特定的视觉原型（vision prototypes），以捕捉高质量的定位图（localization maps）。此外，论文还提出了区域语义对比模块（regional semantic contrast module），通过对比区域嵌入（regions embedding）与相应原型，实现更全面和鲁棒的特征学习。实验结果表明，该框架在两个基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.19650
作者: Zhongxing Xu,Feilong Tang,Zhe Chen,Yingxue Su,Zhiyi Zhao,Ge Zhang,Jionglong Su,Zongyuan Ge
机构: 未知
关键词: Contrastive Language-Image Pre-training, Supervised Semantic Segmentation, Weakly Supervised Semantic, Weakly Supervised, research powerful cross-modal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more representative vision prototypes. The core of this framework is to learn class-specific vision prototypes in vision space with the help of text prototypes, for capturing high-quality localization maps. Moreover, we propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes, leading to more comprehensive and robust feature learning. Experimental results show that our proposed framework achieves state-of-the-art performance on two benchmark datasets.
zh

[CV-10] Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues ICASSP’25

【速读】：该论文旨在解决视觉-语言跟踪（Vision-Language Tracking, VLT）中文本与图像数据不平衡的问题，这一问题限制了VLT方法在有效对齐两种模态（modalities）方面的能力。为解决这一问题，论文提出了一种名为CTVLT的即插即用方法，该方法利用基础定位模型（foundation grounding models）强大的文本-图像对齐能力，将文本线索转换为可解释的视觉热图（visual heatmaps），从而更易于跟踪器处理。具体而言，CTVLT通过文本线索映射模块将文本线索转换为目标分布热图，直观地表示文本描述的位置，并通过热图引导模块将这些热图与搜索图像融合，以更有效地指导跟踪。实验结果表明，该方法在主流基准测试中达到了最先进的性能，验证了其在增强VLT中的实用性。

链接: https://arxiv.org/abs/2412.19648
作者: X. Feng,D. Zhang,S. Hu,X. Li,M. Wu,J. Zhang,X. Chen,K. Huang
机构: 未知
关键词: aims to localize, language description, video sequences, template and language, Vision-Language Tracking
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ICASSP '25 ! Code: this https URL

点击查看摘要

Abstract:Vision-Language Tracking (VLT) aims to localize a target in video sequences using a visual template and language description. While textual cues enhance tracking potential, current datasets typically contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively. To address this imbalance, we propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models. CTVLT converts textual cues into interpretable visual heatmaps, which are easier for trackers to process. Specifically, we design a textual cue mapping module that transforms textual cues into target distribution heatmaps, visually representing the location described by the text. Additionally, the heatmap guidance module fuses these heatmaps with the search image to guide tracking more effectively. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our approach, achieving state-of-the-art performance and validating the utility of our method for enhanced VLT.
zh

[CV-11] Chimera: A Block-Based Neural Architecture Search Framework for Event-Based Object Detection

【速读】：该论文旨在解决事件相机（Event-based cameras）在目标检测任务中的数据处理问题，特别是如何将RGB域的处理方法有效迁移到事件域。事件相机具有高速鲁棒性和低功耗等优势，但其数据特性与传统RGB图像不同，因此需要专门的处理方法。论文提出的解决方案是Chimera，一个基于块（Block-Based）的神经架构搜索（Neural Architecture Search, NAS）框架，专门为事件目标检测设计。Chimera通过构建包含注意力块（Attention blocks）、卷积（Convolutions）、状态空间模型（State Space Models）和基于MLP-mixer架构的宏块（macroblocks）设计空间，实现了局部与全局处理能力之间的平衡，并提供了不同复杂度的选择。实验结果表明，Chimera在PErson Detection in Robotics (PEDRo)数据集上达到了与当前最先进模型相当的性能，同时平均参数减少了1.6倍。

链接: https://arxiv.org/abs/2412.19646
作者: Diego A. Silva,Ahmed Elsheikh,Kamilya Smagulova,Mohammed E. Fouda,Ahmed M. Eltawil
机构: 未知
关键词: low power consumption, Established Deep Learning, human eye, offering advantages, power consumption
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Event-based cameras are sensors that simulate the human eye, offering advantages such as high-speed robustness and low power consumption. Established Deep Learning techniques have shown effectiveness in processing event data. Chimera is a Block-Based Neural Architecture Search (NAS) framework specifically designed for Event-Based Object Detection, aiming to create a systematic approach for adapting RGB-domain processing methods to the event domain. The Chimera design space is constructed from various macroblocks, including Attention blocks, Convolutions, State Space Models, and MLP-mixer-based architectures, which provide a valuable trade-off between local and global processing capabilities, as well as varying levels of complexity. The results on the PErson Detection in Robotics (PEDRo) dataset demonstrated performance levels comparable to leading state-of-the-art models, alongside an average parameter reduction of 1.6 times.
zh

[CV-12] VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

【速读】：该论文旨在解决零样本定制视频生成（zero-shot customized video generation）中现有方法在保持主题外观一致性方面的不足。现有方法通常依赖额外的模型来提取和注入参考主题特征，假设仅靠视频扩散模型（Video Diffusion Model, VDM）无法实现高质量的零样本定制视频生成。然而，这些方法由于特征提取和注入技术的次优性，往往难以保持主题外观的一致性。论文的关键解决方案在于揭示并利用VDM固有的特征提取和注入能力，提出了一种新颖的框架。具体而言，该框架通过直接将参考图像输入VDM并利用其内在的特征提取过程，获得细粒度特征并与VDM的预训练知识高度对齐；同时，通过VDM内的空间自注意力机制，设计了主题特征与生成内容之间的双向交互，确保VDM在保持生成内容多样性的同时，具有更好的主题保真度。实验结果表明，该框架在定制人类和物体视频生成任务中均表现出显著的有效性。

链接: https://arxiv.org/abs/2412.19645
作者: Tao Wu,Yong Zhang,Xiaodong Cun,Zhongang Qi,Junfu Pu,Huanzhang Dou,Guangcong Zheng,Ying Shan,Xi Li
机构: 未知
关键词: substantial application potential, Zero-shot customized video, gained significant attention, customized video generation, significant attention due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM’s inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM’s pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated this http URL on both customized human and object video generation validate the effectiveness of our framework.
zh

[CV-13] ReNeg: Learning Negative Embedding with Reward Guidance

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）生成应用中负嵌入（negative embeddings）优化的问题。传统方法依赖于用户定义的负提示（negative prompts），虽然功能上可行，但并非最优。论文提出了一种名为ReNeg的端到端方法，通过学习改进的负嵌入来提升生成质量。其关键解决方案包括：1）引入奖励反馈学习框架，通过奖励模型（Reward model）指导负嵌入的学习；2）将无分类器引导（Classifier-Free Guidance, CFG）从推理阶段扩展到训练过程中，从而有效学习负嵌入；3）提出了两种策略，分别用于学习全局和样本级别的负嵌入。实验表明，该方法显著优于空文本和手工设计的负嵌入，在人类偏好对齐方面取得了显著提升，并且在同一文本嵌入空间中学到的负嵌入展示了强大的泛化能力，能够无缝迁移到其他模型如ControlNet、ZeroScope和VideoCrafter2中，实现一致的性能提升。

链接: https://arxiv.org/abs/2412.19637
作者: Xiaomin Li,Yixuan Liu,Takashi Isobe,Xu Jia,Qinpeng Cui,Dong Zhou,Dong Li,You He,Huchuan Lu,Zhongdao Wang,Emad Barsoum
机构: 未知
关键词: enhancing generation quality, negative embeddings, generation applications, generation quality, negative
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In text-to-image (T2I) generation applications, negative embeddings have proven to be a simple yet effective approach for enhancing generation quality. Typically, these negative embeddings are derived from user-defined negative prompts, which, while being functional, are not necessarily optimal. In this paper, we introduce ReNeg, an end-to-end method designed to learn improved Negative embeddings guided by a Reward model. We employ a reward feedback learning framework and integrate classifier-free guidance (CFG) into the training process, which was previously utilized only during inference, thus enabling the effective learning of negative embeddings. We also propose two strategies for learning both global and per-sample negative embeddings. Extensive experiments show that the learned negative embedding significantly outperforms null-text and handcrafted counterparts, achieving substantial improvements in human preference alignment. Additionally, the negative embedding learned within the same text embedding space exhibits strong generalization capabilities. For example, using the same CLIP text encoder, the negative embedding learned on SD1.5 can be seamlessly transferred to text-to-image or even text-to-video models such as ControlNet, ZeroScope, and VideoCrafter2, resulting in consistent performance improvements across the board.
zh

[CV-14] RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations

【速读】：该论文旨在解决视觉变换器（Vision Transformers, ViTs）中全局建模能力提升所带来的参数数量和计算复杂度（FLOPs）随卷积核尺寸呈二次方增长的问题。这种增长导致了显著的效率和优化挑战。为解决这一问题，论文提出了RecConv，一种递归分解策略，通过使用小卷积核高效构建多频率表示。RecConv的关键在于建立了参数增长与分解级别之间的线性关系，从而在保持计算复杂度恒定的情况下，实现了有效感受野（Effective Receptive Field, ERF）的扩展。具体而言，RecConv仅需参数扩展 (\ell+2) 倍，最大计算复杂度增加 (5/3) 倍，而传统的深度卷积和标准卷积则呈指数增长（(4^\ell)）。这一创新为设计跨模态高效紧凑网络提供了新的途径。

链接: https://arxiv.org/abs/2412.19628
作者: Mingshu Zhao,Yi Luo,Yong Ouyang
机构: 未知
关键词: global modeling capabilities, prompting widespread integration, Recent advances, effective receptive field, vision transformers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report;

点击查看摘要

Abstract:Recent advances in vision transformers (ViTs) have demonstrated the advantage of global modeling capabilities, prompting widespread integration of large-kernel convolutions for enlarging the effective receptive field (ERF). However, the quadratic scaling of parameter count and computational complexity (FLOPs) with respect to kernel size poses significant efficiency and optimization challenges. This paper introduces RecConv, a recursive decomposition strategy that efficiently constructs multi-frequency representations using small-kernel convolutions. RecConv establishes a linear relationship between parameter growth and decomposing levels which determines the effective kernel size k\times 2^\ell for a base kernel k and \ell levels of decomposition, while maintaining constant FLOPs regardless of the ERF expansion. Specifically, RecConv achieves a parameter expansion of only \ell+2 times and a maximum FLOPs increase of 5/3 times, compared to the exponential growth ( 4^\ell ) of standard and depthwise convolutions. RecNeXt-M3 outperforms RepViT-M1.1 by 1.9 AP^box on COCO with similar FLOPs. This innovation provides a promising avenue towards designing efficient and compact networks across various modalities. Codes and models can be found at \urlthis https URL.
zh

[CV-15] Enhancing Fine-grained Image Classification through Attentive Batch Training

【速读】：该论文旨在解决细粒度图像分类（Fine-grained image classification）中的挑战，即如何精确区分视觉上相似的对象类别。为了解决这一问题，论文提出了三个关键创新：1) 残差关系注意力模块（Residual Relationship Attention, RRA），该模块利用训练批次中图像之间的关系，有效整合批次图像的视觉特征向量；2) 关系位置编码（Relationship Position Encoding, RPE），该技术编码批次中原始图像之间的关系位置，有效保留批次内图像之间的关系信息；3) 关系批次集成框架（Relationship Batch Integration, RBI），该框架结合RRA和RPE，能够识别在单一图像中可能难以捕捉的关键视觉特征。通过大量实验，该方法在CUB200-2011和Stanford Dog数据集上分别实现了平均2.78%和3.83%的准确率提升，并在Stanford Dog数据集上达到了95.79%的state-of-the-art结果。此外，该方法在Tiny-Imagenet数据集上也取得了93.71%的state-of-the-art结果，展示了其在通用图像分类中的潜力。该方法可作为插件式优化模块，易于集成到不同网络中。

链接: https://arxiv.org/abs/2412.19606
作者: Duy M. Le,Bao Q. Bui,Anh Tran,Cong Tran,Cuong Pham
机构: 未知
关键词: requires precise differentiation, similar object categories, visually similar object, Residual Relationship Attention, Relationship Position Encoding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained image classification, which is a challenging task in computer vision, requires precise differentiation among visually similar object categories. In this paper, we propose 1) a novel module called Residual Relationship Attention (RRA) that leverages the relationships between images within each training batch to effectively integrate visual feature vectors of batch images and 2) a novel technique called Relationship Position Encoding (RPE), which encodes the positions of relationships between original images in a batch and effectively preserves the relationship information between images within the batch. Additionally, we design a novel framework, namely Relationship Batch Integration (RBI), which utilizes RRA in conjunction with RPE, allowing the discernment of vital visual features that may remain elusive when examining a singular image representative of a particular class. Through extensive experiments, our proposed method demonstrates significant improvements in the accuracy of different fine-grained classifiers, with an average increase of (+2.78%) and (+3.83%) on the CUB200-2011 and Stanford Dog datasets, respectively, while achieving a state-of-the-art results (95.79%) on the Stanford Dog dataset. Despite not achieving the same level of improvement as in fine-grained image classification, our method still demonstrates its prowess in leveraging general image classification by attaining a state-of-the-art result of (93.71%) on the Tiny-Imagenet dataset. Furthermore, our method serves as a plug-in refinement module and can be easily integrated into different networks.
zh

[CV-16] DAS3R: Dynamics-Aware Gaussian Splatting for Static Scene Reconstruction

【速读】：该论文旨在解决从日常视频中进行场景分解和静态背景重建的问题。现有的方法在处理动态物体占据场景大部分区域时表现不佳，且通常依赖于相机姿态输入或基于SLAM（Simultaneous Localization and Mapping）方法的点云数据。论文提出的解决方案DAS3R（Dynamics-Aware Gaussian Splatting for Static Scene Reconstruction）通过整合训练的运动掩码（motion masks）并将静态场景建模为高斯泼溅（Gaussian splats），结合动态感知优化（dynamics-aware optimization），实现了更精确的背景重建。DAS3R在复杂运动场景中表现出更强的鲁棒性，且无需相机姿态输入或SLAM生成的点云数据。实验表明，DAS3R在DAVIS和Sintel数据集上相比现有方法在PSNR（Peak Signal-to-Noise Ratio）指标上提升了超过2 dB，展示了其优越的性能和鲁棒性。

链接: https://arxiv.org/abs/2412.19584
作者: Kai Xu,Tze Ho Elden Tse,Jizong Peng,Angela Yao
机构: 未知
关键词: static background reconstruction, Static Scene Reconstruction, background reconstruction, Dynamics-Aware Gaussian Splatting, accurate background reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel framework for scene decomposition and static background reconstruction from everyday videos. By integrating the trained motion masks and modeling the static scene as Gaussian splats with dynamics-aware optimization, our method achieves more accurate background reconstruction results than previous works. Our proposed method is termed DAS3R, an abbreviation for Dynamics-Aware Gaussian Splatting for Static Scene Reconstruction. Compared to existing methods, DAS3R is more robust in complex motion scenarios, capable of handling videos where dynamic objects occupy a significant portion of the scene, and does not require camera pose inputs or point cloud data from SLAM-based methods. We compared DAS3R against recent distractor-free approaches on the DAVIS and Sintel datasets; DAS3R demonstrates enhanced performance and robustness with a margin of more than 2 dB in PSNR. The project’s webpage can be accessed via \urlthis https URL
zh

[CV-17] Reinforced Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

【速读】：该论文旨在解决音视频解析（Audio-Visual Video Parsing, AVVP）任务中的标签去噪问题，特别是在音频或视觉模态中可能仅包含单一事件标签且仅有整体视频标签可用的情况下，如何精确识别音视频事件标签及其时间边界。现有方法通常将标签去噪作为独立的预处理步骤，导致去噪过程与AVVP任务脱节。为解决这一问题，论文提出了一种基于联合强化学习的标签去噪方法（Reinforcement Learning-based Label Denoising, RLLD），通过联合优化策略同时训练标签去噪和视频解析模型。该方案的关键在于引入了一种新颖的AVVP验证机制和软互奖励反馈机制，直接指导标签去噪策略的学习，从而显著提升了AVVP任务的性能。

链接: https://arxiv.org/abs/2412.19563
作者: Yongbiao Gao,Xiangcheng Sun,Guohua Lv,Deng Yu,Sijiu Niu
机构: 未知
关键词: precise temporal boundaries, visual event labels, Audio-visual video parsing, label denoising, Audio-visual video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-visual video parsing (AVVP) aims to recognize audio and visual event labels with precise temporal boundaries, which is quite challenging since audio or visual modality might include only one event label with only the overall video labels available. Existing label denoising models often treat the denoising process as a separate preprocessing step, leading to a disconnect between label denoising and AVVP tasks. To bridge this gap, we present a novel joint reinforcement learning-based label denoising approach (RLLD). This approach enables simultaneous training of both label denoising and video parsing models through a joint optimization strategy. We introduce a novel AVVP-validation and soft inter-reward feedback mechanism that directly guides the learning of label denoising policy. Extensive experiments on AVVP tasks demonstrate the superior performance of our proposed method compared to label denoising techniques. Furthermore, by incorporating our label denoising method into other AVVP models, we find that it can further enhance parsing results.
zh

[CV-18] Structural Similarity in Deep Features: Image Quality Assessment Robust to Geometrically Disparate Reference

【速读】：该论文旨在解决图像质量评估（Image Quality Assessment, IQA）中存在的几何形变问题。传统的基于对齐参考图像的IQA方法（Aligned-Reference IQA, AR-IQA）假设参考图像和测试图像的像素完全对齐，无法有效处理实际应用中存在的各种几何形变问题。尽管已有研究尝试解决几何形变参考图像的IQA问题（Geometrically-Disparate-Reference IQA, GDR-IQA），但这些方法通常是任务依赖的，例如针对图像超分辨率或重定向的专门设计，或假设几何形变较小，可以通过平移鲁棒滤波器或显式图像配准来应对。本文提出了一种统一的、无需训练的深度结构相似性（Deep Structural Similarity, DeepSSIM）方法，通过评估深度特征的结构相似性，并结合注意力校准策略来缓解注意力偏差，从而在单一框架内解决上述问题。该方法在AR-IQA数据集上达到了最先进的性能，并在各种GDR-IQA测试案例中表现出强大的鲁棒性。此外，DeepSSIM还被证明可作为图像超分辨率、增强和修复训练的优化工具，显示出更广泛的通用性。

链接: https://arxiv.org/abs/2412.19553
作者: Keke Zhang,Weiling Chen,Tiesong Zhao,Zhou Wang
机构: 未知
关键词: Image Quality Assessment, Quality Assessment, computer vision tasks, evaluating computer vision, Image Quality
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Image Quality Assessment (IQA) with references plays an important role in optimizing and evaluating computer vision tasks. Traditional methods assume that all pixels of the reference and test images are fully aligned. Such Aligned-Reference IQA (AR-IQA) approaches fail to address many real-world problems with various geometric deformations between the two images. Although significant effort has been made to attack Geometrically-Disparate-Reference IQA (GDR-IQA) problem, it has been addressed in a task-dependent fashion, for example, by dedicated designs for image super-resolution and retargeting, or by assuming the geometric distortions to be small that can be countered by translation-robust filters or by explicit image registrations. Here we rethink this problem and propose a unified, non-training-based Deep Structural Similarity (DeepSSIM) approach to address the above problems in a single framework, which assesses structural similarity of deep features in a simple but efficient way and uses an attention calibration strategy to alleviate attention deviation. The proposed method, without application-specific design, achieves state-of-the-art performance on AR-IQA datasets and meanwhile shows strong robustness to various GDR-IQA test cases. Interestingly, our test also shows the effectiveness of DeepSSIM as an optimization tool for training image super-resolution, enhancement and restoration, implying an even wider generalizability. \footnoteSource code will be made public after the review is completed.
zh

[CV-19] Unprejudiced Training Auxiliary Tasks Makes Primary Better: A Multi-Task Learning Perspective

【速读】：该论文旨在解决多任务学习（Multi-task Learning）中辅助任务（auxiliary tasks）训练不足的问题。传统方法通常将辅助任务视为次要任务，在训练过程中赋予其较小的损失权重，导致辅助任务未能充分训练，从而无法有效支持主任务（primary task）的性能提升。为解决这一问题，论文提出了一种基于不确定性（uncertainty-based）的公平学习方法，确保所有任务在训练过程中得到平衡的优化。此外，该方法在反向传播（backpropagation）过程中同时考虑梯度和不确定性信息，以进一步提升主任务的性能。实验结果表明，该方法在性能上达到或超越了现有最先进方法，且其权重策略在增强主任务性能方面具有有效性和鲁棒性，即使辅助任务的伪标签（pseudo labels）存在噪声。

链接: https://arxiv.org/abs/2412.19547
作者: Yuanze Li,Chun-Mei Feng,Qilong Wang,Guanglei Yang,Wangmeng Zuo
机构: 未知
关键词: primary task, leverage knowledge, knowledge from relative, Human, primary
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human beings can leverage knowledge from relative tasks to improve learning on a primary task. Similarly, multi-task learning methods suggest using auxiliary tasks to enhance a neural network’s performance on a specific primary task. However, previous methods often select auxiliary tasks carefully but treat them as secondary during training. The weights assigned to auxiliary losses are typically smaller than the primary loss weight, leading to insufficient training on auxiliary tasks and ultimately failing to support the main task effectively. To address this issue, we propose an uncertainty-based impartial learning method that ensures balanced training across all tasks. Additionally, we consider both gradients and uncertainty information during backpropagation to further improve performance on the primary task. Extensive experiments show that our method achieves performance comparable to or better than state-of-the-art approaches. Moreover, our weighting strategy is effective and robust in enhancing the performance of the primary task regardless the noise auxiliary tasks’ pseudo labels.
zh

[CV-20] Diverse Rare Sample Generation with Pretrained GANs

【速读】：该论文旨在解决深度生成模型（Deep Generative Models）在生成低密度区域中的稀有样本时面临的挑战，这些问题主要源于训练数据集的稀缺性和模式崩溃（Mode Collapse）现象。尽管现有方法在提高生成样本的保真度方面取得了一定进展，但它们往往忽视了稀有和新颖样本，导致生成样本的多样性和覆盖范围下降。为此，论文提出了一种基于预训练生成对抗网络（GANs）的高分辨率图像数据集中生成多样化稀有样本的新方法。该解决方案的关键在于采用多目标框架下的梯度优化策略对潜在向量进行优化，并利用归一化流（Normalizing Flows）在特征空间上进行密度估计。通过这些技术手段，该方法能够生成具有可控稀有度、多样性和与参考图像相似性的多样化稀有图像，且无需对预训练的GANs进行重新训练或微调。

链接: https://arxiv.org/abs/2412.19543
作者: Subeen Lee,Jiyeon Han,Soyeon Kim,Jaesik Choi
机构: 未知
关键词: Deep generative models, mode collapse problem, Deep generative, generating realistic data, low density regions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep generative models are proficient in generating realistic data but struggle with producing rare samples in low density regions due to their scarcity of training datasets and the mode collapse problem. While recent methods aim to improve the fidelity of generated samples, they often reduce diversity and coverage by ignoring rare and novel samples. This study proposes a novel approach for generating diverse rare samples from high-resolution image datasets with pretrained GANs. Our method employs gradient-based optimization of latent vectors within a multi-objective framework and utilizes normalizing flows for density estimation on the feature space. This enables the generation of diverse rare images, with controllable parameters for rarity, diversity, and similarity to a reference image. We demonstrate the effectiveness of our approach both qualitatively and quantitatively across various datasets and GANs without retraining or fine-tuning the pretrained GANs.
zh

[CV-21] Interacted Object Grounding in Spatio-Temporal Human-Object Interactions AAAI2025

【速读】：该论文旨在解决时空人-物交互（Spatio-temporal Human-Object Interaction, ST-HOI）理解中的开放世界物体多样性问题。现有的人-物交互视频基准测试通常局限于预定义的物体类别，无法充分反映现实世界中物体的多样性。为此，论文引入了一个新的开放世界基准测试：Grounding Interacted Objects (GIO)，包含1,098个交互物体类别和290K个交互物体框标注，并提出了物体定位任务，要求视觉系统能够发现交互物体。尽管现有的检测器和定位方法在常规任务中表现良好，但在GIO中定位多样且稀有的物体时表现不佳，揭示了当前视觉系统的局限性。为解决这一问题，论文提出利用时空线索进行物体定位，并设计了一个4D问答框架（4D-QA），从多样化的视频中发现交互物体。该方法在广泛的实验中表现出显著优势，数据与代码将公开提供。

链接: https://arxiv.org/abs/2412.19542
作者: Xiaoyang Liu,Boran Wen,Xinpeng Liu,Zizheng Zhou,Hongwei Fan,Cewu Lu,Lizhuang Ma,Yulong Chen,Yong-Lu Li
机构: 未知
关键词: Spatio-temporal Human-Object Interaction, Interacted Objects, understanding aims, activity understanding, interaction video benchmarks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To be published in the Proceedings of AAAI 2025. The first three authors contributed equally. Project: this https URL

点击查看摘要

Abstract:Spatio-temporal Human-Object Interaction (ST-HOI) understanding aims at detecting HOIs from videos, which is crucial for activity understanding. However, existing whole-body-object interaction video benchmarks overlook the truth that open-world objects are diverse, that is, they usually provide limited and predefined object classes. Therefore, we introduce a new open-world benchmark: Grounding Interacted Objects (GIO) including 1,098 interacted objects class and 290K interacted object boxes annotation. Accordingly, an object grounding task is proposed expecting vision systems to discover interacted objects. Even though today’s detectors and grounding methods have succeeded greatly, they perform unsatisfactorily in localizing diverse and rare objects in GIO. This profoundly reveals the limitations of current vision systems and poses a great challenge. Thus, we explore leveraging spatio-temporal cues to address object grounding and propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos. Our method demonstrates significant superiority in extensive experiments compared to current baselines. Data and code will be publicly available at this https URL.
zh

[CV-22] Finger in Camera Speaks Everything: Unconstrained Air-Writing for Real-World

【速读】：该论文旨在解决空中书写（air-writing）领域中的两个主要挑战：一是现有方法依赖复杂传感器（如雷达、脑电图等）来捕捉精确的手写轨迹，二是缺乏覆盖广泛词汇范围的基于视频的空中书写数据集。这些限制影响了空中书写技术在实际场景中的实用性，特别是在iPhone和笔记本电脑等设备上的应用。为解决这些问题，论文提出了一个开创性的基于视频的空中书写中文字符数据集（AWCV-100K-UCAS2024），该数据集使用常见的RGB摄像头捕捉各种实际场景中的手写轨迹，无需复杂传感器。此外，论文还提出了一种基线方法——基于视频的字符识别器（VCRec），该方法能够从稀疏的视觉线索中提取指尖特征，并利用时空序列模块进行分析。实验结果表明，VCRec在识别空中书写字符方面优于现有模型，为增强现实世界中人机交互提供了新的可能性。

链接: https://arxiv.org/abs/2412.19537
作者: Meiqi Wu,Kaiqi Huang,Yuanqiang Cai,Shiyu Hu,Yuzhong Zhao,Weiqiang Wang
机构: 未知
关键词: natural language processing, vision and natural, natural language, intuitive and natural, language processing
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Air-writing is a challenging task that combines the fields of computer vision and natural language processing, offering an intuitive and natural approach for human-computer interaction. However, current air-writing solutions face two primary challenges: (1) their dependency on complex sensors (e.g., Radar, EEGs and others) for capturing precise handwritten trajectories, and (2) the absence of a video-based air-writing dataset that covers a comprehensive vocabulary range. These limitations impede their practicality in various real-world scenarios, including the use on devices like iPhones and laptops. To tackle these challenges, we present the groundbreaking air-writing Chinese character video dataset (AWCV-100K-UCAS2024), serving as a pioneering benchmark for video-based air-writing. This dataset captures handwritten trajectories in various real-world scenarios using commonly accessible RGB cameras, eliminating the need for complex sensors. AWCV-100K-UCAS2024 includes 8.8 million video frames, encompassing the complete set of 3,755 characters from the GB2312-80 level-1 set (GB1). Furthermore, we introduce our baseline approach, the video-based character recognizer (VCRec). VCRec adeptly extracts fingertip features from sparse visual cues and employs a spatio-temporal sequence module for analysis. Experimental results showcase the superior performance of VCRec compared to existing models in recognizing air-written characters, both quantitatively and qualitatively. This breakthrough paves the way for enhanced human-computer interaction in real-world contexts. Moreover, our approach leverages affordable RGB cameras, enabling its applicability in a diverse range of scenarios. The code and data examples will be made public at this https URL.
zh

[CV-23] StyleRWKV: High-Quality and High-Efficiency Style Transfer with RWKV-like Architecture

【速读】：该论文旨在解决现有风格迁移（Style Transfer）方法在计算复杂度和推理时间上的问题，特别是基于Transformer或扩散模型（Diffusion Models）的方法所面临的二次计算复杂度和高推理时间的挑战。为了解决这些问题，论文提出了一种名为StyleRWKV的新框架，其关键创新点包括：1）引入了一种称为Recurrent WKV（Re-WKV）的注意力机制，通过双向注意力建立全局感受野，从而在保持线性时间复杂度的同时实现高质量的风格迁移；2）开发了Deformable Shifting（Deform-Shifting）层，通过在卷积核的采样网格中引入可学习的偏移量，使模型能够灵活且自适应地从感兴趣区域进行采样，从而增强局部依赖关系的捕捉能力；3）提出了Skip Scanning（S-Scanning）方法，有效建立全局上下文依赖关系。实验结果表明，该方法在风格化质量、模型复杂度和推理效率方面均优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.19535
作者: Miaomiao Dai,Qianyu Zhou,Lizhuang Ma
机构: 未知
关键词: Style transfer aims, aims to generate, image preserving, preserving the content, artistic representation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Style transfer aims to generate a new image preserving the content but with the artistic representation of the style source. Most of the existing methods are based on Transformers or diffusion models, however, they suffer from quadratic computational complexity and high inference time. RWKV, as an emerging deep sequence models, has shown immense potential for long-context sequence modeling in NLP tasks. In this work, we present a novel framework StyleRWKV, to achieve high-quality style transfer with limited memory usage and linear time complexity. Specifically, we propose a Recurrent WKV (Re-WKV) attention mechanism, which incorporates bidirectional attention to establish a global receptive field. Additionally, we develop a Deformable Shifting (Deform-Shifting) layer that introduces learnable offsets to the sampling grid of the convolution kernel, allowing tokens to shift flexibly and adaptively from the region of interest, thereby enhancing the model’s ability to capture local dependencies. Finally, we propose a Skip Scanning (S-Scanning) method that effectively establishes global contextual dependencies. Extensive experiments with analysis including qualitative and quantitative evaluations demonstrate that our approach outperforms state-of-the-art methods in terms of stylization quality, model complexity, and inference efficiency.
zh

[CV-24] P3S-Diffusion:A Selective Subject-driven Generation Framework via Point Supervision

【速读】：该论文旨在解决在主题驱动生成（subject-driven generation）中，如何准确选择给定参考图像中的特定内容，尤其是在图像中存在相似主题（如两只不同的狗）时的挑战。现有方法通常依赖于文本提示（text prompts）或像素掩码（pixel masks）来隔离特定元素，但文本提示往往无法精确描述特定内容，而像素掩码则成本较高。为此，论文提出了一种名为P3S-Diffusion的新架构，通过点监督（point supervision）实现上下文选择的主体驱动生成。P3S-Diffusion的关键在于利用低成本标签（如点）生成主体驱动图像，并在微调过程中从这些点生成扩展的基础掩码，从而避免了对额外分割模型的需求。该掩码用于图像修复（inpainting）并与主体表示对齐。此外，P3S-Diffusion通过多层条件注入（Multi-layers Condition Injection）保留主体的精细特征，并通过注意力一致性损失（Attention Consistency Loss）增强训练效果，最终实现了出色的特征保留和图像生成能力。

链接: https://arxiv.org/abs/2412.19533
作者: Junjie Hu(1),Shuyong Gao(1),Lingyi Hong(1),Qishan Wang(1),Yuzhou Zhao(1),Yan Wang(1),Wenqiang Zhang(1) ((1) Fudan university)
机构: 未知
关键词: Recent research, generation increasingly emphasizes, increasingly emphasizes, emphasizes the importance, importance of selective
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research in subject-driven generation increasingly emphasizes the importance of selective subject features. Nevertheless, accurately selecting the content in a given reference image still poses challenges, especially when selecting the similar subjects in an image (e.g., two different dogs). Some methods attempt to use text prompts or pixel masks to isolate specific elements. However, text prompts often fall short in precisely describing specific content, and pixel masks are often expensive. To address this, we introduce P3S-Diffusion, a novel architecture designed for context-selected subject-driven generation via point supervision. P3S-Diffusion leverages minimal cost label (e.g., points) to generate subject-driven images. During fine-tuning, it can generate an expanded base mask from these points, obviating the need for additional segmentation models. The mask is employed for inpainting and aligning with subject representation. The P3S-Diffusion preserves fine features of the subjects through Multi-layers Condition Injection. Enhanced by the Attention Consistency Loss for improved training, extensive experiments demonstrate its excellent feature preservation and image generation capabilities.
zh

[CV-25] Is Your Text-to-Image Model Robust to Caption Noise?

【速读】：该论文探讨了在文本到图像（Text-to-Image, T2I）生成任务中，视觉语言模型（Vision Language Models, VLMs）生成的图像描述（caption）中的幻觉（hallucination）现象对生成性能的影响。具体而言，论文通过实证研究，分析了描述幻觉在模型微调过程中对输出质量的持续影响，并发现VLMs的置信度评分（confidence scores）可以作为检测和表征数据分布中噪声相关模式的可靠指标。此外，研究还表明，描述保真度的细微变化对学习表示的质量具有显著影响。为解决这一问题，论文提出了一种利用VLM置信度评分来减轻描述噪声的方法，从而增强T2I模型对描述幻觉的鲁棒性。解决方案的关键在于通过VLM置信度评分来识别和过滤噪声数据，进而提升模型的训练效果和生成质量。

链接: https://arxiv.org/abs/2412.19531
作者: Weichen Yu,Ziyan Yang,Shanchuan Lin,Qi Zhao,Jianyi Wang,Liangke Gui,Matt Fredrikson,Lu Jiang
机构: 未知
关键词: utilizing Vision Language, Vision Language Models, involves utilizing Vision, Vision Language, technique involves utilizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In text-to-image (T2I) generation, a prevalent training technique involves utilizing Vision Language Models (VLMs) for image re-captioning. Even though VLMs are known to exhibit hallucination, generating descriptive content that deviates from the visual reality, the ramifications of such caption hallucinations on T2I generation performance remain under-explored. Through our empirical investigation, we first establish a comprehensive dataset comprising VLM-generated captions, and then systematically analyze how caption hallucination influences generation outcomes. Our findings reveal that (1) the disparities in caption quality persistently impact model outputs during fine-tuning. (2) VLMs confidence scores serve as reliable indicators for detecting and characterizing noise-related patterns in the data distribution. (3) even subtle variations in caption fidelity have significant effects on the quality of learned representations. These findings collectively emphasize the profound impact of caption quality on model performance and highlight the need for more sophisticated robust training algorithm in T2I. In response to these observations, we propose a approach leveraging VLM confidence score to mitigate caption noise, thereby enhancing the robustness of T2I models against hallucination in caption.
zh

[CV-26] Attribution for Enhanced Explanation with Transferable Adversarial eXploration

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在计算机视觉等应用中的可解释性问题，特别是如何提高模型解释的准确性和鲁棒性。论文提出的解决方案AttEXplore++框架，基于AttEXplore，通过引入可迁移的对抗攻击方法（如MIG和GRA）来增强归因（attribution）效果。实验表明，AttEXplore++在五个模型（包括CNNs和视觉变换器）上平均性能提升了7.57%和32.62%，优于其他最先进的可解释性算法。关键点在于利用对抗迁移性（adversarial transferability）来提升归因结果的准确性，并通过插入和删除分数（insertion and deletion scores）作为评估指标，验证了该方法的有效性。此外，论文还探讨了随机性、扰动率、噪声幅度和多样性概率对归因性能的影响，证明了AttEXplore++在不同模型上提供更稳定和可靠的解释。

链接: https://arxiv.org/abs/2412.19523
作者: Zhiyu Zhu,Jiayu Zhang,Zhibo Jin,Huaming Chen,Jianlong Zhou,Fang Chen
机构: 未知
关键词: deep neural networks, understanding model decisions, including computer vision, deep neural, neural networks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The interpretability of deep neural networks is crucial for understanding model decisions in various applications, including computer vision. AttEXplore++, an advanced framework built upon AttEXplore, enhances attribution by incorporating transferable adversarial attack methods such as MIG and GRA, significantly improving the accuracy and robustness of model explanations. We conduct extensive experiments on five models, including CNNs (Inception-v3, ResNet-50, VGG16) and vision transformers (MaxViT-T, ViT-B/16), using the ImageNet dataset. Our method achieves an average performance improvement of 7.57% over AttEXplore and 32.62% compared to other state-of-the-art interpretability algorithms. Using insertion and deletion scores as evaluation metrics, we show that adversarial transferability plays a vital role in enhancing attribution results. Furthermore, we explore the impact of randomness, perturbation rate, noise amplitude, and diversity probability on attribution performance, demonstrating that AttEXplore++ provides more stable and reliable explanations across various models. We release our code at: this https URL
zh

[CV-27] Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images

【速读】：该论文旨在解决从稀疏视角（sparse-view）且未校准（uncalibrated）图像中进行逼真场景重建的问题。现有方法要么需要精确的相机参数（如内参和外参），要么需要密集采集的图像，无法同时满足稀疏视角和无校准的需求。为此，论文提出了Dust to Tower (D2T)框架，通过从稀疏且未校准的图像中同时优化3D高斯分布（3D Gaussian Splatting, 3DGS）和图像姿态，实现从粗到细的重建。其关键解决方案包括：首先，通过快速多视角立体（Multi-View Stereo）模型初始化3DGS并恢复初始相机姿态；其次，利用置信度感知深度对齐（Confidence Aware Depth Alignment, CADA）模块，通过单目深度模型（Mono-depth model）估计的深度对齐粗深度图的置信部分，进一步优化深度图；最后，提出基于变形图像的引导修复（Warped Image-Guided Inpainting, WIGI）模块，将训练图像变形到新视角，并通过修复填补因视角变化产生的“空洞”，为3D模型和相机姿态的进一步优化提供高质量监督。实验表明，D2T在新视角合成和姿态估计任务中均达到了最先进的性能，同时保持了高效性。

链接: https://arxiv.org/abs/2412.19518
作者: Xudong Cai,Yongcai Wang,Zhaoxin Fan,Deng Haoran,Shuo Wang,Wanting Li,Deying Li,Lun Luo,Minhang Wang,Jintao Xu
机构: 未知
关键词: Photo-realistic scene reconstruction, Photo-realistic scene, required in practice, scene reconstruction, highly required
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photo-realistic scene reconstruction from sparse-view, uncalibrated images is highly required in practice. Although some successes have been made, existing methods are either Sparse-View but require accurate camera parameters (i.e., intrinsic and extrinsic), or SfM-free but need densely captured images. To combine the advantages of both methods while addressing their respective weaknesses, we propose Dust to Tower (D2T), an accurate and efficient coarse-to-fine framework to optimize 3DGS and image poses simultaneously from sparse and uncalibrated images. Our key idea is to first construct a coarse model efficiently and subsequently refine it using warped and inpainted images at novel viewpoints. To do this, we first introduce a Coarse Construction Module (CCM) which exploits a fast Multi-View Stereo model to initialize a 3D Gaussian Splatting (3DGS) and recover initial camera poses. To refine the 3D model at novel viewpoints, we propose a Confidence Aware Depth Alignment (CADA) module to refine the coarse depth maps by aligning their confident parts with estimated depths by a Mono-depth model. Then, a Warped Image-Guided Inpainting (WIGI) module is proposed to warp the training images to novel viewpoints by the refined depth maps, and inpainting is applied to fulfill the ``holes" in the warped images caused by view-direction changes, providing high-quality supervision to further optimize the 3D model and the camera poses. Extensive experiments and ablation studies demonstrate the validity of D2T and its design choices, achieving state-of-the-art performance in both tasks of novel view synthesis and pose estimation while keeping high efficiency. Codes will be publicly available.
zh

[CV-28] MBQ: Modality-Balanced Quantization for Large Vision-Language Models

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在部署过程中由于参数量大而导致的内存和计算开销问题。现有的后训练量化（Post-Training Quantization, PTQ）方法主要针对大语言模型（Large Language Models, LLMs），未充分考虑多模态（如视觉和语言）之间的差异，导致在处理不同模态的token时可能过度强调不敏感的模态，从而造成显著的精度损失。为解决这一问题，论文提出了一种简单而有效的方法，称为模态平衡量化（Modality-Balanced Quantization, MBQ）。MBQ的关键在于在校准过程中结合不同模态的敏感性差异，以最小化重建损失，从而获得更好的量化参数。实验表明，MBQ在W3和W4A8量化下，相较于现有技术，能够显著提升7B到70B VLMs的任务精度，最高分别提升4.4%和11.6%。此外，论文还实现了一个W3 GPU内核，融合了反量化和GEMV操作，在RTX 4090上对LLaVA-onevision-7B实现了1.4倍的加速。

链接: https://arxiv.org/abs/2412.19509
作者: Shiyao Li,Yingchun Hu,Xuefei Ning,Xihui Liu,Ke Hong,Xiaotao Jia,Xiuhong Li,Yaqi Yan,Pei Ran,Guohao Dai,Shengen Yan,Huazhong Yang,Yu Wang
机构: 未知
关键词: real-world applications, enabled a variety, variety of real-world, Existing PTQ methods, Vision-Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have enabled a variety of real-world applications. The large parameter size of VLMs brings large memory and computation overhead which poses significant challenges for deployment. Post-Training Quantization (PTQ) is an effective technique to reduce the memory and computation overhead. Existing PTQ methods mainly focus on large language models (LLMs), without considering the differences across other modalities. In this paper, we discover that there is a significant difference in sensitivity between language and vision tokens in large VLMs. Therefore, treating tokens from different modalities equally, as in existing PTQ methods, may over-emphasize the insensitive modalities, leading to significant accuracy loss. To deal with the above issue, we propose a simple yet effective method, Modality-Balanced Quantization (MBQ), for large VLMs. Specifically, MBQ incorporates the different sensitivities across modalities during the calibration process to minimize the reconstruction loss for better quantization parameters. Extensive experiments show that MBQ can significantly improve task accuracy by up to 4.4% and 11.6% under W3 and W4A8 quantization for 7B to 70B VLMs, compared to SOTA baselines. Additionally, we implement a W3 GPU kernel that fuses the dequantization and GEMV operators, achieving a 1.4x speedup on LLaVA-onevision-7B on the RTX 4090. The code is available at this https URL.
zh

[CV-29] DrivingWorld: ConstructingWorld Model for Autonomous Driving via Video GPT

【速读】：该论文旨在解决现有自回归生成模型（如GPT系列）在视觉任务中，特别是自动驾驶领域，生成高质量、长时间视频序列时面临的挑战。传统GPT框架设计用于处理一维上下文信息（如文本），缺乏对视频生成所需的空间和时间动态建模的能力，导致生成结果不理想。为此，论文提出了DrivingWorld，一种基于GPT风格的世界模型，通过引入多种时空融合机制，有效建模空间和时间动态，从而实现高保真、长时间的视频生成。关键解决方案包括：1）采用下一状态预测策略（next-state prediction strategy）建模连续帧之间的时间一致性；2）应用下一令牌预测策略（next-token prediction strategy）捕捉每帧内的空间信息；3）提出新颖的掩码策略和重加权策略，以缓解长期漂移问题并实现精确控制。实验表明，该方法能够生成超过40秒的高保真视频，显著优于现有技术。

链接: https://arxiv.org/abs/2412.19505
作者: Xiaotao Hu,Wei Yin,Mingkai Jia,Junyuan Deng,Xiaoyang Guo,Qian Zhang,Xiaoxiao Long,Ping Tan
机构: 未知
关键词: natural language processing, Recent successes, successes in autoregressive, language processing, GPT series
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. Some works attempt to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting ego states. However, prior works tend to produce unsatisfactory results, as the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent ability to model the spatial and temporal dynamics essential for video generation. In this paper, we present DrivingWorld, a GPT-style world model for autonomous driving, featuring several spatial-temporal fusion mechanisms. This design enables effective modeling of both spatial and temporal dynamics, facilitating high-fidelity, long-duration video generation. Specifically, we propose a next-state prediction strategy to model temporal coherence between consecutive frames and apply a next-token prediction strategy to capture spatial information within each frame. To further enhance generalization ability, we propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues and enable precise control. Our work demonstrates the ability to produce high-fidelity and consistent video clips of over 40 seconds in duration, which is over 2 times longer than state-of-the-art driving world models. Experiments show that, in contrast to prior works, our method achieves superior visual quality and significantly more accurate controllable future video generation. Our code is available at this https URL.
zh

[CV-30] Hear the Scene: Audio-Enhanced Text Spotting

【速读】：该论文旨在解决场景文本检测（scene text spotting）中依赖精确位置标注（precise location annotations）的高成本和高劳动强度问题。为此，作者提出了一种仅利用转录标注（transcription annotations）进行训练的创新方法，显著减少了对复杂标注过程的依赖。解决方案的关键在于采用基于查询的范式（query-based paradigm），通过文本查询与图像嵌入（image embeddings）的交互学习隐式位置特征（implicit location features），并在文本识别阶段通过注意力激活图（attention activation map）进一步优化这些特征。此外，作者引入了循环课程学习策略（circular curriculum learning strategy）以增强模型收敛性，并提出了从粗到细的交叉注意力定位机制（coarse-to-fine cross-attention localization mechanism）以提高文本实例定位的准确性。该框架还支持基于音频的标注，显著缩短了标注时间，并为残障人士提供了包容性选择。实验结果表明，该方法在不依赖大量位置标注的情况下，仍能达到与现有基准相媲美的性能。

链接: https://arxiv.org/abs/2412.19504
作者: Jing Li,Bo Wang
机构: 未知
关键词: Recent advancements, methodologies that heavily, labor-intensive to procure, advancements in scene, heavily rely
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in scene text spotting have focused on end-to-end methodologies that heavily rely on precise location annotations, which are often costly and labor-intensive to procure. In this study, we introduce an innovative approach that leverages only transcription annotations for training text spotting models, substantially reducing the dependency on elaborate annotation processes. Our methodology employs a query-based paradigm that facilitates the learning of implicit location features through the interaction between text queries and image embeddings. These features are later refined during the text recognition phase using an attention activation map. Addressing the challenges associated with training a weakly-supervised model from scratch, we implement a circular curriculum learning strategy to enhance model convergence. Additionally, we introduce a coarse-to-fine cross-attention localization mechanism for more accurate text instance localization. Notably, our framework supports audio-based annotation, which significantly diminishes annotation time and provides an inclusive alternative for individuals with disabilities. Our approach achieves competitive performance against existing benchmarks, demonstrating that high accuracy in text spotting can be attained without extensive location annotations.
zh

[CV-31] owards Open-Vocabulary Remote Sensing Image Semantic Segmentation AAAI2025

【速读】：该论文旨在解决遥感图像语义分割中存在的局限性，即现有深度学习方法通常依赖于预定义的语义类别集，无法适应新类别且需要额外的图像标注和模型训练。为此，论文提出了开放词汇遥感图像语义分割（Open-Vocabulary Remote Sensing Image Semantic Segmentation, OVRSISS），旨在实现对任意语义类别的分割。解决方案的关键在于提出了一个名为GSNet的新框架，该框架集成了专用遥感模型的领域先验知识和通用视觉-语言模型的多功能能力。GSNet由双流图像编码器（Dual-Stream Image Encoder, DSIE）、查询引导特征融合（Query-Guided Feature Fusion, QGFF）和残差信息保留解码器（Residual Information Preservation Decoder, RIPD）组成，通过双流特征提取、特征融合和多源特征聚合，实现了更精确的分割效果。此外，论文还开发了LandDiscover50K数据集，包含51,846张图像和40个多样化的语义类别，为OVRSISS方法提供了数据支持。

链接: https://arxiv.org/abs/2412.19492
作者: Chengyang Ye,Yunzhi Zhuge,Pingping Zhang
机构: 未知
关键词: deep learning based, remote sensing, remote sensing image, revolutionized remote sensing, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Recently, deep learning based methods have revolutionized remote sensing image segmentation. However, these methods usually rely on a pre-defined semantic class set, thus needing additional image annotation and model training when adapting to new classes. More importantly, they are unable to segment arbitrary semantic classes. In this work, we introduce Open-Vocabulary Remote Sensing Image Semantic Segmentation (OVRSISS), which aims to segment arbitrary semantic classes in remote sensing images. To address the lack of OVRSISS datasets, we develop LandDiscover50K, a comprehensive dataset of 51,846 images covering 40 diverse semantic classes. In addition, we propose a novel framework named GSNet that integrates domain priors from special remote sensing models and versatile capabilities of general vision-language models. Technically, GSNet consists of a Dual-Stream Image Encoder (DSIE), a Query-Guided Feature Fusion (QGFF), and a Residual Information Preservation Decoder (RIPD). DSIE first captures comprehensive features from both special models and general models in dual streams. Then, with the guidance of variable vocabularies, QGFF integrates specialist and generalist features, enabling them to complement each other. Finally, RIPD is proposed to aggregate multi-source features for more accurate mask predictions. Experiments show that our method outperforms other methods by a large margin, and our proposed LandDiscover50K improves the performance of OVRSISS methods. The proposed dataset and method will be made publicly available at this https URL.
zh

[CV-32] Multi-label Classification using Deep Multi-order Context-aware Kernel Networks

【速读】：该论文旨在解决多标签分类（Multi-label classification）任务中现有深度学习方法忽视上下文信息的问题。上下文信息（context）如图像的几何结构（geometrical structure）可以为模型提供额外的线索，从而显著提升分类性能。论文的关键解决方案是提出了一种深度多阶上下文感知核网络（Deep Multi-order Context-aware Kernel Network, DMCKN），该网络通过充分利用上下文信息来学习更好的上下文感知相似性（context-aware similarities），即核（kernels）。具体而言，作者将上下文感知核设计重新表述为一个前馈网络（feed-forward network），该网络输出显式的核映射特征，并进一步利用不同距离内的多阶邻域信息（multiple orders of patch neighbors），从而构建了一个更具判别力的多标签分类模型。通过在Corel5K和NUS-WIDE基准数据集上的实验验证，该方法在定量和定性分析中均表现出与现有最先进方法相竞争的性能，证明了其有效性和优越性。

链接: https://arxiv.org/abs/2412.19491
作者: Mingyuan Jiu,Hailong Zhu,Hichem Sahbi
机构: 未知
关键词: pattern recognition, task in pattern, classification, context-aware kernel, kernel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-label classification is a challenging task in pattern recognition. Many deep learning methods have been proposed and largely enhanced classification performance. However, most of the existing sophisticated methods ignore context in the models’ learning process. Since context may provide additional cues to the learned models, it may significantly boost classification performances. In this work, we make full use of context information (namely geometrical structure of images) in order to learn better context-aware similarities (a.k.a. kernels) between images. We reformulate context-aware kernel design as a feed-forward network that outputs explicit kernel mapping features. Our obtained context-aware kernel network further leverages multiple orders of patch neighbors within different distances, resulting into a more discriminating Deep Multi-order Context-aware Kernel Network (DMCKN) for multi-label classification. We evaluate the proposed method on the challenging Corel5K and NUS-WIDE benchmarks, and empirical results show that our method obtains competitive performances against the related state-of-the-art, and both quantitative and qualitative performances corroborate its effectiveness and superiority for multi-label image classification.
zh

[CV-33] RAIN: Real-time Animation of Infinite Video Stream

【速读】：该论文旨在解决在消费级 GPU（如 RTX 4090）上实现高质量、实时且稳定的动画生成问题，特别是在使用扩散模型（diffusion models）时，现有方法难以高效生成长时间、一致性的视频流，且常受限于延迟问题和长时间运行后视觉质量的下降。论文提出的解决方案 RAIN 是一种管道式方法，其核心在于通过高效计算不同噪声水平和长时间间隔下的帧-令牌注意力（frame-token attention），同时对比以往基于流的方法显著更多的帧-令牌进行去噪。这一设计使得 RAIN 能够在更短的延迟和更快的速度下生成视频帧，同时保持对长时间视频流的远距离注意力，从而增强连续性和一致性。RAIN 仅引入了少量额外的 1D 注意力块，几乎不增加额外负担，使得经过少量微调的 Stable Diffusion 模型能够以低延迟实时生成无限长的视频流，且质量和一致性几乎不受影响。

链接: https://arxiv.org/abs/2412.19489
作者: Zhilei Shu,Ruili Feng,Yang Cao,Zheng-Jun Zha
机构: 未知
关键词: enhancing online engagement, gained immense popularity, Live animation, models remains challenging, online engagement
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Live animation has gained immense popularity for enhancing online engagement, yet achieving high-quality, real-time, and stable animation with diffusion models remains challenging, especially on consumer-grade GPUs. Existing methods struggle with generating long, consistent video streams efficiently, often being limited by latency issues and degraded visual quality over extended periods. In this paper, we introduce RAIN, a pipeline solution capable of animating infinite video streams in real-time with low latency using a single RTX 4090 GPU. The core idea of RAIN is to efficiently compute frame-token attention across different noise levels and long time-intervals while simultaneously denoising a significantly larger number of frame-tokens than previous stream-based methods. This design allows RAIN to generate video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams, resulting in enhanced continuity and consistency. Consequently, a Stable Diffusion model fine-tuned with RAIN in just a few epochs can produce video streams in real-time and low latency without much compromise in quality or consistency, up to infinite long. Despite its advanced capabilities, the RAIN only introduces a few additional 1D attention blocks, imposing minimal additional burden. Experiments in benchmark datasets and generating super-long videos demonstrating that RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors while costing less latency. All code and models will be made publicly available.
zh

[CV-34] UniBrain: A Unified Model for Cross-Subject Brain Decoding

【速读】：该论文旨在解决脑解码（brain decoding）领域中模型泛化能力不足的问题。当前方法主要依赖于个体特异性模型，由于大脑处理机制的复杂性和个体间fMRI信号的差异，这些方法难以捕捉跨被试的共性，从而限制了模型的泛化能力。为解决这一问题，论文提出了UniBrain，一种无需个体特异性参数的统一脑解码模型。其关键解决方案包括：1）基于群体的提取器（group-based extractor），用于处理不同长度的fMRI信号；2）互协助嵌入器（mutual assistance embedder），用于捕捉跨被试的共性；3）双层特征对齐方案（bilevel feature alignment scheme），用于提取被试不变的特征。通过在脑解码基准测试中的验证，UniBrain在极少数参数的情况下实现了与当前最先进的个体特异性模型相当的性能，并提出了泛化基准测试以推动社区关注跨被试共性的研究。

链接: https://arxiv.org/abs/2412.19487
作者: Zicheng Wang,Zhen Zhao,Luping Zhou,Parashkev Nachev
机构: 未知
关键词: interpreting mental content, reconstruct original stimuli, providing insights, mental content, Brain decoding aims
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Brain decoding aims to reconstruct original stimuli from fMRI signals, providing insights into interpreting mental content. Current approaches rely heavily on subject-specific models due to the complex brain processing mechanisms and the variations in fMRI signals across individuals. Therefore, these methods greatly limit the generalization of models and fail to capture cross-subject commonalities. To address this, we present UniBrain, a unified brain decoding model that requires no subject-specific parameters. Our approach includes a group-based extractor to handle variable fMRI signal lengths, a mutual assistance embedder to capture cross-subject commonalities, and a bilevel feature alignment scheme for extracting subject-invariant features. We validate our UniBrain on the brain decoding benchmark, achieving comparable performance to current state-of-the-art subject-specific models with extremely fewer parameters. We also propose a generalization benchmark to encourage the community to emphasize cross-subject commonalities for more general brain decoding. Our code is available at this https URL.
zh

[CV-35] Learning Radiance Fields from a Single Snapshot Compressive Image

【速读】：该论文旨在解决从单张时间压缩图像中恢复底层三维场景结构的问题。传统的高维数据（如高光谱或时间信息）记录通常需要昂贵的设备和高存储需求，而快照压缩成像（Snapshot Compressive Imaging, SCI）技术通过使用低成本二维成像传感器和一系列特殊设计的二维掩码，能够将高维数据压缩到单张图像中，从而降低存储和传输需求，并提供潜在的隐私保护。论文的关键解决方案是结合神经辐射场（Neural Radiance Fields, NeRF）的强大三维场景表示能力，提出了SCINeRF方法，将SCI的物理成像过程作为NeRF训练的一部分，以捕捉复杂场景结构。此外，论文进一步集成了三维高斯溅射（3D Gaussian Splatting, 3DGS）框架，提出了SCISplat方法，通过显式优化点云为三维高斯表示，提升了三维场景重建质量和训练/渲染速度。实验结果表明，该方法在图像重建和新视角合成方面优于现有技术，并能够利用SCI和3DGS的渲染能力实时生成高帧率的多视角一致图像。

链接: https://arxiv.org/abs/2412.19483
作者: Yunhao Li,Xiang Liu,Xiaodong Wang,Xin Yuan,Peidong Liu
机构: 未知
关键词: Snapshot Compressive Imaging, Snapshot Compressive, single temporal compressed, Compressive Imaging, temporal compressed image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we explore the potential of Snapshot Compressive Imaging (SCI) technique for recovering the underlying 3D scene structure from a single temporal compressed image. SCI is a cost-effective method that enables the recording of high-dimensional data, such as hyperspectral or temporal information, into a single image using low-cost 2D imaging sensors. To achieve this, a series of specially designed 2D masks are usually employed, reducing storage and transmission requirements and offering potential privacy protection. Inspired by this, we take one step further to recover the encoded 3D scene information leveraging powerful 3D scene representation capabilities of neural radiance fields (NeRF). Specifically, we propose SCINeRF, in which we formulate the physical imaging process of SCI as part of the training of NeRF, allowing us to exploit its impressive performance in capturing complex scene structures. In addition, we further integrate the popular 3D Gaussian Splatting (3DGS) framework and propose SCISplat to improve 3D scene reconstruction quality and training/rendering speed by explicitly optimizing point clouds into 3D Gaussian representations. To assess the effectiveness of our method, we conduct extensive evaluations using both synthetic data and real data captured by our SCI system. Experimental results demonstrate that our proposed approach surpasses the state-of-the-art methods in terms of image reconstruction and novel view synthesis. Moreover, our method also exhibits the ability to render high frame-rate multi-view consistent images in real time by leveraging SCI and the rendering capabilities of 3DGS. Codes will be available at: this https URL CVGL/SCISplat.
zh

[CV-36] Generative Adversarial Network on Motion-Blur Image Restoration

【速读】：该论文旨在解决日常生活中因手持相机抖动或突发运动导致的图像运动模糊问题，这种模糊现象显著降低了图像质量。论文提出了一种基于生成式对抗网络（Generative Adversarial Networks, GANs）的深度学习模型，通过对抗训练过程来恢复模糊像素的清晰度。解决方案的关键在于利用GAN的生成器（Generator）和判别器（Discriminator）之间的对抗机制，逐步生成更加逼真的图像。模型在GoPro数据集上进行训练和评估，该数据集包含清晰和模糊的街景图像对。通过峰值信噪比（Peak Signal-to-Noise Ratio, PSNR）和结构相似性指数（Structural Similarity Index Measure, SSIM）这两个评价指标，定量评估了去模糊过程的有效性。实验结果表明，该模型在去模糊时间和图像恢复效果上均表现出色，具有实际应用价值。

链接: https://arxiv.org/abs/2412.19479
作者: Zhengdong Li
机构: 未知
关键词: motion blur due, everyday life, sudden movements, Generative Adversarial Networks, camera often suffer
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In everyday life, photographs taken with a camera often suffer from motion blur due to hand vibrations or sudden movements. This phenomenon can significantly detract from the quality of the images captured, making it an interesting challenge to develop a deep learning model that utilizes the principles of adversarial networks to restore clarity to these blurred pixels. In this project, we will focus on leveraging Generative Adversarial Networks (GANs) to effectively deblur images affected by motion blur. A GAN-based Tensorflow model is defined, training and evaluating by GoPro dataset which comprises paired street view images featuring both clear and blurred versions. This adversarial training process between Discriminator and Generator helps to produce increasingly realistic images over time. Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) are the two evaluation metrics used to provide quantitative measures of image quality, allowing us to evaluate the effectiveness of the deblurring process. Mean PSNR in 29.1644 and mean SSIM in 0.7459 with average 4.6921 seconds deblurring time are achieved in this project. The blurry pixels are sharper in the output of GAN model shows a good image restoration effect in real world applications.
zh

[CV-37] Optimizing Helmet Detection with Hybrid YOLO Pipelines: A Detailed Analysis

【速读】：该论文旨在解决公共道路交通动态中头盔检测（Helmet Detection）的问题，以提高安全防护水平。这一问题被转化为目标检测任务，论文通过比较最新的YOLO（You Only Look Once）模型（包括YOLOv8、YOLOv9和新发布的YOLOv11）在头盔检测中的可靠性和计算负载，提出了一种改进的架构管道。该混合YOLO模型（h-YOLO）在性能上显著优于独立的YOLO模型，特别是在召回率（Recall）、精确率（Precision）和平均精度均值（mAP, Mean Average Precision）等标准目标检测指标上表现更优。此外，论文还记录了训练和测试时间，以评估模型在实时检测场景中的整体适用性。解决方案的关键在于通过混合架构优化YOLO模型，从而在头盔检测任务中实现更高的检测精度和更低的计算负载。

链接: https://arxiv.org/abs/2412.19467
作者: Vaikunth M,Dejey D,Vishaal C,Balamurali S
机构: 未知
关键词: road traffic dynamics, advancing protection levels, public road traffic, traffic dynamics, crucial for advancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Helmet detection is crucial for advancing protection levels in public road traffic dynamics. This problem statement translates to an object detection task. Therefore, this paper compares recent You Only Look Once (YOLO) models in the context of helmet detection in terms of reliability and computational load. Specifically, YOLOv8, YOLOv9, and the newly released YOLOv11 have been used. Besides, a modified architectural pipeline that remarkably improves the overall performance has been proposed in this manuscript. This hybridized YOLO model (h-YOLO) has been pitted against the independent models for analysis that proves h-YOLO is preferable for helmet detection over plain YOLO models. The models were tested using a range of standard object detection benchmarks such as recall, precision, and mAP (Mean Average Precision). In addition, training and testing times were recorded to provide the overall scope of the models in a real-time detection scenario.
zh

[CV-38] MNet-SAt: A Multiscale Network with Spatial-enhanced Attention for Segmentation of Polyps in Colonoscopy

【速读】：该论文旨在解决结肠镜图像中结肠息肉自动分割的现有方法在保留精确息肉边界、整合多尺度特征以及准确反映息肉复杂多样形态的空间依赖性方面的局限性。为此，作者提出了一种新颖的多尺度网络与空间增强注意力机制（MNet-SAt）框架。该框架的核心模块包括：边缘引导特征增强（EGFE），用于保留边缘信息以提升边界质量；多尺度特征聚合器（MSFA），用于提取和聚合跨通道空间维度的多尺度特征，并聚焦于显著区域；空间增强注意力（SEAt），用于捕捉多尺度聚合特征中的空间感知全局依赖性，突出感兴趣区域；以及通道增强的空洞空间金字塔池化（CE-ASPP），用于跨尺度重采样和重新校准注意力特征。通过这一系列关键模块，MNet-SAt在Kvasir-SEG和CVC-ClinicDB数据集上分别达到了96.61%和98.60%的Dice相似系数，显著优于现有方法，展示了其在息肉分割中的高精度和泛化能力，有望改善早期息肉检测和治疗的临床工作流程，从而降低结直肠癌的死亡率。

链接: https://arxiv.org/abs/2412.19464
作者: Chandravardhan Singh Raghaw,Aryan Yadav,Jasmer Singh Sanjotra,Shalini Dangi,Nagendra Kumar
机构: 未知
关键词: precise polyp boundaries, incorporating multi-scale features, preserving precise polyp, deep learning framework, Multi-Scale Feature Aggregator
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Objective: To develop a novel deep learning framework for the automated segmentation of colonic polyps in colonoscopy images, overcoming the limitations of current approaches in preserving precise polyp boundaries, incorporating multi-scale features, and modeling spatial dependencies that accurately reflect the intricate and diverse morphology of polyps. Methods: To address these limitations, we propose a novel Multiscale Network with Spatial-enhanced Attention (MNet-SAt) for polyp segmentation in colonoscopy images. This framework incorporates four key modules: Edge-Guided Feature Enrichment (EGFE) preserves edge information for improved boundary quality; Multi-Scale Feature Aggregator (MSFA) extracts and aggregates multi-scale features across channel spatial dimensions, focusing on salient regions; Spatial-Enhanced Attention (SEAt) captures spatial-aware global dependencies within the multi-scale aggregated features, emphasizing the region of interest; and Channel-Enhanced Atrous Spatial Pyramid Pooling (CE-ASPP) resamples and recalibrates attentive features across scales. Results: We evaluated MNet-SAt on the Kvasir-SEG and CVC-ClinicDB datasets, achieving Dice Similarity Coefficients of 96.61% and 98.60%, respectively. Conclusion: Both quantitative (DSC) and qualitative assessments highlight MNet-SAt’s superior performance and generalization capabilities compared to existing methods. Significance: MNet-SAt’s high accuracy in polyp segmentation holds promise for improving clinical workflows in early polyp detection and more effective treatment, contributing to reduced colorectal cancer mortality rates.
zh

[CV-39] A Prototype Unit for Image De-raining using Time-Lapse Data BMVC2024

【速读】：该论文致力于解决单幅图像去雨（single-image de-raining）的挑战，即从单幅含雨图像中恢复无雨背景信息。尽管现有方法利用真实世界的延时数据（time-lapse data）进行训练，能够估计一致的背景和逼真的雨纹（rain streaks），但这些方法通常面临计算和内存消耗过高的问题，限制了其在实际场景中的应用。论文提出了一种新颖的解决方案：雨纹原型单元（Rain Streak Prototype Unit, RsPU）。RsPU通过从延时数据中实时提取雨纹相关特征的原型，高效地编码这些特征，从而避免了对过多内存资源的需求。该去雨网络结合了编码器-解码器网络和RsPU，采用基于注意力机制的方法，学习并封装多样化的雨纹相关特征为简洁的原型。为确保方法的有效性，论文提出了一种特征原型损失（feature prototype loss），该损失函数包含凝聚性（cohesion）和发散性（divergence）两个部分，旨在捕捉RsPU中原型雨纹特征的紧凑性和多样性。通过在各种去雨基准上的评估和全面的消融研究，论文展示了该方法在多种雨图像中能够取得与现有最先进方法相竞争的结果。

链接: https://arxiv.org/abs/2412.19459
作者: Jaehoon Cho,Minjung Yoo,Jini Yang,Sunok Kim
机构: 未知
关键词: involves recovering rain-free, recovering rain-free background, rain-free background information, Streak Prototype Unit, address the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by BMVC 2024

点击查看摘要

Abstract:We address the challenge of single-image de-raining, a task that involves recovering rain-free background information from a single rain image. While recent advancements have utilized real-world time-lapse data for training, enabling the estimation of consistent backgrounds and realistic rain streaks, these methods often suffer from computational and memory consumption, limiting their applicability in real-world scenarios. In this paper, we introduce a novel solution: the Rain Streak Prototype Unit (RsPU). The RsPU efficiently encodes rain streak-relevant features as real-time prototypes derived from time-lapse data, eliminating the need for excessive memory resources. Our de-raining network combines encoder-decoder networks with the RsPU, allowing us to learn and encapsulate diverse rain streak-relevant features as concise prototypes, employing an attention-based approach. To ensure the effectiveness of our approach, we propose a feature prototype loss encompassing cohesion and divergence components. This loss function captures both the compactness and diversity aspects of the prototypical rain streak features within the RsPU. Our method evaluates various de-raining benchmarks, accompanied by comprehensive ablation studies. We show that it can achieve competitive results in various rain images compared to state-of-the-art methods.
zh

[CV-40] DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes

【速读】：该论文旨在解决在自动驾驶场景中，基于扩散模型（diffusion models）进行物体编辑时存在的精确位置控制和保持高保真外观的挑战。为了解决这些问题，作者提出了DriveEditor，一个基于扩散模型的框架，用于在驾驶视频中进行物体编辑。DriveEditor的关键创新在于其统一框架，能够实现包括重新定位、替换、删除和插入在内的多种物体编辑操作。其核心模块包括位置控制模块和外观维护模块。位置控制模块通过投影给定的3D边界框并保留深度信息，将其分层注入扩散过程，从而实现对物体位置和方向的精确控制。外观维护模块则通过低层次细节保留、高层次语义维护以及从新视角合成模型中整合3D先验信息，确保物体外观的一致性。通过在nuScenes数据集上的广泛定性和定量评估，DriveEditor展示了其在生成多样化驾驶场景编辑中的卓越保真度和可控性，以及其在促进下游任务中的显著能力。

链接: https://arxiv.org/abs/2412.19458
作者: Yiyuan Liang,Zhiying Yan,Liqun Chen,Jiahuan Zhou,Luxin Yan,Sheng Zhong,Xu Zou
机构: 未知
关键词: Vision-centric autonomous driving, Vision-centric autonomous, autonomous driving systems, driving systems require, existing scene captures
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-centric autonomous driving systems require diverse data for robust training and evaluation, which can be augmented by manipulating object positions and appearances within existing scene captures. While recent advancements in diffusion models have shown promise in video editing, their application to object manipulation in driving scenarios remains challenging due to imprecise positional control and difficulties in preserving high-fidelity object appearances. To address these challenges in position and appearance control, we introduce DriveEditor, a diffusion-based framework for object editing in driving videos. DriveEditor offers a unified framework for comprehensive object editing operations, including repositioning, replacement, deletion, and insertion. These diverse manipulations are all achieved through a shared set of varying inputs, processed by identical position control and appearance maintenance modules. The position control module projects the given 3D bounding box while preserving depth information and hierarchically injects it into the diffusion process, enabling precise control over object position and orientation. The appearance maintenance module preserves consistent attributes with a single reference image by employing a three-tiered approach: low-level detail preservation, high-level semantic maintenance, and the integration of 3D priors from a novel view synthesis model. Extensive qualitative and quantitative evaluations on the nuScenes dataset demonstrate DriveEditor’s exceptional fidelity and controllability in generating diverse driving scene edits, as well as its remarkable ability to facilitate downstream tasks.
zh

[CV-41] Focusing Image Generation to Mitigate Spurious Correlations

【速读】：该论文旨在解决深度神经网络分类器在训练过程中因图像实例特征与背景特征之间的虚假相关性（spurious correlations）而导致的分类错误问题。具体而言，虚假相关性使得分类器对实例特征的关注不足，从而影响分类结果的准确性。为解决这一问题，论文提出了一种称为“虚假相关性引导合成”（Spurious Correlations Guided Synthesis, SCGS）的数据增强方法。该方法的关键在于通过图像生成模型生成新的训练数据，从而减少虚假相关性对分类器的影响。SCGS首先识别预训练分类器在训练图像上的错误关注区域，然后基于这些区域生成新的训练数据，以增加数据集的多样性和规模。实验结果表明，该方法有效降低了分类器对虚假相关性的依赖。

链接: https://arxiv.org/abs/2412.19457
作者: Xuewei Li,Zhenzhen Nie,Mei Yu,Zijian Zhang,Jie Gao,Tianyi Xu,Zhiqiang Liu
机构: 未知
关键词: Instance features, deep neural classifiers, spurious correlations, exhibit spurious correlations, Correlations Guided Synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instance features in images exhibit spurious correlations with background features, affecting the training process of deep neural classifiers. This leads to insufficient attention to instance features by the classifier, resulting in erroneous classification outcomes. In this paper, we propose a data augmentation method called Spurious Correlations Guided Synthesis (SCGS) that mitigates spurious correlations through image generation model. This approach does not require expensive spurious attribute (group) labels for the training data and can be widely applied to other debiasing methods. Specifically, SCGS first identifies the incorrect attention regions of a pre-trained classifier on the training images, and then uses an image generation model to generate new training data based on these incorrect attended regions. SCGS increases the diversity and scale of the dataset to reduce the impact of spurious correlations on classifiers. Changes in the classifier’s attention regions and experimental results on three different domain datasets demonstrate that this method is effective in reducing the classifier’s reliance on spurious correlations.
zh

[CV-42] NijiGAN: Transform What You See into Anime with Contrastive Semi-Supervised Learning and Neural Ordinary Differential Equations

【速读】：该论文旨在解决生成式 AI 在动画行业中图像到图像翻译（image-to-image translation）的挑战，特别是将真实世界图像转换为动漫风格时面临的问题。现有方法如 Scenimefy 虽然通过对比学习（contrastive learning）和半监督训练（semi-supervised training）实现了高保真的动漫场景翻译，但其依赖从微调的 StyleGAN 中获取的低质量配对数据，且模型参数庞大，导致计算效率低下。论文提出的解决方案 NijiGAN 引入了神经常微分方程（Neural Ordinary Differential Equations, NeuralODEs），利用其在连续变换建模中的独特优势，显著减少了模型参数（仅为 Scenimefy 的一半），并通过 Scenimefy 生成的伪配对数据进行监督训练，避免了对低质量配对数据的依赖。实验结果表明，NijiGAN 在图像质量和计算效率上均优于现有模型，如 AnimeGAN 和 Scenimefy，具体体现在更高的平均意见得分（Mean Opinion Score, MOS）和更低的弗雷歇起始距离（Frechet Inception Distance, FID）得分。

链接: https://arxiv.org/abs/2412.19455
作者: Kevin Putra Santoso,Anny Yuniarti,Dwiyasa Nakula,Dimas Prihady Setyawan,Adam Haidar Azizi,Jeany Aurellia P. Dewati,Farah Dhia Fadhila,Maria T. Elvara Bumbungan
机构: 未知
关键词: animation industry, transformed the animation, Generative, Scenimefy, Ordinary Differential Equations
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative AI has transformed the animation industry. Several models have been developed for image-to-image translation, particularly focusing on converting real-world images into anime through unpaired translation. Scenimefy, a notable approach utilizing contrastive learning, achieves high fidelity anime scene translation by addressing limited paired data through semi-supervised training. However, it faces limitations due to its reliance on paired data from a fine-tuned StyleGAN in the anime domain, often producing low-quality datasets. Additionally, Scenimefy’s high parameter architecture presents opportunities for computational optimization. This research introduces NijiGAN, a novel model incorporating Neural Ordinary Differential Equations (NeuralODEs), which offer unique advantages in continuous transformation modeling compared to traditional residual networks. NijiGAN successfully transforms real-world scenes into high fidelity anime visuals using half of Scenimefy’s parameters. It employs pseudo-paired data generated through Scenimefy for supervised training, eliminating dependence on low-quality paired data and improving the training process. Our comprehensive evaluation includes ablation studies, qualitative, and quantitative analysis comparing NijiGAN to similar models. The testing results demonstrate that NijiGAN produces higher-quality images compared to AnimeGAN, as evidenced by a Mean Opinion Score (MOS) of 2.192, it surpasses AnimeGAN’s MOS of 2.160. Furthermore, our model achieved a Frechet Inception Distance (FID) score of 58.71, outperforming Scenimefy’s FID score of 60.32. These results demonstrate that NijiGAN achieves competitive performance against existing state-of-the-arts, especially Scenimefy as the baseline model.
zh

[CV-43] Paleoinspired Vision: From Exploring Colour Vision Evolution to Inspiring Camera Design

【速读】：该论文旨在解决视觉进化中的关键问题，特别是通过量化进化压力来重建视蛋白（opsin）的光谱敏感性，从而揭示物种在进化过程中如何适应环境以更有效地识别食物或捕食者。解决方案的关键在于引入了一个简化的视网膜视觉传导模型，并结合了一个新的视蛋白层。通过测量机器视觉在特定视蛋白影响的彩色图像上的识别精度，论文量化了进化压力，并开发了一种进化保守优化算法（evolutionary conservation optimisation algorithm），该算法能够在GPU上快速模拟数百万年的进化过程。这一模型不仅为测试进化生物学中的长期假设提供了实验框架，还为任务特定的相机滤光片设计提供了一种简约而有效的方法，优化了光谱响应函数以满足应用需求。

链接: https://arxiv.org/abs/2412.19439
作者: Junjie Zhang,Zhimin Zong,Lin Gu,Shenghan Su,Ziteng Cui,Yan Pu,Zirui Chen,Jing Lu,Daisuke Kojima,Tatsuya Harada,Ruogu Fang
机构: 未知
关键词: modern imaging technology, simultaneously inspiring innovations, imaging technology, reveals the adaptive, adaptive strategies
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:The evolution of colour vision is captivating, as it reveals the adaptive strategies of extinct species while simultaneously inspiring innovations in modern imaging technology. In this study, we present a simplified model of visual transduction in the retina, introducing a novel opsin layer. We quantify evolutionary pressures by measuring machine vision recognition accuracy on colour images shaped by specific opsins. Building on this, we develop an evolutionary conservation optimisation algorithm to reconstruct the spectral sensitivity of opsins, enabling mutation-driven adaptations to to more effectively spot fruits or predators. This model condenses millions of years of evolution within seconds on GPU, providing an experimental framework to test long-standing hypotheses in evolutionary biology , such as vision of early mammals, primate trichromacy from gene duplication, retention of colour blindness, blue-shift of fish rod and multiple rod opsins with bioluminescence. Moreover, the model enables speculative explorations of hypothetical species, such as organisms with eyes adapted to the conditions on Mars. Our findings suggest a minimalist yet effective approach to task-specific camera filter design, optimising the spectral response function to meet application-driven demands. The code will be made publicly available upon acceptance.
zh

[CV-44] Residual Feature-Reutilization Inception Network for Image Classification

【速读】：该论文旨在解决计算机视觉领域中特征信息有效提取的问题。随着卷积神经网络（CNNs）的发展，残差连接（residual connection）和多尺度（multiple scales）等概念在多种深度学习视觉任务中持续提升性能。论文提出了一种新颖的CNN架构，该架构由残差特征重用初始模块（ResFRI）或分割残差特征重用初始模块（Split-ResFRI）组成。该架构通过四种不同结构的卷积组合，并通过特别设计的信息交互通道连接，以提取多尺度特征信息并有效增加模型的感受野（receptive field）。此外，Split-ResFRI能够根据输入信息的分割比例进行调整，从而减少参数数量并保证模型性能。实验结果表明，在CIFAR10、CIFAR100和Tiny Imagenet等流行视觉数据集上，该模型在模型大小相近且未使用额外数据的前提下，取得了与现有先进模型相当的最优结果。

链接: https://arxiv.org/abs/2412.19433
作者: Yuanpeng He,Wenjie Song,Lijian Li,Tianxiang Zhan,Wenpin Jiao
机构: 未知
关键词: Capturing feature information, Capturing feature, great importance, computer vision, learning vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2301.00424

点击查看摘要

Abstract:Capturing feature information effectively is of great importance in the field of computer vision. With the development of convolutional neural networks (CNNs), concepts like residual connection and multiple scales promote continual performance gains in diverse deep learning vision tasks. In this paper, we propose a novel CNN architecture that it consists of residual feature-reutilization inceptions (ResFRI) or split-residual feature-reutilization inceptions (Split-ResFRI). And it is composed of four convolutional combinations of different structures connected by specially designed information interaction passages, which are utilized to extract multi-scale feature information and effectively increase the receptive field of the model. Moreover, according to the network structure designed above, Split-ResFRI can adjust the segmentation ratio of the input information, thereby reducing the number of parameters and guaranteeing the model performance. Specifically, in experiments based on popular vision datasets, such as CIFAR10 ( 97.94 %), CIFAR100 ( 85.91 %) and Tiny Imagenet ( 70.54 %), we obtain state-of-the-art results compared with other modern models under the premise that the model size is approximate and no additional data is used.
zh

[CV-45] mporal Context Consistency Above All: Enhancing Long-Term Anticipation by Learning and Enforcing Temporal Constraints

【速读】：该论文旨在解决长期动作预测（Long-Term Action Anticipation, LTA）问题，即在给定初始未修剪视频片段的情况下，预测视频中动作的标签及其持续时间。论文提出了基于编码器-解码器架构的并行解码方法，并做出了两个关键贡献。首先，在解码器顶部引入了双向动作上下文正则化模块（bi-directional action context regularizer module），以确保时间相邻片段之间的上下文一致性。其次，通过从已分类的片段中学习转移矩阵（transition matrix），该矩阵建模了从一个动作转移到另一个动作的概率，并在整个预测区间内全局优化序列。此外，论文还使用了专门的动作分割编码器，以提高推理时观察区间内预测的质量，从而更好地理解过去的行为。该方法在EpicKitchen-55、EGTEA+、50Salads和Breakfast四个LTA基准数据集上验证了其性能，展示了优于或与现有最先进方法（包括基于概率模型和大型语言模型的方法）相当的结果。

链接: https://arxiv.org/abs/2412.19424
作者: Alberto Maté,Mariella Dimiccoli
机构: 未知
关键词: long-term action anticipation, predicting action labels, initial untrimmed video, paper proposes, initial untrimmed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a method for long-term action anticipation (LTA), the task of predicting action labels and their duration in a video given the observation of an initial untrimmed video interval. We build on an encoder-decoder architecture with parallel decoding and make two key contributions. First, we introduce a bi-directional action context regularizer module on the top of the decoder that ensures temporal context coherence in temporally adjacent segments. Second, we learn from classified segments a transition matrix that models the probability of transitioning from one action to another and the sequence is optimized globally over the full prediction interval. In addition, we use a specialized encoder for the task of action segmentation to increase the quality of the predictions in the observation interval at inference time, leading to a better understanding of the past. We validate our methods on four benchmark datasets for LTA, the EpicKitchen-55, EGTEA+, 50Salads and Breakfast demonstrating superior or comparable performance to state-of-the-art methods, including probabilistic models and also those based on Large Language Models, that assume trimmed video as input. The code will be released upon acceptance.
zh

[CV-46] Generalized Uncertainty-Based Evidential Fusion with Hybrid Multi-Head Attention for Weak-Supervised Temporal Action Localization

【速读】：该论文旨在解决弱监督时序动作定位（WS-TAL）任务中的动作-背景模糊性问题，这一问题主要由背景噪声（background noise）和动作内部变化（intra-action variation）引起。为了解决这一问题，论文提出了两个关键模块：混合多头注意力（HMHA）模块和基于广义不确定性的证据融合（GUEF）模块。HMHA模块通过过滤冗余信息并调整特征分布，有效增强了RGB和光流特征，使其更符合WS-TAL任务的需求。GUEF模块则通过融合片段级证据来优化不确定性度量，并选择更优的前景特征信息，从而自适应地消除背景噪声的干扰，使模型能够专注于完整的动作实例，提升动作定位和分类的性能。实验结果表明，该方法在THUMOS14数据集上优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.19418
作者: Yuanpeng He,Lijian Li,Tianxiang Zhan,Wenpin Jiao,Chi-Man Pun
机构: 未知
关键词: Weakly supervised temporal, Weakly supervised, supervised temporal action, localizing complete action, video-level labels
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weakly supervised temporal action localization (WS-TAL) is a task of targeting at localizing complete action instances and categorizing them with video-level labels. Action-background ambiguity, primarily caused by background noise resulting from aggregation and intra-action variation, is a significant challenge for existing WS-TAL methods. In this paper, we introduce a hybrid multi-head attention (HMHA) module and generalized uncertainty-based evidential fusion (GUEF) module to address the problem. The proposed HMHA effectively enhances RGB and optical flow features by filtering redundant information and adjusting their feature distribution to better align with the WS-TAL task. Additionally, the proposed GUEF adaptively eliminates the interference of background noise by fusing snippet-level evidences to refine uncertainty measurement and select superior foreground feature information, which enables the model to concentrate on integral action instances to achieve better action localization and classification performance. Experimental results conducted on the THUMOS14 dataset demonstrate that our method outperforms state-of-the-art methods. Our code is available in \urlthis https URL.
zh

[CV-47] KALAHash: Knowledge-Anchored Low-Resource Adaptation for Deep Hashing AAAI2025

【速读】：该论文旨在解决在低资源（low-resource）场景下深度哈希（deep hashing）模型的适应性问题。现有深度哈希方法通常依赖于大量训练数据，而在训练样本极少的条件下，模型性能会因数据分布偏移（distribution shift）而显著下降。为此，论文提出了两种关键解决方案：首先，引入了类校准低秩适应（Class-Calibration LoRA, CLoRA），该方法通过利用类级别的文本知识嵌入（class-level textual knowledge embeddings）动态构建低秩适应矩阵，从而在保持原始数据分布的同时实现参数高效的微调（parameter-efficient fine-tuning）。其次，提出了知识引导的离散优化（Knowledge-Guided Discrete Optimization, KIDDO）框架，利用类知识补偿视觉信息的稀缺性，增强哈希码的区分性。最终，论文提出的知识锚定低资源适应哈希（Knowledge-Anchored Low-Resource Adaptation Hashing, KALAHash）方法显著提升了检索性能，并在低资源场景下实现了4倍的数据效率。

链接: https://arxiv.org/abs/2412.19417
作者: Shu Zhao,Tan Yu,Xiaoshuai Hao,Wenchao Ma,Vijaykrishnan Narayanan
机构: 未知
关键词: nearest neighbor search, large-scale approximate nearest, approximate nearest neighbor, neighbor search due, Deep hashing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Deep hashing has been widely used for large-scale approximate nearest neighbor search due to its storage and search efficiency. However, existing deep hashing methods predominantly rely on abundant training data, leaving the more challenging scenario of low-resource adaptation for deep hashing relatively underexplored. This setting involves adapting pre-trained models to downstream tasks with only an extremely small number of training samples available. Our preliminary benchmarks reveal that current methods suffer significant performance degradation due to the distribution shift caused by limited training samples. To address these challenges, we introduce Class-Calibration LoRA (CLoRA), a novel plug-and-play approach that dynamically constructs low-rank adaptation matrices by leveraging class-level textual knowledge embeddings. CLoRA effectively incorporates prior class knowledge as anchors, enabling parameter-efficient fine-tuning while maintaining the original data distribution. Furthermore, we propose Knowledge-Guided Discrete Optimization (KIDDO), a framework to utilize class knowledge to compensate for the scarcity of visual information and enhance the discriminability of hash codes. Extensive experiments demonstrate that our proposed method, Knowledge- Anchored Low-Resource Adaptation Hashing (KALAHash), significantly boosts retrieval performance and achieves a 4x data efficiency in low-resource scenarios.
zh

[CV-48] Multi-scale Latent Point Consistency Models for 3D Shape Generation

【速读】：该论文旨在解决基于点云的3D形状生成中采样效率低下的问题，特别是在扩散模型（diffusion models）中采样过程耗时较长的情况下。为了解决这一问题，作者提出了一种新颖的多尺度潜在点一致性模型（Multi-scale Latent Point Consistency Model, MLPCM）。该模型的关键创新在于引入了多层次的潜在表示（latent representations），从点级别到超点级别，每个级别对应不同的空间分辨率。通过设计一个多尺度潜在集成模块并结合3D空间注意力机制，MLPCM能够有效地对点级别的潜在表示进行去噪，同时利用多个超点级别的信息进行条件生成。此外，作者还提出了一种通过一致性蒸馏（consistency distillation）学习的潜在一致性模型，将先验压缩为一步生成器，从而显著提高了采样效率，同时保持了原始教师模型的性能。实验结果表明，MLPCM在生成过程中实现了100倍的加速，并且在形状质量和多样性方面超越了现有的最先进扩散模型。

链接: https://arxiv.org/abs/2412.19413
作者: Bi’an Du,Wei Hu,Renjie Liao
机构: 未知
关键词: synthesizing high-resolution images, yielding impressive results, Latent Point Consistency, Point Consistency Model, yielding impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Consistency Models (CMs) have significantly accelerated the sampling process in diffusion models, yielding impressive results in synthesizing high-resolution images. To explore and extend these advancements to point-cloud-based 3D shape generation, we propose a novel Multi-scale Latent Point Consistency Model (MLPCM). Our MLPCM follows a latent diffusion framework and introduces hierarchical levels of latent representations, ranging from point-level to super-point levels, each corresponding to a different spatial resolution. We design a multi-scale latent integration module along with 3D spatial attention to effectively denoise the point-level latent representations conditioned on those from multiple super-point levels. Additionally, we propose a latent consistency model, learned through consistency distillation, that compresses the prior into a one-step generator. This significantly improves sampling efficiency while preserving the performance of the original teacher model. Extensive experiments on standard benchmarks ShapeNet and ShapeNet-Vol demonstrate that MLPCM achieves a 100x speedup in the generation process, while surpassing state-of-the-art diffusion models in terms of both shape quality and diversity.
zh

[CV-49] MINIMA: Modality Invariant Image Matching

【速读】：该论文旨在解决跨视角和跨模态图像匹配中的模态差异问题，这一问题由不同成像系统或风格引起，导致现有方法在有限数据集上训练的泛化能力较差。论文提出的解决方案是MINIMA，一个统一的多跨模态图像匹配框架。其关键在于通过数据扩展提升通用性能，具体采用了一个简单但有效的数据引擎，能够自由生成包含多种模态、丰富场景和精确匹配标签的大规模数据集。通过生成模型从廉价但丰富的RGB匹配数据中扩展模态，生成的跨模态数据继承了RGB数据集的匹配标签和多样性，从而构建了MD-syn数据集。该数据集填补了通用跨模态图像匹配的数据空白，使得可以直接在任意模态对上训练先进的匹配管道，获得跨模态能力。实验结果表明，MINIMA在多种跨模态任务中显著优于基线方法，甚至超越了特定模态的方法。

链接: https://arxiv.org/abs/2412.19412
作者: Xingyu Jiang,Jiangwei Ren,Zizhuo Li,Xin Zhou,Dingkang Liang,Xiang Bai
机构: 未知
关键词: cross-view and cross-modality, cross-modality plays, plays a critical, critical role, matching
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The dataset and code are available at this https URL

点击查看摘要

Abstract:Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance universal performance from the perspective of data scaling up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate matching labels. Specifically, we scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. Under this setting, the matching labels and rich diversity of the RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multimodal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability. Extensive experiments on in-domain and zero-shot matching tasks, including 19 cross-modal cases, demonstrate that our MINIMA can significantly outperform the baselines and even surpass modality-specific methods. The dataset and code are available at this https URL .
zh

[CV-50] MLLM -SUL: Multimodal Large Language Model for Semantic Scene Understanding and Localization in Traffic Scenarios

【速读】：该论文旨在解决自动驾驶任务中的联合语义场景理解（joint semantic scene understanding）和风险定位（risk localization）问题，仅依赖前视图像。解决方案的关键在于提出了一个名为MLLM-SUL的多模态大语言模型（Multimodal Large Language Models, MLLMs）框架。该框架首先设计了一个双分支视觉编码器（dual-branch visual encoder），用于从两种分辨率中提取特征，丰富的视觉信息有助于语言模型准确描述不同大小的风险对象。接着，通过微调LLaMA模型生成场景描述，包括驾驶场景类型、风险对象的动作以及自车的驾驶意图和建议。最后，训练了一个基于Transformer的网络，结合回归标记（regression token）来定位风险对象。实验结果表明，该方法在现有DRAMA-ROLISP数据集和扩展的DRAMA-SRIS数据集上表现优异，超越了多种基于图像和视频的最先进方法。

链接: https://arxiv.org/abs/2412.19406
作者: Jiaqi Fan,Jianhua Wu,Jincheng Gao,Jianhao Yu,Yafei Wang,Hongqing Chu,Bingzhao Gao
机构: 未知
关键词: Multimodal large language, shown satisfactory effects, Multimodal large, autonomous driving tasks, large language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown satisfactory effects in many autonomous driving tasks. In this paper, MLLMs are utilized to solve joint semantic scene understanding and risk localization tasks, while only relying on front-view images. In the proposed MLLM-SUL framework, a dual-branch visual encoder is first designed to extract features from two resolutions, and rich visual information is conducive to the language model describing risk objects of different sizes accurately. Then for the language generation, LLaMA model is fine-tuned to predict scene descriptions, containing the type of driving scenario, actions of risk objects, and driving intentions and suggestions of ego-vehicle. Ultimately, a transformer-based network incorporating a regression token is trained to locate the risk objects. Extensive experiments on the existing DRAMA-ROLISP dataset and the extended DRAMA-SRIS dataset demonstrate that our method is efficient, surpassing many state-of-the-art image-based and video-based methods. Specifically, our method achieves 80.1% BLEU-1 score and 298.5% CIDEr score in the scene understanding task, and 59.6% accuracy in the localization task. Codes and datasets are available at this https URL.
zh

[CV-51] Spectral-Temporal Fusion Representation for Person-in-Bed Detection

【速读】：该论文旨在解决基于加速度计信号（accelerometer signals）的床上人员检测问题，具体任务包括“在床”和“不在床”的分段检测以及流式检测。面临的挑战包括个体差异、姿势变化和外部干扰。论文提出的解决方案关键在于采用了一种基于频谱-时域融合的特征表示方法，并结合了混合数据增强（mixup data augmentation）技术，同时使用交并比损失（Intersection over Union, IoU）来优化检测精度。该方法在两个任务中分别取得了100.00%和95.55%的检测得分，分别获得了第一名和第三名的成绩。

链接: https://arxiv.org/abs/2412.19404
作者: Xuefeng Yang,Shiheng Zhang,Jian Guan,Feiyang Xiao,Wei Lu,Qiaoxi Zhu
机构: 未知
关键词: Signal Processing Grand, Processing Grand Challenge, Grand Challenge Accelerometer-Based, Processing Grand, Signal Processing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study is based on the ICASSP 2025 Signal Processing Grand Challenge’s Accelerometer-Based Person-in-Bed Detection Challenge, which aims to determine bed occupancy using accelerometer signals. The task is divided into two tracks: “in bed” and “not in bed” segmented detection, and streaming detection, facing challenges such as individual differences, posture variations, and external disturbances. We propose a spectral-temporal fusion-based feature representation method with mixup data augmentation, and adopt Intersection over Union (IoU) loss to optimize detection accuracy. In the two tracks, our method achieved outstanding results of 100.00% and 95.55% in detection scores, securing first place and third place, respectively.
zh

[CV-52] An In-Depth Analysis of Adversarial Discriminative Domain Adaptation for Digit Classification

【速读】：该论文旨在解决领域自适应（Domain Adaptation）问题，特别是在图像分类任务中，如何通过对抗学习（Adversarial Learning）提升深度神经网络（DNNs）在真实世界数据上的泛化能力。论文的核心解决方案是采用了一种特定的对抗学习技术，即对抗判别领域自适应（Adversarial Discriminative Domain Adaptation, ADDA），并通过复现原始ADDA论文中的数字分类实验，进一步扩展了研究范围，考察了更广泛的领域偏移（Domain Shifts）。研究结果表明，ADDA在特定领域偏移下显著提高了分类准确率，同时对域内分类性能影响较小。此外，论文还提供了定性分析，并提出了ADDA在某些领域偏移中表现不佳的潜在原因。

链接: https://arxiv.org/abs/2412.19391
作者: Eugene Choi,Julian Rodriguez,Edmund Young
机构: 未知
关键词: robust machine learning, machine learning models, real-world data, Discriminative Domain Adaptation, active area
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Domain adaptation is an active area of research driven by the growing demand for robust machine learning models that perform well on real-world data. Adversarial learning for deep neural networks (DNNs) has emerged as a promising approach to improving generalization ability, particularly for image classification. In this paper, we implement a specific adversarial learning technique known as Adversarial Discriminative Domain Adaptation (ADDA) and replicate digit classification experiments from the original ADDA paper. We extend their findings by examining a broader range of domain shifts and provide a detailed analysis of in-domain classification accuracy post-ADDA. Our results demonstrate that ADDA significantly improves accuracy across certain domain shifts with minimal impact on in-domain performance. Furthermore, we provide qualitative analysis and propose potential explanations for ADDA’s limitations in less successful domain shifts. Code is at this https URL .
zh

[CV-53] BeSplat – Gaussian Splatting from a Single Blurry Image and Event Stream WACV-25

【速读】：该论文旨在解决从单张运动模糊图像及其对应的事件流（event stream）中恢复出清晰的辐射场（radiance field）这一高度不适定问题（ill-posed problem）。传统方法如神经辐射场（NeRF）在训练时间和渲染速度上存在显著挑战，而3D高斯泼溅（3D Gaussian Splatting, 3DGS）则有效解决了这些问题。本文提出的方法（BeSplat）通过高斯泼溅联合学习场景表示，并利用贝塞尔SE(3)公式（Bezier SE(3) formulation）恢复相机运动，从而最小化合成图像与真实世界测量（包括模糊图像和事件流）之间的差异。该方法的关键在于有效结合了事件流捕获的时间信息，首次在高斯泼溅框架下解决了这一复杂问题，并展示了从学习的辐射场和估计的相机轨迹中渲染出视角一致且清晰图像的能力。

链接: https://arxiv.org/abs/2412.19370
作者: Gopi Raju Matta,Reddypalli Trisha,Kaushik Mitra
机构: 未知
关键词: Neural Radiance Fields, Gaussian Splatting, radiance field, radiance field methods, sharp radiance field
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at EVGEN2025, WACV-25 Workshop

点击查看摘要

Abstract:Novel view synthesis has been greatly enhanced by the development of radiance field methods. The introduction of 3D Gaussian Splatting (3DGS) has effectively addressed key challenges, such as long training times and slow rendering speeds, typically associated with Neural Radiance Fields (NeRF), while maintaining high-quality reconstructions. In this work (BeSplat), we demonstrate the recovery of sharp radiance field (Gaussian splats) from a single motion-blurred image and its corresponding event stream. Our method jointly learns the scene representation via Gaussian Splatting and recovers the camera motion through Bezier SE(3) formulation effectively, minimizing discrepancies between synthesized and real-world measurements of both blurry image and corresponding event stream. We evaluate our approach on both synthetic and real datasets, showcasing its ability to render view-consistent, sharp images from the learned radiance field and the estimated camera trajectory. To the best of our knowledge, ours is the first work to address this highly challenging ill-posed problem in a Gaussian Splatting framework with the effective incorporation of temporal information captured using the event stream.
zh

[CV-54] Improving the network traffic classification using the Packet Vision approach

【速读】：该论文旨在解决网络流量分类（network traffic classification）问题，以提升网络管理和服务提供的效率，特别是在未来移动网络架构中实现应用感知的网络需求。解决方案的关键在于开发一种基于卷积神经网络（CNN, Convolutional Neural Networks）的方法，将网络数据包的内容（包括头部和有效载荷）转换为适合CNN处理的图像格式。论文提出的Packet Vision方法通过将原始数据包转换为图像，不仅提升了分类性能，还确保了数据的安全性和隐私性。通过构建包含四类网络流量的数据集，并评估AlexNet、ResNet-18和SqueezeNet三种CNN架构的性能，实验结果表明Packet Vision结合CNN在网络流量分类中表现出色，具有广泛的应用前景。

链接: https://arxiv.org/abs/2412.19360
作者: Rodrigo Moreira,Larissa Ferreira Rodrigues,Pedro Frosi Rosa,Flávio de Oliveira Silva
机构: 未知
关键词: services offer taking, network services offer, network traffic classification, improving the management, kind of application
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:The network traffic classification allows improving the management, and the network services offer taking into account the kind of application. The future network architectures, mainly mobile networks, foresee intelligent mechanisms in their architectural frameworks to deliver application-aware network requirements. The potential of convolutional neural networks capabilities, widely exploited in several contexts, can be used in network traffic classification. Thus, it is necessary to develop methods based on the content of packets transforming it into a suitable input for CNN technologies. Hence, we implemented and evaluated the Packet Vision, a method capable of building images from packets raw-data, considering both header and payload. Our approach excels those found in state-of-the-art by delivering security and privacy by transforming the raw-data packet into images. Therefore, we built a dataset with four traffic classes evaluating the performance of three CNNs architectures: AlexNet, ResNet-18, and SqueezeNet. Experiments showcase the Packet Vision combined with CNNs applicability and suitability as a promising approach to deliver outstanding performance in classifying network traffic.
zh

[CV-55] Federated Hybrid Training and Self-Adversarial Distillation: Towards Robust Edge Networks

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在移动边缘网络中面临的数据异构性和对抗性攻击问题，这些问题导致难以开发出无偏且鲁棒的全局模型用于边缘部署。为解决这些问题，论文提出了联邦混合对抗训练和自对抗蒸馏（Federated hyBrid Adversarial training and self-adversarial disTillation, FedBAT）框架。FedBAT的关键在于从数据增强和特征蒸馏两个角度，将混合对抗训练和自对抗蒸馏无缝集成到传统联邦学习框架中。具体而言，混合对抗训练通过加权结合标准训练和对抗训练，在保持模型准确性的同时提升其鲁棒性；而自对抗蒸馏则通过一种新颖的增强不变对抗蒸馏方法，将增强图像的局部对抗特征与其对应的无偏全局干净特征对齐，从而有效缓解数据异构性带来的偏差，并增强全局模型的鲁棒性和泛化能力。实验结果表明，FedBAT在多个数据集上均表现出优于或可比于现有基线的性能提升。

链接: https://arxiv.org/abs/2412.19354
作者: Yu Qiao,Apurba Adhikary,Kitae Kim,Eui-Nam Huh,Zhu Han,Choong Seon Hong
机构: 未知
关键词: mobile edge networks, distributed training technology, enhances data privacy, allowing data owners, transmitting raw data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Federated learning (FL) is a distributed training technology that enhances data privacy in mobile edge networks by allowing data owners to collaborate without transmitting raw data to the edge server. However, data heterogeneity and adversarial attacks pose challenges to develop an unbiased and robust global model for edge deployment. To address this, we propose Federated hyBrid Adversarial training and self-adversarial disTillation (FedBAT), a new framework designed to improve both robustness and generalization of the global model. FedBAT seamlessly integrates hybrid adversarial training and self-adversarial distillation into the conventional FL framework from data augmentation and feature distillation perspectives. From a data augmentation perspective, we propose hybrid adversarial training to defend against adversarial attacks by balancing accuracy and robustness through a weighted combination of standard and adversarial training. From a feature distillation perspective, we introduce a novel augmentation-invariant adversarial distillation method that aligns local adversarial features of augmented images with their corresponding unbiased global clean features. This alignment can effectively mitigate bias from data heterogeneity while enhancing both the robustness and generalization of the global model. Extensive experimental results across multiple datasets demonstrate that FedBAT yields comparable or superior performance gains in improving robustness while maintaining accuracy compared to several baselines.
zh

[CV-56] CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

【速读】：该论文旨在解决多图像中基于对象部分（object parts）的语义共分割（semantic co-segmentation）问题，特别是在更细粒度上识别和分割跨图像中的共同和独特对象及其部分。现有的大规模视觉-语言模型（LVLMs）虽然在单图像中能够生成与自然语言描述对齐的分割掩码，但在多图像中进行基于分割的比较时表现不佳，尤其是在对象部分级别。为此，论文提出了CALICO模型，这是首个能够在多图像中进行分割和推理的LVLM，支持基于对象组成部分的比较。CALICO的关键创新在于其两个核心模块：对应提取模块（Correspondence Extraction Module），用于捕捉语义丰富的信息以识别对象之间的部分级对应关系；以及对应适应模块（Correspondence Adaptation Module），以参数高效的方式将这些信息嵌入LVLM，从而促进多图像理解。此外，论文还构建了MixedParts数据集，包含约2.4M样本和44K图像，涵盖多样化的对象和部分类别，用于训练和评估。实验结果表明，CALICO仅在其架构的0.3%上进行微调，即可在部分聚焦的语义共分割任务中表现出色。

链接: https://arxiv.org/abs/2412.19331
作者: Kiet A. Nguyen,Adheesh Juvekar,Tianjiao Yu,Muntasir Wahed,Ismini Lourentzou
机构: 未知
关键词: Large Vision-Language Models, visual instruction tuning, sparked significant progress, Recent advances, Vision-Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in Large Vision-Language Models (LVLMs) have sparked significant progress in general-purpose vision tasks through visual instruction tuning. While some works have demonstrated the capability of LVLMs to generate segmentation masks that align phrases with natural language descriptions in a single image, they struggle with segmentation-grounded comparisons across multiple images, particularly at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which seeks to identify and segment common and unique objects and parts across images. To address this task, we present CALICO, the first LVLM that can segment and reason over multiple masks across images, enabling object comparison based on their constituent parts. CALICO features two proposed components, a novel Correspondence Extraction Module, which captures semantic-rich information to identify part-level correspondences between objects, and a Correspondence Adaptation Module, which embeds this information into the LVLM to facilitate multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MixedParts, a comprehensive multi-image segmentation dataset containing \sim 2.4M samples across \sim 44K images with diverse object and part categories. Experimental results show CALICO, finetuned on only 0.3% of its architecture, achieves robust performance in part-focused semantic co-segmentation.
zh

[CV-57] Resolving the Ambiguity of Complete-to-Partial Point Cloud Registration for Image-Guided Liver Surgery with Patches-to-Partial Matching

【速读】：该论文旨在解决图像引导肝脏手术中术前与术中数据（通常表示为点云）之间的初始刚性配准问题，特别是在术中表面可见性有限的情况下（如腹腔镜手术中常见的完全到部分模糊性）。当前的半自动配准方法虽然在一定程度上有效，但容易产生误差，需要手动校正。论文提出了一种基于点云对应关系的配准方法，并引入了一种“patches-to-partial matching”策略作为即插即用模块，以解决完全到部分模糊性问题。该模块能够无缝集成到基于学习的配准方法中，且不破坏其端到端结构，从而在术中可见性有限的情况下显著提升配准性能。这一解决方案的关键在于通过构建的基准数据集和提出的模块，为点云对应关系配准方法在图像引导肝脏手术中的应用奠定了坚实基础。

链接: https://arxiv.org/abs/2412.19328
作者: Zixin Yang,Jon S. Heiselman,Cheng Han,Kelly Merrell,Richard Simon,Cristian. A. Linte
机构: 未知
关键词: providing sub-surface information, MRI images, initial rigid alignment, image-guided liver surgery, initial rigid
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In image-guided liver surgery, the initial rigid alignment between preoperative and intraoperative data, often represented as point clouds, is crucial for providing sub-surface information from preoperative CT/MRI images to the surgeon during the procedure. Currently, this alignment is typically performed using semi-automatic methods, which, while effective to some extent, are prone to errors that demand manual correction. Point cloud correspondence-based registration methods are promising to serve as a fully automatic solution. However, they may struggle in scenarios with limited intraoperative surface visibility, a common challenge in liver surgery, particularly in laparoscopic procedures, which we refer to as complete-to-partial ambiguity. We first illustrate this ambiguity by evaluating the performance of state-of-the-art learning-based point cloud registration methods on our carefully constructed in silico and in vitro datasets. Then, we propose a patches-to-partial matching strategy as a plug-and-play module to resolve the ambiguity, which can be seamlessly integrated into learning-based registration methods without disrupting their end-to-end structure. It has proven effective and efficient in improving registration performance for cases with limited intraoperative visibility. The constructed benchmark and the proposed module establish a solid foundation for advancing applications of point cloud correspondence-based registration methods in image-guided liver surgery.
zh

[CV-58] ask Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉任务中难以实现细粒度或精确理解的问题。尽管MLLMs在广泛的视觉应用中展现出全面的感知和推理能力，但在处理精细视觉任务时仍存在不足。为解决这一问题，论文提出了一种名为任务偏好优化（Task Preference Optimization, TPO）的新方法。TPO的关键在于利用从典型细粒度视觉任务中导出的可微分任务偏好，通过引入可学习的任务标记（task tokens）来建立多个任务特定头（task-specific heads）与MLLM之间的连接。通过在训练过程中利用丰富的视觉标签，TPO显著提升了MLLM的多模态能力和任务特定性能。此外，TPO通过多任务协同训练实现了协同效应，使得单个任务的性能超越了单任务训练方法。实验结果表明，基于TPO的VideoChat和LLaVA模型在多模态性能上相比基线模型整体提升了14.6%，并在多种任务中展现出强大的零样本能力，与当前最先进的监督模型表现相当。

链接: https://arxiv.org/abs/2412.19326
作者: Ziang Yan,Zhilin Li,Yinan He,Chenting Wang,Kunchang Li,Xinhao Li,Xiangyu Zeng,Zilei Wang,Yali Wang,Yu Qiao,Limin Wang,Yi Wang
机构: 未知
关键词: Current multimodal large, give comprehensive perception, multimodal large language, large language models, Current multimodal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM’s multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at this https URL
zh

[CV-59] Perceive Query Reason: Enhancing Video QA with Question-Guided Temporal Queries WACV2025

【速读】：该论文旨在解决视频问答（Video QA）任务中的时空对齐（space-time alignment）问题，即在多帧视频中提取与问题相关的信息。视频问答要求模型能够理解整个视频内容，并根据问题的上下文线索识别最相关的信息，进而进行准确推理以提供答案。尽管多模态大语言模型（MLLMs）在常识推理方面表现出色，并有效对齐了视觉数据和语言空间，但在视频问答中，跨帧的时空对齐仍然是一个重大挑战。为此，论文提出了一种名为T-Former的新型时序建模方法，该方法在帧级视觉感知与大语言模型的推理能力之间建立了一个问题引导的时序桥梁。T-Former的关键在于结合了预训练的视觉和文本对齐，并通过问题引导的时序建模技术，实现了更有效的跨帧信息提取和推理。实验结果表明，T-Former在多个视频问答基准测试中表现优异，与现有的时序建模方法相比具有竞争力，并与视频问答领域的最新进展保持一致。

链接: https://arxiv.org/abs/2412.19304
作者: Roberto Amoroso,Gengyuan Zhang,Rajat Koner,Lorenzo Baraldi,Rita Cucchiara,Volker Tresp
机构: 未知
关键词: Video Question Answering, Question Answering, Large Language Models, challenging video understanding, video understanding task
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025

点击查看摘要

Abstract:Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from a given question, and reason accurately to provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. This progress is largely driven by the effective alignment between visual data and the language space of MLLMs. However, for video QA, an additional space-time alignment poses a considerable challenge for extracting question-relevant information across frames. In this work, we investigate diverse temporal modeling techniques to integrate with MLLMs, aiming to achieve question-guided temporal modeling that leverages pre-trained visual and textual alignment in MLLMs. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across multiple video QA benchmarks demonstrates that T-Former competes favorably with existing temporal modeling approaches and aligns with recent advancements in video QA.
zh

[CV-60] Manga Generation via Layout-controllable Diffusion

【速读】：该论文旨在解决基于纯文本生成多面板日本漫画（Manga）的挑战。日本漫画具有故事连贯性、合理且多样化的页面布局、角色一致性以及面板绘图与面板脚本之间的语义对应等特点，因此生成漫画面临较大难度。论文提出了漫画生成任务，并构建了Manga109Story数据集，用于研究仅基于纯文本的漫画生成。关键解决方案是提出了MangaDiffusion方法，该方法在漫画生成过程中促进了面板内和面板间的信息交互。实验结果表明，该方法特别确保了面板数量、合理且多样化的页面布局，展示了将大量文本故事转化为更具吸引力的漫画阅读的潜在应用前景。

链接: https://arxiv.org/abs/2412.19303
作者: Siyu Chen,Dengjie Li,Zenghao Bao,Yao Zhou,Lingfeng Tan,Yujie Zhong,Zheng Zhao
机构: 未知
关键词: widely studied, Manga, Japanese comics, diverse page layouts, Generating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating comics through text is widely studied. However, there are few studies on generating multi-panel Manga (Japanese comics) solely based on plain text. Japanese manga contains multiple panels on a single page, with characteristics such as coherence in storytelling, reasonable and diverse page layouts, consistency in characters, and semantic correspondence between panel drawings and panel scripts. Therefore, generating manga poses a significant challenge. This paper presents the manga generation task and constructs the Manga109Story dataset for studying manga generation solely from plain text. Additionally, we propose MangaDiffusion to facilitate the intra-panel and inter-panel information interaction during the manga generation process. The results show that our method particularly ensures the number of panels, reasonable and diverse page layouts. Based on our approach, there is potential to converting a large amount of textual stories into more engaging manga readings, leading to significant application prospects.
zh

[CV-61] When SAM2 Meets Video Shadow and Mirror Detection

【速读】：该论文旨在评估Segment Anything Model 2 (SAM2)在视频分割任务中的表现，特别是针对视频中罕见物体的分割效果。研究聚焦于两个具体任务：视频阴影检测（Video Shadow Detection, VSD）和视频镜面检测（Video Mirror Detection, VMD）。解决方案的关键在于使用地面真实点或掩码提示（ground truth point or mask prompts）来初始化第一帧，并预测后续帧的相应掩码。实验结果表明，SAM2在这些任务中的表现欠佳，尤其是在使用点提示时，无论是定量还是定性分析均显示出其局限性。

链接: https://arxiv.org/abs/2412.19293
作者: Leiping Jie
机构: 未知
关键词: Segment Anything Model, Model, Segment, Video Shadow Detection, Video Mirror Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:As the successor to the Segment Anything Model (SAM), the Segment Anything Model 2 (SAM2) not only improves performance in image segmentation but also extends its capabilities to video segmentation. However, its effectiveness in segmenting rare objects that seldom appear in videos remains underexplored. In this study, we evaluate SAM2 on three distinct video segmentation tasks: Video Shadow Detection (VSD) and Video Mirror Detection (VMD). Specifically, we use ground truth point or mask prompts to initialize the first frame and then predict corresponding masks for subsequent frames. Experimental results show that SAM2’s performance on these tasks is suboptimal, especially when point prompts are used, both quantitatively and qualitatively. Code is available at \urlthis https URL
zh

[CV-62] Reflective Gaussian Splatting

【速读】：该论文旨在解决反射物体重建（reflective object reconstruction）中的挑战，特别是在实现实时、高质量渲染的同时处理物体间的相互反射（inter-reflection）问题。现有的基于NeRF（Neural Radiance Fields）和3DGS（3D Gaussian Splatting）的方法在此类场景中表现不足。为此，作者提出了一种名为Reflective Gaussian Splatting（Ref-Gaussian）的框架，其关键创新包括两个核心组件：(I) 基于物理的延迟渲染（Physically based deferred rendering），通过分和近似（split-sum approximation）在像素级别赋予渲染方程材料属性；(II) 基于高斯分布的相互反射（Gaussian-grounded inter-reflection），首次在高斯分布框架内实现了所需的相互反射功能。此外，作者还引入了材料感知的法线传播（material-aware normal propagation）和初始的每高斯着色阶段（per-Gaussian shading stage），以及2D高斯基元（2D Gaussian primitives），以增强几何建模。实验表明，Ref-Gaussian在定量指标、视觉质量和计算效率上均优于现有方法，并且能够统一处理反射和非反射场景，支持重光照（relighting）和编辑等更多应用。

链接: https://arxiv.org/abs/2412.19282
作者: Yuxuan Yao,Zixuan Zeng,Chun Gu,Xiatian Zhu,Li Zhang
机构: 未知
关键词: increasingly capable NeRF, experienced significant advancements, significant advancements owing, capable NeRF, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 14 figures

点击查看摘要

Abstract:Novel view synthesis has experienced significant advancements owing to increasingly capable NeRF- and 3DGS-based methods. However, reflective object reconstruction remains challenging, lacking a proper solution to achieve real-time, high-quality rendering while accommodating inter-reflection. To fill this gap, we introduce a Reflective Gaussian splatting (\textbfRef-Gaussian) framework characterized with two components: (I) \em Physically based deferred rendering that empowers the rendering equation with pixel-level material properties via formulating split-sum approximation; (II) \em Gaussian-grounded inter-reflection that realizes the desired inter-reflection function within a Gaussian splatting paradigm for the first time. To enhance geometry modeling, we further introduce material-aware normal propagation and an initial per-Gaussian shading stage, along with 2D Gaussian primitives. Extensive experiments on standard datasets demonstrate that Ref-Gaussian surpasses existing approaches in terms of quantitative metrics, visual quality, and compute efficiency. Further, we show that our method serves as a unified solution for both reflective and non-reflective scenes, going beyond the previous alternatives focusing on only reflective scenes. Also, we illustrate that Ref-Gaussian supports more applications such as relighting and editing.
zh

[CV-63] FineVQ: Fine-Grained User Generated Content Video Quality Assessment

【速读】：该论文旨在解决用户生成内容（UGC）视频质量评估（VQA）中缺乏细粒度标签的问题，以更好地支持视频处理和推荐应用。当前VQA模型通常仅提供UGC视频的整体评分，无法满足细粒度需求。为此，作者建立了首个大规模细粒度视频质量评估数据库（FineVD），包含6104个UGC视频，并提供了多维度细粒度质量评分和描述。基于此数据库，作者提出了一种细粒度视频质量评估模型（FineVQ），具备质量评级、评分和归因能力。实验结果表明，FineVQ在FineVD及其他常用UGC-VQA数据集上实现了最先进的性能。FineVD和FineVQ的公开将促进UGC视频质量评估的进一步发展。

链接: https://arxiv.org/abs/2412.19238
作者: Huiyu Duan,Qiang Hu,Jiarui Wang,Liu Yang,Zitong Xu,Lu Liu,Xiongkuo Min,Chunlei Cai,Tianxiao Ye,Xiaoyun Zhang,Guangtao Zhai
机构: 未知
关键词: UGC videos, video quality assessment, Fine-grained Video quality, effective video quality, monitor video quality
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets. Both Both FineVD and FineVQ will be made publicly available.
zh

[CV-64] SeaMo: A Multi-Seasonal and Multimodal Remote Sensing Foundation Model

【速读】：该论文旨在解决当前遥感（Remote Sensing, RS）领域中视觉基础模型（Visual Foundation Models, VFMs）在预训练和微调过程中未能充分利用遥感数据多维度特性的问题。现有的VFMs主要针对遥感影像的特定特征进行预训练，忽略了遥感数据在时间、空间和多模态（multimodal）等方面的丰富信息。为此，论文提出了SeaMo模型，其关键创新在于整合了多季节（multi-seasonal）和多模态信息，通过掩码图像建模（masked image modeling）框架，采用非对齐裁剪技术提取空间特性，利用多源输入实现多模态融合，并通过时间-多模态融合块（temporal-multimodal fusion blocks）有效整合多季节数据。SeaMo显式地建模了遥感数据的多维度特性，使其在多个下游地球科学任务中表现出卓越的性能，并通过大量消融实验验证了其优越性。

链接: https://arxiv.org/abs/2412.19237
作者: Xuyang Li,Danfeng Hong,Chenyu Li,Jocelyn Chanussot
机构: 未知
关键词: Remote Sensing, Earth observation, crucial for Earth, data, Visual Foundation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Remote Sensing (RS) data contains a wealth of multi-dimensional information crucial for Earth observation. Owing to its vast volume, diverse sources, and temporal properties, RS data is highly suitable for the development of large Visual Foundation Models (VFMs). VFMs act as robust feature extractors, learning from extensive RS data, and are subsequently fine-tuned for deployment in various geoscientific tasks. However, current VFMs in the RS domain are predominantly pretrained and tailored exclusively for specific characteristics of RS imagery, neglecting the potential of utilizing the multi-dimensional properties of RS data. Therefore, in this work, we propose SeaMo, a pioneering visual foundation model that integrates multi-seasonal and multimodal information in the RS field. SeaMo is designed to harness multiple properties of RS data. Within the masked image modeling framework, we employ non-aligned cropping techniques to extract spatial properties, use multi-source inputs for multimodal integration, and incorporate temporal-multimodal fusion blocks for effective assimilation of multi-seasonal data. SeaMo explicitly models the multi-dimensional properties of RS data, making the model more comprehensive, robust, and versatile. We applied SeaMo to several downstream geoscience tasks, which demonstrated exceptional performance. Extensive ablation studies were conducted to validate the model’s superiority.
zh

[CV-65] VINEVI: A Virtualized Network Vision Architecture for Smart Monitoring of Heterogeneous Applications and Infrastructures

【速读】：该论文旨在解决异构基础设施（heterogeneous infrastructures）和应用程序监控中的不足，特别是现有方法和工具无法无缝监控裸金属（bare-metal）、低成本基础设施以及托管或虚拟化服务的细粒度细节问题。论文提出的解决方案是VIrtualized NEtwork VIsion架构（VINEVI），其关键创新在于引入了一个节点嵌入的流量分类代理（node-embedded traffic classification agent），该代理能够在物理和虚拟化基础设施中实现实时流量分类。VINEVI通过结合这一实时流量分类技术与Prometheus和Victoria Metrics等现有工具，实现了从硬件到虚拟化应用程序的全面监控。实验结果表明，VINEVI架构能够以更高的细节水平无缝监控异构基础设施，超越了现有文献中的方法。

链接: https://arxiv.org/abs/2412.19226
作者: Rodrigo Moreira,Hugo G. V. O. da Cunha,Larissa F. Rodrigues Moreira,Flávio de Oliveira Silva
机构: 未知
关键词: user requirements properly, Monitoring heterogeneous infrastructures, requirements properly, lacks enhancements, Monitoring heterogeneous
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages

点击查看摘要

Abstract:Monitoring heterogeneous infrastructures and applications is essential to cope with user requirements properly, but it still lacks enhancements. The well-known state-of-the-art methods and tools do not support seamless monitoring of bare-metal, low-cost infrastructures, neither hosted nor virtualized services with fine-grained details. This work proposes VIrtualized NEtwork VIsion architecture (VINEVI), an intelligent method for seamless monitoring heterogeneous infrastructures and applications. The VINEVI architecture advances state of the art with a node-embedded traffic classification agent placing physical and virtualized infrastructures enabling real-time traffic classification. VINEVI combines this real-time traffic classification with well-known tools such as Prometheus and Victoria Metrics to monitor the entire stack from the hardware to the virtualized applications. Experimental results showcased that VINEVI architecture allowed seamless heterogeneous infrastructure monitoring with a higher level of detail beyond literature. Also, our node-embedded real-time Internet traffic classifier evolved with flexibility the methods with monitoring heterogeneous infrastructures seamlessly.
zh

[CV-66] Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion

【速读】：该论文旨在解决深度补全（depth completion）任务中由于稀疏数据直接卷积导致的不匹配和模糊性问题。传统的卷积神经网络（CNNs）在处理不规则采样的稀疏深度数据时，往往难以有效捕捉高频信息，导致补全结果不理想。为此，论文提出了一种新颖的退化感知框架——选择性图像引导网络（Selective Image Guided Network, SigNet），首次将深度补全任务转化为深度增强任务。SigNet的核心解决方案包括两个关键步骤：首先，通过非CNN的稠密化工具对稀疏深度数据进行初步稠密化，生成粗糙但稠密的深度图，从而避免直接卷积带来的问题；其次，将补全任务重新定义为增强任务，通过在粗糙深度与目标稠密深度之间建立自监督的退化桥梁，实现有效的RGB-D融合。SigNet进一步利用隐式退化机制，自适应地选择RGB数据的高频成分（如边缘）来补偿粗糙深度，并将这一退化机制集成到多模态条件Mamba中，动态生成状态参数，以实现高效的全局高频信息交互。通过这一系列创新，SigNet在多个数据集上展示了最先进的性能。

链接: https://arxiv.org/abs/2412.19225
作者: Zhiqiang Yan,Zhengxue Wang,Kun Wang,Jun Li,Jian Yang
机构: 未知
关键词: Selective Image Guided, Image Guided Network, Selective Image, Image Guided, Guided Network
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In this paper, we introduce the Selective Image Guided Network (SigNet), a novel degradation-aware framework that transforms depth completion into depth enhancement for the first time. Moving beyond direct completion using convolutional neural networks (CNNs), SigNet initially densifies sparse depth data through non-CNN densification tools to obtain coarse yet dense depth. This approach eliminates the mismatch and ambiguity caused by direct convolution over irregularly sampled sparse data. Subsequently, SigNet redefines completion as enhancement, establishing a self-supervised degradation bridge between the coarse depth and the targeted dense depth for effective RGB-D fusion. To achieve this, SigNet leverages the implicit degradation to adaptively select high-frequency components (e.g., edges) of RGB data to compensate for the coarse depth. This degradation is further integrated into a multi-modal conditional Mamba, dynamically generating the state parameters to enable efficient global high-frequency information interaction. We conduct extensive experiments on the NYUv2, DIML, SUN RGBD, and TOFDC datasets, demonstrating the state-of-the-art (SOTA) performance of SigNet.
zh

[CV-67] ransformer-Based Wireless Capsule Endoscopy Bleeding Tissue Detection and Classification

【速读】：该论文旨在解决无线胶囊内窥镜（Wireless Capsule Endoscopy, WCE）视频中出血与非出血帧的自动检测与分类问题。解决方案的关键在于设计了一个端到端可训练的模型，该模型基于DETR（Detection Transformer）架构，结合了ResNet50进行特征提取，使用Transformer编码器-解码器进行出血与非出血区域的检测，并通过前馈神经网络进行分类。该模型在Auto-WCEBleedGen Version 1挑战赛的训练集上进行端到端训练，能够同时完成检测与分类任务。实验结果表明，该模型在验证集上取得了较高的分类准确率（98.28%）、召回率（96.79%）和F1分数（98.37%），并在检测任务中获得了0.7447的平均精度（AP @ 0.5）和0.7328的均值平均精度（mAP），最终在挑战赛中获得了第三名。

链接: https://arxiv.org/abs/2412.19218
作者: Basit Alawode,Shibani Hamza,Adarsh Ghimire,Divya Velayudhan
机构: 未知
关键词: Wireless Capsule Endoscopy, Capsule Endoscopy, Wireless Capsule, non-bleeding frames extracted, extracted from Wireless
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Informed by the success of the transformer model in various computer vision tasks, we design an end-to-end trainable model for the automatic detection and classification of bleeding and non-bleeding frames extracted from Wireless Capsule Endoscopy (WCE) videos. Based on the DETR model, our model uses the Resnet50 for feature extraction, the transformer encoder-decoder for bleeding and non-bleeding region detection, and a feedforward neural network for classification. Trained in an end-to-end approach on the Auto-WCEBleedGen Version 1 challenge training set, our model performs both detection and classification tasks as a single unit. Our model achieves an accuracy, recall, and F1-score classification percentage score of 98.28, 96.79, and 98.37 respectively, on the Auto-WCEBleedGen version 1 validation set. Further, we record an average precision (AP @ 0.5), mean-average precision (mAP) of 0.7447 and 0.7328 detection results. This earned us a 3rd place position in the challenge. Our code is publicly available via this https URL.
zh

[CV-68] NADER: Neural Architecture Design via Multi-Agent Collaboration

【速读】：该论文旨在解决深度学习中设计有效神经架构（Neural Architecture Design, NAD）的挑战，特别是现有神经架构搜索（Neural Architecture Search, NAS）方法受限于预定义的搜索空间，可能遗漏关键架构的问题。为此，论文提出了NADER（Neural Architecture Design via multi-agEnt collaboRation）框架，将神经架构设计视为基于大语言模型（LLM）的多智能体协作问题。NADER通过一组专门化的智能体对基础架构进行迭代修改，克服了现有LLM-based方法独立操作、无法从过去经验中学习的缺陷。关键创新包括引入Reflector机制，使智能体能够从即时反馈和长期经验中有效学习，以及采用基于图的表示方法替代传统的代码表示，使智能体能够专注于设计本身，避免编码干扰。通过广泛的基准任务实验，NADER展示了其在超越预定义搜索空间发现高性能架构方面的优势。

链接: https://arxiv.org/abs/2412.19206
作者: Zekang Yang,Wang Zeng,Sheng Jin,Chen Qian,Ping Luo,Wentao Liu
机构: 未知
关键词: Designing effective neural, Designing effective, effective neural architectures, neural architectures poses, Neural Architecture Design
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Designing effective neural architectures poses a significant challenge in deep learning. While Neural Architecture Search (NAS) automates the search for optimal architectures, existing methods are often constrained by predetermined search spaces and may miss critical neural architectures. In this paper, we introduce NADER (Neural Architecture Design via multi-agEnt collaboRation), a novel framework that formulates neural architecture design (NAD) as a LLM-based multi-agent collaboration problem. NADER employs a team of specialized agents to enhance a base architecture through iterative modification. Current LLM-based NAD methods typically operate independently, lacking the ability to learn from past experiences, which results in repeated mistakes and inefficient exploration. To address this issue, we propose the Reflector, which effectively learns from immediate feedback and long-term experiences. Additionally, unlike previous LLM-based methods that use code to represent neural architectures, we utilize a graph-based representation. This approach allows agents to focus on design aspects without being distracted by coding. We demonstrate the effectiveness of NADER in discovering high-performing architectures beyond predetermined search spaces through extensive experiments on benchmark tasks, showcasing its advantages over state-of-the-art methods. The codes will be released soon.
zh

[CV-69] An End-to-End Depth-Based Pipeline for Selfie Image Rectification

【速读】：该论文旨在解决近距离拍摄的肖像或自拍图像中常见的透视失真（perspective distortion）问题。其解决方案的关键在于提出了一种端到端的深度学习校正管道，通过训练深度卷积神经网络（CNN）来预测面部深度，并利用估计的深度调整相机到主体的距离，包括将相机移远、增加相机焦距以及将3D图像特征重新投影到新的视角。重新投影后的特征被输入到修复模块中以填补缺失的像素。此外，论文引入了一个辅助模块来预测相机的水平移动，从而减少需要修复的具有挑战性的面部区域（如耳朵）的面积。与以往工作不同，该方法一次性处理全帧输入图像，无需裁剪面部并单独处理，避免了复杂的后处理步骤。为了训练网络，论文利用Unreal Engine生成包含多样化主体、头部姿态、表情、眼镜、服装和光照条件的大规模合成人脸数据集。实验结果表明，该校正管道在效果上优于现有方法，并与耗时的3D GAN方法相当，但速度提高了260倍以上。

链接: https://arxiv.org/abs/2412.19189
作者: Ahmed Alhawwary,Phong Nguyen-Ha,Janne Mustaniemi,Janne Heikkilä
机构: 未知
关键词: close distance typically, distance typically suffer, Portraits or selfie, perspective distortion, typically suffer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Portraits or selfie images taken from a close distance typically suffer from perspective distortion. In this paper, we propose an end-to-end deep learning-based rectification pipeline to mitigate the effects of perspective distortion. We learn to predict the facial depth by training a deep CNN. The estimated depth is utilized to adjust the camera-to-subject distance by moving the camera farther, increasing the camera focal length, and reprojecting the 3D image features to the new perspective. The reprojected features are then fed to an inpainting module to fill in the missing pixels. We leverage a differentiable renderer to enable end-to-end training of our depth estimation and feature extraction nets to improve the rectified outputs. To boost the results of the inpainting module, we incorporate an auxiliary module to predict the horizontal movement of the camera which decreases the area that requires hallucination of challenging face parts such as ears. Unlike previous works, we process the full-frame input image at once without cropping the subject’s face and processing it separately from the rest of the body, eliminating the need for complex post-processing steps to attach the face back to the subject’s body. To train our network, we utilize the popular game engine Unreal Engine to generate a large synthetic face dataset containing various subjects, head poses, expressions, eyewear, clothes, and lighting. Quantitative and qualitative results show that our rectification pipeline outperforms previous methods, and produces comparable results with a time-consuming 3D GAN-based method while being more than 260 times faster.
zh

[CV-70] Mask Approximation Net: Merging Feature Extraction and Distribution Learning for Remote Sensing Change Captioning

【速读】：该论文旨在解决遥感图像变化描述任务中传统方法存在的泛化性和鲁棒性不足的问题。传统方法主要依赖卷积神经网络（CNN）架构提取双时相图像特征，导致过度关注特定网络架构设计，且特征分布局限于当前数据集，难以推广到其他数据集或实际场景。为解决这一问题，论文提出了一种集成扩散模型（diffusion models）的新方法，将重点从传统的特征学习范式转向数据分布学习。该方法的核心包括一个简单的多尺度变化检测模块，其输出特征通过扩散模型进行细化。此外，论文还引入了一个频率引导的复杂滤波模块，以处理扩散过程中的高频噪声，从而保持模型性能。实验结果表明，该方法在多个遥感变化检测描述数据集上表现出优越的性能。

链接: https://arxiv.org/abs/2412.19179
作者: Dongwei Sun,Xiangyong Cao
机构: 未知
关键词: enhancing human interpretability, Convolutional Neural Network, Remote sensing image, employed Convolutional Neural, remote sensing processing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Remote sensing image change description, as a novel multimodal task in the field of remote sensing processing, not only enables the detection of changes in surface conditions but also provides detailed descriptions of these changes, thereby enhancing human interpretability and interactivity. However, previous methods mainly employed Convolutional Neural Network (CNN) architectures to extract bitemporal image features. This approach often leads to an overemphasis on designing specific network architectures and limits the captured feature distributions to the current dataset, resulting in poor generalizability and robustness when applied to other datasets or real-world scenarios. To address these limitations, this paper proposes a novel approach for remote sensing image change detection and description that integrates diffusion models, aiming to shift the focus from conventional feature learning paradigms to data distribution learning. The proposed method primarily includes a simple multi-scale change detection module, whose output features are subsequently refined using a diffusion model. Additionally, we introduce a frequency-guided complex filter module to handle high-frequency noise during the diffusion process, which helps to maintain model performance. Finally, we validate the effectiveness of our proposed method on several remote sensing change detection description datasets, demonstrating its superior performance. The code available at MaskApproxNet.
zh

[CV-71] Revisiting Monocular 3D Object Detection from Scene-Level Depth Retargeting to Instance-Level Spatial Refinement

【速读】：该论文旨在解决单目3D目标检测（Monocular 3D Object Detection）中由于深度信息不准确导致的性能瓶颈问题。现有深度辅助解决方案的性能普遍不佳，主要原因在于单目深度估计模型（Monocular Depth Estimation Models）的精度不足，以及现有深度表示（如深度独热编码或深度分布）在3D结构感知能力上的局限性。为解决这一问题，论文提出了一种新颖的深度适应单目3D目标检测网络，称为RD3D。该网络的核心创新包括两个关键模块：场景级深度重定向模块（Scene-Level Depth Retargeting, SDR）和实例级空间细化模块（Instance-Level Spatial Refinement, ISR）。SDR模块通过引入场景级3D结构感知，将传统深度表示重定向为一种新的形式——深度厚度场（Depth Thickness Field）；ISR模块则在实例的指导下细化体素空间表示，消除3D占位的模糊性，从而提升检测精度。通过在KITTI和Waymo数据集上的广泛实验，论文验证了该方法的优越性和与不同深度估计模型的通用性。

链接: https://arxiv.org/abs/2412.19165
作者: Qiude Zhang,Chunyu Lin,Zhijie Shen,Nie Lang,Yao Zhao
机构: 未知
关键词: depth, object detection, challenging due, lack of accurate, depth estimation models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular 3D object detection is challenging due to the lack of accurate depth. However, existing depth-assisted solutions still exhibit inferior performance, whose reason is universally acknowledged as the unsatisfactory accuracy of monocular depth estimation models. In this paper, we revisit monocular 3D object detection from the depth perspective and formulate an additional issue as the limited 3D structure-aware capability of existing depth representations (\textite.g., depth one-hot encoding or depth distribution). To address this issue, we propose a novel depth-adapted monocular 3D object detection network, termed \textbfRD3D, that mainly comprises a Scene-Level Depth Retargeting (SDR) module and an Instance-Level Spatial Refinement (ISR) module. The former incorporates the scene-level perception of 3D structures, retargeting traditional depth representations to a new formulation: \textbfDepth Thickness Field. The latter refines the voxel spatial representation with the guidance of instances, eliminating the ambiguity of 3D occupation and thus improving detection accuracy. Extensive experiments on the KITTI and Waymo datasets demonstrate our superiority to existing state-of-the-art (SoTA) methods and the universality when equipped with different depth estimation models. The code will be available.
zh

[CV-72] Dual Channel Multi-Attention in ViT for Biometric Authentication using Forehead Subcutaneous Vein Pattern and Periocular Pattern

【速读】：该论文旨在解决传统生物识别系统（如人脸识别和指纹识别）在佩戴口罩和卫生问题上面临的挑战。为了解决因佩戴口罩导致的面部部分遮挡以及指纹识别中的卫生问题，论文提出了一种基于双通道多注意力视觉 Transformer (Vision Transformer, ViT) 框架的生物认证方法，利用额头皮下静脉模式（forehead subcutaneous vein patterns）和眼周模式（periocular patterns）作为替代方案。该框架的关键在于其双通道 ViT 架构，能够分别处理两种不同的生物特征，并捕捉静脉和眼周模式的独立特征之间的长程依赖关系。通过设计一个自定义分类器，将独立提取的特征进行整合，最终生成类别预测。实验结果表明，该算法在结合静脉和眼周模式的情况下，分类准确率达到了 99.3 ± 0.02%，显著优于现有技术。

链接: https://arxiv.org/abs/2412.19160
作者: Arun K. Sharma,Shubhobrata Bhattacharya,Motahar Reza
机构: 未知
关键词: encountered significant setbacks, wearing face masks, Traditional biometric systems, significant setbacks due, face masks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditional biometric systems, like face and fingerprint recognition, have encountered significant setbacks due to wearing face masks and hygiene concerns. To meet the challenges of the partially covered face due to face masks and hygiene concerns of fingerprint recognition, this paper proposes a novel dual-channel multi-attention Vision Transformer (ViT) framework for biometric authentication using forehead subcutaneous vein patterns and periocular patterns, offering a promising alternative to traditional methods, capable of performing well even with face masks and without any physical touch. The proposed framework leverages a dual-channel ViT architecture, designed to handle two distinct biometric traits. It can capture long-range dependencies of independent features from the vein and periocular patterns. A custom classifier is then designed to integrate the independently extracted features, producing a final class prediction. The performance of the proposed algorithm was rigorously evaluated using the Forehead Subcutaneous Vein Pattern and Periocular Biometric Pattern (FSVP-PBP) database. The results demonstrated the superiority of the algorithm over state-of-the-art methods, achieving remarkable classification accuracy of 99.3 \pm 0.02% with the combined vein and periocular patterns.
zh

[CV-73] Generating Editable Head Avatars with 3D Gaussian GANs

【速读】：该论文旨在解决传统基于隐式场（如Neural Radiance Fields, NeRF）的3D生成对抗网络（GANs）在生成可动画化和可编辑的3D头部虚拟形象时面临的形变灵活性和可编辑性不足的问题。传统方法虽然在生成逼真且视角一致的3D头部合成方面表现出色，但在实现逼真且易于修改的3D头部时存在局限。论文提出了一种新颖的方法，通过引入3D高斯溅射（3D Gaussian Splatting, 3DGS）作为显式3D表示，增强了3D头部虚拟形象的可编辑性和动画控制能力。该方法的核心理念是Editable Gaussian Head (EG-Head)模型，该模型结合了3D Morphable Model (3DMM)和纹理贴图，实现了精确的表情控制和灵活的纹理编辑，同时保留了身份特征。此外，为了捕捉复杂的非面部几何结构（如头发），论文还使用了辅助的3DGS和三平面特征。实验结果表明，该方法在3D感知合成方面具有高质量的生成效果，并实现了最先进的可控性。

链接: https://arxiv.org/abs/2412.19149
作者: Guohao Li,Hongyu Yang,Yifang Men,Di Huang,Weixin Li,Ruijie Yang,Yunhong Wang
机构: 未知
关键词: Generating animatable, Neural Radiance Fields, vision and graphics, applications in computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating animatable and editable 3D head avatars is essential for various applications in computer vision and graphics. Traditional 3D-aware generative adversarial networks (GANs), often using implicit fields like Neural Radiance Fields (NeRF), achieve photorealistic and view-consistent 3D head synthesis. However, these methods face limitations in deformation flexibility and editability, hindering the creation of lifelike and easily modifiable 3D heads. We propose a novel approach that enhances the editability and animation control of 3D head avatars by incorporating 3D Gaussian Splatting (3DGS) as an explicit 3D representation. This method enables easier illumination control and improved editability. Central to our approach is the Editable Gaussian Head (EG-Head) model, which combines a 3D Morphable Model (3DMM) with texture maps, allowing precise expression control and flexible texture editing for accurate animation while preserving identity. To capture complex non-facial geometries like hair, we use an auxiliary set of 3DGS and tri-plane features. Extensive experiments demonstrate that our approach delivers high-quality 3D-aware synthesis with state-of-the-art controllability. Our code and models are available at this https URL.
zh

[CV-74] AskChart: Universal Chart Understanding through Textual Enhancement

【速读】：该论文旨在解决图表理解任务（Chart Understanding Tasks）中现有方法主要依赖视觉线索而未能充分利用图表中嵌入的丰富文本信息（如数据标签和轴标签）的问题。现有模型通常规模庞大且计算密集，限制了其实际应用。为此，论文提出了AskChart模型，通过混合专家（Mixture of Experts, MoE）架构显式整合图表中的文本和视觉线索，从而增强视觉-文本表示，有效处理多种图表理解任务，同时保持较小的模型规模。解决方案的关键在于：1）引入大规模数据集ChartBank（约750万数据样本），以对齐文本和视觉信息并促进视觉实体和文本的提取；2）设计三阶段训练策略，优化视觉和文本模态的对齐以及MoE层的学习。实验结果表明，AskChart在多个数据集上显著优于现有模型，尤其在开放式ChartQA和Chart-to-Text任务中表现尤为突出。

链接: https://arxiv.org/abs/2412.19146
作者: Xudong Yang,Yifan Wu,Yizhang Zhu,Nan Tang,Yuyu Luo
机构: 未知
关键词: involve automatically extracting, interpreting key information, Chart understanding tasks, convert visual data, involve automatically
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 12 figures, 14 tables

点击查看摘要

Abstract:Chart understanding tasks such as ChartQA and Chart-to-Text involve automatically extracting and interpreting key information from charts, enabling users to query or convert visual data into structured formats. State-of-the-art approaches primarily focus on visual cues from chart images, failing to explicitly incorporate rich textual information (e.g., data labels and axis labels) embedded within the charts. This textual information is vital for intuitive human comprehension and interpretation of charts. Moreover, existing models are often large and computationally intensive, limiting their practical applicability. In this paper, we introduce AskChart, a universal model that explicitly integrates both textual and visual cues from charts using a Mixture of Experts (MoE) architecture. AskChart facilitates the learning of enhanced visual-textual representations of charts for effectively handling multiple chart understanding tasks, while maintaining a smaller model size. To capture the synergy between visual and textual modalities, we curate a large-scale dataset named ChartBank with about 7.5M data samples, which helps align textual and visual information and facilitates the extraction of visual entities and text. To effectively train AskChart, we design a three-stage training strategy to align visual and textual modalities for learning robust visual-textual representations and optimizing the learning of the MoE layer. Extensive experiments across five datasets demonstrate the significant performance gains of AskChart in four chart understanding tasks. Remarkably, AskChart with 4.6B parameters outperforms state-of-the-art models with 13B parameters by 68.3% in Open-ended ChartQA and 49.2% in Chart-to-Text tasks, while achieving comparable performance in ChartQA and Chart-to-Table tasks.
zh

[CV-75] Impact of color and mixing proportion of synthetic point clouds on semantic segmentation

【速读】：该论文旨在解决点云语义分割（Semantic Segmentation of Point Clouds）中高质量数据不足的问题，特别是如何有效利用合成点云（Synthetic Point Clouds, SPC）来弥补真实数据的短缺。论文的核心解决方案包括提出了一种基于建筑信息模型（Building Information Modeling, BIM）的扫描过程模拟方法，生成两种合成数据集：一种是具有一致BIM颜色的合成点云（UniSPC），另一种是具有真实颜色的合成点云（RealSPC）。通过将这些合成数据与S3DIS数据集结合，论文进一步在PointNet、PointNet++和DGCNN模型上进行了实验，并引入了新的评估指标来更好地衡量模型性能。实验结果表明，合成颜色对模型性能有显著影响，使用纯RealSPC训练的模型在常见组件上的性能与使用真实数据训练的模型相当，且RealSPC在整体准确率（Overall Accuracy）和平均交并比（mIoU）上分别比UniSPC提高了14.1%和7.3%。此外，合成点云的比例也对性能有显著影响，在混合训练实验中，添加超过70%的SPC在三个模型上分别比基准模型在整体准确率和mIoU上平均提高了3.9%和3.4%。研究还发现，对于大面积平面元素（如地板、天花板和墙壁），合成点云甚至可以替代真实点云而不影响模型性能。

链接: https://arxiv.org/abs/2412.19145
作者: Shaojie Zhou,Jia-Rui Lin,Peng Pan,Yuandong Pan,Ioannis Brilakis
机构: 未知
关键词: deep learning models, point clouds, training deep learning, SPC, built environment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation of point clouds is essential for understanding the built environment, and a large amount of high-quality data is required for training deep learning models. Despite synthetic point clouds (SPC) having the potential to compensate for the shortage of real data, how to exploit the benefits of SPC is still open. Therefore, this study systematically investigates how color and mixing proportion of SPC impact semantic segmentation for the first time. First, a new method to mimic the scanning process and generate SPC based on BIM is proposed, to create a synthetic dataset with consistent colors of BIM (UniSPC) and a synthetic dataset with real colors (RealSPC) respectively. Subsequently, by integrating with the S3DIS dataset, further experiments on PointNet, PointNet++, and DGCNN are conducted. Meanwhile, benchmark experiments and new evaluation metrics are introduced to better evaluate the performance of different models. Experiments show that synthetic color significantly impacts model performance, the performance for common components of the models trained with pure RealSPC is comparable to models with real data, and RealSPC contributes average improvements of 14.1% on overall accuracy and 7.3% on mIoU than UniSPC. Furthermore, the proportion of SPC also has a significant impact on the performance. In mixing training experiments, adding more than 70% SPC achieves an average of 3.9% on overall accuracy and 3.4% on mIoU better than benchmark on three models. It is also revealed that for large flat elements such as floors, ceilings, and walls, the SPC can even replace real point clouds without compromising model performance.
zh

[CV-76] CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

【速读】：该论文旨在解决基于点云（point cloud）的3D多模态表示学习在重建能力上的局限性，特别是点云无法有效捕捉3D物体的纹理信息的问题。为此，论文提出了CLIP-GS框架，其关键解决方案是引入3D高斯泼溅（3D Gaussian Splatting, 3DGS）作为新的3D表示技术，并通过GS Tokenizer生成序列化的高斯令牌（gaussian tokens），这些令牌经过预初始化权重的Transformer层处理，生成3DGS嵌入。CLIP-GS利用3DGS与CLIP的视觉-文本嵌入之间的对比损失（contrastive loss），并引入图像投票损失（image voting loss）来指导梯度优化的方向和收敛。此外，论文还开发了一种高效的方法生成3DGS、图像和文本的三元组，从而促进CLIP-GS学习统一的多模态表示。通过这些创新，CLIP-GS在多种3D任务中表现出色，超越了基于点云的模型。

链接: https://arxiv.org/abs/2412.19142
作者: Siyu Jiao,Haoye Dong,Yuyang Yin,Zequn Jie,Yinlong Qian,Yao Zhao,Humphrey Shi,Yunchao Wei
机构: 未知
关键词: made remarkable progress, Recent works, remarkable progress, made remarkable, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent works in 3D multimodal learning have made remarkable progress. However, typically 3D multimodal models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.
zh

[CV-77] How Panel Layouts Define Manga: Insights from Visual Ablation Experiments

【速读】：该论文旨在探讨漫画（manga）中各种元素（如角色、文本和分镜布局）如何反映并定义特定作品的独特性，尤其是分镜布局（panel layout）的视觉特征。研究通过定量和定性分析，使用深度学习模型对漫画作品进行分类预测，以评估分镜布局在作品独特性中的作用。关键解决方案包括利用Manga109数据集中的104部作品、12种类型和10,122对跨页图像作为输入，训练深度学习模型进行漫画标题预测，并通过消融实验（ablation studies）限制页面图像信息仅包含分镜框架，以分析分镜布局的特征。此外，研究还使用Grad-CAM进行定性分析，进一步验证了分镜布局在漫画作品独特性中的显著影响。

链接: https://arxiv.org/abs/2412.19141
作者: Siyuan Feng,Teruya Yoshinaga,Katsuhiko Hayashi,Koki Washio,Hidetaka Kamigaito
机构: 未知
关键词: gained worldwide popularity, worldwide popularity, gained worldwide, Today, manga
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, under review

点击查看摘要

Abstract:Today, manga has gained worldwide popularity. However, the question of how various elements of manga, such as characters, text, and panel layouts, reflect the uniqueness of a particular work, or even define it, remains an unexplored area. In this paper, we aim to quantitatively and qualitatively analyze the visual characteristics of manga works, with a particular focus on panel layout features. As a research method, we used facing page images of manga as input to train a deep learning model for predicting manga titles, examining classification accuracy to quantitatively analyze these features. Specifically, we conducted ablation studies by limiting page image information to panel frames to analyze the characteristics of panel layouts. Through a series of quantitative experiments using all 104 works, 12 genres, and 10,122 facing page images from the Manga109 dataset, as well as qualitative analysis using Grad-CAM, our study demonstrates that the uniqueness of manga works is strongly reflected in their panel layouts.
zh

[CV-78] PlanLLM : Video Procedure Planning with Refinable Large Language Models AAAI2025

【速读】：该论文旨在解决视频过程规划（Video Procedure Planning）中的两个主要问题：一是现有方法将动作步骤解码为封闭集（closed-set）的独热向量（one-hot vectors），限制了模型在新步骤或任务中的泛化能力；二是基于世界级常识（world-level commonsense）的固定动作步骤描述在特定视觉状态实例中可能包含噪声。为解决这些问题，论文提出了PlanLLM，一个结合大语言模型（LLMs）的跨模态联合学习框架。其关键解决方案包括：1）引入LLM增强规划模块（LLM-Enhanced Planning module），充分利用LLMs的泛化能力生成自由形式的规划输出，并增强动作步骤解码；2）提出互信息最大化模块（Mutual Information Maximization module），将步骤描述的世界级常识与视觉状态的样本特定信息相结合，使LLMs能够运用推理能力生成步骤序列。通过这些设计，PlanLLM能够在封闭集和开放词汇（open vocabulary）过程规划任务中均表现出色，并在三个基准测试中取得了优异的性能。

链接: https://arxiv.org/abs/2412.19139
作者: Dejie Yang,Zijing Zhao,YangLiu
机构: 未知
关键词: Large Language Models, utilize Large Language, action step, action step decoding, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: accepted to AAAI2025

点击查看摘要

Abstract:Video procedure planning, i.e., planning a sequence of action steps given the video frames of start and goal states, is an essential ability for embodied AI. Recent works utilize Large Language Models (LLMs) to generate enriched action step description texts to guide action step decoding. Although LLMs are introduced, these methods decode the action steps into a closed-set of one-hot vectors, limiting the model’s capability of generalizing to new steps or tasks. Additionally, fixed action step descriptions based on world-level commonsense may contain noise in specific instances of visual states. In this paper, we propose PlanLLM, a cross-modal joint learning framework with LLMs for video procedure planning. We propose an LLM-Enhanced Planning module which fully uses the generalization ability of LLMs to produce free-form planning output and to enhance action step decoding. We also propose Mutual Information Maximization module to connect world-level commonsense of step descriptions and sample-specific information of visual states, enabling LLMs to employ the reasoning ability to generate step sequences. With the assistance of LLMs, our method can both closed-set and open vocabulary procedure planning tasks. Our PlanLLM achieves superior performance on three benchmarks, demonstrating the effectiveness of our designs.
zh

[CV-79] SUTrack: Towards Simple and Unified Single Object Tracking AAAI2025

【速读】：该论文旨在解决单目标跟踪（Single Object Tracking, SOT）领域中不同任务（如基于RGB、RGB-深度、RGB-热成像、RGB-事件、RGB-语言跟踪）因数据特性差异而导致的模型碎片化问题。当前方法通常为每个任务设计独立的架构并分别训练模型，导致训练过程冗余、技术创新重复以及跨模态知识共享受限。论文提出的解决方案SUTrack通过构建一个统一的模型，能够在单次训练中处理多种常见的SOT任务，从而消除了任务特定设计和单独训练的需求。其关键在于引入统一的输入表示、任务识别辅助训练策略以及软令牌类型嵌入（soft token type embedding），这些创新显著提升了模型性能，同时保持了较低的计算开销。实验结果表明，SUTrack在涵盖五个SOT任务的11个数据集上均优于之前的任务特定模型，并为边缘设备和高性能GPU提供了多种模型，实现了速度与精度的良好平衡。

链接: https://arxiv.org/abs/2412.19138
作者: Xin Chen,Ben Kang,Wanting Geng,Jiawen Zhu,Yi Liu,Dong Wang,Huchuan Lu
机构: 未知
关键词: SOT tasks, single object tracking, unified single object, propose a simple, common SOT tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:In this paper, we propose a simple yet unified single object tracking (SOT) framework, dubbed SUTrack. It consolidates five SOT tasks (RGB-based, RGB-Depth, RGB-Thermal, RGB-Event, RGB-Language Tracking) into a unified model trained in a single session. Due to the distinct nature of the data, current methods typically design individual architectures and train separate models for each task. This fragmentation results in redundant training processes, repetitive technological innovations, and limited cross-modal knowledge sharing. In contrast, SUTrack demonstrates that a single model with a unified input representation can effectively handle various common SOT tasks, eliminating the need for task-specific designs and separate training sessions. Additionally, we introduce a task-recognition auxiliary training strategy and a soft token type embedding to further enhance SUTrack’s performance with minimal overhead. Experiments show that SUTrack outperforms previous task-specific counterparts across 11 datasets spanning five SOT tasks. Moreover, we provide a range of models catering edge devices as well as high-performance GPUs, striking a good trade-off between speed and accuracy. We hope SUTrack could serve as a strong foundation for further compelling research into unified tracking models. Code and models are available at this http URL.
zh

[CV-80] Extended Cross-Modality United Learning for Unsupervised Visible-Infrared Person Re-identification

【速读】：该论文旨在解决无监督学习可见光-红外行人重识别（USL-VI-ReID）中的跨模态特征学习问题。现有方法在跨模态聚类方面存在不足，或过度追求聚类级别的关联，导致难以可靠地学习模态不变特征。为解决这一问题，论文提出了扩展跨模态联合学习（ECUL）框架，关键创新点包括扩展模态-相机聚类（EMCC）模块和两步记忆更新策略（TSMem）模块。ECUL框架通过自然整合模态内聚类、跨模态聚类和跨模态实例选择，建立紧凑且准确的跨模态关联，同时减少噪声标签的引入。EMCC模块通过扩展编码向量捕获并过滤邻域关系，进一步促进模态不变和相机不变知识的学习。TSMem模块通过分阶段更新记忆，为对比学习提供准确且泛化的代理点。实验结果表明，ECUL在SYSU-MM01和RegDB数据集上表现出色，甚至优于某些监督学习方法。

链接: https://arxiv.org/abs/2412.19134
作者: Ruixing Wu,Yiming Yang,Jiakai He,Haifeng Hu
机构: 未知
关键词: Unsupervised learning visible-infrared, visible-infrared person re-identification, learning visible-infrared person, learn modality-invariant features, Unsupervised learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims to learn modality-invariant features from unlabeled cross-modality datasets and reduce the inter-modality gap. However, the existing methods lack cross-modality clustering or excessively pursue cluster-level association, which makes it difficult to perform reliable modality-invariant features learning. To deal with this issue, we propose a Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules. Specifically, we design ECUL to naturally integrates intra-modality clustering, inter-modality clustering and inter-modality instance selection, establishing compact and accurate cross-modality associations while reducing the introduction of noisy labels. Moreover, EMCC captures and filters the neighborhood relationships by extending the encoding vector, which further promotes the learning of modality-invariant and camera-invariant knowledge in terms of clustering algorithm. Finally, TSMem provides accurate and generalized proxy points for contrastive learning by updating the memory in stages. Extensive experiments results on SYSU-MM01 and RegDB datasets demonstrate that the proposed ECUL shows promising performance and even outperforms certain supervised methods.
zh

[CV-81] MVS-GS: High-Quality 3D Gaussian Splatting Mapping via Online Multi-View Stereo ICRA2025

【速读】：该论文旨在解决基于RGB图像流在线生成高质量3D模型用于神经渲染（neural rendering）的挑战。现有研究通常通过将神经辐射场（Neural Radiance Fields, NeRF）或3D高斯泼溅（3D Gaussian Splatting, 3DGS）作为场景表示方法集成到密集SLAM（Simultaneous Localization and Mapping）中，但这些方法主要关注粗略的3D场景估计，难以实现细节重建，且仅基于图像的深度估计往往存在模糊性，导致生成的3D模型质量较低，渲染结果不准确。为解决这些问题，论文提出了一种新颖的高质量3DGS建模框架，其关键创新在于采用在线多视图立体（multi-view stereo, MVS）方法，利用局部时间窗口内的序列帧进行MVS深度估计，并通过深度细化技术过滤异常值，从而精确初始化3DGS中的高斯分布。此外，论文还引入了一个并行化的后端模块，高效优化3DGS模型，确保每帧关键帧的及时更新。实验结果表明，该方法在复杂户外环境中显著优于现有的密集SLAM方法。

链接: https://arxiv.org/abs/2412.19130
作者: Byeonggwon Lee,Junkyu Park,Khang Truong Giang,Sungho Jo,Soohwan Song
机构: 未知
关键词: RGB image stream, Neural Radiance Fields, incorporating Neural Radiance, RGB image, Radiance Fields
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, submitted to IEEE ICRA 2025

点击查看摘要

Abstract:This study addresses the challenge of online 3D model generation for neural rendering using an RGB image stream. Previous research has tackled this issue by incorporating Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) as scene representations within dense SLAM methods. However, most studies focus primarily on estimating coarse 3D scenes rather than achieving detailed reconstructions. Moreover, depth estimation based solely on images is often ambiguous, resulting in low-quality 3D models that lead to inaccurate renderings. To overcome these limitations, we propose a novel framework for high-quality 3DGS modeling that leverages an online multi-view stereo (MVS) approach. Our method estimates MVS depth using sequential frames from a local time window and applies comprehensive depth refinement techniques to filter out outliers, enabling accurate initialization of Gaussians in 3DGS. Furthermore, we introduce a parallelized backend module that optimizes the 3DGS model efficiently, ensuring timely updates with each new keyframe. Experimental results demonstrate that our method outperforms state-of-the-art dense SLAM methods, particularly excelling in challenging outdoor environments.
zh

[CV-82] Semantic Residual for Multimodal Unified Discrete Representation ICASSP2025

【速读】：该论文旨在解决多模态统一表示（multimodal unified representations）领域中量化表示形式单一的问题。当前研究主要采用码本（codebook）作为表示形式，并使用向量量化（Vector Quantization, VQ）进行量化，但对其他量化表示形式的探索不足。为此，论文提出了一种新的框架——语义残差跨模态信息解耦（Semantic Residual Cross-modal Information Disentanglement, SRCID），该框架受残差向量量化（Residual Vector Quantization, RVQ）中数值残差概念的启发，通过语义残差进行多模态数据的信息解耦，以更好地处理不同模态之间的固有差异。该方法的创新之处在于其能够提升多模态统一表示的能力，并在跨模态泛化和跨模态零样本检索任务中表现出卓越性能，其平均结果显著超越了现有的最先进模型以及基于RVQ和有限标量量化（Finite Scalar Quantization, FSQ）的先前尝试。

链接: https://arxiv.org/abs/2412.19128
作者: Hai Huang,Shulei Wang,Yan Xia
机构: 未知
关键词: utilizing Vector Quantization, Residual Vector Quantization, Recent research, quantization representation forms, Vector Quantization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICASSP 2025 Accepted

点击查看摘要

Abstract:Recent research in the domain of multimodal unified representations predominantly employs codebook as representation forms, utilizing Vector Quantization(VQ) for quantization, yet there has been insufficient exploration of other quantization representation forms. Our work explores more precise quantization methods and introduces a new framework, Semantic Residual Cross-modal Information Disentanglement (SRCID), inspired by the numerical residual concept inherent to Residual Vector Quantization (RVQ). SRCID employs semantic residual-based information disentanglement for multimodal data to better handle the inherent discrepancies between different modalities. Our method enhances the capabilities of unified multimodal representations and demonstrates exceptional performance in cross-modal generalization and cross-modal zero-shot retrieval. Its average results significantly surpass existing state-of-the-art models, as well as previous attempts with RVQ and Finite Scalar Quantization (FSQ) based on these modals.
zh

[CV-83] Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing

【速读】：该论文旨在解决低比特量化（low-bit quantized, Q）模型在零样本量化（zero-shot quantization, ZSQ）领域中训练能力受限的问题。现有研究主要关注从全精度（full-precision, FP）模型生成高质量数据，但这些方法在低比特量化下由于信息容量有限，导致学习能力下降。为此，论文提出了一种名为AKT（Advanced Knowledge Transfer）的新方法，通过优化特征蒸馏过程中的特征图（feature maps）来有效传递知识，从而提升低比特量化模型的性能。AKT的关键创新在于首次在ZSQ中同时利用空间和通道注意力信息进行特征蒸馏，并解决了低比特量化模型中的梯度爆炸问题。实验结果表明，AKT在CIFAR-10和CIFAR-100数据集上显著提升了现有生成模型的性能，特别是在3位和5位量化场景下达到了最先进的精度水平。

链接: https://arxiv.org/abs/2412.19125
作者: Inpyo Hong,Youngwan Jo,Hyojeong Lee,Sunghyun Ahn,Sanghyun Park
机构: 未知
关键词: Advanced Knowledge Transfer, Advanced Knowledge, field of zero-shot, ZSQ, AKT
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ACM SAC 2025

点击查看摘要

Abstract:We introduce AKT (Advanced Knowledge Transfer), a novel method to enhance the training ability of low-bit quantized (Q) models in the field of zero-shot quantization (ZSQ). Existing research in ZSQ has focused on generating high-quality data from full-precision (FP) models. However, these approaches struggle with reduced learning ability in low-bit quantization due to its limited information capacity. To overcome this limitation, we propose effective training strategy compared to data generation. Particularly, we analyzed that refining feature maps in the feature distillation process is an effective way to transfer knowledge to the Q model. Based on this analysis, AKT efficiently transfer core information from the FP model to the Q model. AKT is the first approach to utilize both spatial and channel attention information in feature distillation in ZSQ. Our method addresses the fundamental gradient exploding problem in low-bit Q models. Experiments on CIFAR-10 and CIFAR-100 datasets demonstrated the effectiveness of the AKT. Our method led to significant performance enhancement in existing generative models. Notably, AKT achieved significant accuracy improvements in low-bit Q models, achieving state-of-the-art in the 3,5bit scenarios on CIFAR-10. The code is available at this https URL.
zh

[CV-84] Evaluating Self-Supervised Learning in Medical Imaging: A Benchmark for Robustness Generalizability and Multi-Domain Impact

【速读】：该论文旨在解决自监督学习（Self-supervised Learning, SSL）在医学影像领域中应用时面临的评估碎片化问题。现有研究通常局限于特定数据集或模态，且仅评估模型性能的孤立方面，这在实际医疗环境中可能导致模型缺乏鲁棒性和泛化能力。为解决这一问题，论文提出了对SSL方法在医学领域中的全面评估，重点关注模型的鲁棒性和泛化能力。研究采用MedMNIST数据集作为标准化基准，评估了8种主要SSL方法在11个不同医学数据集上的表现，涵盖了域内场景、分布外（Out-of-Distribution, OOD）样本检测、初始化策略、模型架构以及多域预训练的影响。此外，通过跨数据集评估和不同标签比例（1%、10%和100%）下的域内性能测试，进一步验证了SSL方法在有限监督条件下的泛化能力。该研究为医学应用中的SSL方法选择提供了全面的基准支持。

链接: https://arxiv.org/abs/2412.19124
作者: Valay Bundele,Oğuz Ata Çal,Bora Kargi,Karahan Sarıtaş,Kıvanç Tezören,Zohreh Ghaderi,Hendrik Lensch
机构: 未知
关键词: Self-supervised learning, limited labeled data, SSL methods, SSL, addressing the chronic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a promising paradigm in medical imaging, addressing the chronic challenge of limited labeled data in healthcare settings. While SSL has shown impressive results, existing studies in the medical domain are often limited in scope, focusing on specific datasets or modalities, or evaluating only isolated aspects of model performance. This fragmented evaluation approach poses a significant challenge, as models deployed in critical medical settings must not only achieve high accuracy but also demonstrate robust performance and generalizability across diverse datasets and varying conditions. To address this gap, we present a comprehensive evaluation of SSL methods within the medical domain, with a particular focus on robustness and generalizability. Using the MedMNIST dataset collection as a standardized benchmark, we evaluate 8 major SSL methods across 11 different medical datasets. Our study provides an in-depth analysis of model performance in both in-domain scenarios and the detection of out-of-distribution (OOD) samples, while exploring the effect of various initialization strategies, model architectures, and multi-domain pre-training. We further assess the generalizability of SSL methods through cross-dataset evaluations and the in-domain performance with varying label proportions (1%, 10%, and 100%) to simulate real-world scenarios with limited supervision. We hope this comprehensive benchmark helps practitioners and researchers make more informed decisions when applying SSL methods to medical applications.
zh

[CV-85] ask Success Prediction and Open-Vocabulary Object Manipulation

【速读】：该论文旨在解决开放词汇物体操作（open-vocabulary object manipulation）任务中未来成功或失败的预测问题。传统方法通常在操作执行后才进行成功预测，这限制了整个任务序列的执行效率。论文提出了一种新颖的方法，通过将给定的轨迹和图像与自然语言指令对齐，实现对操作结果的提前预测。其关键解决方案是引入了轨迹编码器（Trajectory Encoder），该编码器对输入轨迹应用可学习的权重，使模型能够考虑时间动态以及物体与末端执行器之间的相互作用，从而提高了模型对操作结果预测的准确性。实验结果表明，该方法在基于RT-1数据集构建的评估数据集上，预测精度优于基线方法。

链接: https://arxiv.org/abs/2412.19112
作者: Motonari Kambara,Komei Sugiura
机构: 未知
关键词: natural language instructions, study addresses, open-vocabulary object manipulation, object manipulation tasks, task designed
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at LangRob @ CoRL 2024

点击查看摘要

Abstract:This study addresses a task designed to predict the future success or failure of open-vocabulary object manipulation. In this task, the model is required to make predictions based on natural language instructions, egocentric view images before manipulation, and the given end-effector trajectories. Conventional methods typically perform success prediction only after the manipulation is executed, limiting their efficiency in executing the entire task sequence. We propose a novel approach that enables the prediction of success or failure by aligning the given trajectories and images with natural language instructions. We introduce Trajectory Encoder to apply learnable weighting to the input trajectories, allowing the model to consider temporal dynamics and interactions between objects and the end effector, improving the model’s ability to predict manipulation outcomes accurately. We constructed a dataset based on the RT-1 dataset, a large-scale benchmark for open-vocabulary object manipulation tasks, to evaluate our method. The experimental results show that our method achieved a higher prediction accuracy than baseline approaches.
zh

[CV-86] Spectral Enhancement and Pseudo-Anchor Guidance for Infrared-Visible Person Re-Identification

【速读】：该论文旨在解决可见光-红外行人重识别（VI-ReID）中的模态差异问题，特别是在跨模态匹配时由于光谱差异导致的性能限制。现有的方法依赖于无监督的模态转换和低效的嵌入约束，难以有效弥合红外与可见光图像之间的光谱差异。为解决这一问题，论文提出了一种名为SEPG-Net的简单而有效的网络，其核心包括两个关键创新：首先，基于频域信息和灰度空间的同质化光谱增强方案，避免了传统模态转换中的信息丢失；其次，引入了伪锚点引导的双向聚合（PABA）损失函数，旨在减少局部模态差异的同时更好地保留判别性身份嵌入。实验结果表明，SEPG-Net在两个公开基准数据集上均优于其他最先进的方法。

链接: https://arxiv.org/abs/2412.19111
作者: Yiyuan Ge,Zhihao Chen,Ziyang Wang,Jiaju Kang,Mingya Zhang
机构: 未知
关键词: Visible-infrared person re-identification, person re-identification, technology in intelligent, intelligent security, development of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The development of deep learning has facilitated the application of person re-identification (ReID) technology in intelligent security. Visible-infrared person re-identification (VI-ReID) aims to match pedestrians across infrared and visible modality images enabling 24-hour surveillance. Current studies relying on unsupervised modality transformations as well as inefficient embedding constraints to bridge the spectral differences between infrared and visible images, however, limit their potential performance. To tackle the limitations of the above approaches, this paper introduces a simple yet effective Spectral Enhancement and Pseudo-anchor Guidance Network, named SEPG-Net. Specifically, we propose a more homogeneous spectral enhancement scheme based on frequency domain information and greyscale space, which avoids the information loss typically caused by inefficient modality transformations. Further, a Pseudo Anchor-guided Bidirectional Aggregation (PABA) loss is introduced to bridge local modality discrepancies while better preserving discriminative identity embeddings. Experimental results on two public benchmark datasets demonstrate the superior performance of SEPG-Net against other state-of-the-art methods. The code is available at this https URL.
zh

[CV-87] Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models

【速读】：该论文旨在探讨在预训练深度网络（deep networks）中引入加性噪声（additive noise）的影响，特别是在与掩码图像建模（masked image modeling）结合时，为何其效果在识别任务中表现有限。通过深入研究，论文提出了三个关键条件以有效结合这两种方法：首先，噪声的破坏与恢复必须在编码器（encoder）内部进行；其次，噪声应引入特征空间（feature space）；最后，必须明确区分加噪和掩码的标记（tokens）。通过实施这些条件，论文展示了在多种识别任务中预训练性能的提升，尤其是那些需要细粒度、高频信息解决的任务。

链接: https://arxiv.org/abs/2412.19104
作者: Hyesong Choi,Daeun Kim,Sungmin Cha,Kwang Moo Yi,Dongbo Min
机构: 未知
关键词: pre-training deep networks, deep networks, dive deep, additive noise inspired, additive noise
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we dive deep into the impact of additive noise in pre-training deep networks. While various methods have attempted to use additive noise inspired by the success of latent denoising diffusion models, when used in combination with masked image modeling, their gains have been marginal when it comes to recognition tasks. We thus investigate why this would be the case, in an attempt to find effective ways to combine the two ideas. Specifically, we find three critical conditions: corruption and restoration must be applied within the encoder, noise must be introduced in the feature space, and an explicit disentanglement between noised and masked tokens is necessary. By implementing these findings, we demonstrate improved pre-training performance for a wide range of recognition tasks, including those that require fine-grained, high-frequency information to solve.
zh

[CV-88] Reconstruction Target Matters in Masked Image Modeling for Cross-Domain Few-Shot Learning

【速读】：该论文旨在解决跨域少样本学习（Cross-Domain Few-Shot Learning, CDFSL）中的挑战，即在源域数据丰富而目标域数据稀缺的情况下，模型如何有效地进行知识迁移以适应新领域。由于源域和目标域之间存在较大的领域差异，CDFSL成为一个极具挑战性的问题。论文发现，尽管掩码自编码器（Masked Autoencoder, MAE）在利用未标记数据和学习图像全局结构方面表现出色，但在CDFSL任务中，其性能甚至低于基线监督模型。通过深入分析，作者发现MAE在像素重建过程中倾向于关注低层次领域信息，而将重建目标改为令牌特征可以缓解这一问题。然而，并非所有特征都有益，重建高层次特征难以提升模型的迁移能力，这表明在过滤领域信息和保留图像全局结构之间存在权衡。基于这些发现，论文提出了领域无关的掩码图像建模（Domain-Agnostic Masked Image Modeling, DAMIM），其关键包括一个聚合特征重建模块，用于自动聚合特征以实现重建，平衡领域无关信息和图像全局结构的学习，以及一个轻量级解码器模块，进一步提升编码器的泛化能力。实验结果表明，该方法在四个CDFSL数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.19101
作者: Ran Ma,Yixiong Zou,Yuhua Li,Ruixuan Li
机构: 未知
关键词: Cross-Domain Few-Shot Learning, gap makes CDFSL, data-abundant source domain, large domain gap, domain gap makes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-Domain Few-Shot Learning (CDFSL) requires the model to transfer knowledge from the data-abundant source domain to data-scarce target domains for fast adaptation, where the large domain gap makes CDFSL a challenging problem. Masked Autoencoder (MAE) excels in effectively using unlabeled data and learning image’s global structures, enhancing model generalization and robustness. However, in the CDFSL task with significant domain shifts, we find MAE even shows lower performance than the baseline supervised models. In this paper, we first delve into this phenomenon for an interpretation. We find that MAE tends to focus on low-level domain information during reconstructing pixels while changing the reconstruction target to token features could mitigate this problem. However, not all features are beneficial, as we then find reconstructing high-level features can hardly improve the model’s transferability, indicating a trade-off between filtering domain information and preserving the image’s global structure. In all, the reconstruction target matters for the CDFSL task. Based on the above findings and interpretations, we further propose Domain-Agnostic Masked Image Modeling (DAMIM) for the CDFSL task. DAMIM includes an Aggregated Feature Reconstruction module to automatically aggregate features for reconstruction, with balanced learning of domain-agnostic information and images’ global structure, and a Lightweight Decoder module to further benefit the encoder’s generalizability. Experiments on four CDFSL datasets demonstrate that our method achieves state-of-the-art performance.
zh

[CV-89] From Coin to Data: The Impact of Object Detection on Digital Numismatics

【速读】：该论文旨在解决数字钱币学（digital numismatics）中历史钱币分析与分类的挑战，特别是针对复杂图案和低质量图像的识别问题。研究通过应用先进的物体检测技术，尤其是对比语言-图像预训练（Contrastive Language-Image Pre-training, CLIP）模型，开发了一个灵活的框架，结合图像和文本描述来识别和分类钱币特征。研究的关键解决方案包括：1）利用CLIP模型在处理复杂图案时的优越性能；2）传统方法在识别简单几何图案中的有效性；3）提出一种统计校准机制，以提高低质量数据集中相似性评分的可靠性。这些方法为文化遗产研究、文物来源鉴定和赝品检测提供了新的方法论基础。

链接: https://arxiv.org/abs/2412.19091
作者: Rafael Cabral,Maria De Iorio,Andrew Harris
机构: 未知
关键词: Contrastive Language-Image Pre-training, Southeast Asian coins, investigate the application, application of advanced, advanced object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work we investigate the application of advanced object detection techniques to digital numismatics, focussing on the analysis of historical coins. Leveraging models such as Contrastive Language-Image Pre-training (CLIP), we develop a flexible framework for identifying and classifying specific coin features using both image and textual descriptions. By examining two distinct datasets, modern Russian coins featuring intricate “Saint George and the Dragon” designs and degraded 1st millennium AD Southeast Asian coins bearing Hindu-Buddhist symbols, we evaluate the efficacy of different detection algorithms in search and classification tasks. Our results demonstrate the superior performance of larger CLIP models in detecting complex imagery, while traditional methods excel in identifying simple geometric patterns. Additionally, we propose a statistical calibration mechanism to enhance the reliability of similarity scores in low-quality datasets. This work highlights the transformative potential of integrating state-of-the-art object detection into digital numismatics, enabling more scalable, precise, and efficient analysis of historical artifacts. These advancements pave the way for new methodologies in cultural heritage research, artefact provenance studies, and the detection of forgeries.
zh

[CV-90] Humans as a Calibration Pattern: Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos

【速读】：该论文旨在解决在现实场景中，动态神经场重建（dynamic neural field reconstruction）面临的输入限制问题，即传统方法依赖于同步多视角视频和已知相机姿态，而这些条件在实际应用中往往难以满足。论文提出了一种解决方案，利用未同步且相机姿态未知的视频来生成动态神经场，前提是这些视频捕捉了人体运动。关键步骤包括：首先，通过先进方法估计人体的形状和姿态参数，尽管存在噪声，但这些参数为高度非凸且欠约束的动态神经表示训练提供了良好的初始化；其次，基于人体姿态和形状序列，估计视频间的时间偏移，并通过分析3D关节位置来估计相机姿态；最后，采用多分辨率网格（multiresolution grids）训练动态NeRF（Neural Radiance Fields），同时优化时间偏移和相机姿态。为了稳定这一涉及大量参数优化的过程，论文引入了一种鲁棒的渐进学习策略。实验结果表明，该方法在复杂条件下实现了精确的时空校准和高质量的场景重建。

链接: https://arxiv.org/abs/2412.19089
作者: Changwoon Choi(1),Jeongjun Kim(1),Geonho Cha(2),Minkwan Kim(1),Dongyoon Wee(2),Young Min Kim(1) ((1) Seoul National University, (2) NAVER Cloud)
机构: 未知
关键词: Recent works, synchronized multi-view videos, synchronized multi-view, reconstruction assume input, field reconstruction assume
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent works on dynamic neural field reconstruction assume input from synchronized multi-view videos with known poses. These input constraints are often unmet in real-world setups, making the approach impractical. We demonstrate that unsynchronized videos with unknown poses can generate dynamic neural fields if the videos capture human motion. Humans are one of the most common dynamic subjects whose poses can be estimated using state-of-the-art methods. While noisy, the estimated human shape and pose parameters provide a decent initialization for the highly non-convex and under-constrained problem of training a consistent dynamic neural representation. Given the sequences of pose and shape of humans, we estimate the time offsets between videos, followed by camera pose estimations by analyzing 3D joint locations. Then, we train dynamic NeRF employing multiresolution rids while simultaneously refining both time offsets and camera poses. The setup still involves optimizing many parameters, therefore, we introduce a robust progressive learning strategy to stabilize the process. Experiments show that our approach achieves accurate spatiotemporal calibration and high-quality scene reconstruction in challenging conditions.
zh

[CV-91] Mask Factory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation

【速读】：该论文旨在解决二值图像分割（Dichotomous Image Segmentation, DIS）任务中数据集创建的高成本、高劳动强度以及需要大量领域专业知识的问题。传统方法在生成精确标注时面临诸多挑战，而现有的生成模型和技术在处理场景偏差、噪声引起的错误以及训练样本多样性不足等问题时表现不佳。为此，论文提出了一种名为 \textbf\ourmodel 的新方法，通过结合刚性和非刚性编辑技术生成高质量合成掩码，显著减少了数据集准备的时间和成本。具体而言，刚性编辑利用扩散模型（diffusion models）的几何先验知识，在零样本条件下实现精确的视角变换；非刚性编辑则通过对抗训练（adversarial training）和自注意力机制（self-attention mechanisms）进行复杂且拓扑一致的修改。此外，采用多条件控制生成方法生成高分辨率图像和精确分割掩码对。实验结果表明，该方法在广泛使用的 DIS5K 数据集上表现出优于现有方法的质量和效率。

链接: https://arxiv.org/abs/2412.19080
作者: Haotian Qian,YD Chen,Shengtao Lou,Fahad Shahbaz Khan,Xiaogang Jin,Deng-Ping Fan
机构: 未知
关键词: tasks require highly, require extensive domain, extensive domain expertise, Dichotomous Image Segmentation, require highly precise
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dichotomous Image Segmentation (DIS) tasks require highly precise annotations, and traditional dataset creation methods are labor intensive, costly, and require extensive domain expertise. Although using synthetic data for DIS is a promising solution to these challenges, current generative models and techniques struggle with the issues of scene deviations, noise-induced errors, and limited training sample variability. To address these issues, we introduce a novel approach, \textbf\ourmodel, which provides a scalable solution for generating diverse and precise datasets, markedly reducing preparation time and costs. We first introduce a general mask editing method that combines rigid and non-rigid editing techniques to generate high-quality synthetic masks. Specially, rigid editing leverages geometric priors from diffusion models to achieve precise viewpoint transformations under zero-shot conditions, while non-rigid editing employs adversarial training and self-attention mechanisms for complex, topologically consistent modifications. Then, we generate pairs of high-resolution image and accurate segmentation mask using a multi-conditional control generation method. Finally, our experiments on the widely-used DIS5K dataset benchmark demonstrate superior performance in quality and efficiency compared to existing methods. The code is available at \urlthis https URL.
zh

[CV-92] Learning Monocular Depth from Events via Egomotion Compensation

【速读】：该论文旨在解决基于事件相机（event cameras）的单目深度估计（monocular depth estimation）中现有方法过度参数化且未能充分利用事件数据中丰富时间信息的问题。现有方法通常将事件流视为黑箱学习系统，忽略了先验物理原理，导致模型复杂且性能受限。为解决这一问题，论文提出了一种可解释的单目深度估计框架，其核心在于结合物理运动原理，通过运动补偿（motion compensation）显式确定不同深度假设的似然性。关键解决方案包括引入焦点成本判别（Focus Cost Discrimination, FCD）模块，通过测量边缘清晰度作为聚焦水平的关键指标，并结合空间上下文以优化成本估计；此外，论文还提出了跨假设成本聚合（Inter-Hypotheses Cost Aggregation, IHCA）模块，通过成本趋势预测和多尺度成本一致性约束来优化成本体积（cost volume）。实验结果表明，该框架在真实和合成数据集上的绝对相对误差（absolute relative error）指标上优于现有方法，最高提升达10%。

链接: https://arxiv.org/abs/2412.19067
作者: Haitao Meng,Chonghao Zhong,Sheng Tang,Lian JunJia,Wenwei Lin,Zhenshan Bing,Yi Chang,Gang Chen,Alois Knoll
机构: 未知
关键词: neuromorphically inspired sensors, asynchronously report brightness, neuromorphically inspired, inspired sensors, sensors that sparsely
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Event cameras are neuromorphically inspired sensors that sparsely and asynchronously report brightness changes. Their unique characteristics of high temporal resolution, high dynamic range, and low power consumption make them well-suited for addressing challenges in monocular depth estimation (e.g., high-speed or low-lighting conditions). However, current existing methods primarily treat event streams as black-box learning systems without incorporating prior physical principles, thus becoming over-parameterized and failing to fully exploit the rich temporal information inherent in event camera data. To address this limitation, we incorporate physical motion principles to propose an interpretable monocular depth estimation framework, where the likelihood of various depth hypotheses is explicitly determined by the effect of motion compensation. To achieve this, we propose a Focus Cost Discrimination (FCD) module that measures the clarity of edges as an essential indicator of focus level and integrates spatial surroundings to facilitate cost estimation. Furthermore, we analyze the noise patterns within our framework and improve it with the newly introduced Inter-Hypotheses Cost Aggregation (IHCA) module, where the cost volume is refined through cost trend prediction and multi-scale cost consistency constraints. Extensive experiments on real-world and synthetic datasets demonstrate that our proposed framework outperforms cutting-edge methods by up to 10% in terms of the absolute relative error metric, revealing superior performance in predicting accuracy.
zh

[CV-93] DAPoinTr: Domain Adaptive Point Transformer for Point Cloud Completion AAAI2025

【速读】：该论文旨在解决点云补全（point cloud completion）中领域自适应（domain adaptation）的问题，特别是在点Transformer（Point Transformer, PoinTr）模型中如何有效提升目标领域的可迁移性。现有方法在点Transformer的CNN骨干网络上直接进行特征对齐（feature alignment）效果有限，因为无法保证Transformer中序列级别的领域不变特征。为此，作者提出了一种创新的领域自适应点Transformer框架（Domain Adaptive Point Transformer, DAPoinTr），其核心包括三个关键组件：基于领域查询的特征对齐（Domain Query-based Feature Alignment, DQFA）、点令牌级别的特征对齐（Point Token-wise Feature Alignment, PTFA）和投票预测一致性（Voted Prediction Consistency, VPC）。DQFA通过引入领域代理（domain proxy）和领域查询（domain query）分别在Transformer编码器和解码器中缩小全局领域差异；PTFA通过对齐点代理（point proxy）和动态查询（dynamic query）在编码器和解码器中减少局部领域偏移；VPC则将多个Transformer解码器视为专家集合（multiple of experts, MoE），通过集成预测投票和伪标签生成进一步提升模型性能。实验结果表明，DAPoinTr在多个领域自适应基准数据集上表现出显著的有效性和优越性。

链接: https://arxiv.org/abs/2412.19062
作者: Yinghui Li,Qianyu Zhou,Jingyu Gong,Ye Zhu,Richard Dazeley,Xinkui Zhao,Xuequan Lu
机构: 未知
关键词: shown great potential, point Transformer CNN, Adaptive Point Transformer, Transformer, point Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Point Transformers (PoinTr) have shown great potential in point cloud completion recently. Nevertheless, effective domain adaptation that improves transferability toward target domains remains unexplored. In this paper, we delve into this topic and empirically discover that direct feature alignment on point Transformer’s CNN backbone only brings limited improvements since it cannot guarantee sequence-wise domain-invariant features in the Transformer. To this end, we propose a pioneering Domain Adaptive Point Transformer (DAPoinTr) framework for point cloud completion. DAPoinTr consists of three key components: Domain Query-based Feature Alignment (DQFA), Point Token-wise Feature alignment (PTFA), and Voted Prediction Consistency (VPC). In particular, DQFA is presented to narrow the global domain gaps from the sequence via the presented domain proxy and domain query at the Transformer encoder and decoder, respectively. PTFA is proposed to close the local domain shifts by aligning the tokens, \emphi.e., point proxy and dynamic query, at the Transformer encoder and decoder, respectively. VPC is designed to consider different Transformer decoders as multiple of experts (MoE) for ensembled prediction voting and pseudo-label generation. Extensive experiments with visualization on several domain adaptation benchmarks demonstrate the effectiveness and superiority of our DAPoinTr compared with state-of-the-art methods. Code will be publicly available at: this https URL
zh

[CV-94] SpectralKD: Understanding and Optimizing Vision Transformer Distillation through Spectral Analysis

【速读】：该论文旨在解决知识蒸馏（Knowledge Distillation）过程中知识转移机制不明确的问题，并提出了一种优化蒸馏过程的方法。通过引入谱分析（Spectral Analysis）方法，论文揭示了CaiT模型在其前几层和最后几层集中信息的特点，从而为特征图蒸馏（Feature Map Distillation）的层选择提供了指导。此外，研究发现尽管Swin Transformer和CaiT在架构上存在差异，但它们表现出相似的谱编码模式，这增强了对Transformer架构的理解，并改进了特征图对齐策略。基于这些发现，论文提出了一种简单而有效的谱对齐方法，称为SpectralKD。实验结果表明，遵循这些指导原则，SpectralKD在ImageNet-1k Top-1准确率上实现了显著的性能提升（DeiT-Tiny: +5.2%，Swin-Tiny: +1.4%）。此外，通过对有蒸馏和无蒸馏训练的学生模型进行谱分析，研究发现蒸馏模型能够反映其教师模型的谱模式，这为解释知识蒸馏的动态过程提供了新的视角。

链接: https://arxiv.org/abs/2412.19055
作者: Huiyuan Tian,Bonan Xu,Shijian Li,Gang Pan
机构: 未知
关键词: remain poorly understood, mechanisms remain poorly, transfer mechanisms remain, underlying knowledge transfer, knowledge transfer mechanisms
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge distillation effectively reduces model complexity while improving performance, yet the underlying knowledge transfer mechanisms remain poorly understood. We propose novel spectral analysis methods and guidelines to optimize distillation, making the knowledge transfer process more interpretable. Our analysis reveals that CaiT models concentrate information in their first and last few layers, informing optimal layer selection for feature map distillation. Surprisingly, we discover that Swin Transformer and CaiT exhibit similar spectral encoding patterns despite their architectural differences, enhancing our understanding of transformer architectures and leading to improved feature map alignment strategies. Based on these insights, we introduce a simple yet effective spectral alignment method named SpectralKD. Experimental results demonstrate that following our guidelines enables SpectralKD to achieve state-of-the-art performance (DeiT-Tiny: +5.2% , Swin-Tiny: +1.4% in ImageNet-1k Top-1 accuracy). Furthermore, through spectral analysis of student models trained with and without distillation, we show that distilled models mirror spectral patterns of their teachers, providing a new lens for interpreting knowledge distillation dynamics. Our code, pre-trained models, and experimental logs will be made publicly available.
zh

[CV-95] Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation AAAI-25

【速读】：该论文旨在解决开放词汇场景图生成（Open-vocabulary Scene Graph Generation, OV-SGG）中现有方法受限于固定文本表示的问题，这些问题限制了图像-文本对齐的多样性和准确性。为解决这些挑战，论文提出了关系感知分层提示（Relation-Aware Hierarchical Prompting, RAHP）框架。该框架的关键在于通过整合主客体信息和区域特定关系信息来增强文本表示，利用实体聚类（entity clustering）处理关系三元组类别的复杂性，并借助大语言模型（Large Language Model, LLM）生成细粒度的区域感知提示，从而捕捉视觉交互的细节并提升视觉与文本模态的对齐效果。此外，RAHP框架引入了视觉-语言模型（Vision-Language Models, VLMs）中的动态选择机制，根据视觉内容自适应地选择相关文本提示，减少无关提示的噪声。通过在Visual Genome和Open Images v6数据集上的广泛实验，该框架展示了其在开放词汇场景图生成中的卓越性能。

链接: https://arxiv.org/abs/2412.19021
作者: Tao Liu,Rongjie Li,Chongyu Wang,Xuming He
机构: 未知
关键词: Scene Graph Generation, aligning visual relationship, Open-vocabulary Scene Graph, overcomes the limitations, closed-set assumption
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI-25

点击查看摘要

Abstract:Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations. This enables the identification of novel visual relationships, making it applicable to real-world scenarios with diverse relationships. However, existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment. To address these challenges, we propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information. Our approach utilizes entity clustering to address the complexity of relation triplet categories, enabling the effective integration of subject-object information. Additionally, we utilize a large language model (LLM) to generate detailed region-aware prompts, capturing fine-grained visual interactions and improving alignment between visual and textual modalities. RAHP also introduces a dynamic selection mechanism within Vision-Language Models (VLMs), which adaptively selects relevant text prompts based on the visual content, reducing noise from irrelevant prompts. Extensive experiments on the Visual Genome and Open Images v6 datasets demonstrate that our framework consistently achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of open-vocabulary scene graph generation.
zh

[CV-96] Imperceptible Adversarial Attacks on Point Clouds Guided by Point-to-Surface Field ICASSP2025

【速读】：该论文旨在解决点云（point clouds）对抗攻击（adversarial attacks）中难以平衡攻击效果与不可感知性（imperceptibility）的问题。传统方法在攻击过程中严格限制点的位移，导致攻击效果与不可感知性之间的权衡困难。论文指出，点云对抗攻击的不可感知性不足主要源于点偏离其原始表面。为解决这一问题，论文提出了一种新颖的点到表面（P2S, point-to-surface）场，通过将点拉回其原始表面来调整对抗扰动方向。具体而言，论文利用去噪网络（denoising network）学习形状表面对数密度函数（logarithmic density function）的梯度场，并在攻击过程中应用距离感知调整（distance-aware adjustment）来优化扰动方向，从而显著提升了攻击的不可感知性。实验结果表明，基于P2S场的对抗攻击在不可感知性方面优于现有最先进方法。

链接: https://arxiv.org/abs/2412.19015
作者: Keke Tang,Weiyao Ke,Weilong Peng,Xiaofei Wang,Ziyong Du,Zhize Wu,Peican Zhu,Zhihong Tian
机构: 未知
关键词: deep learning models, deep learning, learning models, crucial for assessing, assessing and improving
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Adversarial attacks on point clouds are crucial for assessing and improving the adversarial robustness of 3D deep learning models. Traditional solutions strictly limit point displacement during attacks, making it challenging to balance imperceptibility with adversarial effectiveness. In this paper, we attribute the inadequate imperceptibility of adversarial attacks on point clouds to deviations from the underlying surface. To address this, we introduce a novel point-to-surface (P2S) field that adjusts adversarial perturbation directions by dragging points back to their original underlying surface. Specifically, we use a denoising network to learn the gradient field of the logarithmic density function encoding the shape’s surface, and apply a distance-aware adjustment to perturbation directions during attacks, thereby enhancing imperceptibility. Extensive experiments show that adversarial attacks guided by our P2S field are more imperceptible, outperforming state-of-the-art methods.
zh

[CV-97] FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial Editing

【速读】：该论文旨在解决现有面部编辑方法在多模态条件局部面部编辑中的不足，特别是多次迭代增量编辑后输出图像质量显著下降的问题。现有方法通常不支持局部编辑，导致未编辑部分受到影响。为此，论文提出了一种新颖的多模态生成与融合框架（FACEMUG），用于实现全局一致的局部面部编辑。该框架能够处理多种输入模态（如草图、语义图、颜色图、示例图像、文本和属性标签），并在保持未编辑部分不变的同时，实现细粒度和语义化的编辑。解决方案的关键在于：1）将多模态信息整合到统一的生成潜在空间中，通过多模态特征融合机制（包括多模态聚合和风格融合模块）在潜在空间和特征空间中融合面部先验和多模态信息；2）引入自监督潜在扭曲算法，校正未对齐的面部特征，有效将编辑图像的姿态传递到给定的潜在代码中。实验结果表明，FACEMUG在编辑质量、灵活性和语义控制方面优于现有方法，为广泛的局部面部编辑任务提供了有效的解决方案。

链接: https://arxiv.org/abs/2412.19009
作者: Wanglong Lu,Jikai Wang,Xiaogang Jin,Xianta Jiang,Hanli Zhao
机构: 未知
关键词: Existing facial editing, supporting multimodal conditional, Existing facial, local facial editing, achieved remarkable results
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Published at IEEE Transactions on Visualization and Computer Graphics; 21 pages, 26 figures

点击查看摘要

Abstract:Existing facial editing methods have achieved remarkable results, yet they often fall short in supporting multimodal conditional local facial editing. One of the significant evidences is that their output image quality degrades dramatically after several iterations of incremental editing, as they do not support local editing. In this paper, we present a novel multimodal generative and fusion framework for globally-consistent local facial editing (FACEMUG) that can handle a wide range of input modalities and enable fine-grained and semantic manipulation while remaining unedited parts unchanged. Different modalities, including sketches, semantic maps, color maps, exemplar images, text, and attribute labels, are adept at conveying diverse conditioning details, and their combined synergy can provide more explicit guidance for the editing process. We thus integrate all modalities into a unified generative latent space to enable multimodal local facial edits. Specifically, a novel multimodal feature fusion mechanism is proposed by utilizing multimodal aggregation and style fusion blocks to fuse facial priors and multimodalities in both latent and feature spaces. We further introduce a novel self-supervised latent warping algorithm to rectify misaligned facial features, efficiently transferring the pose of the edited image to the given latent codes. We evaluate our FACEMUG through extensive experiments and comparisons to state-of-the-art (SOTA) methods. The results demonstrate the superiority of FACEMUG in terms of editing quality, flexibility, and semantic control, making it a promising solution for a wide range of local facial editing tasks.
zh

[CV-98] MGAN-CRCM: A Novel Multiple Generative Adversarial Network and Coarse-Refinement Based Cognizant Method for Image Inpainting

【速读】：该论文旨在解决图像修复（Image Inpainting）中缺失或损坏像素的重建问题。传统方法在处理复杂图像时存在局限性，而生成对抗网络（Generative Adversarial Networks, GANs）和残差网络（Residual Networks, ResNet）的引入显著提升了修复效果。论文提出了一种结合GAN和ResNet的新架构，其关键解决方案包括三个核心组件：基于转置卷积的GAN（Transpose Convolution-based GAN）用于引导和盲修复，快速ResNet-卷积神经网络（Fast ResNet-Convolutional Neural Network, FR-CNN）用于对象移除，以及协同调制GAN（Co-Modulation GAN, Co-Mod GAN）用于精细修复。该架构在Image-Net、Places2和CelebA等基准数据集上表现出色，准确率分别达到96.59%、96.70%和96.16%，显著优于现有方法，验证了其在定性和定量评估中的有效性。

链接: https://arxiv.org/abs/2412.19000
作者: Nafiz Al Asad,Md. Appel Mahmud Pranto,Shbiruzzaman Shiam,Musaddeq Mahmud Akand,Mohammad Abu Yousuf,Khondokar Fida Hasan,Mohammad Ali Moni
机构: 未知
关键词: Generative Adversarial Networks, widely used technique, technique in computer, computer vision, vision for reconstructing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 34 pages

点击查看摘要

Abstract:Image inpainting is a widely used technique in computer vision for reconstructing missing or damaged pixels in images. Recent advancements with Generative Adversarial Networks (GANs) have demonstrated superior performance over traditional methods due to their deep learning capabilities and adaptability across diverse image domains. Residual Networks (ResNet) have also gained prominence for their ability to enhance feature representation and compatibility with other architectures. This paper introduces a novel architecture combining GAN and ResNet models to improve image inpainting outcomes. Our framework integrates three components: Transpose Convolution-based GAN for guided and blind inpainting, Fast ResNet-Convolutional Neural Network (FR-CNN) for object removal, and Co-Modulation GAN (Co-Mod GAN) for refinement. The model’s performance was evaluated on benchmark datasets, achieving accuracies of 96.59% on Image-Net, 96.70% on Places2, and 96.16% on CelebA. Comparative analyses demonstrate that the proposed architecture outperforms existing methods, highlighting its effectiveness in both qualitative and quantitative evaluations.
zh

[CV-99] MiTREE: Multi-input Transformer Ecoregion Encoder for Species Distribution Modelling

【速读】：该论文旨在解决物种分布模型（Species Distribution Models, SDMs）在处理多源数据（如遥感影像和环境数据）时，难以有效利用空间关系及整合地理位置和生态特征的问题。传统SDMs依赖专家观察，耗时且效率低，而现有的机器学习方法在处理空间关系时往往需要上采样或扭曲原始输入数据，且未能充分整合地理位置和生态背景信息。论文提出的解决方案是MiTREE模型，该模型基于多输入视觉Transformer（Vision-Transformer）架构，并引入了生态区编码器（ecoregion encoder）。MiTREE的关键创新在于能够在不进行上采样的前提下计算空间跨模态关系，并有效整合地理位置和生态背景信息，从而提升物种分布预测的准确性。该模型在SatBird夏季和冬季数据集上进行了评估，结果表明其优于现有的最先进基线方法。

链接: https://arxiv.org/abs/2412.18995
作者: Theresa Chen,Yao-Yi Chiang
机构: 未知
关键词: Climate change poses, threat to biodiversity, making it imperative, Species Distribution Models, change poses
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 11 pages, GeoAI Workshop and SIGSPATIAL 2024

点击查看摘要

Abstract:Climate change poses an extreme threat to biodiversity, making it imperative to efficiently model the geographical range of different species. The availability of large-scale remote sensing images and environmental data has facilitated the use of machine learning in Species Distribution Models (SDMs), which aim to predict the presence of a species at any given location. Traditional SDMs, reliant on expert observation, are labor-intensive, but advancements in remote sensing and citizen science data have facilitated machine learning approaches to SDM development. However, these models often struggle with leveraging spatial relationships between different inputs – for instance, learning how climate data should inform the data present in satellite imagery – without upsampling or distorting the original inputs. Additionally, location information and ecological characteristics at a location play a crucial role in predicting species distribution models, but these aspects have not yet been incorporated into state-of-the-art approaches. In this work, we introduce MiTREE: a multi-input Vision-Transformer-based model with an ecoregion encoder. MiTREE computes spatial cross-modal relationships without upsampling as well as integrates location and ecological context. We evaluate our model on the SatBird Summer and Winter datasets, the goal of which is to predict bird species encounter rates, and we find that our approach improves upon state-of-the-art baselines.
zh

[CV-100] Geospatial Data Fusion: Combining Lidar SAR and Optical Imagery with AI for Enhanced Urban Mapping

【速读】：该论文旨在解决单一传感器数据在城市测绘中的局限性问题，通过融合激光雷达（Lidar）、合成孔径雷达（SAR）和光学影像等多源地理空间数据，实现对城市环境的更全面表征。解决方案的关键在于采用全卷积网络（Fully Convolutional Networks, FCNs）作为深度学习模型，进行城市要素的像素级分类，并结合粒子群优化算法（Particle Swarm Optimization, PSO）进行超参数调优，从而显著提升模型精度。研究结果表明，FCN-PSO模型在像素精度和平均交并比（Intersection over Union, IoU）上均优于传统单一传感器方法，分别为92.3%和87.6%，验证了融合地理空间数据与人工智能方法在城市测绘中的潜力。

链接: https://arxiv.org/abs/2412.18994
作者: Sajjad Afroosheh,Mohammadreza Askari
机构: 未知
关键词: Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, advanced artificial intelligence, artificial intelligence techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This study explores the integration of Lidar, Synthetic Aperture Radar (SAR), and optical imagery through advanced artificial intelligence techniques for enhanced urban mapping. By fusing these diverse geospatial datasets, we aim to overcome the limitations associated with single-sensor data, achieving a more comprehensive representation of urban environments. The research employs Fully Convolutional Networks (FCNs) as the primary deep learning model for urban feature extraction, enabling precise pixel-wise classification of essential urban elements, including buildings, roads, and vegetation. To optimize the performance of the FCN model, we utilize Particle Swarm Optimization (PSO) for hyperparameter tuning, significantly enhancing model accuracy. Key findings indicate that the FCN-PSO model achieved a pixel accuracy of 92.3% and a mean Intersection over Union (IoU) of 87.6%, surpassing traditional single-sensor approaches. These results underscore the potential of fused geospatial data and AI-driven methodologies in urban mapping, providing valuable insights for urban planning and management. The implications of this research pave the way for future developments in real-time mapping and adaptive urban infrastructure planning.
zh

[CV-101] MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition

【速读】：该论文旨在解决动态面部表情识别（Dynamic Facial Expression Recognition, DFER）中的两个主要问题：一是如何有效结合全局和局部动态特征以提升模型性能，二是如何缓解复杂大模型的过拟合问题。为此，论文提出了一种基于自编码器（autoencoder）的多任务学习框架，即多任务级联自编码器（Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition, MTCAE-DFER）。该框架的关键在于引入了一个即插即用的级联解码器模块，该模块基于视觉Transformer（Vision Transformer, ViT）架构，并利用Transformer的解码器概念重构了多头注意力（multi-head attention）模块。具体而言，前一个任务的解码器输出作为查询（Q），代表局部动态特征，而共享的Video Masked Autoencoder（VideoMAE）编码器输出则作为键（K）和值（V），代表全局动态特征。这种设计促进了相关任务中全局与局部动态特征的交互，从而增强了模型的泛化能力。通过广泛的消融实验和与现有最先进方法的对比，论文验证了MTCAE-DFER模型的鲁棒性以及全局-局部动态特征交互的有效性。

链接: https://arxiv.org/abs/2412.18988
作者: Peihao Xiang,Kaida Wu,Chaohao Lin,Ou Bai
机构: 未知
关键词: dynamic facial expression, facial expression recognition, cascaded network branch, Video Masked Autoencoder, Multi-Task Cascaded Autoencoder
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:This paper expands the cascaded network branch of the autoencoder-based multi-task learning (MTL) framework for dynamic facial expression recognition, namely Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition (MTCAE-DFER). MTCAE-DFER builds a plug-and-play cascaded decoder module, which is based on the Vision Transformer (ViT) architecture and employs the decoder concept of Transformer to reconstruct the multi-head attention module. The decoder output from the previous task serves as the query (Q), representing local dynamic features, while the Video Masked Autoencoder (VideoMAE) shared encoder output acts as both the key (K) and value (V), representing global dynamic features. This setup facilitates interaction between global and local dynamic features across related tasks. Additionally, this proposal aims to alleviate overfitting of complex large model. We utilize autoencoder-based multi-task cascaded learning approach to explore the impact of dynamic face detection and dynamic face landmark on dynamic facial expression recognition, which enhances the model’s generalization ability. After we conduct extensive ablation experiments and comparison with state-of-the-art (SOTA) methods on various public datasets for dynamic facial expression recognition, the robustness of the MTCAE-DFER model and the effectiveness of global-local dynamic feature interaction among related tasks have been proven.
zh

[CV-102] HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis

【速读】：该论文旨在解决手写文档识别（Handwritten Document Recognition, HDR）中的两大核心挑战：手写文本识别和布局分析。传统方法通常将这两个任务分开处理，难以有效整合。为此，论文提出了一种名为HAND（Hierarchical Attention Network for Multi-Scale Document）的新型端到端、无需分割的架构，能够同时进行文本识别和布局分析。HAND的关键创新包括：1）采用集成了门控深度可分离卷积（Gated Depth-wise Separable Convolutions）和八度卷积（Octave Convolutions）的高级卷积编码器，以实现鲁棒的特征提取；2）引入多尺度自适应处理框架（Multi-Scale Adaptive Processing, MSAP），动态适应文档的复杂性；3）设计了具有记忆增强和稀疏注意力机制的分层注意力解码器。此外，HAND通过五个复杂度级别的课程学习（Curriculum Learning）进行训练，并集成了领域自适应预训练的mT5模型进行后处理优化，以提升复杂古籍手稿的识别精度。实验结果表明，HAND在READ 2016数据集上显著优于现有方法，并在模型参数仅为5.60M的情况下，实现了文本识别和布局分析的新基准。

链接: https://arxiv.org/abs/2412.18981
作者: Mohammed Hamdan,Abderrahmane Rahiche,Mohamed Cheriet
机构: 未知
关键词: handwritten text recognition, Handwritten document recognition, complex layouts inherent, layout analysis, text recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Handwritten document recognition (HDR) is one of the most challenging tasks in the field of computer vision, due to the various writing styles and complex layouts inherent in handwritten texts. Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis, and struggled to integrate the two processes effectively. This paper introduces HAND (Hierarchical Attention Network for Multi-Scale Document), a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks. Our model’s key components include an advanced convolutional encoder integrating Gated Depth-wise Separable and Octave Convolutions for robust feature extraction, a Multi-Scale Adaptive Processing (MSAP) framework that dynamically adjusts to document complexity and a hierarchical attention decoder with memory-augmented and sparse attention mechanisms. These components enable our model to scale effectively from single-line to triple-column pages while maintaining computational efficiency. Additionally, HAND adopts curriculum learning across five complexity levels. To improve the recognition accuracy of complex ancient manuscripts, we fine-tune and integrate a Domain-Adaptive Pre-trained mT5 model for post-processing refinement. Extensive evaluations on the READ 2016 dataset demonstrate the superior performance of HAND, achieving up to 59.8% reduction in CER for line-level recognition and 31.2% for page-level recognition compared to state-of-the-art methods. The model also maintains a compact size of 5.60M parameters while establishing new benchmarks in both text recognition and layout analysis. Source code and pre-trained models are available at : this https URL.
zh

[CV-103] CGCOD: Class-Guided Camouflaged Object Detection

【速读】：该论文旨在解决伪装目标检测（Camouflaged Object Detection, COD）中由于伪装目标与背景高度相似而导致语义线索模糊或丢失的问题。现有方法主要依赖视觉特征，但在多变的伪装环境中，这些特征不够稳定，容易导致误检和漏检，从而影响分割结果的完整性和准确性。为解决这一问题，论文提出了一种新任务——类别引导的伪装目标检测（Class-Guided Camouflaged Object Detection, CG-COD），通过引入目标类别知识，显著提升了模型在复杂环境中的鲁棒性和分割精度。解决方案的关键在于构建了一个包含真实场景中伪装目标及其对应类别标注的数据集 CamoClass，并提出了一个多阶段框架 CGNet。该框架包括一个即插即用的类别提示生成器和一个类别引导的检测器，通过文本信息的引导实现高效分割。此外，论文首次在现有 COD 基准数据集上扩展了目标类别标注，并引入了一个灵活的框架，以在文本引导下提升现有 COD 模型的性能。

链接: https://arxiv.org/abs/2412.18977
作者: Chenxi Zhang,Qing Zhang,Jiayun Wu,Youwei Pang
机构: 未知
关键词: Camouflaged Object Detection, COD, Object Detection, Existing COD, designed to identify
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Camouflaged Object Detection (COD) is designed to identify objects that blend seamlessly with their surroundings. Due to the complexity of camouflaged objects (such as shape, color, and texture), their semantic cues are often blurred or completely lost, posing a significant challenge for COD. Existing COD methods often rely on visual features, which are not stable enough in changeable camouflage environments. This instability leads to false positives and false negatives, resulting in incomplete or inaccurate segmentation results. In this paper, to solve this problem, we propose a new task, Class-Guided Camouflaged Object Detection (CG-COD), which extends the traditional COD task by introducing object class knowledge, significantly improving the robustness and segmentation accuracy of the model in complex environments. Toward this end, we construct a dataset, CamoClass, containing the camouflaged objects in the real scenes and their corresponding class annotation. Based on this, we propose a multi-stage framework CGNet which consists of a plug-and-play class prompt generator and a class-guided detector. Under the guidance of textual information, CGNet enables efficient segmentation. It is worth emphasizing that for the first time, we extend the object class annotations on existing COD benchmark datasets, and introduce a flexible framework to improve the performance of the existing COD model under text guidance.
zh

[CV-104] ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

【速读】：该论文旨在解决文本到视频（Text-to-video, T2V）生成模型在训练成本高、生成性能有限，尤其是在计算资源受限情况下的问题。论文提出了一种名为ModelGrow的持续通用预训练方法，通过基于预训练基础模型逐步扩展模型能力，类似于人类基于已有经验获取新知识的方式。解决方案的关键在于两个方面：首先，通过引入多种新技术来扩展模型容量，使其能够存储新知识并提升生成性能；其次，利用大语言模型（large language models）作为高级文本编码器，将其集成到T2V模型中，以增强语言理解能力，并根据详细提示引导生成结果，从而实现更好的语义对齐，特别是在应对复杂用户提示时。实验结果表明，该方法在多个指标上均表现出显著的有效性。

链接: https://arxiv.org/abs/2412.18966
作者: Zhefan Rao,Liya Ji,Yazhou Xing,Runtao Liu,Zhaoyang Liu,Jiaxin Xie,Ziqiao Peng,Yingqing He,Qifeng Chen
机构: 未知
关键词: significant attention recently, gained significant attention, attention recently, gained significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages

点击查看摘要

Abstract:Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to “grow” its abilities based on a pre-trained foundation, analogous to how humans acquire new knowledge based on past experiences. There is a lack of extensive study of the continual pre-training techniques in T2V generation. In this work, we take the initial step toward exploring this task systematically and propose ModelGrow. Specifically, we break this task into two key aspects: increasing model capacity and improving semantic understanding. For model capacity, we introduce several novel techniques to expand the model size, enabling it to store new knowledge and improve generation performance. For semantic understanding, we propose a method that leverages large language models as advanced text encoders, integrating them into T2V models to enhance language comprehension and guide generation results according to detailed prompts. This approach enables the model to achieve better semantic alignment, particularly in response to complex user prompts. Extensive experiments demonstrate the effectiveness of our method across various metrics. The source code and the model of ModelGrow will be publicly available.
zh

[CV-105] opoBDA: Towards Bezier Deformable Attention for Road Topology Understanding

【速读】：该论文旨在解决自动驾驶中道路拓扑结构理解的关键问题，特别是针对细长且复杂的车道中心线检测与表示。解决方案的核心是提出了TopoBDA（Topology with Bezier Deformable Attention）方法，该方法通过引入贝塞尔可变形注意力机制（Bezier Deformable Attention, BDA），利用贝塞尔控制点驱动可变形注意力机制，显著提升了对细长多段线结构（如车道中心线）的检测与表示能力。TopoBDA通过处理多摄像头360度图像生成鸟瞰图（Bird’s Eye View, BEV）特征，并采用基于BDA的Transformer解码器进行特征优化，从而在保持高精度的同时提升计算效率。此外，TopoBDA还引入了实例掩码公式化和一对多集合预测损失策略，进一步优化了中心线检测和道路拓扑理解。实验结果表明，TopoBDA在OpenLane-V2数据集上表现优异，达到了车道中心线检测和拓扑推理的先进水平。多模态数据（如激光雷达和雷达）的集成进一步增强了模型性能，凸显了其在自动驾驶应用中的重要性。

链接: https://arxiv.org/abs/2412.18951
作者: Muhammet Esat Kalfaoglu,Halil Ibrahim Ozturk,Ozsel Kilinc,Alptekin Temizel
机构: 未知
关键词: Bezier Deformable Attention, Deformable Attention, Bezier Deformable, leveraging Bezier Deformable, deformable attention mechanism
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for consideration in the ACM Transactions on Intelligent Systems and Technology (TIST) Special Issue on Transformers

点击查看摘要

Abstract:Understanding road topology is crucial for autonomous driving. This paper introduces TopoBDA (Topology with Bezier Deformable Attention), a novel approach that enhances road topology understanding by leveraging Bezier Deformable Attention (BDA). BDA utilizes Bezier control points to drive the deformable attention mechanism, significantly improving the detection and representation of elongated and thin polyline structures, such as lane centerlines. TopoBDA processes multi-camera 360-degree imagery to generate Bird’s Eye View (BEV) features, which are refined through a transformer decoder employing BDA. This method enhances computational efficiency while maintaining high accuracy in centerline prediction. Additionally, TopoBDA incorporates an instance mask formulation and an auxiliary one-to-many set prediction loss strategy to further refine centerline detection and improve road topology understanding. Experimental evaluations on the OpenLane-V2 dataset demonstrate that TopoBDA outperforms existing methods, achieving state-of-the-art results in centerline detection and topology reasoning. The integration of multi-modal data, including lidar and radar, specifically for road topology understanding, further enhances the model’s performance, underscoring its importance in autonomous driving applications.
zh

[CV-106] Single Trajectory Distillation for Accelerating Image and Video Style Transfer

【速读】：该论文旨在解决基于扩散（Diffusion-based）的图像和视频风格化方法中存在的计算效率低下的问题。现有的多步扩散过程计算成本高，限制了其在实际应用中的广泛使用。论文提出的解决方案是通过轨迹蒸馏（trajectory distillation）来加速这一过程，但现有的方法仅强制学生模型和不完美教师模型的概率流常微分方程（PF-ODE）轨迹在初始步骤对齐，无法确保整个轨迹的一致性。为此，论文提出了单轨迹蒸馏（Single Trajectory Distillation, STD）方法，从特定的部分噪声状态开始，引入轨迹库（trajectory bank）来存储教师模型的轨迹状态，从而减少训练时间成本。此外，论文还采用非对称对抗损失（asymmetric adversarial loss）来提升生成图像的风格和质量。实验结果表明，该方法在风格相似性和美学评估方面优于现有的加速模型。

链接: https://arxiv.org/abs/2412.18945
作者: Sijie Xu,Runqi Wang,Wei Zhu,Dejia Song,Nemo Chen,Xu Tang,Yao Hu
机构: 未知
关键词: methods typically denoise, Diffusion-based stylization methods, Diffusion-based stylization, typically denoise, specific partial noise
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based stylization methods typically denoise from a specific partial noise state for image-to-image and video-to-video tasks. This multi-step diffusion process is computationally expensive and hinders real-world application. A promising solution to speed up the process is to obtain few-step consistency models through trajectory distillation. However, current consistency models only force the initial-step alignment between the probability flow ODE (PF-ODE) trajectories of the student and the imperfect teacher models. This training strategy can not ensure the consistency of whole trajectories. To address this issue, we propose single trajectory distillation (STD) starting from a specific partial noise state. We introduce a trajectory bank to store the teacher model’s trajectory states, mitigating the time cost during training. Besides, we use an asymmetric adversarial loss to enhance the style and quality of the generated images. Extensive experiments on image and video stylization demonstrate that our method surpasses existing acceleration models in terms of style similarity and aesthetic evaluations. Our code and results will be available on the project page: this https URL.
zh

[CV-107] INQ: Temporal Inconsistency Guided Blind Video Quality Assessment

【速读】：该论文旨在解决用户生成内容（UGC）视频和超分辨率（SR）视频中的盲视频质量评估（BVQA）问题。当前BVQA方法通常通过运动信息的统计量来建模UGC视频中的时间关系，但忽略了时间不一致性（temporal inconsistency）的影响。特别是在SR视频中，由于上采样算法的放大效应，时间不一致性更为显著。论文提出了一种基于时间不一致性引导的盲视频质量评估方法（Temporal Inconsistency Guided Blind Video Quality Assessment, TINQ），其关键在于通过不同的方式计算UGC和SR视频中的时间不一致性，并利用空间模块在粗粒度和细粒度上突出连续帧之间的不一致区域。此外，时间模块通过两个阶段聚合时间特征：第一阶段使用视觉记忆容量块（visual memory capacity block）基于估计的复杂度自适应地分割时间维度，第二阶段则专注于选择关键特征。这两个阶段通过一致性感知融合单元（Consistency-aware Fusion Units）协同工作，以回归跨时间尺度的视频质量。实验结果表明，该方法在UGC和SR视频质量数据集上优于现有的最先进BVQA方法。

链接: https://arxiv.org/abs/2412.18933
作者: Yixiao Li,Xiaoyuan Yang,Weide Liu,Xin Jin,Xu Jia,Yukun Lai,Haotao Liu,Paul L Rosin,Wei Zhou
机构: 未知
关键词: Blind video quality, UGC, video quality assessment, video quality, UGC videos
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Blind video quality assessment (BVQA) has been actively researched for user-generated content (UGC) videos. Recently, super-resolution (SR) techniques have been widely applied in UGC. Therefore, an effective BVQA method for both UGC and SR scenarios is essential. Temporal inconsistency, referring to irregularities between consecutive frames, is relevant to video quality. Current BVQA approaches typically model temporal relationships in UGC videos using statistics of motion information, but inconsistencies remain unexplored. Additionally, different from temporal inconsistency in UGC videos, such inconsistency in SR videos is amplified due to upscaling algorithms. In this paper, we introduce the Temporal Inconsistency Guided Blind Video Quality Assessment (TINQ) metric, demonstrating that exploring temporal inconsistency is crucial for effective BVQA. Since temporal inconsistencies vary between UGC and SR videos, they are calculated in different ways. Based on this, a spatial module highlights inconsistent areas across consecutive frames at coarse and fine granularities. In addition, a temporal module aggregates features over time in two stages. The first stage employs a visual memory capacity block to adaptively segment the time dimension based on estimated complexity, while the second stage focuses on selecting key features. The stages work together through Consistency-aware Fusion Units to regress cross-time-scale video quality. Extensive experiments on UGC and SR video quality datasets show that our method outperforms existing state-of-the-art BVQA methods. Code is available at this https URL.
zh

[CV-108] Graph Cut-guided Maximal Coding Rate Reduction for Learning Image Embedding and Clustering ACCV2024

【速读】：该论文旨在解决图像聚类任务中特征提取与聚类分离导致的次优性能问题。传统方法通常将这两个阶段视为独立过程或采用不同的学习范式，导致聚类效果不佳。为此，论文提出了一种统一框架，称为图割引导的最大编码率缩减（CgMCR^2），用于联合学习结构化嵌入和聚类。该框架的关键在于将高效的聚类模块集成到学习结构化表示的原则性框架中，利用聚类模块提供的分区信息指导簇级压缩，同时通过学习到的嵌入与期望的几何结构对齐，从而生成更准确的分区。通过在多标准图像数据集和域外数据集上的广泛实验，验证了该方法的有效性。

链接: https://arxiv.org/abs/2412.18930
作者: W. He,Z. Huang,X. Meng,X. Qi,R. Xiao,C.-G. Li
机构: 未知
关键词: pre-trained vision models, pre-trained models, vision models, pre-trained features, image clustering task
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 9 figures, accepted in ACCV2024

点击查看摘要

Abstract:In the era of pre-trained models, image clustering task is usually addressed by two relevant stages: a) to produce features from pre-trained vision models; and b) to find clusters from the pre-trained features. However, these two stages are often considered separately or learned by different paradigms, leading to suboptimal clustering performance. In this paper, we propose a unified framework, termed graph Cut-guided Maximal Coding Rate Reduction (CgMCR ^2 ), for jointly learning the structured embeddings and the clustering. To be specific, we attempt to integrate an efficient clustering module into the principled framework for learning structured representation, in which the clustering module is used to provide partition information to guide the cluster-wise compression and the learned embeddings is aligned to desired geometric structures in turn to help for yielding more accurate partitions. We conduct extensive experiments on both standard and out-of-domain image datasets and experimental results validate the effectiveness of our approach.
zh

[CV-109] UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation

【速读】：该论文旨在解决当前文本到图像生成模型在仅使用文本提示时难以精确控制像素级布局、物体外观和全局风格的问题。为解决这一问题，论文提出了一种统一可控生成的新方法，即基于多模态扩散变换器架构的统一图像指令适配器（UNIC-Adapter）。该适配器的关键创新在于能够通过结合条件图像和任务指令，提取多模态指令信息，并通过增强的交叉注意力机制（Rotary Position Embedding）将这些信息注入图像生成过程中，从而在单一框架内实现跨多种条件的灵活可控生成，避免了为不同类型参考输入设计专门模型的需求。

链接: https://arxiv.org/abs/2412.18928
作者: Lunhao Duan,Shanshan Zhao,Wenjun Yan,Yinglun Li,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Mingming Gong,Gui-Song Xia
机构: 未知
关键词: achieved remarkable advancements, diffusion models facilitating, models facilitating high-quality, facilitating high-quality image, remarkable advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the need for multiple specialized models. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding. Experimental results across a variety of tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based image synthesis, demonstrate the effectiveness of our UNIC-Adapter in unified controllable image generation.
zh

[CV-110] Generative Face Parsing Map Guided 3D Face Reconstruction Under Occluded Scenes

【速读】：该论文旨在解决单视角3D人脸重建（single-view 3D face reconstruction）在遮挡场景下的挑战。现有方法在输入无遮挡时能够生成高质量的3D模型，但在遮挡情况下表现不佳。论文提出了一种新的系统，通过解析面部特征（parsing facial features）生成完整的面部分割图（face parsing map），并估计遮挡区域的合理2D面部结构，从而用于3D重建。该方法的关键在于确保输出结果的真实性，特别是眼睛、鼻子和嘴之间的拓扑结构（topological structure）的准确性。通过大量实验，论文展示了该方法在遮挡场景下相较于现有方法的优越性，显著提升了3D人脸重建的质量和鲁棒性。

链接: https://arxiv.org/abs/2412.18920
作者: Dapeng Zhao,Yue Qi
机构: 未知
关键词: http URL, face reconstruction, face reconstruction methods, http URL describe, http URL estimate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Over the past few years, single-view 3D face reconstruction methods can produce beautiful 3D models. Nevertheless,the input of these works is unobstructed this http URL describe a system designed to reconstruct convincing face texture in the case of this http URL by parsing facial features,we propose a complete face parsing map generation method guided by this http URL estimate the 2D face structure of the reasonable position of the occlusion area,which is used for the construction of 3D this http URL excellent anti-occlusion face reconstruction method should ensure the authenticity of the output,including the topological structure between the eyes,nose, and mouth. We extensively tested our method and its components, qualitatively demonstrating the rationality of our estimated facial structure. We conduct extensive experiments on general 3D face reconstruction tasks as concrete examples to demonstrate the method’s superior regulation ability over existing methods often break this http URL further provide numerous quantitative examples showing that our method advances both the quality and the robustness of 3D face reconstruction under occlusion scenes.
zh

[CV-111] An Attentive Dual-Encoder Framework Leveraging Multimodal Visual and Semantic Information for Automatic OSAHS Diagnosis ICASSP2025

【速读】：该论文旨在解决阻塞性睡眠呼吸暂停低通气综合征（OSAHS）诊断中传统多导睡眠图（PSG）方法昂贵、耗时且不适的问题，以及现有基于面部图像分析的深度学习方法因面部特征捕捉不足和样本量有限而导致的准确性不足问题。为此，作者提出了一种多模态双编码器模型，该模型通过整合视觉和语言输入来实现自动化OSAHS诊断。解决方案的关键在于：1）使用随机过采样器（randomOverSampler）平衡数据；2）通过注意力网格提取关键面部特征；3）将生理数据转化为有意义的文本；4）利用交叉注意力机制结合图像和文本数据进行更有效的特征提取；5）采用有序回归损失确保学习过程的稳定性。该模型在四类严重程度分类任务中达到了91.3%的Top-1准确率，展示了当前最先进的性能。

链接: https://arxiv.org/abs/2412.18919
作者: Yingchen Wei,Xihe Qiu,Xiaoyu Tan,Jingjing Huang,Wei Chu,Yinghui Xu,Yuan Qi
机构: 未知
关键词: Obstructive sleep apnea-hypopnea, sleep apnea-hypopnea syndrome, common sleep disorder, sleep disorder caused, upper airway blockage
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, Published as a conference paper at ICASSP 2025

点击查看摘要

Abstract:Obstructive sleep apnea-hypopnea syndrome (OSAHS) is a common sleep disorder caused by upper airway blockage, leading to oxygen deprivation and disrupted sleep. Traditional diagnosis using polysomnography (PSG) is expensive, time-consuming, and uncomfortable. Existing deep learning methods using facial image analysis lack accuracy due to poor facial feature capture and limited sample sizes. To address this, we propose a multimodal dual encoder model that integrates visual and language inputs for automated OSAHS diagnosis. The model balances data using randomOverSampler, extracts key facial features with attention grids, and converts physiological data into meaningful text. Cross-attention combines image and text data for better feature extraction, and ordered regression loss ensures stable learning. Our approach improves diagnostic efficiency and accuracy, achieving 91.3% top-1 accuracy in a four-class severity classification task, demonstrating state-of-the-art performance. Code will be released upon acceptance.
zh

[CV-112] BCR-Net: Boundary-Category Refinement Network for Weakly Semi-Supervised X-Ray Prohibited Item Detection with Points

【速读】：该论文旨在解决X射线图像中违禁物品自动检测的难题，特别是在标注成本与检测性能之间寻求平衡。现有方法要么依赖昂贵的边界框标注以实现高性能，要么使用弱标注但精度有限。为此，论文提出了基于点标注的弱半监督X射线违禁物品检测方法（WSSPID-P），并设计了一种新颖的边界-类别优化网络（BCR-Net）。BCR-Net的关键在于其引入了边界优化模块（BR）和类别优化模块（CR）。BR模块通过双注意力机制聚焦于违禁物品的边界和显著特征，而CR模块则在RPN和ROI头部引入尺度与旋转感知的对比损失，增强特征空间中的类内一致性和类间可分性。通过这些设计，BCR-Net有效解决了定位不精确和分类不准确的问题，在有限标注条件下显著提升了检测性能。

链接: https://arxiv.org/abs/2412.18918
作者: Sanjoeng Wong
机构: 未知
关键词: Automatic prohibited item, prohibited item detection, Automatic prohibited, X-ray Prohibited Item, item detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic prohibited item detection in X-ray images is crucial for public safety. However, most existing detection methods either rely on expensive box annotations to achieve high performance or use weak annotations but suffer from limited accuracy. To balance annotation cost and detection performance, we study Weakly Semi-Supervised X-ray Prohibited Item Detection with Points (WSSPID-P) and propose a novel \textbfBoundary-\textbfCategory \textbfRefinement \textbfNetwork (\textbfBCR-Net) that requires only a few box annotations and a large number of point annotations. BCR-Net is built based on Group R-CNN and introduces a new Boundary Refinement (BR) module and a new Category Refinement (CR) module. The BR module develops a dual attention mechanism to focus on both the boundaries and salient features of prohibited items. Meanwhile, the CR module incorporates contrastive branches into the heads of RPN and ROI by introducing a scale- and rotation-aware contrastive loss, enhancing intra-class consistency and inter-class separability in the feature space. Based on the above designs, BCR-Net effectively addresses the closely related problems of imprecise localization and inaccurate classification. Experimental results on public X-ray datasets show the effectiveness of BCR-Net, achieving significant performance improvements to state-of-the-art methods under limited annotations.
zh

[CV-113] Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model ICIP2024

【速读】：该论文致力于解决开放词汇全景分割（open-vocabulary panoptic segmentation）这一具有挑战性的问题，其核心难点在于如何利用有限的分类训练数据使模型能够泛化到无限数量的类别。为解决这一问题，论文提出了OMTSeg方法，其关键创新在于采用了另一种大规模视觉-语言预训练模型BEiT-3，并充分利用了BEiT-3中视觉与语言特征之间的跨模态注意力机制（cross-modal attention），从而提升了模型性能。实验结果表明，OMTSeg在性能上优于当前最先进的模型。

链接: https://arxiv.org/abs/2412.18917
作者: Yi-Chia Chen,Wei-Hua Li,Chu-Song Chen
机构: 未知
关键词: panoptic segmentation remains, challenging problem, remains a challenging, Open-vocabulary panoptic segmentation, Open-vocabulary panoptic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICIP 2024

点击查看摘要

Abstract:Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.
zh

[CV-114] Accelerating Diffusion Transformers with Dual Feature Caching

【速读】：该论文旨在解决扩散变换器（Diffusion Transformers, DiT）在图像和视频生成中计算成本过高的问题。现有的特征缓存方法虽然能够通过缓存前一时间步的特征并在后续时间步中重用它们来加速计算，但存在生成质量下降或加速效果有限的权衡。论文通过定量研究缓存特征引入的累积误差，发现激进缓存（aggressive caching）在缓存步骤中并未显著增加误差，而保守缓存（conservative caching）能够有效修复激进缓存引入的误差。基于这一发现，论文提出了一种双重缓存策略（dual caching strategy），交替采用激进缓存和保守缓存，从而在显著加速的同时保持高生成质量。此外，论文还引入了一种与闪存注意力（flash attention）兼容且无需训练和校准数据的V缓存策略（V-caching strategy），用于逐令牌的保守缓存。

链接: https://arxiv.org/abs/2412.18911
作者: Chang Zou,Evelyn Zhang,Runlin Guo,Haohang Xu,Conghui He,Xuming Hu,Linfeng Zhang
机构: 未知
关键词: Diffusion Transformers, substantial computational costs, suffer substantial computational, generation quality, computational costs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiT) have become the dominant methods in image and video generation yet still suffer substantial computational costs. As an effective approach for DiT acceleration, feature caching methods are designed to cache the features of DiT in previous timesteps and reuse them in the next timesteps, allowing us to skip the computation in the next timesteps. However, on the one hand, aggressively reusing all the features cached in previous timesteps leads to a severe drop in generation quality. On the other hand, conservatively caching only the features in the redundant layers or tokens but still computing the important ones successfully preserves the generation quality but results in reductions in acceleration ratios. Observing such a tradeoff between generation quality and acceleration performance, this paper begins by quantitatively studying the accumulated error from cached features. Surprisingly, we find that aggressive caching does not introduce significantly more caching errors in the caching step, and the conservative feature caching can fix the error introduced by aggressive caching. Thereby, we propose a dual caching strategy that adopts aggressive and conservative caching iteratively, leading to significant acceleration and high generation quality at the same time. Besides, we further introduce a V-caching strategy for token-wise conservative caching, which is compatible with flash attention and requires no training and calibration data. Our codes have been released in Github: \textbfCode: \hrefthis https URL\texttt\textcolorcyanthis https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.18911 [cs.LG] (or arXiv:2412.18911v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.18911 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-115] EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation

【速读】：该论文旨在解决从高维观测中学习多物体操纵任务时面临的挑战，特别是在未见过的物体配置中实现组合泛化（compositional generalization）的问题。传统方法依赖于大规模离线数据和像素观测进行模型训练，但在网络和数据集规模受限的情况下，难以有效处理多物体环境中的组合复杂性。为此，论文提出了一种基于行为克隆（Behavioral Cloning, BC）的新方法，其核心在于利用物体中心表示（object-centric representations）和实体中心Transformer（entity-centric Transformer），并结合扩散模型（diffusion models）进行优化。该方法首先将观测分解为物体中心表示，然后通过实体中心Transformer在物体级别计算注意力，同时预测物体动态和智能体的动作。扩散模型能够捕捉多模态行为分布，从而显著提升多物体任务中的性能，并实现组合泛化。最终，该方法展示了在训练中未见过的物体组合和目标任务中实现零样本泛化（zero-shot generalization）的能力，包括处理比训练时更多物体的任务。

链接: https://arxiv.org/abs/2412.18907
作者: Carl Qi,Dan Haramati,Tal Daniel,Aviv Tamar,Amy Zhang
机构: 未知
关键词: common component, component of everyday, presents significant challenges, significant challenges, high-dimensional observations presents
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Object manipulation is a common component of everyday tasks, but learning to manipulate objects from high-dimensional observations presents significant challenges. These challenges are heightened in multi-object environments due to the combinatorial complexity of the state space as well as of the desired behaviors. While recent approaches have utilized large-scale offline data to train models from pixel observations, achieving performance gains through scaling, these methods struggle with compositional generalization in unseen object configurations with constrained network and dataset sizes. To address these issues, we propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer with diffusion-based optimization, enabling efficient learning from offline image data. Our method first decomposes observations into an object-centric representation, which is then processed by our entity-centric Transformer that computes attention at the object level, simultaneously predicting object dynamics and the agent’s actions. Combined with the ability of diffusion models to capture multi-modal behavior distributions, this results in substantial performance improvements in multi-object tasks and, more importantly, enables compositional generalization. We present BC agents capable of zero-shot generalization to tasks with novel compositions of objects and goals, including larger numbers of objects than seen during training. We provide video rollouts on our webpage: this https URL.
zh

[CV-116] HV-BEV: Decoupling Horizontal and Vertical Feature Sampling for Multi-View 3D Object Detection

【速读】：该论文旨在解决基于视觉的多视角环境感知系统在自动驾驶技术中的应用问题，特别是基于鸟瞰图（BEV）的模型。当前最先进的解决方案主要通过显式或隐式深度预测将每个相机视角的图像特征编码到BEV空间中，但这些方法往往专注于提高将2D特征投影到相应深度区域的准确性，而忽略了现实世界物体的高度结构化信息以及不同场景中物体高度分布的多样性。为此，论文提出了HV-BEV方法，其关键创新在于将BEV网格查询范式中的特征采样解耦为水平特征聚合和垂直自适应高度感知参考点采样。具体而言，该方法在水平面上构建了一个可学习的图结构，用于3D参考点，以增强不同BEV网格中同一实例的关联性，特别是在实例跨越车辆周围多个图像视角时。此外，引入了一个高度感知模块，结合历史信息，使参考点能够自适应地关注不同场景中物体出现的高度变化。实验结果表明，该方法在nuScenes数据集上显著优于基线模型，最佳模型在nuScenes测试集上达到了50.5%的mAP和59.8%的NDS。

链接: https://arxiv.org/abs/2412.18884
作者: Di Wu,Feng Yang,Benlian Xu,Pan Liao,Wenhui Zhao,Dingwen Zhang
机构: 未知
关键词: autonomous driving technology, vision-based multi-view environmental, multi-view environmental perception, environmental perception system, driving technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:The application of vision-based multi-view environmental perception system has been increasingly recognized in autonomous driving technology, especially the BEV-based models. Current state-of-the-art solutions primarily encode image features from each camera view into the BEV space through explicit or implicit depth prediction. However, these methods often focus on improving the accuracy of projecting 2D features into corresponding depth regions, while overlooking the highly structured information of real-world objects and the varying height distributions of objects across different scenes. In this work, we propose HV-BEV, a novel approach that decouples feature sampling in the BEV grid queries paradigm into horizontal feature aggregation and vertical adaptive height-aware reference point sampling, aiming to improve both the aggregation of objects’ complete information and generalization to diverse road environments. Specifically, we construct a learnable graph structure in the horizontal plane aligned with the ground for 3D reference points, reinforcing the association of the same instance across different BEV grids, especially when the instance spans multiple image views around the vehicle. Additionally, instead of relying on uniform sampling within a fixed height range, we introduce a height-aware module that incorporates historical information, enabling the reference points to adaptively focus on the varying heights at which objects appear in different scenes. Extensive experiments validate the effectiveness of our proposed method, demonstrating its superior performance over the baseline across the nuScenes dataset. Moreover, our best-performing model achieves a remarkable 50.5% mAP and 59.8% NDS on the nuScenes testing set.
zh

[CV-117] MotionMap: Representing Multimodality in Human Pose Forecasting

【速读】：该论文旨在解决人体姿态预测（Human Pose Forecasting）中的多模态（multimodality）问题。由于观察到的姿态序列可能存在多种未来状态，因此该任务本质上是多模态的。然而，评估多模态性具有挑战性，因为该任务本身是不适定的（ill-posed）。为此，论文首先提出了一种替代范式，使任务变得适定。其次，尽管现有最先进的方法能够预测多模态性，但需要大量预测样本进行过采样，这引发了两个关键问题：(1) 是否可以通过高效采样少量预测来捕捉多模态性？(2) 对于观察到的姿态序列，哪些预测的未来状态更有可能发生？论文通过提出MotionMap来解决这些问题，MotionMap是一种基于热图（heatmap）的简单而有效的多模态表示方法。MotionMap将热图扩展到表示所有可能运动的空间分布，其中不同的局部最大值对应于给定观察的不同预测。MotionMap能够捕捉每个观察的多种模态，并为不同模态提供置信度度量。此外，MotionMap还引入了对预测姿态序列的不确定性和可控性概念，并能够捕捉难以评估但对安全至关重要的罕见模态。论文通过在Human3.6M和AMASS等流行的3D人体姿态数据集上进行定性和定量实验，验证了所提出方法的优势和局限性。

链接: https://arxiv.org/abs/2412.18883
作者: Reyhaneh Hosseininejad,Megh Shukla,Saeed Saadatnejad,Mathieu Salzmann,Alexandre Alahi
机构: 未知
关键词: forecasting is inherently, inherently multimodal, Human pose forecasting, observed pose sequence, pose sequence
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: TLDR: We propose a new representation for learning multimodality in human pose forecasting which does not depend on generative models

点击查看摘要

Abstract:Human pose forecasting is inherently multimodal since multiple futures exist for an observed pose sequence. However, evaluating multimodality is challenging since the task is ill-posed. Therefore, we first propose an alternative paradigm to make the task well-posed. Next, while state-of-the-art methods predict multimodality, this requires oversampling a large volume of predictions. This raises key questions: (1) Can we capture multimodality by efficiently sampling a smaller number of predictions? (2) Subsequently, which of the predicted futures is more likely for an observed pose sequence? We address these questions with MotionMap, a simple yet effective heatmap based representation for multimodality. We extend heatmaps to represent a spatial distribution over the space of all possible motions, where different local maxima correspond to different forecasts for a given observation. MotionMap can capture a variable number of modes per observation and provide confidence measures for different modes. Further, MotionMap allows us to introduce the notion of uncertainty and controllability over the forecasted pose sequence. Finally, MotionMap captures rare modes that are non-trivial to evaluate yet critical for safety. We support our claims through multiple qualitative and quantitative experiments using popular 3D human pose datasets: Human3.6M and AMASS, highlighting the strengths and limitations of our proposed method. Project Page: this https URL
zh

[CV-118] IUST_PersonReId: A New Domain in Person Re-Identification Datasets

【速读】：该论文旨在解决行人重识别（Person Re-identification, ReID）模型在跨文化环境中的泛化问题，特别是在伊斯兰地区如伊朗，由于当地普遍穿着保守服饰，现有模型表现不佳。现有数据集主要基于西方和东亚的服饰风格，限制了其在这些文化背景下的适用性。为解决这一问题，论文提出了IUST_PersonReId数据集，该数据集专门设计用于反映伊朗等新文化环境中行人重识别的独特挑战，涵盖了市场、校园和清真寺等多种场景，并强调保守服饰的影响。通过在IUST_PersonReId上对Solider和CLIP-ReID等先进模型进行实验，发现其性能相较于Market1501和MSMT17等基准数据集显著下降，凸显了遮挡和特征有限性带来的挑战。序列化评估表明，利用时间上下文信息可以提升性能，强调了该数据集在推动文化敏感且鲁棒的行人重识别系统发展中的潜力。IUST_PersonReId为全球行人重识别研究中的公平性和偏见问题提供了关键资源。

链接: https://arxiv.org/abs/2412.18874
作者: Alireza Sedighi Moghaddam,Fatemeh Anvari,Mohammadjavad Mirshekari Haghighi,Mohammadali Fakhari,Mohammad Reza Mohammadi
机构: 未知
关键词: Person re-identification, modest clothing styles, Islamic regions, East Asian fashion, styles are prevalent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures. The dataset introduced in this paper, IUST_PersonReId, is publicly available at this https URL

点击查看摘要

Abstract:Person re-identification (ReID) models often struggle to generalize across diverse cultural contexts, particularly in Islamic regions like Iran, where modest clothing styles are prevalent. Existing datasets predominantly feature Western and East Asian fashion, limiting their applicability in these settings. To address this gap, we introduce IUST_PersonReId, a dataset designed to reflect the unique challenges of ReID in new cultural environments, emphasizing modest attire and diverse scenarios from Iran, including markets, campuses, and mosques. Experiments on IUST_PersonReId with state-of-the-art models, such as Solider and CLIP-ReID, reveal significant performance drops compared to benchmarks like Market1501 and MSMT17, highlighting the challenges posed by occlusion and limited distinctive features. Sequence-based evaluations show improvements by leveraging temporal context, emphasizing the dataset’s potential for advancing culturally sensitive and robust ReID systems. IUST_PersonReId offers a critical resource for addressing fairness and bias in ReID research globally. The dataset is publicly available at this https URL.
zh

[CV-119] Cross-PCR: A Robust Cross-Source Point Cloud Registration Framework AAAI2025

【速读】：该论文旨在解决跨源点云（cross-source point clouds）配准中的密度不一致性和分布差异问题。传统方法在处理跨源点云时，由于不同来源的点云数据在密度和分布上存在显著差异，导致配准效果不佳。为此，论文提出了一种密度鲁棒的特征提取与匹配方案。其关键解决方案包括：首先，引入密度鲁棒编码器（density-robust encoder）来提取对密度变化不敏感的特征；其次，采用“宽松生成，严格筛选”（loose generation, strict selection）的匹配流程，通过一对多策略生成初始对应关系，随后通过稀疏匹配和密集匹配严格筛选高质量对应关系，从而实现鲁棒的配准。该方法在跨源3DCSR数据集的Kinect-LiDAR场景中显著提升了特征匹配召回率和配准召回率，并在3DMatch数据集上取得了最佳性能，同时在不同下采样密度下保持了鲁棒性。

链接: https://arxiv.org/abs/2412.18873
作者: Guiyu Zhao,Zhentao Guo,Zewen Du,Hongbin Ma
机构: 未知
关键词: previous methods fail, distribution difference, cross-source point clouds, Due, density inconsistency
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Due to the density inconsistency and distribution difference between cross-source point clouds, previous methods fail in cross-source point cloud registration. We propose a density-robust feature extraction and matching scheme to achieve robust and accurate cross-source registration. To address the density inconsistency between cross-source data, we introduce a density-robust encoder for extracting density-robust features. To tackle the issue of challenging feature matching and few correct correspondences, we adopt a loose-to-strict matching pipeline with a ``loose generation, strict selection’’ idea. Under it, we employ a one-to-many strategy to loosely generate initial correspondences. Subsequently, high-quality correspondences are strictly selected to achieve robust registration through sparse matching and dense matching. On the challenging Kinect-LiDAR scene in the cross-source 3DCSR dataset, our method improves feature matching recall by 63.5 percentage points (pp) and registration recall by 57.6 pp. It also achieves the best performance on 3DMatch, while maintaining robustness under diverse downsampling densities.
zh

[CV-120] SceneJAL: Joint Active Learning of Traffic Scenes for 3D Object Detection

【速读】：该论文旨在解决自动驾驶（Autonomous Driving, AD）数据集中存在的数据质量低、冗余度高的问题，这些问题影响了模型的性能和效率。为了解决这一问题，论文提出了一种交通场景联合主动学习（Traffic Scene Joint Active Learning, TSceneJAL）框架，该框架能够从已标注和未标注数据中高效地采样出平衡、多样且复杂的交通场景。该框架的关键创新点包括：1）基于类别熵（category entropy）的场景采样方案，用于识别包含多类对象的场景，从而缓解主动学习中的类别不平衡问题；2）通过有向图表示和边缘化核算法（marginalize kernel algorithm）估计的相似性采样方案，用于选择稀疏且多样的场景；3）基于混合密度网络（mixture density network）预测的不确定性采样方案，用于选择回归结果最不明确或最复杂的实例。最终，通过将这三个方案整合到一个联合选择策略中，生成一个最优且有价值的子数据集。实验结果表明，该方法在3D目标检测任务中优于现有最先进方法，性能提升高达12%。

链接: https://arxiv.org/abs/2412.18870
作者: Chenyang Lei,Meiying Zhang,Weiyuan Peng,Qi Hao,Chengzhong Xu,Chunlin Ji,Guang Zhou
机构: 未知
关键词: incur substantial costs, datasets incur substantial, redundant data instances, autonomous driving, collection and labeling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most autonomous driving (AD) datasets incur substantial costs for collection and labeling, inevitably yielding a plethora of low-quality and redundant data instances, thereby compromising performance and efficiency. Many applications in AD systems necessitate high-quality training datasets using both existing datasets and newly collected data. In this paper, we propose a traffic scene joint active learning (TSceneJAL) framework that can efficiently sample the balanced, diverse, and complex traffic scenes from both labeled and unlabeled data. The novelty of this framework is threefold: 1) a scene sampling scheme based on a category entropy, to identify scenes containing multiple object classes, thus mitigating class imbalance for the active learner; 2) a similarity sampling scheme, estimated through the directed graph representation and a marginalize kernel algorithm, to pick sparse and diverse scenes; 3) an uncertainty sampling scheme, predicted by a mixture density network, to select instances with the most unclear or complex regression outcomes for the learner. Finally, the integration of these three schemes in a joint selection strategy yields an optimal and valuable subdataset. Experiments on the KITTI, Lyft, nuScenes and SUScape datasets demonstrate that our approach outperforms existing state-of-the-art methods on 3D object detection tasks with up to 12% improvements.
zh

[CV-121] WeatherGS: 3D Scene Reconstruction in Adverse Weather Conditions via Gaussian Splatting

【速读】：该论文旨在解决3D高斯泼溅（3D Gaussian Splatting, 3DGS）在复杂户外环境，尤其是恶劣天气条件下进行场景重建时，由于将天气引起的伪影直接重建为场景的一部分，导致重建场景清晰度大幅下降的问题。为解决这一挑战，论文提出了WeatherGS框架，该框架基于3DGS，能够从多视角图像中重建出不同天气条件下的清晰场景。解决方案的关键在于将多天气伪影明确分类为具有不同特性的密集颗粒（如雪花和雨滴）和镜头遮挡（如相机镜头上的降水），并采用密集到稀疏的预处理策略，依次通过大气效应过滤器（Atmospheric Effect Filter, AEF）去除密集颗粒，再通过镜头效应检测器（Lens Effect Detector, LED）提取相对稀疏的遮挡掩码。最后，利用处理后的图像和生成的掩码训练一组3D高斯，通过高斯泼溅准确恢复出清晰的底层场景。

链接: https://arxiv.org/abs/2412.18862
作者: Chenghao Qian,Yuhu Guo,Wenjing Li,Gustav Markkula
机构: 未知
关键词: gained significant attention, complex outdoor environments, outdoor environments, Gaussian Splatting, gained significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has gained significant attention for 3D scene reconstruction, but still suffers from complex outdoor environments, especially under adverse weather. This is because 3DGS treats the artifacts caused by adverse weather as part of the scene and will directly reconstruct them, largely reducing the clarity of the reconstructed scene. To address this challenge, we propose WeatherGS, a 3DGS-based framework for reconstructing clear scenes from multi-view images under different weather conditions. Specifically, we explicitly categorize the multi-weather artifacts into the dense particles and lens occlusions that have very different characters, in which the former are caused by snowflakes and raindrops in the air, and the latter are raised by the precipitation on the camera lens. In light of this, we propose a dense-to-sparse preprocess strategy, which sequentially removes the dense particles by an Atmospheric Effect Filter (AEF) and then extracts the relatively sparse occlusion masks with a Lens Effect Detector (LED). Finally, we train a set of 3D Gaussians by the processed images and generated masks for excluding occluded areas, and accurately recover the underlying clear scene by Gaussian splatting. We conduct a diverse and challenging benchmark to facilitate the evaluation of 3D reconstruction under complex weather scenarios. Extensive experiments on this benchmark demonstrate that our WeatherGS consistently produces high-quality, clean scenes across various weather scenarios, outperforming existing state-of-the-art methods. See project page:this https URL.
zh

[CV-122] Few-shot Metric Domain Adaptation: Practical Learning Strategies for an Automated Plant Disease Diagnosis AAAI

【速读】：该论文旨在解决基于图像的植物病害自动诊断系统在实际应用中因训练数据与目标环境（域）不一致而导致的诊断能力显著下降的问题。具体而言，由于训练数据集的规模有限、病害症状的多样性以及栽培环境和成像条件（如设备和构图）的显著差异，导致训练数据的多样性不足，从而限制了系统的鲁棒性和泛化能力。为解决这一问题，论文提出了Few-shot Metric Domain Adaptation (FMDA)方法，其关键是通过引入约束条件，最小化源数据（训练数据）与目标数据特征空间之间的“距离”，从而减少域差异。FMDA具有计算效率高、仅需基本特征距离计算和反向传播的特点，并可无缝集成到任何机器学习流程中。实验结果表明，FMDA在仅使用目标域中每种病害的10张图像的情况下，显著提升了诊断准确性，且优于利用相同数据进行微调的方法。

链接: https://arxiv.org/abs/2412.18859
作者: Shoma Kudo,Satoshi Kagiwada,Hitoshi Iyatomi
机构: 未知
关键词: explored image-based automated, Numerous studies, demonstrating impressive diagnostic, impressive diagnostic capabilities, plant disease diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, 3 tables. Accepted at 4th Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE)

点击查看摘要

Abstract:Numerous studies have explored image-based automated systems for plant disease diagnosis, demonstrating impressive diagnostic capabilities. However, recent large-scale analyses have revealed a critical limitation: that the diagnostic capability suffers significantly when validated on images captured in environments (domains) differing from those used during training. This shortfall stems from the inherently limited dataset size and the diverse manifestation of disease symptoms, combined with substantial variations in cultivation environments and imaging conditions, such as equipment and composition. These factors lead to insufficient variety in training data, ultimately constraining the system’s robustness and generalization. To address these challenges, we propose Few-shot Metric Domain Adaptation (FMDA), a flexible and effective approach for enhancing diagnostic accuracy in practical systems, even when only limited target data is available. FMDA reduces domain discrepancies by introducing a constraint to the diagnostic model that minimizes the “distance” between feature spaces of source (training) data and target data with limited samples. FMDA is computationally efficient, requiring only basic feature distance calculations and backpropagation, and can be seamlessly integrated into any machine learning (ML) pipeline. In large-scale experiments, involving 223,015 leaf images across 20 fields and 3 crop species, FMDA achieved F1 score improvements of 11.1 to 29.3 points compared to cases without target data, using only 10 images per disease from the target domain. Moreover, FMDA consistently outperformed fine-tuning methods utilizing the same data, with an average improvement of 8.5 points.
zh

[CV-123] Cross-View Image Set Geo-Localization

【速读】：该论文旨在解决跨视角地理定位（Cross-View Geo-Localization, CVGL）中视角多样性不足的问题。现有方法主要依赖单张图像或固定视角的图像序列作为查询，限制了视角的多样性，而人类在视觉定位时通常会通过移动获取多视角信息。为此，论文提出了一种新任务：跨视角图像集地理定位（Cross-View Image Set Geo-Localization, Set-CVGL），通过整合多视角图像作为查询集来提高定位的可靠性。为支持该任务，论文引入了SetVL-480K基准数据集，包含全球范围内捕获的480,000张地面图像及其对应的卫星图像，每张卫星图像平均对应40张不同视角和位置的地面图像。此外，论文提出了FlexGeo方法，该方法专为Set-CVGL设计，同时也能适应单张图像和图像序列输入。FlexGeo的核心模块包括相似性引导特征融合器（Similarity-guided Feature Fuser, SFF），用于自适应融合图像特征，以及个体级属性学习器（Individual-level Attributes Learner, IAL），利用每张图像的地理属性进行全面的场景感知。FlexGeo在SetVL-480K及两个公开数据集SeqGeo和KITTI-CVL上均显著优于现有方法，尤其在SetVL-480K上实现了超过22%的定位精度提升。

链接: https://arxiv.org/abs/2412.18852
作者: Qiong Wu,Panwang Xia,Lei Yu,Yi Liu,Mingtao Xiong,Liheng Zhong,Jingdong Chen,Ming Yang,Yongjun Zhang,Yi Wan
机构: 未知
关键词: augmented reality, widely applied, applied in fields, robotic navigation, navigation and augmented
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) has been widely applied in fields such as robotic navigation and augmented reality. Existing approaches primarily use single images or fixed-view image sequences as queries, which limits perspective diversity. In contrast, when humans determine their location visually, they typically move around to gather multiple perspectives. This behavior suggests that integrating diverse visual cues can improve geo-localization reliability. Therefore, we propose a novel task: Cross-View Image Set Geo-Localization (Set-CVGL), which gathers multiple images with diverse perspectives as a query set for localization. To support this task, we introduce SetVL-480K, a benchmark comprising 480,000 ground images captured worldwide and their corresponding satellite images, with each satellite image corresponds to an average of 40 ground images from varied perspectives and locations. Furthermore, we propose FlexGeo, a flexible method designed for Set-CVGL that can also adapt to single-image and image-sequence inputs. FlexGeo includes two key modules: the Similarity-guided Feature Fuser (SFF), which adaptively fuses image features without prior content dependency, and the Individual-level Attributes Learner (IAL), leveraging geo-attributes of each image for comprehensive scene perception. FlexGeo consistently outperforms existing methods on SetVL-480K and two public datasets, SeqGeo and KITTI-CVL, achieving a localization accuracy improvement of over 22% on SetVL-480K.
zh

[CV-124] SWAG: Long-term Surgical Workflow Prediction with Generative-based Anticipation

【速读】：该论文旨在解决现有手术阶段识别方法在预测未来手术步骤方面的局限性，特别是缺乏对手术工作流程动态性和序列性的长期预测能力。为此，作者提出了SWAG（Surgical Workflow Anticipative Generation）框架，该框架通过整合阶段识别和长期预测功能，提供了一种统一的解决方案。SWAG的关键创新在于采用了两种生成式解码方法——单次通过（SP）和自回归（AR）——来预测未来手术阶段的序列，并引入了一种新颖的先验知识嵌入机制以提高预测的准确性。此外，SWAG还通过回归到分类（R2C）方法将连续预测映射到离散时间片段，从而在分类和回归任务中均表现出色。该框架在Cholec80和AutoLaparo21数据集上的评估结果表明，其在15分钟预测和剩余时间预测任务中均优于现有方法，为实时手术工作流程预测提供了强有力的工具。

链接: https://arxiv.org/abs/2412.18849
作者: Maxence Boels,Yang Liu,Prokar Dasgupta,Alejandro Granados,Sebastien Ourselin
机构: 未知
关键词: future procedural steps, identifying current surgical, recognition approaches excel, Workflow Anticipative Generation, Surgical Workflow Anticipative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IJCARS, Demo website: this https URL

点击查看摘要

Abstract:While existing recognition approaches excel at identifying current surgical phases, they provide limited foresight into future procedural steps, restricting their intraoperative utility. Similarly, current anticipation methods are constrained to predicting short-term events or singular future occurrences, neglecting the dynamic and sequential nature of surgical workflows. To address these limitations, we propose SWAG (Surgical Workflow Anticipative Generation), a unified framework for phase recognition and long-term anticipation of surgical workflows. SWAG employs two generative decoding methods – single-pass (SP) and auto-regressive (AR) – to predict sequences of future surgical phases. A novel prior knowledge embedding mechanism enhances the accuracy of anticipatory predictions. The framework addresses future phase classification and remaining time regression tasks. Additionally, a regression-to-classification (R2C) method is introduced to map continuous predictions to discrete temporal segments. SWAG’s performance was evaluated on the Cholec80 and AutoLaparo21 datasets. The single-pass classification model with prior knowledge embeddings (SWAG-SP*) achieved 53.5% accuracy in 15-minute anticipation on AutoLaparo21, while the R2C model reached 60.8% accuracy on Cholec80. SWAG’s single-pass regression approach outperformed existing methods for remaining time prediction, achieving weighted mean absolute errors of 0.32 and 0.48 minutes for 2- and 3-minute horizons, respectively. SWAG demonstrates versatility across classification and regression tasks, offering robust tools for real-time surgical workflow anticipation. By unifying recognition and anticipatory capabilities, SWAG provides actionable predictions to enhance intraoperative decision-making.
zh

[CV-125] Improving Integrated Gradient-based Transferable Adversarial Examples by Refining the Integration Path AAAI2025

【速读】：该论文旨在解决现有基于集成梯度（Integrated Gradients, IG）的对抗样本攻击在迁移性（transferability）方面的局限性。尽管IG最初是为模型可解释性开发的，但其在对抗攻击中的直接应用效果有限。论文通过聚焦IG的积分路径，从多重性（multiplicity）、单调性（monotonicity）和多样性（diversity）三个方面对其进行优化，并提出了多重单调多样化集成梯度（Multiple Monotonic Diversified Integrated Gradients, MuMoDIG）攻击方法。实验表明，MuMoDIG在不同卷积神经网络（CNN）和视觉Transformer（ViT）模型及防御机制上生成的对抗样本具有更高的迁移性，显著优于现有IG攻击和其他先进攻击方法。该研究揭示了将已有技术迁移至提升迁移性领域时所需的非平凡努力。

链接: https://arxiv.org/abs/2412.18844
作者: Yuchen Ren,Zhengyu Zhao,Chenhao Lin,Bo Yang,Lu Zhou,Zhe Liu,Chao Shen
机构: 未知
关键词: black-box attack scenarios, threats in practical, Diversified Integrated Gradients, integrated gradients, Monotonic Diversified Integrated
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Transferable adversarial examples are known to cause threats in practical, black-box attack scenarios. A notable approach to improving transferability is using integrated gradients (IG), originally developed for model interpretability. In this paper, we find that existing IG-based attacks have limited transferability due to their naive adoption of IG in model interpretability. To address this limitation, we focus on the IG integration path and refine it in three aspects: multiplicity, monotonicity, and diversity, supported by theoretical analyses. We propose the Multiple Monotonic Diversified Integrated Gradients (MuMoDIG) attack, which can generate highly transferable adversarial examples on different CNN and ViT models and defenses. Experiments validate that MuMoDIG outperforms the latest IG-based attack by up to 37.3% and other state-of-the-art attacks by 8.4%. In general, our study reveals that migrating established techniques to improve transferability may require non-trivial efforts. Code is available at \urlthis https URL.
zh

[CV-126] Context-Based Semantic-Aware Alignment for Semi-Supervised Multi-Label Learning

【速读】：该论文旨在解决半监督多标签学习（Semi-Supervised Multi-Label Learning, SSMLL）中由于缺乏精确标注的多标签数据而面临的挑战。现有的基于微调视觉语言模型（Vision-Language Models, VLMs）的方法在弱监督多标签学习中取得了一定进展，但未能充分利用标注数据来增强未标注数据的学习。为此，论文提出了一种基于上下文的语义感知对齐方法，通过利用VLMs的知识来解决SSMLL问题。其关键解决方案包括：1）引入一种新颖的框架设计，提取标签特定的图像特征，以实现文本特征与标签特定图像特征之间的紧凑对齐，从而生成高质量的伪标签；2）设计了一个半监督上下文识别辅助任务，通过捕捉共现信息来增强特征表示，从而提升模型对图像的全面理解。实验结果表明，该方法在多个基准数据集上具有显著的有效性。

链接: https://arxiv.org/abs/2412.18842
作者: Heng-Bo Fan,Ming-Kun Xie,Jia-Hao Xiao,Sheng-Jun Huang
机构: 未知
关键词: gradually gained attention, precisely-annotated multi-label data, extensive precisely-annotated multi-label, multi-label learning, weakly-supervised multi-label learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Due to the lack of extensive precisely-annotated multi-label data in real word, semi-supervised multi-label learning (SSMLL) has gradually gained attention. Abundant knowledge embedded in vision-language models (VLMs) pre-trained on large-scale image-text pairs could alleviate the challenge of limited labeled data under SSMLL this http URL existing methods based on fine-tuning VLMs have achieved advances in weakly-supervised multi-label learning, they failed to fully leverage the information from labeled data to enhance the learning of unlabeled data. In this paper, we propose a context-based semantic-aware alignment method to solve the SSMLL problem by leveraging the knowledge of VLMs. To address the challenge of handling multiple semantics within an image, we introduce a novel framework design to extract label-specific image features. This design allows us to achieve a more compact alignment between text features and label-specific image features, leading the model to generate high-quality pseudo-labels. To incorporate the model with comprehensive understanding of image, we design a semi-supervised context identification auxiliary task to enhance the feature representation by capturing co-occurrence information. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our proposed method.
zh

[CV-127] DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

【速读】：该论文旨在解决细粒度聚类（fine-grained clustering）任务中的挑战，即如何捕捉不同类别实例之间的细微差异。这些细微差异容易受到数据增强（data augmentation）的干扰或被数据中的冗余信息所掩盖，导致现有聚类方法的性能显著下降。论文提出的解决方案DiFiC基于条件扩散模型（conditional diffusion model），其关键创新在于通过推断用于图像生成的文本条件（textual conditions）来提取更精确且有利于聚类的对象语义，而非直接从图像中提取判别特征。此外，DiFiC通过正则化扩散目标（diffusion target）并利用邻域相似性（neighborhood similarity）来引导蒸馏过程，从而进一步提升聚类效果。实验结果表明，DiFiC在四个细粒度图像聚类基准上均优于现有的判别性和生成性聚类方法。

链接: https://arxiv.org/abs/2412.18838
作者: Ruohong Yang,Peng Hu,Xi Peng,Xiting Liu,Yunfan Li
机构: 未知
关键词: subtle differences, practical yet challenging, essence lies, lies in capturing, fine-grained clustering method
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-grained clustering is a practical yet challenging task, whose essence lies in capturing the subtle differences between instances of different classes. Such subtle differences can be easily disrupted by data augmentation or be overwhelmed by redundant information in data, leading to significant performance degradation for existing clustering methods. In this work, we introduce DiFiC a fine-grained clustering method building upon the conditional diffusion model. Distinct from existing works that focus on extracting discriminative features from images, DiFiC resorts to deducing the textual conditions used for image generation. To distill more precise and clustering-favorable object semantics, DiFiC further regularizes the diffusion target and guides the distillation process utilizing neighborhood similarity. Extensive experiments demonstrate that DiFiC outperforms both state-of-the-art discriminative and generative clustering methods on four fine-grained image clustering benchmarks. We hope the success of DiFiC will inspire future research to unlock the potential of diffusion models in tasks beyond generation. The code will be released.
zh

[CV-128] Adaptive Rate Control for Deep Video Compression with Rate-Distortion Prediction

【速读】：该论文旨在解决深度视频压缩（deep video compression）中速率控制（rate control）方案尚未得到充分研究的问题。传统视频压缩方法在速率控制方面已有成熟方案，但针对深度视频压缩的速率控制方案仍存在不足。论文提出了一种基于神经网络的λ域速率控制方案，通过直接从未压缩帧中学习率失真-λ（R-D-λ）关系，确定每帧的编码参数λ，从而实现高效的速率控制，且无需预编码。该方案的关键在于引入了两个基于神经网络的预测器，分别估计每帧的比特率与λ之间的关系以及失真与λ之间的关系，进而根据目标比特率确定每帧的编码参数λ。实验结果表明，该方法在mini-GOP级别实现了高精度的速率控制，且时间开销低，同时能够有效缓解不同分辨率视频内容中的帧间质量波动。

链接: https://arxiv.org/abs/2412.18834
作者: Bowen Gu,Hao Chen,Ming Lu,Jie Yao,Zhan Ma
机构: 未知
关键词: Deep video compression, made significant progress, Deep video, video compression methods, lambda
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Deep video compression has made significant progress in recent years, achieving rate-distortion performance that surpasses that of traditional video compression methods. However, rate control schemes tailored for deep video compression have not been well studied. In this paper, we propose a neural network-based \lambda -domain rate control scheme for deep video compression, which determines the coding parameter \lambda for each to-be-coded frame based on the rate-distortion- \lambda (R-D- \lambda ) relationships directly learned from uncompressed frames, achieving high rate control accuracy efficiently without the need for pre-encoding. Moreover, this content-aware scheme is able to mitigate inter-frame quality fluctuations and adapt to abrupt changes in video content. Specifically, we introduce two neural network-based predictors to estimate the relationship between bitrate and \lambda , as well as the relationship between distortion and \lambda for each frame. Then we determine the coding parameter \lambda for each frame to achieve the target bitrate. Experimental results demonstrate that our approach achieves high rate control accuracy at the mini-GOP level with low time overhead and mitigates inter-frame quality fluctuations across video content of varying resolutions.
zh

[CV-129] Federated Learning with Partially Labeled Data: A Conditional Distillation Approach

【速读】：该论文旨在解决医学影像领域中多器官和病变的通用分割模型开发所面临的挑战，特别是由于完全标注数据集的稀缺性和严格的隐私法规导致的数据共享障碍。现有的联邦学习（Federated Learning, FL）方法在处理部分标注数据时往往存在模型发散和灾难性遗忘的问题。为此，论文提出了ConDistFL，一种新颖的联邦学习框架，通过引入条件蒸馏（conditional distillation）来有效应对这些挑战。ConDistFL能够在部分标注的数据集上进行高效学习，显著提高分布式和非均匀数据集上的分割精度。此外，ConDistFL在计算和通信效率上表现出色，确保了其在实际应用中的可扩展性，并在联邦外测试中展现出卓越的泛化能力，甚至能够适应未见过的对比阶段（如非对比CT图像）。

链接: https://arxiv.org/abs/2412.18833
作者: Pochuan Wang,Chen Shen,Masahiro Oda,Chiou-Shann Fuh,Kensaku Mori,Weichung Wang,Holger R. Roth
机构: 未知
关键词: handle multiple organs, developing generalized segmentation, developing generalized, lesions is crucial, handle multiple
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In medical imaging, developing generalized segmentation models that can handle multiple organs and lesions is crucial. However, the scarcity of fully annotated datasets and strict privacy regulations present significant barriers to data sharing. Federated Learning (FL) allows decentralized model training, but existing FL methods often struggle with partial labeling, leading to model divergence and catastrophic forgetting. We propose ConDistFL, a novel FL framework incorporating conditional distillation to address these challenges. ConDistFL enables effective learning from partially labeled datasets, significantly improving segmentation accuracy across distributed and non-uniform datasets. In addition to its superior segmentation performance, ConDistFL maintains computational and communication efficiency, ensuring its scalability for real-world applications. Furthermore, ConDistFL demonstrates remarkable generalizability, significantly outperforming existing FL methods in out-of-federation tests, even adapting to unseen contrast phases (e.g., non-contrast CT images) in our experiments. Extensive evaluations on 3D CT and 2D chest X-ray datasets show that ConDistFL is an efficient, adaptable solution for collaborative medical image segmentation in privacy-constrained settings.
zh

[CV-130] Distortion-Aware Adversarial Attacks on Bounding Boxes of Object Detectors

【速读】：该论文旨在解决深度学习（Deep Learning）目标检测模型在面对对抗性攻击（adversarial attacks）时的脆弱性问题。尽管目标检测模型在许多实际应用中表现出高精度，但它们容易受到对抗性样本的攻击，尤其是在分类器（classifiers）的背景下，这种攻击方式并不完全适用于实际的目标检测场景。为此，作者提出了一种新颖的方法，通过在训练过程中扰动目标置信度分数（object confidence scores）来生成对抗性图像，从而暴露当前最先进目标检测器的漏洞，并推动后续研究构建更具鲁棒性的检测器。该方法的关键在于利用检测对象的掩码（masks）和训练损失（training loss），通过迭代图像的梯度（gradient of iterative images）来控制原始图像的失真，从而嵌入加性噪声（additive noises）。实验验证了该方法在多种目标检测器（如YOLOv8、Faster R-CNN、RetinaNet和Swin Transformer）上的有效性，并在MS COCO 2017和PASCAL VOC 2012数据集上评估了攻击成功率与图像失真之间的权衡。结果表明，在白盒攻击（white-box attacks）和黑盒攻击（black-box attacks）中，攻击成功率分别可达100%和98%。

链接: https://arxiv.org/abs/2412.18815
作者: Pham Phuc,Son Vuong,Khang Nguyen,Tuan Dang
机构: 未知
关键词: Deep learning-based object, Deep learning-based, learning-based object detection, real-world applications, decade due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based object detection has become ubiquitous in the last decade due to its high accuracy in many real-world applications. With this growing trend, these models are interested in being attacked by adversaries, with most of the results being on classifiers, which do not match the context of practical object detection. In this work, we propose a novel method to fool object detectors, expose the vulnerability of state-of-the-art detectors, and promote later works to build more robust detectors to adversarial examples. Our method aims to generate adversarial images by perturbing object confidence scores during training, which is crucial in predicting confidence for each class in the testing phase. Herein, we provide a more intuitive technique to embed additive noises based on detected objects’ masks and the training loss with distortion control over the original image by leveraging the gradient of iterative images. To verify the proposed method, we perform adversarial attacks against different object detectors, including the most recent state-of-the-art models like YOLOv8, Faster R-CNN, RetinaNet, and Swin Transformer. We also evaluate our technique on MS COCO 2017 and PASCAL VOC 2012 datasets and analyze the trade-off between success attack rate and image distortion. Our experiments show that the achievable success attack rate is up to 100 % and up to 98 % when performing white-box and black-box attacks, respectively. The source code and relevant documentation for this work are available at the following link: this https URL
zh

[CV-131] DebiasDiff: Debiasing Text-to-image Diffusion Models with Self-discovering Latent Attribute Directions

【速读】：该论文旨在解决扩散模型（Diffusion Models, DM）在图像生成任务中反映出的训练集固有偏差问题，这些偏差可能导致对少数群体的不公平对待，并传播扭曲的世界观。现有方法通常需要重新训练模型，并依赖于人工构建的参考数据集或额外的分类器，这不仅带来高昂的标注成本，且去偏效果受限于参考数据集或分类器的质量。为解决这些问题，论文提出了DebiasDiff方法，其关键创新在于通过自发现的方式学习属性潜在方向，从而消除对参考数据集的依赖。DebiasDiff由一组属性适配器和一个分布指示器组成，适配器通过噪声组合进行优化，分布指示器则引导生成过程朝向预设分布。该方法能够同时去除多个属性的偏差，且无需重新训练，具有轻量化和易于集成的特点。实验表明，DebiasDiff在去除性别、种族及其交叉偏差方面显著优于现有方法。

链接: https://arxiv.org/abs/2412.18810
作者: Yilei Jiang,Weihong Li,Yiyuan Zhang,Minghong Cai,Xiangyu Yue
机构: 未知
关键词: image generative tasks, inherent bias presented, exhibit remarkable performance, Diffusion Models, exhibit remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set. As DMs are now widely used in real-world applications, these biases could perpetuate a distorted worldview and hinder opportunities for minority groups. Existing methods on debiasing DMs usually requires model re-training with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose DebiasDiff, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. Specifically, DebiasDiff consists of two parts: a set of attribute adapters and a distribution indicator. Each adapter in the set aims to learn an attribute latent direction, and is optimized via noise composition through a self-discovering process. Then, the distribution indicator is multiplied by the set of adapters to guide the generation process towards the prescribed distribution. Our method enables debiasing multiple attributes in DMs simultaneously, while remaining lightweight and easily integrable with other DMs, eliminating the need for re-training. Extensive experiments on debiasing gender, racial, and their intersectional biases show that our method outperforms previous SOTA by a large margin.
zh

[CV-132] Provable Uncertainty Decomposition via Higher-Order Calibration ICLR2025

【速读】：该论文旨在解决模型预测不确定性（predictive uncertainty）的分解问题，特别是如何将不确定性分解为偶然不确定性（aleatoric uncertainty）和认知不确定性（epistemic uncertainty），并确保这些分解与真实世界数据分布具有明确的语义关联。现有文献中的许多方法缺乏形式化的保证，而本文提出的方法基于高阶校准（higher-order calibration）这一新概念，将普通校准（ordinary calibration）推广到高阶预测器（higher-order predictors）的设定中，这些预测器在每个点上预测标签分布的混合。通过使用k-快照（k-snapshots），即每个点具有k个独立条件标签的示例，本文展示了如何测量和实现高阶校准。在高阶校准下，模型在某个点上估计的偶然不确定性保证与真实世界中所有预测点的平均偶然不确定性相匹配。这是首次在不假设真实世界数据分布的情况下提供此类形式化保证。此外，高阶校准也适用于现有的高阶预测器，如贝叶斯模型和集成模型，并为这些模型提供了自然的评估指标。实验表明，该方法在图像分类任务中产生了有意义的不确定性分解。

链接: https://arxiv.org/abs/2412.18808
作者: Gustaf Ahdritz,Aravind Gollakota,Parikshit Gopalan,Charlotte Peale,Udi Wieder
机构: 未知
关键词: explicit semantics relating, give a principled, decomposing the predictive, epistemic components, components with explicit
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Submitted to ICLR 2025

点击查看摘要

Abstract:We give a principled method for decomposing the predictive uncertainty of a model into aleatoric and epistemic components with explicit semantics relating them to the real-world data distribution. While many works in the literature have proposed such decompositions, they lack the type of formal guarantees we provide. Our method is based on the new notion of higher-order calibration, which generalizes ordinary calibration to the setting of higher-order predictors that predict mixtures over label distributions at every point. We show how to measure as well as achieve higher-order calibration using access to k -snapshots, namely examples where each point has k independent conditional labels. Under higher-order calibration, the estimated aleatoric uncertainty at a point is guaranteed to match the real-world aleatoric uncertainty averaged over all points where the prediction is made. To our knowledge, this is the first formal guarantee of this type that places no assumptions whatsoever on the real-world data distribution. Importantly, higher-order calibration is also applicable to existing higher-order predictors such as Bayesian and ensemble models and provides a natural evaluation metric for such models. We demonstrate through experiments that our method produces meaningful uncertainty decompositions for image classification.
zh

[CV-133] FOR: Finetuning for Object Level Open Vocabulary Image Retrieval WACV2025

【速读】：该论文旨在解决在大规模数据集中通过开放集文本查询（open set textual query）准确检索包含目标对象的图像的问题。当前的主流方法依赖于预训练的CLIP模型，但未对目标域进行适应性调整，仅通过额外的后处理来平衡准确性和效率。论文提出的解决方案FOR（Finetuning for Object-centric Open-vocabulary Image Retrieval）通过在目标数据集上使用闭集标签（closed-set labels）进行微调，同时保持视觉-语言关联，以支持开放词汇检索。FOR的关键设计包括：1）针对任务定制的CLIP头部的专用解码器变体；2）将其嵌入多目标训练框架中。这些设计显著提升了检索精度，在三个数据集上比现有技术（SoTA）提高了最多8个mAP@50点。此外，FOR在半监督设置下也表现出色，即使仅使用少量标注数据也能取得优异结果。

链接: https://arxiv.org/abs/2412.18806
作者: Hila Levi,Guy Heller,Dan Levi
机构: 未知
关键词: gains practical importance, set textual query, textual query gains, query gains practical, accurately retrieving images
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: WACV 2025

点击查看摘要

Abstract:As working with large datasets becomes standard, the task of accurately retrieving images containing objects of interest by an open set textual query gains practical importance. The current leading approach utilizes a pre-trained CLIP model without any adaptation to the target domain, balancing accuracy and efficiency through additional post-processing. In this work, we propose FOR: Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows finetuning on a target dataset using closed-set labels while keeping the visual-language association crucial for open vocabulary retrieval. FOR is based on two design elements: a specialized decoder variant of the CLIP head customized for the intended task, and its coupling within a multi-objective training framework. Together, these design choices result in a significant increase in accuracy, showcasing improvements of up to 8 mAP@50 points over SoTA across three datasets. Additionally, we demonstrate that FOR is also effective in a semi-supervised setting, achieving impressive results even when only a small portion of the dataset is labeled.
zh

[CV-134] DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images

【速读】：该论文旨在解决人物图像合成（Person Image Synthesis）中存在的细节缺失、肢体扭曲和服装风格偏差等问题，这些问题在虚拟试衣、图像编辑和视频制作等实际应用中尤为突出。为解决这些问题，论文提出了一种解耦表示扩散模型（Disentangled Representations Diffusion Model, DRDM），其关键创新点包括：首先，通过姿态编码器（Pose Encoder）将姿态特征编码到高维空间，以指导人物图像的生成；其次，引入身体部分子空间解耦模块（Body-Part Subspace Decoupling Block, BSDB），从源图像的不同身体部分解耦特征，并将其输入到噪声预测模块的各个层中，从而为网络提供丰富的解耦特征以生成逼真的目标图像。此外，在推理阶段，论文开发了一种基于解析图的解耦无分类器引导采样方法（Parsing Map-based Disentangled Classifier-Free Guided Sampling），以增强纹理和姿态的条件信号。通过在Deepfashion数据集上的广泛实验，验证了该方法在姿态迁移和外观控制方面的有效性。

链接: https://arxiv.org/abs/2412.18797
作者: Enbo Huang,Yuan Zhang,Faliang Huang,Guangyu Zhang,Yang Liu
机构: 未知
关键词: essential task owing, Person image synthesis, Representations Diffusion Model, virtual try-on, video production
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Person image synthesis with controllable body poses and appearances is an essential task owing to the practical needs in the context of virtual try-on, image editing and video production. However, existing methods face significant challenges with details missing, limbs distortion and the garment style deviation. To address these issues, we propose a Disentangled Representations Diffusion Model (DRDM) to generate photo-realistic images from source portraits in specific desired poses and appearances. First, a pose encoder is responsible for encoding pose features into a high-dimensional space to guide the generation of person images. Second, a body-part subspace decoupling block (BSDB) disentangles features from the different body parts of a source figure and feeds them to the various layers of the noise prediction block, thereby supplying the network with rich disentangled features for generating a realistic target image. Moreover, during inference, we develop a parsing map-based disentangled classifier-free guided sampling method, which amplifies the conditional signals of texture and pose. Extensive experimental results on the Deepfashion dataset demonstrate the effectiveness of our approach in achieving pose transfer and appearance control.
zh

[CV-135] Protective Perturbations against Unauthorized Data Usage in Diffusion-based Image Generation

【速读】：该论文旨在解决基于扩散模型（Diffusion-based models）的文本到图像生成技术中，未经授权数据使用所带来的隐私和知识产权问题。现有的解决方案主要依赖于对抗性攻击（adversarial attacks）引入的保护性扰动（protective perturbations），这些扰动被应用于定制化样本中，以防止未经授权的数据使用。论文的关键在于系统化地梳理了现有的保护性扰动方法，建立了威胁模型，并对相关下游任务进行了分类，同时提出了一个完整的评估框架，以推动该领域的研究进展。

链接: https://arxiv.org/abs/2412.18791
作者: Sen Peng,Jijia Yang,Mingyue Wang,Jianfei He,Xiaohua Jia
机构: 未知
关键词: shown immense potential, shown immense, immense potential, Abstract, image-related tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based text-to-image models have shown immense potential for various image-related tasks. However, despite their prominence and popularity, customizing these models using unauthorized data also brings serious privacy and intellectual property issues. Existing methods introduce protective perturbations based on adversarial attacks, which are applied to the customization samples. In this systematization of knowledge, we present a comprehensive survey of protective perturbation methods designed to prevent unauthorized data usage in diffusion-based image generation. We establish the threat model and categorize the downstream tasks relevant to these methods, providing a detailed analysis of their designs. We also propose a completed evaluation framework for these perturbation techniques, aiming to advance research in this field.
zh

[CV-136] Simultaneously Recovering Multi-Person Meshes and Multi-View Cameras with Human Semantics

【速读】：该论文旨在解决多视角动态多人网格重建（dynamic multi-person mesh recovery）中未校准相机（uncalibrated cameras）所面临的两个主要挑战：一是人与人之间的交互和遮挡导致相机校准和运动捕捉存在固有模糊性；二是动态多人场景中缺乏密集对应关系来约束稀疏相机几何。论文提出的关键解决方案是通过引入运动先验知识（motion prior knowledge），同时从噪声人体语义中估计相机参数和人体网格。具体步骤包括：首先利用2D图像中的人体信息初始化相机的内参和外参，从而避免依赖其他校准工具或背景特征；其次引入姿态-几何一致性（pose-geometry consistency）来关联不同视角下检测到的人体；最后提出潜在运动先验（latent motion prior）来优化相机参数和人体运动。实验结果表明，该方法能够通过一步重建获得准确的相机参数和人体运动。

链接: https://arxiv.org/abs/2412.18785
作者: Buzhen Huang,Jingyi Ju,Yuan Shu,Yangang Wang
机构: 未知
关键词: multi-person mesh recovery, Dynamic multi-person mesh, virtual reality, sports broadcasting, video games
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TCSVT. arXiv admin note: text overlap with arXiv:2110.10355

点击查看摘要

Abstract:Dynamic multi-person mesh recovery has broad applications in sports broadcasting, virtual reality, and video games. However, current multi-view frameworks rely on a time-consuming camera calibration procedure. In this work, we focus on multi-person motion capture with uncalibrated cameras, which mainly faces two challenges: one is that inter-person interactions and occlusions introduce inherent ambiguities for both camera calibration and motion capture; the other is that a lack of dense correspondences can be used to constrain sparse camera geometries in a dynamic multi-person scene. Our key idea is to incorporate motion prior knowledge to simultaneously estimate camera parameters and human meshes from noisy human semantics. We first utilize human information from 2D images to initialize intrinsic and extrinsic parameters. Thus, the approach does not rely on any other calibration tools or background features. Then, a pose-geometry consistency is introduced to associate the detected humans from different views. Finally, a latent motion prior is proposed to refine the camera parameters and human motions. Experimental results show that accurate camera parameters and human motions can be obtained through a one-step reconstruction. The code are publicly available at~\urlthis https URL.
zh

[CV-137] ArtNVG: Content-Style Separated Artistic Neighboring-View Gaussian Stylization

【速读】：该论文旨在解决当前3D场景风格化技术中局部颜色和纹理一致性不足的问题，这对于保持场景的美学连贯性至关重要。为解决这一问题，论文提出了ArtNVG框架，该框架基于3D高斯泼溅（3D Gaussian Splatting, 3DGS）技术，能够高效生成风格化的3D场景。ArtNVG的关键创新在于引入了两种核心技术：内容-风格分离控制（Content-Style Separated Control）和基于注意力的邻域视图对齐（Attention-based Neighboring-View Alignment）。内容-风格分离控制通过CSGO模型和Tile ControlNet实现内容与风格的解耦，减少了信息泄露的风险；而基于注意力的邻域视图对齐则确保了相邻视图间局部颜色和纹理的一致性，显著提升了视觉质量。实验结果表明，ArtNVG在内容保留、风格对齐和局部一致性方面均优于现有方法。

链接: https://arxiv.org/abs/2412.18783
作者: Zixiao Gu,Mengtian Li,Ruhua Chen,Zhongxia Ji,Sichen Guo,Zhenye Zhang,Guangnan Ye,Zuo Hu
机构: 未知
关键词: target styles grows, Content-Style Separated Control, stylization techniques increases, importance of advanced, film and gaming
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As demand from the film and gaming industries for 3D scenes with target styles grows, the importance of advanced 3D stylization techniques increases. However, recent methods often struggle to maintain local consistency in color and texture throughout stylized scenes, which is essential for maintaining aesthetic coherence. To solve this problem, this paper introduces ArtNVG, an innovative 3D stylization framework that efficiently generates stylized 3D scenes by leveraging reference style images. Built on 3D Gaussian Splatting (3DGS), ArtNVG achieves rapid optimization and rendering while upholding high reconstruction quality. Our framework realizes high-quality 3D stylization by incorporating two pivotal techniques: Content-Style Separated Control and Attention-based Neighboring-View Alignment. Content-Style Separated Control uses the CSGO model and the Tile ControlNet to decouple the content and style control, reducing risks of information leakage. Concurrently, Attention-based Neighboring-View Alignment ensures consistency of local colors and textures across neighboring views, significantly improving visual quality. Extensive experiments validate that ArtNVG surpasses existing methods, delivering superior results in content preservation, style alignment, and local consistency.
zh

[CV-138] Skeleton-based Action Recognition with Non-linear Dependency Modeling and Hilbert-Schmidt Independence Criterion

【速读】：该论文旨在解决基于人体骨架的动作识别（Human skeleton-based action recognition）中的两个关键问题：一是现有方法通常仅考虑相邻关节之间的依赖关系，难以捕捉物理距离较远的关节之间的非线性依赖关系；二是现有方法通过估计运动表示的概率密度来区分动作类别，但由于人体运动的高维特性，这种测量存在固有困难。为解决这些问题，论文提出了两个关键解决方案：首先，提出了一种新的依赖关系优化方法，显式地建模任意关节对之间的依赖关系，从而超越关节距离的限制；其次，提出了一个利用希尔伯特-施密特独立性准则（Hilbert-Schmidt Independence Criterion）的框架，能够在不受数据维度影响的情况下区分动作类别，并通过数学推导确保精确识别的学习目标。实验结果表明，该方法在NTU RGB+D、NTU RGB+D 120和Northwestern-UCLA数据集上达到了当前最优性能。

链接: https://arxiv.org/abs/2412.18780
作者: Yuheng Yang
机构: 未知
关键词: Human skeleton-based action, artificial intelligence, indispensable aspect, aspect of artificial, skeleton-based action recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human skeleton-based action recognition has long been an indispensable aspect of artificial intelligence. Current state-of-the-art methods tend to consider only the dependencies between connected skeletal joints, limiting their ability to capture non-linear dependencies between physically distant joints. Moreover, most existing approaches distinguish action classes by estimating the probability density of motion representations, yet the high-dimensional nature of human motions invokes inherent difficulties in accomplishing such measurements. In this paper, we seek to tackle these challenges from two directions: (1) We propose a novel dependency refinement approach that explicitly models dependencies between any pair of joints, effectively transcending the limitations imposed by joint distance. (2) We further propose a framework that utilizes the Hilbert-Schmidt Independence Criterion to differentiate action classes without being affected by data dimensionality, and mathematically derive learning objectives guaranteeing precise recognition. Empirically, our approach sets the state-of-the-art performance on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.
zh

[CV-139] Unified Local and Global Attention Interaction Modeling for Vision Transformers

【速读】：该论文旨在解决视觉Transformer（ViT）在目标检测任务中自注意力机制（self-attention mechanism）的局限性。具体而言，传统的自注意力机制在处理视觉标记（visual tokens）时，未能在计算全局注意力之前允许这些标记与邻近特征交换局部或全局信息，导致标记在匹配其他标记时被孤立处理，忽视了有价值的空间关系。此外，点积相似度操作使得来自不同语义类别的标记在视觉上显得相似，进一步加剧了这一问题。为解决这些限制，论文提出了两项关键改进：首先，引入了一种新颖的激进卷积池化策略（aggressive convolution pooling strategy），用于局部特征混合；其次，提出了一种新的概念注意力变换（conceptual attention transformation），以促进语义概念之间的交互和特征交换。实验结果表明，在自注意力计算之前进行视觉特征的局部和全局信息交换，显著提升了在复杂目标检测任务中的性能，并在多个基准数据集和具有挑战性的医学数据集上表现出良好的泛化能力。

链接: https://arxiv.org/abs/2412.18778
作者: Tan Nguyen,Coy D. Heldermon,Corey Toler-Franklin
机构: 未知
关键词: accurate object detection, object detection, vision transformer, method that extends, accurate object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 Pages, 24 figures

点击查看摘要

Abstract:We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets. ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification. This is due in part to their ability to leverage global information from interactions among visual tokens. However, the self-attention mechanism in ViTs are limited because they do not allow visual tokens to exchange local or global information with neighboring features before computing global attention. This is problematic because tokens are treated in isolation when attending (matching) to other tokens, and valuable spatial relationships are overlooked. This isolation is further compounded by dot-product similarity operations that make tokens from different semantic classes appear visually similar. To address these limitations, we introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation to facilitate interaction and feature exchange between semantic concepts. Experimental results demonstrate that local and global information exchange among visual features before self-attention significantly improves performance on challenging object detection tasks and generalizes across multiple benchmark datasets and challenging medical datasets. We publish source code and a novel dataset of cancerous tumors (chimeric cell clusters).
zh

[CV-140] ObitoNet: Multimodal High-Resolution Point Cloud Reconstruction

【速读】：该论文旨在解决多模态输入下的高分辨率点云重建问题，特别是在数据稀疏或噪声较大的挑战性条件下。解决方案的关键在于采用了一种跨注意力机制（Cross Attention mechanism），通过视觉变换器（Vision Transformers, ViT）从图像中提取语义特征，同时利用最远点采样（Farthest Point Sampling, FPS）和K近邻（K Nearest Neighbors, KNN）的点云标记器处理几何信息，以捕捉空间结构。最终，学习到的多模态特征被输入到基于变换器的解码器中，实现高精度的点云重建。该方法结合了图像特征的丰富性和几何细节的精确性，确保了在复杂条件下的鲁棒性。

链接: https://arxiv.org/abs/2412.18775
作者: Apoorv Thapliyal,Vinay Lanka,Swathi Baskaran
机构: 未知
关键词: Cross Attention mechanism, Farthest Point Sampling, Vision Transformers, Nearest Neighbors, spatial structure capture
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:ObitoNet employs a Cross Attention mechanism to integrate multimodal inputs, where Vision Transformers (ViT) extract semantic features from images and a point cloud tokenizer processes geometric information using Farthest Point Sampling (FPS) and K Nearest Neighbors (KNN) for spatial structure capture. The learned multimodal features are fed into a transformer-based decoder for high-resolution point cloud reconstruction. This approach leverages the complementary strengths of both modalities rich image features and precise geometric details ensuring robust point cloud generation even in challenging conditions such as sparse or noisy data.
zh

[CV-141] Embodied Image Quality Assessment for Robotic Intelligence ICME2025

【速读】：该论文旨在解决机器人生成内容（RGC）的图像质量评估（IQA）问题，特别是其与人类生成内容（UGC）在图像质量评估上的差异。由于机器人作为具身代理（Embodied Agent）需要在环境中交互和感知，并执行特定任务，因此其视觉图像的质量直接影响下游任务的性能。论文提出了具身图像质量评估（EIQA）框架，通过建立基于机器人下游任务的图像评估指标，并构建了包含5000张参考图像和失真图像标注的具身偏好数据库（EPD）。该框架的关键在于将图像质量评估与机器人的具体任务需求相结合，而非仅依赖于人类主观评分。实验结果表明，具身图像的质量评估与人类评估存在显著差异，EPD的建立有望推动具身AI在图像质量评估领域的发展。

链接: https://arxiv.org/abs/2412.18774
作者: Jianbo Zhang,Chunyi Li,Liang Yuan,Guoquan Zheng,Jie Hao,Guangtao Zhai
机构: 未知
关键词: critical technique, Image quality assessment, Image quality, UGC, user-generated content
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, ICME2025

点击查看摘要

Abstract:Image quality assessment (IQA) of user-generated content (UGC) is a critical technique for human quality of experience (QoE). However, for robot-generated content (RGC), will its image quality be consistent with the Moravec paradox and counter to human common sense? Human subjective scoring is more based on the attractiveness of the image. Embodied agent are required to interact and perceive in the environment, and finally perform specific tasks. Visual images as inputs directly influence downstream tasks. In this paper, we first propose an embodied image quality assessment (EIQA) frameworks. We establish assessment metrics for input images based on the downstream tasks of robot. In addition, we construct an Embodied Preference Database (EPD) containing 5,000 reference and distorted image annotations. The performance of mainstream IQA algorithms on EPD dataset is finally verified. The experiments demonstrate that quality assessment of embodied images is different from that of humans. We sincerely hope that the EPD can contribute to the development of embodied AI by focusing on image quality assessment. The benchmark is available at this https URL.
zh

[CV-142] Hierarchical Multi-Graphs Learning for Robust Group Re-Identification

【速读】：该论文旨在解决群体重识别（Group Re-identification, G-ReID）中的复杂性问题，这些问题包括成员间的相互遮挡、动态交互以及群体结构的演变。现有的基于图的方法通常将群体建模为单一拓扑结构，难以泛化到多样化的群体组合，且无法充分表示群体内部的多面关系。为此，论文提出了一种分层多图学习（Hierarchical Multi-Graphs Learning, HMGL）框架，通过将群体建模为多关系图的集合，利用显式特征（如遮挡、外观和前景信息）和成员间的隐式依赖关系，来捕捉群体动态。该框架通过多图神经网络（Multi-Graphs Neural Network, MGNN）进行编码，能够有效解决成员关系中的模糊性，特别是在复杂、密集场景中。此外，论文还提出了一种多尺度匹配（Multi-Scale Matching, MSM）算法，以减少成员信息模糊性及对困难样本的敏感性，从而提升在挑战性场景中的鲁棒性。实验结果表明，该方法在CSG和RoadGroup两个标准基准上取得了当前最优的性能，Rank-1/mAP分别达到95.3%/94.4%和93.9%/95.4%，较现有方法在Rank-1准确率上分别提升了1.7%和2.5%。

链接: https://arxiv.org/abs/2412.18766
作者: Ruiqi Liu,Xingyu Liu,Xiaohao Xu,Yixuan Zhang,Yongxin Ge,Lubin Weng
机构: 未知
关键词: faces greater complexity, individual Re-identification, evolving group structures, Group Re-identification, dynamic member interactions
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Group Re-identification (G-ReID) faces greater complexity than individual Re-identification (ReID) due to challenges like mutual occlusion, dynamic member interactions, and evolving group structures. Prior graph-based approaches have aimed to capture these dynamics by modeling the group as a single topological structure. However, these methods struggle to generalize across diverse group compositions, as they fail to fully represent the multifaceted relationships within the group. In this study, we introduce a Hierarchical Multi-Graphs Learning (HMGL) framework to address these challenges. Our approach models the group as a collection of multi-relational graphs, leveraging both explicit features (such as occlusion, appearance, and foreground information) and implicit dependencies between members. This hierarchical representation, encoded via a Multi-Graphs Neural Network (MGNN), allows us to resolve ambiguities in member relationships, particularly in complex, densely populated scenes. To further enhance matching accuracy, we propose a Multi-Scale Matching (MSM) algorithm, which mitigates issues of member information ambiguity and sensitivity to hard samples, improving robustness in challenging scenarios. Our method achieves state-of-the-art performance on two standard benchmarks, CSG and RoadGroup, with Rank-1/mAP scores of 95.3%/94.4% and 93.9%/95.4%, respectively. These results mark notable improvements of 1.7% and 2.5% in Rank-1 accuracy over existing approaches. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2412.18766 [cs.CV] (or arXiv:2412.18766v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.18766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-143] Successes and Limitations of Object-centric Models at Compositional Generalisation NEURIPS2024

【速读】：该论文旨在解决标准解耦潜在变量模型（disentangled latent variable models）在视觉领域中缺乏稳健的组合学习能力的问题。尽管这些模型的设计目标是将数据集分解为其变化因素的组成部分，但它们在组合泛化能力方面表现极为有限。相比之下，以对象为中心的架构（object-centric architectures）展示了有前景的组合技能，但这些技能尚未在广泛的实验中验证，且实验主要局限于场景组合（scene composition），即模型需要泛化到视觉场景中对象的新组合，而非对象属性的新组合。本文通过实验证明，这些组合泛化技能可以扩展到对象属性的新组合场景，并进一步指出了这些技能的来源以及如何通过精细训练来提升它们。此外，论文还指出了一个仍然存在的重要限制，为未来的研究指明了新方向。解决方案的关键在于采用以对象为中心的架构，并通过精心设计的训练策略来增强其组合泛化能力。

链接: https://arxiv.org/abs/2412.18743
作者: Milton L. Montero,Jeffrey S. Bowers,Gaurav Malhotra
机构: 未知
关键词: standard disentangled latent, disentangled latent variable, latent variable models, support robust compositional, robust compositional learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: As it appeared in the Compositional Learning Workshop, NeurIPS 2024; 14 pages (5 main text, 7 appendices, 2 references); 9 figures

点击查看摘要

Abstract:In recent years, it has been shown empirically that standard disentangled latent variable models do not support robust compositional learning in the visual domain. Indeed, in spite of being designed with the goal of factorising datasets into their constituent factors of variations, disentangled models show extremely limited compositional generalisation capabilities. On the other hand, object-centric architectures have shown promising compositional skills, albeit these have 1) not been extensively tested and 2) experiments have been limited to scene composition – where models must generalise to novel combinations of objects in a visual scene instead of novel combinations of object properties. In this work, we show that these compositional generalisation skills extend to this later setting. Furthermore, we present evidence pointing to the source of these skills and how they can be improved through careful training. Finally, we point to one important limitation that still exists which suggests new directions of research.
zh

[CV-144] HELPNet: Hierarchical Perturbations Consistency and Entropy-guided Ensemble for Scribble Supervised Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割（Medical Image Segmentation）中全标注（fully annotated labels）成本高、耗时长的问题，提出了一种基于涂鸦标注（scribble annotations）的弱监督分割框架HELPNet，以在标注效率和分割性能之间取得平衡。解决方案的关键在于HELPNet框架中集成的三个模块：1）分层扰动一致性模块（Hierarchical Perturbations Consistency, HPC），通过全局、局部和焦点视图的密度控制拼图扰动（density-controlled jigsaw perturbations）增强多尺度结构特征的学习；2）熵引导伪标签模块（Entropy-guided Pseudo-label, EGPL），利用熵评估分割预测的置信度，生成高质量的伪标签；3）结构先验优化模块（Structural Prior Refinement, SPR），通过引入连通性和边界先验（connectivity and bounded priors）提高伪标签的精度和可靠性。实验结果表明，HELPNet在ACDC、MSCMRseg和CHAOS三个公开数据集上显著优于现有的涂鸦标注弱监督分割方法，并达到了与全监督方法相当的性能。

链接: https://arxiv.org/abs/2412.18738
作者: Xiao Zhang,Shaoxuan Wu,Peilin Zhang,Zhuo Jin,Xiaosong Xiong,Qirong Bu,Jingkun Chen,Jun Feng
机构: 未知
关键词: Creating fully annotated, medical image segmentation, Scribble annotations offer, fully annotated labels, time-intensive and costly
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating fully annotated labels for medical image segmentation is prohibitively time-intensive and costly, emphasizing the necessity for innovative approaches that minimize reliance on detailed annotations. Scribble annotations offer a more cost-effective alternative, significantly reducing the expenses associated with full annotations. However, scribble annotations offer limited and imprecise information, failing to capture the detailed structural and boundary characteristics necessary for accurate organ delineation. To address these challenges, we propose HELPNet, a novel scribble-based weakly supervised segmentation framework, designed to bridge the gap between annotation efficiency and segmentation performance. HELPNet integrates three modules. The Hierarchical perturbations consistency (HPC) module enhances feature learning by employing density-controlled jigsaw perturbations across global, local, and focal views, enabling robust modeling of multi-scale structural representations. Building on this, the Entropy-guided pseudo-label (EGPL) module evaluates the confidence of segmentation predictions using entropy, generating high-quality pseudo-labels. Finally, the structural prior refinement (SPR) module incorporates connectivity and bounded priors to enhance the precision and reliability and pseudo-labels. Experimental results on three public datasets ACDC, MSCMRseg, and CHAOS show that HELPNet significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation and achieves performance comparable to fully supervised methods. The code is available at this https URL.
zh

[CV-145] Evaluating the Adversarial Robustness of Detection Transformers

【速读】：该论文旨在解决目标检测变换器（DETRs）在对抗攻击下的鲁棒性问题，特别是在自动驾驶和移动机器人等安全关键应用中的表现。尽管DETRs在目标检测领域取得了显著进展，但其在面对对抗攻击时的脆弱性尚未得到充分研究。论文通过在白盒（white-box）和黑盒（black-box）攻击场景下对DETR及其变体进行全面评估，揭示了DETR模型与传统的基于卷积神经网络（CNN）的检测器类似，同样容易受到对抗攻击的影响。关键解决方案包括扩展现有的白盒攻击方法（如FGSM、PGD和CW）来评估DETR的脆弱性，并提出一种专门针对DETR的无目标攻击方法，利用其中间损失函数以最小扰动诱导误分类。此外，论文通过自注意力特征图的可视化，深入分析了对抗攻击如何影响DETR模型的内部表示。这些发现揭示了DETR在标准对抗攻击下的关键脆弱性，强调了未来研究在提升基于变换器的目标检测器鲁棒性方面的必要性。

链接: https://arxiv.org/abs/2412.18718
作者: Amirhossein Nazeri,Chunheng Zhao,Pierluigi Pisu
机构: 未知
关键词: Robust object detection, Robust object, mobile robotics, ensuring safety, DETR
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robust object detection is critical for autonomous driving and mobile robotics, where accurate detection of vehicles, pedestrians, and obstacles is essential for ensuring safety. Despite the advancements in object detection transformers (DETRs), their robustness against adversarial attacks remains underexplored. This paper presents a comprehensive evaluation of DETR model and its variants under both white-box and black-box adversarial attacks, using the MS-COCO and KITTI datasets to cover general and autonomous driving scenarios. We extend prominent white-box attack methods (FGSM, PGD, and CW) to assess DETR vulnerability, demonstrating that DETR models are significantly susceptible to adversarial attacks, similar to traditional CNN-based detectors. Our extensive transferability analysis reveals high intra-network transferability among DETR variants, but limited cross-network transferability to CNN-based models. Additionally, we propose a novel untargeted attack designed specifically for DETR, exploiting its intermediate loss functions to induce misclassification with minimal perturbations. Visualizations of self-attention feature maps provide insights into how adversarial attacks affect the internal representations of DETR models. These findings reveal critical vulnerabilities in detection transformers under standard adversarial attacks, emphasizing the need for future research to enhance the robustness of transformer-based object detectors in safety-critical applications.
zh

[CV-146] Uncertainty Quantification in Stereo Matching

【速读】：该论文旨在解决立体匹配（stereo matching）中不确定性（uncertainty）的估计与分析问题。现有研究通常对不确定性的解释有限，且难以有效将其分解为数据不确定性（aleatoric uncertainty）和模型不确定性（epistemic uncertainty）。这种分解对于理解误差的根源、提升预测置信度和决策过程至关重要。论文提出了一种新的框架，采用贝叶斯风险（Bayes risk）作为不确定性的度量，并分别估计数据和模型的不确定性。通过在四个立体匹配基准数据集上的实验，验证了该方法能够准确且高效地估计不确定性。此外，论文还通过选择不确定性较小的数据点来提升预测精度，进一步证明了所估计不确定性的准确性。

链接: https://arxiv.org/abs/2412.18703
作者: Wenxiao Cai,Dongting Hu,Ruoyan Yin,Jiankang Deng,Huan Fu,Wankou Yang,Mingming Gong
机构: 未知
关键词: Stereo matching plays, Stereo matching, safety and reliability, plays a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo matching plays a crucial role in various applications, where understanding uncertainty can enhance both safety and reliability. Despite this, the estimation and analysis of uncertainty in stereo matching have been largely overlooked. Previous works often provide limited interpretations of uncertainty and struggle to separate it effectively into data (aleatoric) and model (epistemic) components. This disentanglement is essential, as it allows for a clearer understanding of the underlying sources of error, enhancing both prediction confidence and decision-making processes. In this paper, we propose a new framework for stereo matching and its uncertainty quantification. We adopt Bayes risk as a measure of uncertainty and estimate data and model uncertainty separately. Experiments are conducted on four stereo benchmarks, and the results demonstrate that our method can estimate uncertainty accurately and efficiently. Furthermore, we apply our uncertainty method to improve prediction accuracy by selecting data points with small uncertainties, which reflects the accuracy of our estimated uncertainty. The codes are publicly available at this https URL.
zh

[CV-147] STITCH: Surface reconstrucTion using Implicit neural representations with Topology Constraints and persistent Homology

【速读】：该论文旨在解决稀疏且不规则分布的点云（point cloud）在神经隐式表面重建（neural implicit surface reconstruction）过程中难以保持拓扑约束（topological constraints）的问题，特别是确保重建对象具有单一连通分量（single connected component）。其解决方案的关键在于提出了一种基于持久同调（persistent homology）的可微分框架，通过引入拓扑损失项（topological loss terms）来强制重建对象符合单一2-流形（2-manifold）的先验条件。该方法通过随机（次）梯度下降（stochastic (sub)gradient descent）优化损失函数，确保了收敛性，并能够重建具有单一连通分量的形状。该研究展示了可微分拓扑数据分析工具在隐式表面重建中的有效集成。

链接: https://arxiv.org/abs/2412.18696
作者: Anushrut Jignasu,Ethan Herron,Zhanhong Jiang,Soumik Sarkar,Chinmay Hegde,Baskar Ganapathysubramanian,Aditya Balu,Adarsh Krishnamurthy
机构: 未知
关键词: irregularly spaced point, spaced point cloud, enforcing topological constraints, single connected component, present STITCH
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 19 pages, 12 figures, 29 tables

点击查看摘要

Abstract:We present STITCH, a novel approach for neural implicit surface reconstruction of a sparse and irregularly spaced point cloud while enforcing topological constraints (such as having a single connected component). We develop a new differentiable framework based on persistent homology to formulate topological loss terms that enforce the prior of a single 2-manifold object. Our method demonstrates excellent performance in preserving the topology of complex 3D geometries, evident through both visual and empirical comparisons. We supplement this with a theoretical analysis, and provably show that optimizing the loss with stochastic (sub)gradient descent leads to convergence and enables reconstructing shapes with a single connected component. Our approach showcases the integration of differentiable topological data analysis tools for implicit surface reconstruction.
zh

[CV-148] Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation

【速读】：该论文旨在探讨长视频生成（long video generation）领域面临的挑战及其解决方案。尽管多模态大语言模型（MLLMs）在多模态任务中取得了显著进展，生成长视频仍然是一个复杂的问题，主要难点在于需要处理规划、故事发展以及保持时空一致性等关键方面。当前最先进的系统如OpenAI的Sora，仍受限于生成一分钟以内的视频。论文提出，通过将生成式 AI（Generative AI）与分治法（divide-and-conquer approach）相结合，可以提高长视频生成的可扩展性，并提供更好的控制。论文还全面回顾了长视频生成的基础技术（如生成对抗网络 GANs 和扩散模型 diffusion models）、视频生成策略、大规模训练数据集、长视频质量评估指标以及未来研究方向，旨在为该领域的进一步研究和发展提供全面的基础。

链接: https://arxiv.org/abs/2412.18688
作者: Faraz Waseem,Muhammad Shahzad
机构: 未知
关键词: thousand words, long video generation, image frames, video generation, composed of hundreds
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 18 figures, Manuscript submitted to ACM

点击查看摘要

Abstract:An image may convey a thousand words, but a video composed of hundreds or thousands of image frames tells a more intricate story. Despite significant progress in multimodal large language models (MLLMs), generating extended videos remains a formidable challenge. As of this writing, OpenAI’s Sora, the current state-of-the-art system, is still limited to producing videos that are up to one minute in length. This limitation stems from the complexity of long video generation, which requires more than generative AI techniques for approximating density functions essential aspects such as planning, story development, and maintaining spatial and temporal consistency present additional hurdles. Integrating generative AI with a divide-and-conquer approach could improve scalability for longer videos while offering greater control. In this survey, we examine the current landscape of long video generation, covering foundational techniques like GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of the existing video generation capabilities. We believe it would serve as a comprehensive foundation, offering extensive information to guide future advancements and research in the field of long video generation.
zh

[CV-149] AB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

【速读】：该论文旨在解决多头自注意力机制（Multi-head Self-Attention, MHSA）在视觉语言模型（Vision-Language Model, VLM）中难以解释和干预的问题。MHSA虽然通过多个并行处理头增强了模型的表达能力，但也使得每个输入块对模型输出的贡献变得模糊，难以追踪。为此，作者提出了一种新颖的单头Transformer注意力瓶颈（Transformer Attention Bottleneck, TAB）层，将其插入传统的MHSA架构之后，作为注意力瓶颈以提高模型的可解释性和可干预性。TAB的关键在于将所有输入块的总注意力约束在[0, 1]范围内，当总注意力为0时，模型将不再传播视觉信息，转而生成与图像无关的通用响应。通过在图像差异描述任务中的实验，作者证明了TAB在定位变化和识别无变化情况方面优于基线模型，并且首次实现了通过编辑注意力进行用户干预的功能。

链接: https://arxiv.org/abs/2412.18675
作者: Pooyan Rahmanzadehgrevi,Hung Huy Nguyen,Rosanne Liu,Long Mai,Anh Totti Nguyen
机构: 未知
关键词: widely popular architecture, Multi-head self-attention, language and vision, key component, widely popular
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the attribution of each input patch to the output of a model. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention, TAB constrains the total attention over all patches to \in [0, 1] . That is, when the total attention is 0, no visual information is propagated further into the network and the vision-language model (VLM) would default to a generic, image-independent response. To demonstrate the advantages of TAB, we train VLMs with TAB to perform image difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is superior in localizing changes and in identifying when no changes occur. TAB is the first architecture to enable users to intervene by editing attention, which often produces expected outputs by VLMs.
zh

[CV-150] 1.58-bit FLUX

【速读】：该论文旨在解决在保持生成质量的前提下，显著降低文本到图像生成模型（FLUX.1-dev）的计算和存储开销的问题。具体而言，论文提出了一种名为1.58-bit FLUX的量化方法，将模型权重量化为仅包含-1、0、+1三个值的1.58位表示。这一方法的关键在于无需访问图像数据，仅通过FLUX.1-dev模型的自监督学习实现量化。此外，论文还开发了针对1.58位操作优化的自定义内核，显著减少了模型存储（7.7倍）、推理内存（5.1倍）并提升了推理速度。通过在GenEval和T2I Compbench基准上的广泛评估，证明了该方法在保持生成质量的同时，显著提升了计算效率。

链接: https://arxiv.org/abs/2412.18653
作者: Chenglin Yang,Celong Liu,Xueqing Deng,Dongwon Kim,Xing Mei,Xiaohui Shen,Liang-Chieh Chen
机构: 未知
关键词: maintaining comparable performance, performance for generating, successful approach, approach to quantizing, comparable performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in -1, 0, +1) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.
zh

[CV-151] Dissecting CLIP: Decomposition with a Schur Complement-based Approach

【速读】：该论文旨在解决现有CLIPScore（基于文本和图像嵌入的余弦相似度）在评估文本到图像生成模型时无法量化生成图像多样性的问题。为了解决这一问题，作者提出了一种基于CLIP嵌入的方法，通过将图像数据的核协方差矩阵分解为基于文本和非基于文本的组件，来量化和解释文本到图像模型的内在多样性。具体而言，作者利用联合图像-文本核协方差矩阵的Schur补（Schur complement）进行分解，并定义了基于分解组件的矩阵熵作为Schur补熵（SCE）评分，用于衡量模型在不同文本提示下生成图像的多样性。此外，该方法还通过Schur补分解消除给定提示对图像CLIP嵌入的影响，从而在下游任务中聚焦或忽略特定对象或属性。该解决方案的关键在于利用Schur补分解技术，实现了对文本到图像模型内在多样性的量化评估，并提供了修改CLIP图像嵌入的工具。

链接: https://arxiv.org/abs/2412.18645
作者: Azim Ospanov,Mohammad Jalali,Farzan Farnia
机构: 未知
关键词: CLIP image embeddings, CLIP embeddings, assess the alignment, alignment of samples, samples produced
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The use of CLIP embeddings to assess the alignment of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the relevance of a generated image, it does not quantify the diversity of images generated by a text-to-image model. In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which is responsible for generating diverse images from similar text prompts. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the \textitSchur Complement Entropy (SCE) score, a measure of the intrinsic diversity of a text-to-image model based on data collected with varying text prompts. Additionally, we demonstrate the use of the Schur complement-based decomposition to nullify the influence of a given prompt in the CLIP embedding of an image, enabling focus or defocus of embeddings on specific objects or properties for downstream tasks. We present several numerical results that apply our Schur complement-based approach to evaluate text-to-image models and modify CLIP image embeddings. The codebase is available at this https URL
zh

[CV-152] ZenSVI: An Open-Source Software for the Integrated Acquisition Processing and Analysis of Street View Imagery Towards Scalable Urban Science

【速读】：该论文旨在解决街景图像（Street View Imagery, SVI）在多个研究领域中应用时缺乏标准化和可重复性的问题。尽管SVI在交通、健康、建筑、人类感知和基础设施等领域被广泛用于分析街道特征和建成环境，但现有的图像处理方法和应用往往孤立实施，导致研究难以复现和扩展。论文提出的解决方案是开发一个名为ZenSVI的免费开源Python包，该包集成了SVI分析的整个流程，包括从多个平台（如Mapillary和KartaView）高效下载SVI、预处理图像、应用计算机视觉模型进行特征提取、将SVI转换为不同投影（如鱼眼和透视）和格式（如深度图和点云）、可视化分析结果，并导出数据到其他软件工具。ZenSVI的关键在于其端到端的模块化设计，显著提高了研究的透明度、可重复性和可扩展性，支持研究人员高效进行城市分析。

链接: https://arxiv.org/abs/2412.18641
作者: Koichi Ito,Yihan Zhu,Mahmoud Abdelrahman,Xiucheng Liang,Zicheng Fan,Yujun Hou,Tianhong Zhao,Rui Ma,Kunihiko Fujiwara,Jiani Ouyang,Matias Quintana,Filip Biljecki
机构: 未知
关键词: Street view imagery, characterize street features, Street view, characterize street, SVI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Street view imagery (SVI) has been instrumental in many studies in the past decade to understand and characterize street features and the built environment. Researchers across a variety of domains, such as transportation, health, architecture, human perception, and infrastructure have employed different methods to analyze SVI. However, these applications and image-processing procedures have not been standardized, and solutions have been implemented in isolation, often making it difficult for others to reproduce existing work and carry out new research. Using SVI for research requires multiple technical steps: accessing APIs for scalable data collection, preprocessing images to standardize formats, implementing computer vision models for feature extraction, and conducting spatial analysis. These technical requirements create barriers for researchers in urban studies, particularly those without extensive programming experience. We develop ZenSVI, a free and open-source Python package that integrates and implements the entire process of SVI analysis, supporting a wide range of use cases. Its end-to-end pipeline includes downloading SVI from multiple platforms (e.g., Mapillary and KartaView) efficiently, analyzing metadata of SVI, applying computer vision models to extract target features, transforming SVI into different projections (e.g., fish-eye and perspective) and different formats (e.g., depth map and point cloud), visualizing analyses with maps and plots, and exporting outputs to other software tools. We demonstrate its use in Singapore through a case study of data quality assessment and clustering analysis in a streamlined manner. Our software improves the transparency, reproducibility, and scalability of research relying on SVI and supports researchers in conducting urban analyses efficiently. Its modular design facilitates extensions and unlocking new use cases.
zh

[CV-153] Edge-AI for Agriculture: Lightweight Vision Models for Disease Detection in Resource-Limited Settings

【速读】：该论文旨在解决农民在资源有限的环境下检测橙子病害的问题。其核心解决方案是开发一种轻量级且高效的计算机视觉管道（computer vision pipeline），该管道集成了先进的目标检测（object detection）、分类（classification）和分割（segmentation）模型，并针对边缘设备（edge devices）进行了优化，以确保在资源受限的环境中仍能有效运行。研究评估了多种先进模型的性能，重点关注其准确性、计算效率和泛化能力。其中，Vision Transformer在橙子种类分类中达到了96%的准确率，而轻量级的YOLOv8-S模型在目标检测中表现出色且计算开销极小。研究强调了现代深度学习架构在解决关键农业挑战中的潜力，并探讨了模型复杂性与实际应用之间的平衡。未来的工作将扩展数据集、探索模型压缩技术和联邦学习（federated learning），以增强这些系统在不同农业环境中的适用性，从而推动更可持续的农业实践。

链接: https://arxiv.org/abs/2412.18635
作者: Harsh Joshi
机构: 未知
关键词: efficient computer vision, computer vision pipeline, vision pipeline aimed, detecting orange diseases, research paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This research paper presents the development of a lightweight and efficient computer vision pipeline aimed at assisting farmers in detecting orange diseases using minimal resources. The proposed system integrates advanced object detection, classification, and segmentation models, optimized for deployment on edge devices, ensuring functionality in resource-limited environments. The study evaluates the performance of various state-of-the-art models, focusing on their accuracy, computational efficiency, and generalization capabilities. Notable findings include the Vision Transformer achieving 96 accuracy in orange species classification and the lightweight YOLOv8-S model demonstrating exceptional object detection performance with minimal computational overhead. The research highlights the potential of modern deep learning architectures to address critical agricultural challenges, emphasizing the importance of model complexity versus practical utility. Future work will explore expanding datasets, model compression techniques, and federated learning to enhance the applicability of these systems in diverse agricultural contexts, ultimately contributing to more sustainable farming practices.
zh

[CV-154] Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads

【速读】：该论文旨在解决数据驱动的AI模型（如深度学习推理、训练、视觉变换器（Vision Transformers, ViTs）和其他高性能计算（HPC）应用）在运行时对多种非线性激活函数（Activation Functions, AFs）硬件支持的迫切需求。现有解决方案虽然支持多种精度或运行时AF可重构性，但无法同时满足这两点。为此，论文提出了一种灵活的SIMD多精度处理单元（FlexPE），该单元支持多种运行时可配置的AFs，包括sigmoid、tanh、ReLU和softmax，以及MAC操作。该设计在流水线模式下实现了高达16倍（FxP4）、8倍（FxP8）、4倍（FxP16）和1倍（FxP32）的吞吐量提升，并实现了100%的时间复用硬件。此外，论文还提出了一种面向边缘AI应用的高效多精度迭代模式，显著减少了VGG16中输入特征图和权重滤波器的DMA读取次数（分别高达62倍和371倍），并在精度损失不超过2%的情况下实现了8.42 GOPS/W的能效。该架构还支持新兴的4位计算，同时提升了FxP8/16模式下的吞吐量，适用于变换器和其他HPC应用，为未来边缘和云环境中的高效能AI加速器提供了可行方案。

链接: https://arxiv.org/abs/2412.11702
作者: Mukul Lokhande,Gopal Raut,Santosh Kumar Vishvakarma
机构: 未知
关键词: linear activation functions, deep learning inference, Vision Transformers, driven AI models, drives a strong
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
备注: 10 pages, 5 figures, Preprint, Submitted to TVLSI Regular papers

点击查看摘要

Abstract:The rapid adaptation of data driven AI models, such as deep learning inference, training, Vision Transformers (ViTs), and other HPC applications, drives a strong need for runtime precision configurable different non linear activation functions (AF) hardware support. Existing solutions support diverse precision or runtime AF reconfigurability but fail to address both simultaneously. This work proposes a flexible and SIMD multiprecision processing element (FlexPE), which supports diverse runtime configurable AFs, including sigmoid, tanh, ReLU and softmax, and MAC operation. The proposed design achieves an improved throughput of up to 16X FxP4, 8X FxP8, 4X FxP16 and 1X FxP32 in pipeline mode with 100% time multiplexed hardware. This work proposes an area efficient multiprecision iterative mode in the SIMD systolic arrays for edge AI use cases. The design delivers superior performance with up to 62X and 371X reductions in DMA reads for input feature maps and weight filters in VGG16, with an energy efficiency of 8.42 GOPS / W within the accuracy loss of 2%. The proposed architecture supports emerging 4-bit computations for DL inference while enhancing throughput in FxP8/16 modes for transformers and other HPC applications. The proposed approach enables future energy-efficient AI accelerators in edge and cloud environments.
zh

[CV-155] HOLa: HoloLens Object Labeling

【速读】：该论文旨在解决医学增强现实（AR）应用中的物体跟踪（object tracking）问题，特别是在需要大量标注掩码（annotation masks）的场景下。现有的分割基础模型（segmentation foundation models）如Segment Anything Model (SAM)虽然能够实现零样本分割（zero-shot segmentation），但仍需一定的人工参与。为此，作者提出了一种基于SAM-Track算法的HoloLens-Object-Labeling (HOLa) Unity和Python应用程序，该应用能够在HoloLens 2上实现全自动的单物体标注，极大减少了人工干预。HOLa的关键优势在于其无需针对特定图像外观进行调整，因此可广泛应用于任何AR研究领域。通过在开放性肝脏手术和医学体模实验中对不同复杂度的图像进行评估，HOLa在标注速度上提升了500倍以上，且其Dice分数（0.875至0.982）与人工标注结果相当。

链接: https://arxiv.org/abs/2412.04945
作者: Michael Schwimmbeck,Serouj Khajarian,Konstantin Holzapfel,Johannes Schmidt,Stefanie Remmele
机构: 未知
关键词: medical Augmented Reality, Augmented Reality, medical Augmented, key challenge, significant amount
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: accepted by BMT 2024

点击查看摘要

Abstract:In the context of medical Augmented Reality (AR) applications, object tracking is a key challenge and requires a significant amount of annotation masks. As segmentation foundation models like the Segment Anything Model (SAM) begin to emerge, zero-shot segmentation requires only minimal human participation obtaining high-quality object masks. We introduce a HoloLens-Object-Labeling (HOLa) Unity and Python application based on the SAM-Track algorithm that offers fully automatic single object annotation for HoloLens 2 while requiring minimal human participation. HOLa does not have to be adjusted to a specific image appearance and could thus alleviate AR research in any application field. We evaluate HOLa for different degrees of image complexity in open liver surgery and in medical phantom experiments. Using HOLa for image annotation can increase the labeling speed by more than 500 times while providing Dice scores between 0.875 and 0.982, which are comparable to human annotators. Our code is publicly available at: this https URL
zh

[CV-156] HYDRA: Hybrid Data Multiplexing and Run-time Layer Configurable DNN Accelerator

【速读】：该论文旨在解决深度神经网络（DNNs）在边缘节点（edge nodes）上执行高效计算时面临的硬件资源需求过大的问题。为解决这一问题，论文提出了HYDRA架构，该架构结合了混合数据复用（hybrid data multiplexing）和运行时层可配置的DNN加速器。其关键解决方案包括：采用层复用（layer-multiplexed）方法，在单层执行过程中复用单个激活函数，并通过改进的融合乘加运算（Fused-Multiply-Accumulate, FMA）提升计算效率；同时，HYDRA以迭代模式运行，复用相同的硬件资源，并以可配置的方式执行不同层。该架构在功耗和资源利用率方面实现了超过90%的优化，计算性能达到35.21 TOPSW，并在带宽、激活函数（AF）和层架构方面显著减少了面积开销。HYDRA架构在资源受限的边缘设备上实现了最优的DNN计算性能。

链接: https://arxiv.org/abs/2409.04976
作者: Sonu Kumar,Komal Gupta,Gopal Raut,Mukul Lokhande,Santosh Kumar Vishvakarma
机构: 未知
关键词: Deep neural networks, Deep neural, executing efficient computation, neural networks, offer plenty
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) offer plenty of challenges in executing efficient computation at edge nodes, primarily due to the huge hardware resource demands. The article proposes HYDRA, hybrid data multiplexing, and runtime layer configurable DNN accelerators to overcome the drawbacks. The work proposes a layer-multiplexed approach, which further reuses a single activation function within the execution of a single layer with improved Fused-Multiply-Accumulate (FMA). The proposed approach works in iterative mode to reuse the same hardware and execute different layers in a configurable fashion. The proposed architectures achieve reductions over 90% of power consumption and resource utilization improvements of state-of-the-art works, with 35.21 TOPSW. The proposed architecture reduces the area overhead (N-1) times required in bandwidth, AF and layer architecture. This work shows HYDRA architecture supports optimal DNN computations while improving performance on resource-constrained edge devices.
zh

[CV-157] ProKAN: Progressive Stacking of Kolmogorov-Arnold Networks for Efficient Liver Segmentation

【速读】：该论文旨在解决3D医学图像分割（特别是肝脏分割）中深度学习模型面临的过拟合（overfitting）和计算成本过高的问题。现有的架构虽然在性能上表现良好，但在时间效率和模型复杂度之间难以取得平衡。为此，论文提出了一种名为proKAN的渐进式堆叠方法，基于Kolmogorov-Arnold Networks (KANs) 的动态调整机制。proKAN的关键在于其能够根据过拟合行为在训练过程中逐步增加KAN模块，从而在检测到过拟合时停止网络增长，避免不必要的计算开销，同时保持高精度。此外，proKAN利用KAN的可学习激活函数（通过B样条建模），增强了处理复杂3D医学数据关系的灵活性。实验结果表明，proKAN在肝脏分割任务中实现了最先进的性能，显著提升了准确性、Dice分数和时间效率，同时提供了更好的可解释性。

链接: https://arxiv.org/abs/2412.19713
作者: Bhavesh Gyanchandani,Aditya Oza,Abhinav Roy
机构: 未知
关键词: spurred considerable research, identification of tumors, spurred considerable, considerable research, research into deep
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The growing need for accurate and efficient 3D identification of tumors, particularly in liver segmentation, has spurred considerable research into deep learning models. While many existing architectures offer strong performance, they often face challenges such as overfitting and excessive computational costs. An adjustable and flexible architecture that strikes a balance between time efficiency and model complexity remains an unmet requirement. In this paper, we introduce proKAN, a progressive stacking methodology for Kolmogorov-Arnold Networks (KANs) designed to address these challenges. Unlike traditional architectures, proKAN dynamically adjusts its complexity by progressively adding KAN blocks during training, based on overfitting behavior. This approach allows the network to stop growing when overfitting is detected, preventing unnecessary computational overhead while maintaining high accuracy. Additionally, proKAN utilizes KAN’s learnable activation functions modeled through B-splines, which provide enhanced flexibility in learning complex relationships in 3D medical data. Our proposed architecture achieves state-of-the-art performance in liver segmentation tasks, outperforming standard Multi-Layer Perceptrons (MLPs) and fixed KAN architectures. The dynamic nature of proKAN ensures efficient training times and high accuracy without the risk of overfitting. Furthermore, proKAN provides better interpretability by allowing insight into the decision-making process through its learnable coefficients. The experimental results demonstrate a significant improvement in accuracy, Dice score, and time efficiency, making proKAN a compelling solution for 3D medical image segmentation tasks.
zh

[CV-158] A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation

【速读】：该论文旨在探讨人工智能（AI）在体外受精（IVF）过程中卵巢刺激阶段的应用，特别是结合医学影像技术来优化决策和治疗方案。通过对13项相关研究的分析，论文指出当前AI算法在预测最佳激素剂量、触发时机和卵母细胞获取结果方面展现出显著潜力，但主要依赖二维（2D）超声影像数据，且仅限于基础量化指标（如卵泡大小和数量），缺乏直接特征提取或高级图像分析技术的应用。解决方案的关键在于整合先进的图像分析技术（如深度学习）和多样化的影像模态（如三维（3D）超声），以挖掘更深层次的洞察。此外，论文强调需要引入可解释AI（XAI）方法，以提高AI驱动决策的透明度和可追溯性，同时建议通过多中心合作和更大规模数据集的研究来提升结果的普适性。这些改进有望推动卵巢刺激管理的优化，实现高效、个性化和数据驱动的IVF治疗路径。

链接: https://arxiv.org/abs/2412.19688
作者: Jana Zakall,Birgit Pohn,Antonia Graf,Daniel Kovatchki,Arezoo Borji,Ragib Shahriar Islam,Hossam Haick,Heinz Strohmer,Sepideh Hatamikia
机构: 未知
关键词: Artificial intelligence, optimize treatment protocols, vitro fertilization, powerful tool, ovarian stimulation
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Artificial intelligence (AI) has emerged as a powerful tool to enhance decision-making and optimize treatment protocols in in vitro fertilization (IVF). In particular, AI shows significant promise in supporting decision-making during the ovarian stimulation phase of the IVF process. This review evaluates studies focused on the applications of AI combined with medical imaging in ovarian stimulation, examining methodologies, outcomes, and current limitations. Our analysis of 13 studies on this topic reveals that, reveal that while AI algorithms demonstrated notable potential in predicting optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the medical imaging data utilized predominantly came from two-dimensional (2D) ultrasound which mainly involved basic quantifications, such as follicle size and number, with limited use of direct feature extraction or advanced image analysis techniques. This points to an underexplored opportunity where advanced image analysis approaches, such as deep learning, and more diverse imaging modalities, like three-dimensional (3D) ultrasound, could unlock deeper insights. Additionally, the lack of explainable AI (XAI) in most studies raises concerns about the transparency and traceability of AI-driven decisions - key factors for clinical adoption and trust. Furthermore, many studies relied on single-center designs and small datasets, which limit the generalizability of their findings. This review highlights the need for integrating advanced imaging analysis techniques with explainable AI methodologies, as well as the importance of leveraging multicenter collaborations and larger datasets. Addressing these gaps has the potential to enhance ovarian stimulation management, paving the way for efficient, personalized, and data-driven treatment pathways that improve IVF outcomes.
zh

[CV-159] DLScanner: A parameter space scanner package assisted by deep learning methods

【速读】：该论文旨在解决深度学习（DL）技术在扫描应用中的两个主要问题：高维扫描中的收敛速度缓慢以及DL网络在将随机点映射到目标空间时的泛化能力有限。针对第一个问题，论文提出了一种相似性学习网络（similarity learning network），该网络将采样点映射到一个表示空间，使得目标点聚集在一起而非目标点被有效分离，从而优化采样点的表示并加速扫描收敛。对于第二个问题，论文采用了动态采样策略，具体通过VEGAS映射（VEGAS mapping）自适应地建议新的采样点，并在收集更多点的同时改进映射效果。该框架在性能和效率上均显著优于其他扫描方法。

链接: https://arxiv.org/abs/2412.19675
作者: A. Hammad,Raymundo Ramos
机构: 未知
关键词: scanner package enhanced, introduce a scanner, enhanced by deep, scanner package, package enhanced
类目: High Energy Physics - Phenomenology (hep-ph); Computer Vision and Pattern Recognition (cs.CV); High Energy Physics - Experiment (hep-ex); High Energy Physics - Theory (hep-th)
备注: 34 pages, 6 figures and 2 tables

点击查看摘要

Abstract:In this paper, we introduce a scanner package enhanced by deep learning (DL) techniques. The proposed package addresses two significant challenges associated with previously developed DL-based methods: slow convergence in high-dimensional scans and the limited generalization of the DL network when mapping random points to the target space. To tackle the first issue, we utilize a similarity learning network that maps sampled points into a representation space. In this space, in-target points are grouped together while out-target points are effectively pushed apart. This approach enhances the scan convergence by refining the representation of sampled points. The second challenge is mitigated by integrating a dynamic sampling strategy. Specifically, we employ a VEGAS mapping to adaptively suggest new points for the DL network while also improving the mapping when more points are collected. Our proposed framework demonstrates substantial gains in both performance and efficiency compared to other scanning methods.
zh

[CV-160] Evaluating Convolutional Neural Networks for COVID-19 classification in chest X-ray images

【速读】：该论文旨在解决COVID-19疫情中快速、准确筛查感染患者的问题，特别是在缺乏自动化工具的情况下。解决方案的关键在于利用胸部X光图像（chest X-ray images）结合机器学习技术，特别是通过四种卷积神经网络（Convolutional Neural Networks, CNNs）——AlexNet、VGG-11、SqueezeNet和DenseNet-121——来实现自动化的COVID-19检测。论文采用了十折交叉验证（ten-fold cross-validation）方法对训练集和测试集进行验证，并通过浅层微调（shallow fine-tuning）和数据增强（data augmentation）策略来处理公开可用的COVID-19阳性图像数量不足的问题。实验结果表明，所有CNN模型的准确率均超过97.00%，其中SqueezeNet模型表现最佳，准确率达到99.20%。

链接: https://arxiv.org/abs/2412.19362
作者: Leonardo Gabriel Ferreira Rodrigues,Danilo Ferreira da Silva,Larissa Ferreira Rodrigues,João Fernando Mari
机构: 未知
关键词: pandemic rapidly spread, rapidly spread globally, Coronavirus Disease, pandemic rapidly, impacting the lives
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages

点击查看摘要

Abstract:Coronavirus Disease 2019 (COVID-19) pandemic rapidly spread globally, impacting the lives of billions of people. The effective screening of infected patients is a critical step to struggle with COVID-19, and treating the patients avoiding this quickly disease spread. The need for automated and scalable methods has increased due to the unavailability of accurate automated toolkits. Recent researches using chest X-ray images suggest they include relevant information about the COVID-19 virus. Hence, applying machine learning techniques combined with radiological imaging promises to identify this disease accurately. It is straightforward to collect these images once it is spreadly shared and analyzed in the world. This paper presents a method for automatic COVID-19 detection using chest Xray images through four convolutional neural networks, namely: AlexNet, VGG-11, SqueezeNet, and DenseNet-121. This method had been providing accurate diagnostics for positive or negative COVID-19 classification. We validate our experiments using a ten-fold cross-validation procedure over the training and test sets. Our findings include the shallow fine-tuning and data augmentation strategies that can assist in dealing with the low number of positive COVID-19 images publicly available. The accuracy for all CNNs is higher than 97.00%, and the SqueezeNet model achieved the best result with 99.20%.
zh

[CV-161] Modality-Projection Universal Model for Comprehensive Full-Body Medical Imaging Segmentation

【速读】：该论文旨在解决深度学习在医学影像中应用时，由于不同成像模态（modalities）数据特性差异导致的通用模型难以跨模态优化的问题。为此，研究提出了一种模态投影通用模型（Modality Projection Universal Model, MPUM），其关键创新在于采用了一种新颖的模态投影策略，使模型能够动态调整参数以优化在不同成像模态中的性能。MPUM通过控制器卷积层（controller-based convolution layer）实现了网络各层显著性图（saliency maps）的可视化，显著提升了模型的可解释性，同时在解剖结构识别和脑-体轴代谢关联研究中表现出优越的准确性和临床应用潜力。

链接: https://arxiv.org/abs/2412.19026
作者: Yixin Chen,Lin Gao,Yajuan Gao,Rui Wang,Jingge Lian,Xiangxi Meng,Yanhua Duan,Leiying Chai,Hongbin Han,Zhaoping Cheng,Zhaoheng Xie
机构: 未知
关键词: shown great promise, Modality Projection Universal, integration of deep, deep learning, learning in medical
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of deep learning in medical imaging has shown great promise for enhancing diagnostic, therapeutic, and research outcomes. However, applying universal models across multiple modalities remains challenging due to the inherent variability in data characteristics. This study aims to introduce and evaluate a Modality Projection Universal Model (MPUM). MPUM employs a novel modality-projection strategy, which allows the model to dynamically adjust its parameters to optimize performance across different imaging modalities. The MPUM demonstrated superior accuracy in identifying anatomical structures, enabling precise quantification for improved clinical decision-making. It also identifies metabolic associations within the brain-body axis, advancing research on brain-body physiological correlations. Furthermore, MPUM’s unique controller-based convolution layer enables visualization of saliency maps across all network layers, significantly enhancing the model’s interpretability.
zh

[CV-162] Brain Ageing Prediction using Isolation Forest Technique and Residual Neural Network (ResNet)

【速读】：该论文旨在解决通过神经影像数据准确估计大脑年龄（brain age）的问题，以便早期检测神经退行性疾病的初始迹象。大脑衰老是一个复杂且动态的过程，会导致大脑功能和结构的变化，进而增加神经退行性疾病和认知衰退的风险。论文提出了一种基于深度学习的创新方法，利用残差神经网络101版本2（ResNet101V2）模型从MRI扫描中预测大脑年龄。解决方案的关键在于使用大规模数据集（来自国际脑图谱联盟ICBM的2102张图像）进行模型训练、验证和测试，并应用数据预处理技术，包括图像归一化和通过孤立森林（Isolation Forest）方法进行异常值检测。通过评估多种预训练模型（如MobileNetV2、ResNet50V2、ResNet101V2、Xception），研究发现ResNet101V2模型在性能上优于其他模型，在使用孤立森林处理前后分别达到了0.9136和0.8242年的平均绝对误差（MAE），从而实现了高精度的大脑年龄估计。

链接: https://arxiv.org/abs/2412.19017
作者: Saadat Behzadi,Danial Sharifrazi,Roohallah Alizadehsani,Mojtaba Lotfaliany,Mohammadreza Mohebbi
机构: 未知
关键词: leading to functional, complex and dynamic, functional and structural, Residual Neural Network, Brain
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain aging is a complex and dynamic process, leading to functional and structural changes in the brain. These changes could lead to the increased risk of neurodegenerative diseases and cognitive decline. Accurate brain-age estimation utilizing neuroimaging data has become necessary for detecting initial signs of neurodegeneration. Here, we propose a novel deep learning approach using the Residual Neural Network 101 Version 2 (ResNet101V2) model to predict brain age from MRI scans. To train, validate and test our proposed model, we used a large dataset of 2102 images which were selected randomly from the International Consortium for Brain Mapping (ICBM). Next, we applied data preprocessing techniques, including normalizing the images and using outlier detection via Isolation Forest method. Then, we evaluated various pre-trained approaches (namely: MobileNetV2, ResNet50V2, ResNet101V2, Xception). The results demonstrated that the ResNet101V2 model has higher performance compared with the other models, attaining MAEs of 0.9136 and 0.8242 years for before and after using Isolation Forest process. Our method achieved a high accuracy in brain age estimation in ICBM dataset and it provides a reliable brain age prediction.
zh

[CV-163] WaveDiffUR: A diffusion SDE-based solver for ultra magnification super-resolution in remote sensing images

【速读】：该论文旨在解决高倍率超分辨率（Super-Resolution, SR）问题，特别是在极端放大倍数下，现有方法由于问题的病态性（ill-posedness）而表现受限。为此，论文将高倍率超分辨率重新定义为超分辨率（Ultra-Resolution, UR）问题，并提出了一种基于条件扩散随机微分方程（Conditional Diffusion Stochastic Differential Equation, SDE）的解决方案。其核心创新在于提出了WaveDiffUR，一种新颖的小波域扩散超分辨率求解器。WaveDiffUR通过将超分辨率过程分解为处理条件小波分量的序列子过程，迭代重建低频小波细节（确保全局一致性）和高频分量（增强局部保真度）。此外，论文引入了跨尺度金字塔（Cross-Scale Pyramid, CSP）约束，这一动态自适应框架指导WaveDiffUR生成精细的小波细节，确保即使在极端放大倍数下也能输出一致且高保真的结果。

链接: https://arxiv.org/abs/2412.18996
作者: Yue Shi,Liangxiu Han,Darren Dancy,Lianghao Han
机构: 未知
关键词: Deep neural networks, remote sensing superresolu-tion, recently achieved significant, achieved significant advancements, Deep neural
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks have recently achieved significant advancements in remote sensing superresolu-tion (SR). However, most existing methods are limited to low magnification rates (e.g., 2 or 4) due to the escalating ill-posedness at higher magnification scales. To tackle this challenge, we redefine high-magnification SR as the ultra-resolution (UR) problem, reframing it as solving a conditional diffusion stochastic differential equation (SDE). In this context, we propose WaveDiffUR, a novel wavelet-domain diffusion UR solver that decomposes the UR process into sequential sub-processes addressing conditional wavelet components. WaveDiffUR iteratively reconstructs low-frequency wavelet details (ensuring global consistency) and high-frequency components (enhancing local fidelity) by incorporating pre-trained SR models as plug-and-play modules. This modularity mitigates the ill-posedness of the SDE and ensures scalability across diverse applications. To address limitations in fixed boundary conditions at extreme magnifications, we introduce the cross-scale pyramid (CSP) constraint, a dynamic and adaptive framework that guides WaveDiffUR in generating fine-grained wavelet details, ensuring consistent and high-fidelity outputs even at extreme magnification rates.
zh

[CV-164] Comprehensive Study on Lumbar Disc Segmentation Techniques Using MRI Data

【速读】：该论文旨在解决腰椎间盘（lumbar disk）分割问题，这对于通过医学影像精确检测椎间盘边界以诊断和治疗脊柱疾病至关重要。论文评估了多种先进的深度学习架构，包括ResUnext、Ef3 Net、UNet和TransUNet，在腰椎间盘分割中的效果，并重点关注了像素准确率（Pixel Accuracy）、平均交并比（Mean Intersection over Union, Mean IoU）和Dice系数（Dice Coefficient）等关键指标。研究结果表明，ResUnext在分割准确率上表现最佳，像素准确率为0.9492，Dice系数为0.8425，而TransUNet紧随其后。此外，滤波技术在一定程度上提升了大多数模型的性能，尤其是Dense UNet，增强了稳定性和分割质量。论文的核心解决方案在于通过对比不同深度学习架构的性能，筛选出最优模型，并探讨了滤波技术对模型性能的改进作用。

链接: https://arxiv.org/abs/2412.18894
作者: Serkan Salturk,Irem Sayin,Ibrahim Cem Balci,Taha Emre Pamukcu,Zafer Soydan,Huseyin Uvet
机构: 未知
关键词: curing spinal disorders, enabling precise detection, Lumbar disk segmentation, medical imaging, Lumbar disk
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Lumbar disk segmentation is essential for diagnosing and curing spinal disorders by enabling precise detection of disk boundaries in medical imaging. The advent of deep learning has resulted in the development of many segmentation methods, offering differing levels of accuracy and effectiveness. This study assesses the effectiveness of several sophisticated deep learning architectures, including ResUnext, Ef3 Net, UNet, and TransUNet, for lumbar disk segmentation, highlighting key metrics like as Pixel Accuracy, Mean Intersection over Union (Mean IoU), and Dice Coefficient. The findings indicate that ResUnext achieved the highest segmentation accuracy, with a Pixel Accuracy of 0.9492 and a Dice Coefficient of 0.8425, with TransUNet following closely after. Filtering techniques somewhat enhanced the performance of most models, particularly Dense UNet, improving stability and segmentation quality. The findings underscore the efficacy of these models in lumbar disk segmentation and highlight potential areas for improvement.
zh

[CV-165] MRI Reconstruction with Regularized 3D Diffusion Model (R3DM) WACV2025

【速读】：该论文旨在解决快速三维磁共振成像（3D-MRI）重建算法在处理欠采样（under-sampled）k空间数据时面临的挑战，特别是在保持高质量成像的同时处理有限输入数据的需求。传统方法主要依赖于二维（2D）重建，而本文提出了一种基于正则化三维扩散模型（regularized 3D diffusion model）与优化方法相结合的三维重建方法。通过引入扩散先验（diffusion based priors），该方法显著提升了图像质量，降低了噪声，并增强了三维MRI重建的整体保真度。实验结果表明，该方法在临床和植物科学MRI数据集上均优于现有竞争算法，尤其是在处理不同欠采样模式和预训练数据时表现出色。

链接: https://arxiv.org/abs/2412.18723
作者: Arya Bangun,Zhuo Cao,Alessio Quercia,Hanno Scharr,Elisabeth Pfaehler
机构: 未知
关键词: Magnetic Resonance Imaging, Magnetic Resonance, powerful imaging technique, imaging technique widely, Resonance Imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to WACV 2025,17 pages, 8 figures

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is a powerful imaging technique widely used for visualizing structures within the human body and in other fields such as plant sciences. However, there is a demand to develop fast 3D-MRI reconstruction algorithms to show the fine structure of objects from under-sampled acquisition data, i.e., k-space data. This emphasizes the need for efficient solutions that can handle limited input while maintaining high-quality imaging. In contrast to previous methods only using 2D, we propose a 3D MRI reconstruction method that leverages a regularized 3D diffusion model combined with optimization method. By incorporating diffusion based priors, our method improves image quality, reduces noise, and enhances the overall fidelity of 3D MRI reconstructions. We conduct comprehensive experiments analysis on clinical and plant science MRI datasets. To evaluate the algorithm effectiveness for under-sampled k-space data, we also demonstrate its reconstruction performance with several undersampling patterns, as well as with in- and out-of-distribution pre-trained data. In experiments, we show that our method improves upon tested competitors.
zh

[CV-166] Pruning Unrolled Networks (PUN) at Initialization for MRI Reconstruction Improves Generalization

【速读】：该论文旨在解决深度学习模型在图像重建任务中，当测试时应用于不同的实验设置或存在分布偏移（distribution shifts）时，性能下降的问题。论文提出的关键解决方案是在训练时对深度图像重建网络进行剪枝（pruning），特别是针对加速磁共振成像（accelerated magnetic resonance imaging）中的展开式重建架构（unrolled reconstruction architectures）。作者引入了一种在初始化时对展开式网络进行剪枝的方法（PUN），实验结果表明，与传统密集网络相比，PUN在多种实验设置下表现出更好的泛化能力，甚至在分布内数据上也能带来轻微的性能提升。

链接: https://arxiv.org/abs/2412.18668
作者: Shijun Liang,Evan Bell,Avrajit Ghosh,Saiprasad Ravishankar
机构: 未知
关键词: highly effective, image reconstruction tasks, Deep learning methods, distribution shifts, Deep learning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 4 figures, Asilomar Conference on Signals, Systems, and Computers 2024

点击查看摘要

Abstract:Deep learning methods are highly effective for many image reconstruction tasks. However, the performance of supervised learned models can degrade when applied to distinct experimental settings at test time or in the presence of distribution shifts. In this study, we demonstrate that pruning deep image reconstruction networks at training time can improve their robustness to distribution shifts. In particular, we consider unrolled reconstruction architectures for accelerated magnetic resonance imaging and introduce a method for pruning unrolled networks (PUN) at initialization. Our experiments demonstrate that when compared to traditional dense networks, PUN offers improved generalization across a variety of experimental settings and even slight performance gains on in-distribution data.
zh

人工智能

[AI-0] Can AI Help with Your Personal Finances?

链接: https://arxiv.org/abs/2412.19784
作者: Oudom Hean,Utsha Saha,Binita Saha
关键词: drawing significant attention, Large Language Models, Large Language, recent years, artificial intelligence
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have emerged as a transformative development in artificial intelligence (AI), drawing significant attention from industry and academia. Trained on vast datasets, these sophisticated AI systems exhibit impressive natural language processing and content generation capabilities. This paper explores the potential of LLMs to address key challenges in personal finance, focusing on the United States. We evaluate several leading LLMs, including OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and Meta’s Llama, to assess their effectiveness in providing accurate financial advice on topics such as mortgages, taxes, loans, and investments. Our findings show that while these models achieve an average accuracy rate of approximately 70%, they also display notable limitations in certain areas. Specifically, LLMs struggle to provide accurate responses for complex financial queries, with performance varying significantly across different topics. Despite these limitations, the analysis reveals notable improvements in newer versions of these models, highlighting their growing utility for individuals and financial advisors. As these AI systems continue to evolve, their potential for advancing AI-driven applications in personal finance becomes increasingly promising.

[AI-1] Enhancing Cognitive Diagnosis by Modeling Learner Cognitive Structure State

链接: https://arxiv.org/abs/2412.19759
作者: Zhifu Chen,Hengnian Gu,Jin Peng Zhou,Dongdai Zhou
关键词: fundamental research area, cognitive structure, cognitive structure state, Cognitive, learner cognitive structure
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cognitive diagnosis represents a fundamental research area within intelligent education, with the objective of measuring the cognitive status of individuals. Theoretically, an individual’s cognitive state is essentially equivalent to their cognitive structure state. Cognitive structure state comprises two key components: knowledge state (KS) and knowledge structure state (KUS). The knowledge state reflects the learner’s mastery of individual concepts, a widely studied focus within cognitive diagnosis. In contrast, the knowledge structure state-representing the learner’s understanding of the relationships between concepts-remains inadequately modeled. A learner’s cognitive structure is essential for promoting meaningful learning and shaping academic performance. Although various methods have been proposed, most focus on assessing KS and fail to assess KUS. To bridge this gap, we propose an innovative and effective framework-CSCD (Cognitive Structure State-based Cognitive Diagnosis)-which introduces a novel framework to modeling learners’ cognitive structures in diagnostic assessments, thereby offering new insights into cognitive structure modeling. Specifically, we employ an edge-feature-based graph attention network to represent the learner’s cognitive structure state, effectively integrating KS and KUS. Extensive experiments conducted on real datasets demonstrate the superior performance of this framework in terms of diagnostic accuracy and interpretability.

[AI-2] “Did my figure do justice to the answer?” : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

链接: https://arxiv.org/abs/2412.19755
作者: Pritam Sil,Bhaskaran Raman,Pushpak Bhattacharyya
关键词: Personalized feedback plays, Short Answer Grading, Automatic Short Answer, student learning process, Multimodal Short Answer
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Personalized feedback plays a vital role in a student’s learning process. While existing systems are adept at providing feedback over MCQ-based evaluation, this work focuses more on subjective and open-ended questions, which is similar to the problem of Automatic Short Answer Grading (ASAG) with feedback. Additionally, we introduce the Multimodal Short Answer grading with Feedback (MMSAF) problem over the traditional ASAG feedback problem to address the scenario where the student answer and reference answer might contain images. Moreover, we introduce the MMSAF dataset with 2197 data points along with an automated framework for generating such data sets. Our evaluations on existing LLMs over this dataset achieved an overall accuracy of 55% on Level of Correctness labels, 75% on Image Relevance labels and a score of 4.27 out of 5 in correctness level of LLM generated feedback as rated by experts. As per experts, Pixtral achieved a rating of above 4 out of all metrics, indicating that it is more aligned to human judgement, and that it is the best solution for assisting students.

[AI-3] IMAGINE: An 8-to-1b 22nm FD-SOI Compute-In-Memory CNN Accelerator With an End-to-End Analog Charge-Based 0.15-8POPS/W Macro Featuring Distribution-Aware Data Reshaping

链接: https://arxiv.org/abs/2412.19750
作者: Adrian Kneip,Martin Lefebvre,Pol Maistriaux,David Bol
关键词: convolutional neural networks, SRAMs have recently, convolutional neural, neural networks, enticing compromise
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 14 pages, 23 figures, 1 table

点击查看摘要

Abstract:Charge-domain compute-in-memory (CIM) SRAMs have recently become an enticing compromise between computing efficiency and accuracy to process sub-8b convolutional neural networks (CNNs) at the edge. Yet, they commonly make use of a fixed dot-product (DP) voltage swing, which leads to a loss in effective ADC bits due to data-dependent clipping or truncation effects that waste precious conversion energy and computing accuracy. To overcome this, we present IMAGINE, a workload-adaptive 1-to-8b CIM-CNN accelerator in 22nm FD-SOI. It introduces a 1152x256 end-to-end charge-based macro with a multi-bit DP based on an input-serial, weight-parallel accumulation that avoids power-hungry DACs. An adaptive swing is achieved by combining a channel-wise DP array split with a linear in-ADC implementation of analog batch-normalization (ABN), obtaining a distribution-aware data reshaping. Critical design constraints are relaxed by including the post-silicon equivalent noise within a CIM-aware CNN training framework. Measurement results showcase an 8b system-level energy efficiency of 40TOPS/W at 0.3/0.6V, with competitive accuracies on MNIST and CIFAR-10. Moreover, the peak energy and area efficiencies of the 187kB/mm2 macro respectively reach up to 0.15-8POPS/W and 2.6-154TOPS/mm2, scaling with the 8-to-1b computing precision. These results exceed previous charge-based designs by 3-to-5x while being the first work to provide linear in-memory rescaling.

[AI-4] Enhancing Adversarial Robustness of Deep Neural Networks Through Supervised Contrastive Learning

链接: https://arxiv.org/abs/2412.19747
作者: Longwei Wang,Navid Nayyem,Abdullah Rakin
关键词: convolutional neural networks, introducing imperceptible perturbations, lead to misclassifications, exposing weaknesses, exploit the vulnerabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 11 figures

点击查看摘要

Abstract:Adversarial attacks exploit the vulnerabilities of convolutional neural networks by introducing imperceptible perturbations that lead to misclassifications, exposing weaknesses in feature representations and decision boundaries. This paper presents a novel framework combining supervised contrastive learning and margin-based contrastive loss to enhance adversarial robustness. Supervised contrastive learning improves the structure of the feature space by clustering embeddings of samples within the same class and separating those from different classes. Margin-based contrastive loss, inspired by support vector machines, enforces explicit constraints to create robust decision boundaries with well-defined margins. Experiments on the CIFAR-100 dataset with a ResNet-18 backbone demonstrate robustness performance improvements in adversarial accuracy under Fast Gradient Sign Method attacks.

[AI-5] Adaptive Context-Aware Multi-Path Transmission Control for VR/AR Content: A Deep Reinforcement Learning Approach

链接: https://arxiv.org/abs/2412.19737
作者: Shakil Ahmed,Saifur Rahman Sabuj,Ashfaq Khokhar
关键词: Transmission Control Protocol, Multi-Path Transmission Control, Control Protocol, Transmission Control, Context-Aware Multi-Path Transmission
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces the Adaptive Context-Aware Multi-Path Transmission Control Protocol (ACMPTCP), an efficient approach designed to optimize the performance of Multi-Path Transmission Control Protocol (MPTCP) for data-intensive applications such as augmented and virtual reality (AR/VR) streaming. ACMPTCP addresses the limitations of conventional MPTCP by leveraging deep reinforcement learning (DRL) for agile end-to-end path management and optimal bandwidth allocation, facilitating path realignment across diverse network environments.

[AI-6] Can Large Language Models Adapt to Other Agents In-Context?

链接: https://arxiv.org/abs/2412.19726
作者: Matthew Riemer,Zahra Ashktorab,Djallel Bouneffouf,Payel Das,Miao Liu,Justin D. Weisz,Murray Campbell
关键词: large language models, theory of mind, research community aims, language models, mind capabilities
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the research community aims to build better AI assistants that are more dynamic and personalized to the diversity of humans that they interact with, there is increased interest in evaluating the theory of mind capabilities of large language models (LLMs). Indeed, several recent studies suggest that LLM theory of mind capabilities are quite impressive, approximating human-level performance. Our paper aims to rebuke this narrative and argues instead that past studies were not directly measuring agent performance, potentially leading to findings that are illusory in nature as a result. We draw a strong distinction between what we call literal theory of mind i.e. measuring the agent’s ability to predict the behavior of others and functional theory of mind i.e. adapting to agents in-context based on a rational response to predictions of their behavior. We find that top performing open source LLMs may display strong capabilities in literal theory of mind, depending on how they are prompted, but seem to struggle with functional theory of mind – even when partner policies are exceedingly simple. Our work serves to highlight the double sided nature of inductive bias in LLMs when adapting to new situations. While this bias can lead to strong performance over limited horizons, it often hinders convergence to optimal long-term behavior.

[AI-7] xt2Insight: Transform natural language text into insights seamlessly using multi-model architecture

链接: https://arxiv.org/abs/2412.19718
作者: Pradeep Sain
关键词: domains like healthcare, growing demand, evident across domains, user-centric data analysis, model
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing demand for dynamic, user-centric data analysis and visualization is evident across domains like healthcare, finance, and research. Traditional visualization tools often fail to meet individual user needs due to their static and predefined nature. To address this gap, Text2Insight is introduced as an innovative solution that delivers customized data analysis and visualizations based on user-defined natural language requirements. Leveraging a multi-model architecture, Text2Insight transforms user inputs into actionable insights and dynamic visualizations. The methodology begins with analyzing the input dataset to extract structural details such as columns and values. A pre-trained Llama3 model converts the user’s natural language query into an SQL query, which is further refined using a Named Entity Recognition (NER) model for accuracy. A chart predictor determines the most suitable visualization type, while the Llama3 model generates insights based on the SQL query’s results. The output is a user-friendly and visually informative chart. To enhance analysis capabilities, the system integrates a question-answering model and a predictive model using the BERT framework. These models provide insights into historical data and predict future trends. Performance evaluation of Text2Insight demonstrates its effectiveness, achieving high accuracy (99%), precision (100%), recall (99%), and F1-score (99%), with a BLEU score of 0.5. The question-answering model attained an accuracy of 89% and the predictive model achieved 70% accuracy. These results validate Text2Insight as a robust and viable solution for transforming natural language text into dynamic, user-specific data analysis and visualizations. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.19718 [cs.AI] (or arXiv:2412.19718v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.19718 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-8] An Integrated Optimization and Deep Learning Pipeline for Predicting Live Birth Success in IVF Using Feature Optimization and Transformer-Based Models

链接: https://arxiv.org/abs/2412.19696
作者: Arezoo Borji,Hossam Haick,Birgit Pohn,Antonia Graf,Jana Zakall,S M Ragib Shahriar Islam,Gernot Kronreif,Daniel Kovatchki,Heinz Strohmer,Sepideh Hatamikia
关键词: assisted reproductive technology, widely utilized assisted, utilized assisted reproductive, remains challenging due, success remains challenging
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In vitro fertilization (IVF) is a widely utilized assisted reproductive technology, yet predicting its success remains challenging due to the multifaceted interplay of clinical, demographic, and procedural factors. This study develops a robust artificial intelligence (AI) pipeline aimed at predicting live birth outcomes in IVF treatments. The pipeline uses anonymized data from 2010 to 2018, obtained from the Human Fertilization and Embryology Authority (HFEA). We evaluated the prediction performance of live birth success as a binary outcome (success/failure) by integrating different feature selection methods, such as principal component analysis (PCA) and particle swarm optimization (PSO), with different traditional machine learning-based classifiers including random forest (RF) and decision tree, as well as deep learning-based classifiers including custom transformer-based model and a tab transformer model with an attention mechanism. Our research demonstrated that the best performance was achieved by combining PSO for feature selection with the TabTransformer-based deep learning model, yielding an accuracy of 99.50% and an AUC of 99.96%, highlighting its significant performance to predict live births. This study establishes a highly accurate AI pipeline for predicting live birth outcomes in IVF, demonstrating its potential to enhance personalized fertility treatments.

[AI-9] Boosting Private Domain Understanding of Efficient MLLM s: A Tuning-free Adaptive Universal Prompt Optimization Framework

链接: https://arxiv.org/abs/2412.19684
作者: Jiang Liu,Bolin Li,Haoyuan Li,Tianwei Lin,Wenqiao Zhang,Tao Zhong,Zhelun Yu,Jinghao Wei,Hao Cheng,Hao Jiang,Zheqi Lv,Juncheng Li,Siliang Tang,Yueting Zhuang
关键词: large language models, multimodal large language, Efficient multimodal large, large language, language models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient multimodal large language models (EMLLMs), in contrast to multimodal large language models (MLLMs), reduce model size and computational costs and are often deployed on resource-constrained devices. However, due to data privacy concerns, existing open-source EMLLMs rarely have access to private domain-specific data during the pre-training process, making them difficult to directly apply in device-specific domains, such as certain business scenarios. To address this weakness, this paper focuses on the efficient adaptation of EMLLMs to private domains, specifically in two areas: 1) how to reduce data requirements, and 2) how to avoid parameter fine-tuning. Specifically, we propose a tun\textbf\underlineIng-free, a\textbf\underlineDaptiv\textbf\underlineE, univers\textbf\underlineAL \textbf\underlinePrompt Optimization Framework, abbreviated as \textit\textbf\ourmethod which consists of two stages: 1) Predefined Prompt, based on the reinforcement searching strategy, generate a prompt optimization strategy tree to acquire optimization priors; 2) Prompt Reflection initializes the prompt based on optimization priors, followed by self-reflection to further search and refine the prompt. By doing so, \ourmethod elegantly generates the ``ideal prompts’’ for processing private domain-specific data. Note that our method requires no parameter fine-tuning and only a small amount of data to quickly adapt to the data distribution of private data. Extensive experiments across multiple tasks demonstrate that our proposed \ourmethod significantly improves both efficiency and performance compared to baselines.

[AI-10] Xmodel-2 Technical Report

链接: https://arxiv.org/abs/2412.19638
作者: Wang Qun,Liu Yang,Lin Qingquan,Qu Zhijiu,Jiang Ling
关键词: large language model, language model designed, model designed specifically, large language, designed specifically
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Xmodel-2 is a 1.2-billion-parameter large language model designed specifically for reasoning tasks. Its architecture enables different model scales to share a unified set of hyperparameters, allowing for extensive experimentation on smaller models and seamless transfer of optimal configurations to larger models. To maximize training efficiency and stability, Xmodel-2 employs the WSD learning rate scheduler from MiniCPM. Pretrained on 1.5 trillion tokens from diverse sources, Xmodel-2 achieves state-of-the-art performance in complex reasoning and agent-based tasks, while maintaining low training costs. These results highlight the potential of efficient model design and training strategies in advancing reasoning capabilities. Model checkpoints and code are publicly available on GitHub at this https URL

[AI-11] Gradient Weight-normalized Low-rank Projection for Efficient LLM Training AAAI AAAI-25

链接: https://arxiv.org/abs/2412.19616
作者: Jia-Hong Huang,Yixian Shen,Hongyi Zhu,Stevan Rudinac,Evangelos Kanoulas
关键词: pose significant challenges, computational resources pose, resources pose significant, Large Language Models, shown remarkable performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25) [Main Technical Track]

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable performance across various tasks, but the escalating demands on computational resources pose significant challenges, particularly in the extensive utilization of full fine-tuning for downstream tasks. To address this, parameter-efficient fine-tuning (PEFT) methods have been developed, but they often underperform compared to full fine-tuning and struggle with memory efficiency. In this work, we introduce Gradient Weight-Normalized Low-Rank Projection (GradNormLoRP), a novel approach that enhances both parameter and memory efficiency while maintaining comparable performance to full fine-tuning. GradNormLoRP normalizes the weight matrix to improve gradient conditioning, facilitating better convergence during optimization. Additionally, it applies low-rank approximations to the weight and gradient matrices, significantly reducing memory usage during training. Extensive experiments demonstrate that our 8-bit GradNormLoRP reduces optimizer memory usage by up to 89.5% and enables the pre-training of large LLMs, such as LLaMA 7B, on consumer-level GPUs like the NVIDIA RTX 4090, without additional inference costs. Moreover, GradNormLoRP outperforms existing low-rank methods in fine-tuning tasks. For instance, when fine-tuning the RoBERTa model on all GLUE tasks with a rank of 8, GradNormLoRP achieves an average score of 80.65, surpassing LoRA’s score of 79.23. These results underscore GradNormLoRP as a promising alternative for efficient LLM pre-training and fine-tuning. Source code and Appendix: this https URL

[AI-12] Bidding Games on Markov Decision Processes with Quantitative Reachability Objectives AAMAS2025

链接: https://arxiv.org/abs/2412.19609
作者: Guy Avni,Martin Kurečka,Kaushik Mallik,Petr Novotný,Suman Sadhukhan
关键词: bidding games, Graph games, fundamental in strategic, strategic reasoning, reasoning of multi-agent
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: To appear in AAMAS 2025

点击查看摘要

Abstract:Graph games are fundamental in strategic reasoning of multi-agent systems and their environments. We study a new family of graph games which combine stochastic environmental uncertainties and auction-based interactions among the agents, formalized as bidding games on (finite) Markov decision processes (MDP). Normally, on MDPs, a single decision-maker chooses a sequence of actions, producing a probability distribution over infinite paths. In bidding games on MDPs, two players – called the reachability and safety players – bid for the privilege of choosing the next action at each step. The reachability player’s goal is to maximize the probability of reaching a target vertex, whereas the safety player’s goal is to minimize it. These games generalize traditional bidding games on graphs, and the existing analysis techniques do not extend. For instance, the central property of traditional bidding games is the existence of a threshold budget, which is a necessary and sufficient budget to guarantee winning for the reachability player. For MDPs, the threshold becomes a relation between the budgets and probabilities of reaching the target. We devise value-iteration algorithms that approximate thresholds and optimal policies for general MDPs, and compute the exact solutions for acyclic MDPs, and show that finding thresholds is at least as hard as solving simple-stochastic games.

[AI-13] SocRATES: Towards Automated Scenario-based Testing of Social Navigation Algorithms

链接: https://arxiv.org/abs/2412.19595
作者: Shashank Rao Marpally,Pranav Goyal,Harold Soh
关键词: benchmarks primarily focus, Current social navigation, Current social, task efficiency, social navigation methods
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Current social navigation methods and benchmarks primarily focus on proxemics and task efficiency. While these factors are important, qualitative aspects such as perceptions of a robot’s social competence are equally crucial for successful adoption and integration into human environments. We propose a more comprehensive evaluation of social navigation through scenario-based testing, where specific human-robot interaction scenarios can reveal key robot behaviors. However, creating such scenarios is often labor-intensive and complex. In this work, we address this challenge by introducing a pipeline that automates the generation of context-, and location-appropriate social navigation scenarios, ready for simulation. Our pipeline transforms simple scenario metadata into detailed textual scenarios, infers pedestrian and robot trajectories, and simulates pedestrian behaviors, which enables more controlled evaluation. We leverage the social reasoning and code-generation capabilities of Large Language Models (LLMs) to streamline scenario generation and translation. Our experiments show that our pipeline produces realistic scenarios and significantly improves scenario translation over naive LLM prompting. Additionally, we present initial feedback from a usability study with social navigation experts and a case-study demonstrating a scenario-based evaluation of three navigation algorithms.

[AI-14] ViDTA: Enhanced Drug-Target Affinity Prediction via Virtual Graph Nodes and Attention-based Feature Fusion

链接: https://arxiv.org/abs/2412.19589
作者: Minghui Li,Zikang Guo,Yang Wu,Peijin Guo,Yao Shi,Shengshan Hu,Wei Wan,Shengqing Hu
关键词: predicting drug-target affinity, accurately predicting drug-target, affect biological systems, drug-target affinity, predicting drug-target
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: Accepted by International Conference on Bioinformatics and Biomedicine (BIBM 24)

点击查看摘要

Abstract:Drug-target interaction is fundamental in understanding how drugs affect biological systems, and accurately predicting drug-target affinity (DTA) is vital for drug discovery. Recently, deep learning methods have emerged as a significant approach for estimating the binding strength between drugs and target proteins. However, existing methods simply utilize the drug’s local information from molecular topology rather than global information. Additionally, the features of drugs and proteins are usually fused with a simple concatenation operation, limiting their effectiveness. To address these challenges, we proposed ViDTA, an enhanced DTA prediction framework. We introduce virtual nodes into the Graph Neural Network (GNN)-based drug feature extraction network, which acts as a global memory to exchange messages more efficiently. By incorporating virtual graph nodes, we seamlessly integrate local and global features of drug molecular structures, expanding the GNN’s receptive field. Additionally, we propose an attention-based linear feature fusion network for better capturing the interaction information between drugs and proteins. Experimental results evaluated on various benchmarks including Davis, Metz, and KIBA demonstrate that our proposed ViDTA outperforms the state-of-the-art baselines.

[AI-15] Graph-attention-based Casual Discovery with Trust Region-navigated Clipping Policy Optimization

链接: https://arxiv.org/abs/2412.19578
作者: Shixuan Liu,Yanghe Feng,Keyu Wu,Guangquan Cheng,Jincai Huang,Zhong Liu
关键词: empirical sciences, indispensable task, domains of empirical, remains an indispensable, causal structure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In many domains of empirical sciences, discovering the causal structure within variables remains an indispensable task. Recently, to tackle with unoriented edges or latent assumptions violation suffered by conventional methods, researchers formulated a reinforcement learning (RL) procedure for causal discovery, and equipped REINFORCE algorithm to search for the best-rewarded directed acyclic graph. The two keys to the overall performance of the procedure are the robustness of RL methods and the efficient encoding of variables. However, on the one hand, REINFORCE is prone to local convergence and unstable performance during training. Neither trust region policy optimization, being computationally-expensive, nor proximal policy optimization (PPO), suffering from aggregate constraint deviation, is decent alternative for combinatory optimization problems with considerable individual subactions. We propose a trust region-navigated clipping policy optimization method for causal discovery that guarantees both better search efficiency and steadiness in policy optimization, in comparison with REINFORCE, PPO and our prioritized sampling-guided REINFORCE implementation. On the other hand, to boost the efficient encoding of variables, we propose a refined graph attention encoder called SDGAT that can grasp more feature information without priori neighbourhood information. With these improvements, the proposed method outperforms former RL method in both synthetic and benchmark datasets in terms of output results and optimization robustness.

[AI-16] Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

链接: https://arxiv.org/abs/2412.19562
作者: Yuxiao Yang,Shenao Zhang,Zhihan Liu,Huaxiu Yao,Zhaoran Wang
关键词: Large Language Models, Language Models, Embodied Instruction, Large Language, focuses on building
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This work focuses on building a task planner for Embodied Instruction Following (EIF) using Large Language Models (LLMs). Previous works typically train a planner to imitate expert trajectories, treating this as a supervised task. While these methods achieve competitive performance, they often lack sufficient robustness. When a suboptimal action is taken, the planner may encounter an out-of-distribution state, which can lead to task failure. In contrast, we frame the task as a Partially Observable Markov Decision Process (POMDP) and aim to develop a robust planner under a few-shot assumption. Thus, we propose a closed-loop planner with an adaptation module and a novel hindsight method, aiming to use as much information as possible to assist the planner. Our experiments on the ALFRED dataset indicate that our planner achieves competitive performance under a few-shot assumption. For the first time, our few-shot agent’s performance approaches and even surpasses that of the full-shot supervised agent.

[AI-17] Learning states enhanced knowledge tracing: Simulating the diversity in real-world learning process

链接: https://arxiv.org/abs/2412.19550
作者: Shanshan Wang,Xueying Zhang,Keyang Wang,Xun Yang,Xingyi Zhang
关键词: future performance based, learning state, knowledge state, learner future performance, Enhanced Knowledge Tracing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Knowledge Tracing (KT) task focuses on predicting a learner’s future performance based on the historical interactions. The knowledge state plays a key role in learning process. However, considering that the knowledge state is influenced by various learning factors in the interaction process, such as the exercises similarities, responses reliability and the learner’s learning state. Previous models still face two major limitations. First, due to the exercises differences caused by various complex reasons and the unreliability of responses caused by guessing behavior, it is hard to locate the historical interaction which is most relevant to the current answered exercise. Second, the learning state is also a key factor to influence the knowledge state, which is always ignored by previous methods. To address these issues, we propose a new method named Learning State Enhanced Knowledge Tracing (LSKT). Firstly, to simulate the potential differences in interactions, inspired by Item Response Theory~(IRT) paradigm, we designed three different embedding methods ranging from coarse-grained to fine-grained views and conduct comparative analysis on them. Secondly, we design a learning state extraction module to capture the changing learning state during the learning process of the learner. In turn, with the help of the extracted learning state, a more detailed knowledge state could be captured. Experimental results on four real-world datasets show that our LSKT method outperforms the current state-of-the-art methods.

[AI-18] Scalable Hierarchical Reinforcement Learning for Hyper Scale Multi-Robot Task Planning

链接: https://arxiv.org/abs/2412.19538
作者: Xuan Zhou,Xiang Shi,Lele Zhang,Chen Chen,Hongbo Li,Lin Ma,Fang Deng,Jie Chen
关键词: mobile fulfillment system, huge customer orders, meet huge customer, robotic mobile fulfillment, hyper scale MRTP
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:To improve the efficiency of warehousing system and meet huge customer orders, we aim to solve the challenges of dimension disaster and dynamic properties in hyper scale multi-robot task planning (MRTP) for robotic mobile fulfillment system (RMFS). Existing research indicates that hierarchical reinforcement learning (HRL) is an effective method to reduce these challenges. Based on that, we construct an efficient multi-stage HRL-based multi-robot task planner for hyper scale MRTP in RMFS, and the planning process is represented with a special temporal graph topology. To ensure optimality, the planner is designed with a centralized architecture, but it also brings the challenges of scaling up and generalization that require policies to maintain performance for various unlearned scales and maps. To tackle these difficulties, we first construct a hierarchical temporal attention network (HTAN) to ensure basic ability of handling inputs with unfixed lengths, and then design multi-stage curricula for hierarchical policy learning to further improve the scaling up and generalization ability while avoiding catastrophic forgetting. Additionally, we notice that policies with hierarchical structure suffer from unfair credit assignment that is similar to that in multi-agent reinforcement learning, inspired of which, we propose a hierarchical reinforcement learning algorithm with counterfactual rollout baseline to improve learning performance. Experimental results demonstrate that our planner outperform other state-of-the-art methods on various MRTP instances in both simulated and real-world RMFS. Also, our planner can successfully scale up to hyper scale MRTP instances in RMFS with up to 200 robots and 1000 retrieval racks on unlearned maps while keeping superior performance over other methods.

[AI-19] PLN and NARS Often Yield Similar strength times confidence Given Highly Uncertain Term Probabilities

链接: https://arxiv.org/abs/2412.19524
作者: Ben Goertzel
关键词: Probabilistic Logic Networks, reasoning frameworks aimed, Non-Axiomatic Reasoning System, Logic Networks, Probabilistic Logic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We provide a comparative analysis of the deduction, induction, and abduction formulas used in Probabilistic Logic Networks (PLN) and the Non-Axiomatic Reasoning System (NARS), two uncertain reasoning frameworks aimed at AGI. One difference between the two systems is that, at the level of individual inference rules, PLN directly leverages both term and relationship probabilities, whereas NARS only leverages relationship frequencies and has no simple analogue of term probabilities. Thus we focus here on scenarios where there is high uncertainty about term probabilities, and explore how this uncertainty influences the comparative inferential conclusions of the two systems. We compare the product of strength and confidence ( s\times c ) in PLN against the product of frequency and confidence ( f\times c ) in NARS (quantities we refer to as measuring the “power” of an uncertain statement) in cases of high term probability uncertainty, using heuristic analyses and elementary numerical computations. We find that in many practical situations with high term probability uncertainty, PLN and NARS formulas give very similar results for the power of an inference conclusion, even though they sometimes come to these similar numbers in quite different ways.

[AI-20] Estimation of System Parameters Including Repeated Cross-Sectional Data through Emulator-Informed Deep Generative Model

链接: https://arxiv.org/abs/2412.19517
作者: Hyunwoo Cho,Sung Woong Cho,Hyeontae Jo,Hyung Ju Hwang
关键词: Differential equations, crucial for modeling, modeling the evolution, evolution of natural, natural or engineered
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Populations and Evolution (q-bio.PE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Differential equations (DEs) are crucial for modeling the evolution of natural or engineered systems. Traditionally, the parameters in DEs are adjusted to fit data from system observations. However, in fields such as politics, economics, and biology, available data are often independently collected at distinct time points from different subjects (i.e., repeated cross-sectional (RCS) data). Conventional optimization techniques struggle to accurately estimate DE parameters when RCS data exhibit various heterogeneities, leading to a significant loss of information. To address this issue, we propose a new estimation method called the emulator-informed deep-generative model (EIDGM), designed to handle RCS data. Specifically, EIDGM integrates a physics-informed neural network-based emulator that immediately generates DE solutions and a Wasserstein generative adversarial network-based parameter generator that can effectively mimic the RCS data. We evaluated EIDGM on exponential growth, logistic population models, and the Lorenz system, demonstrating its superior ability to accurately capture parameter distributions. Additionally, we applied EIDGM to an experimental dataset of Amyloid beta 40 and beta 42, successfully capturing diverse parameter distribution shapes. This shows that EIDGM can be applied to model a wide range of systems and extended to uncover the operating principles of systems based on limited data.

[AI-21] Hybrid Local Causal Discovery

链接: https://arxiv.org/abs/2412.19507
作者: Zhaolong Ling,Honghui Peng,Yiwen Zhang,Peng Zhou,Xingyu Wu,Kui Yu,Xindong Wu
关键词: Local causal discovery, Local causal, causal discovery, causal discovery methods, causal discovery aims
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Local causal discovery aims to learn and distinguish the direct causes and effects of a target variable from observed data. Existing constraint-based local causal discovery methods use AND or OR rules in constructing the local causal skeleton, but using either rule alone is prone to produce cascading errors in the learned local causal skeleton, and thus impacting the inference of local causal relationships. On the other hand, directly applying score-based global causal discovery methods to local causal discovery may randomly return incorrect results due to the existence of local equivalence classes. To address the above issues, we propose a Hybrid Local Causal Discovery algorithm, called HLCD. Specifically, HLCD initially utilizes a constraint-based approach combined with the OR rule to obtain a candidate skeleton and then employs a score-based method to eliminate redundant portions in the candidate skeleton. Furthermore, during the local causal orientation phase, HLCD distinguishes between V-structures and equivalence classes by comparing the local structure scores between the two, thereby avoiding orientation interference caused by local equivalence classes. We conducted extensive experiments with seven state-of-the-art competitors on 14 benchmark Bayesian network datasets, and the experimental results demonstrate that HLCD significantly outperforms existing local causal discovery algorithms.

[AI-22] Multi-P2A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

链接: https://arxiv.org/abs/2412.19496
作者: Jie Zhang,Xiangkui Cao,Zhouyu Han,Shiguang Shan,Xilin Chen
关键词: Large Vision-Language Models, exhibit impressive potential, Large Vision-Language, face significant privacy, privacy
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) exhibit impressive potential across various tasks but also face significant privacy risks, limiting their practical applications. Current researches on privacy assessment for LVLMs is limited in scope, with gaps in both assessment dimensions and privacy categories. To bridge this gap, we propose Multi-P ^2 A, a comprehensive benchmark for evaluating the privacy preservation capabilities of LVLMs in terms of privacy awareness and leakage. Privacy awareness measures the model’s ability to recognize the privacy sensitivity of input data, while privacy leakage assesses the risk of the model unintentionally disclosing privacy information in its output. We design a range of sub-tasks to thoroughly evaluate the model’s privacy protection offered by LVLMs. Multi-P ^2 A covers 26 categories of personal privacy, 15 categories of trade secrets, and 18 categories of state secrets, totaling 31,962 samples. Based on Multi-P ^2 A, we evaluate the privacy preservation capabilities of 21 open-source and 2 closed-source LVLMs. Our results reveal that current LVLMs generally pose a high risk of facilitating privacy breaches, with vulnerabilities varying across personal privacy, trade secret, and state secret.

[AI-23] Disparate Model Performance and Stability in Machine Learning Clinical Support for Diabetes and Heart Diseases

链接: https://arxiv.org/abs/2412.19495
作者: Ioannis Bilionis,Ricardo C. Berrios,Luis Fernandez-Luque,Carlos Castillo
关键词: Machine Learning, supporting clinical decision-making, algorithms are vital, biomedical informatics, vital for supporting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper will be presented in American Medical Informatics Association (AMIA) Informatics Summit Conference 2025 (Pittsburgh, PA). 10 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Machine Learning (ML) algorithms are vital for supporting clinical decision-making in biomedical informatics. However, their predictive performance can vary across demographic groups, often due to the underrepresentation of historically marginalized populations in training datasets. The investigation reveals widespread sex- and age-related inequities in chronic disease datasets and their derived ML models. Thus, a novel analytical framework is introduced, combining systematic arbitrariness with traditional metrics like accuracy and data complexity. The analysis of data from over 25,000 individuals with chronic diseases revealed mild sex-related disparities, favoring predictive accuracy for males, and significant age-related differences, with better accuracy for younger patients. Notably, older patients showed inconsistent predictive accuracy across seven datasets, linked to higher data complexity and lower model performance. This highlights that representativeness in training data alone does not guarantee equitable outcomes, and model arbitrariness must be addressed before deploying models in clinical settings.

[AI-24] Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models

链接: https://arxiv.org/abs/2412.19450
作者: Hyeonseok Moon,Jaehyung Seo,Seungyoon Lee,Chanjun Park,Heuiseok Lim
关键词: Large Language Models, Large Language, strengths of Large, Language Models, key strengths
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:One of the key strengths of Large Language Models (LLMs) is their ability to interact with humans by generating appropriate responses to given instructions. This ability, known as instruction-following capability, has established a foundation for the use of LLMs across various fields and serves as a crucial metric for evaluating their performance. While numerous evaluation benchmarks have been developed, most focus solely on clear and coherent instructions. However, we have noted that LLMs can become easily distracted by instruction-formatted statements, which may lead to an oversight of their instruction comprehension skills. To address this issue, we introduce the Intention of Instruction (IoInst) benchmark. This benchmark evaluates LLMs’ capacity to remain focused and understand instructions without being misled by extraneous instructions. The primary objective of this benchmark is to identify the appropriate instruction that accurately guides the generation of a given context. Our findings suggest that even recently introduced state-of-the-art models still lack instruction understanding capability. Along with the proposition of IoInst in this study, we also present broad analyses of the several strategies potentially applicable to IoInst.

[AI-25] A Survey on Large Language Model Acceleration based on KV Cache Management

链接: https://arxiv.org/abs/2412.19442
作者: Haoyang Li,Yiming Li,Anxin Tian,Tianhao Tang,Zhanchao Xu,Xuejia Chen,Nicole Hu,Wei Dong,Qing Li,Lei Chen
关键词: Large Language Models, natural language processing, perform logical reasoning, Large Language, Language Models
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments. Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications. The curated paper list for KV cache management is in: \hrefthis https URLthis https URL.

[AI-26] A Self-Efficacy Theory-based Study on the Teachers Readiness to Teach Artificial Intelligence in Public Schools in Sri Lanka

链接: https://arxiv.org/abs/2412.19425
作者: Chathura Rajapakse,Wathsala Ariyarathna,Shanmugalingam Selvakan
关键词: Sri Lankan ICT, investigates Sri Lankan, Lankan ICT teachers’, Sri Lankan, Lankan ICT
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates Sri Lankan ICT teachers’ readiness to teach AI in schools, focusing on self-efficacy. A survey of over 1,300 teachers assessed their self-efficacy using a scale developed based on Bandura’s theory. PLS-SEM analysis revealed that teachers’ self-efficacy was low, primarily influenced by emotional and physiological states and imaginary experiences related to AI instruction. Mastery experiences had a lesser impact, and vicarious experiences and verbal persuasion showed no significant effect. The study highlights the need for a systemic approach to teacher professional development, considering the limitations in teachers’ AI expertise and social capital. Further research is recommended to explore a socio-technical systems perspective for effective AI teacher training.

[AI-27] Revisiting PCA for time series reduction in temporal dimension

链接: https://arxiv.org/abs/2412.19423
作者: Jiaxin Gao,Wenbo Hu,Yuntian Chen
关键词: Jiaxin Gao, Yuntian Chen, Deep learning, significantly advanced time, Revisiting PCA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: 13 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Revisiting PCA for Time Series Reduction in Temporal Dimension; Jiaxin Gao, Wenbo Hu, Yuntian Chen; Deep learning has significantly advanced time series analysis (TSA), enabling the extraction of complex patterns for tasks like classification, forecasting, and regression. Although dimensionality reduction has traditionally focused on the variable space-achieving notable success in minimizing data redundancy and computational complexity-less attention has been paid to reducing the temporal dimension. In this study, we revisit Principal Component Analysis (PCA), a classical dimensionality reduction technique, to explore its utility in temporal dimension reduction for time series data. It is generally thought that applying PCA to the temporal dimension would disrupt temporal dependencies, leading to limited exploration in this area. However, our theoretical analysis and extensive experiments demonstrate that applying PCA to sliding series windows not only maintains model performance, but also enhances computational efficiency. In auto-regressive forecasting, the temporal structure is partially preserved through windowing, and PCA is applied within these windows to denoise the time series while retaining their statistical information. By preprocessing time-series data with PCA, we reduce the temporal dimensionality before feeding it into TSA models such as Linear, Transformer, CNN, and RNN architectures. This approach accelerates training and inference and reduces resource consumption. Notably, PCA improves Informer training and inference speed by up to 40% and decreases GPU memory usage of TimesNet by 30%, without sacrificing model accuracy. Comparative analysis against other reduction methods further highlights the effectiveness of PCA in improving the efficiency of TSA models.

[AI-28] Gx2Mol: De Novo Generation of Hit-like Molecules from Gene Expression Profiles via Deep Learning

链接: https://arxiv.org/abs/2412.19422
作者: Chen Li,Yuki Matsukiyo,Yoshihiro Yamanishi
关键词: drug discovery process, gene expression profiles, discovery process, novo generation, generation of hit-like
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:De novo generation of hit-like molecules is a challenging task in the drug discovery process. Most methods in previous studies learn the semantics and syntax of molecular structures by analyzing molecular graphs or simplified molecular input line entry system (SMILES) strings; however, they do not take into account the drug responses of the biological systems consisting of genes and proteins. In this study we propose a deep generative model, Gx2Mol, which utilizes gene expression profiles to generate molecular structures with desirable phenotypes for arbitrary target proteins. In the algorithm, a variational autoencoder is employed as a feature extractor to learn the latent feature distribution of the gene expression profiles. Then, a long short-term memory is leveraged as the chemical generator to produce syntactically valid SMILES strings that satisfy the feature conditions of the gene expression profile extracted by the feature extractor. Experimental results and case studies demonstrate that the proposed Gx2Mol model can produce new molecules with potential bioactivities and drug-like properties.

[AI-29] Introduction to Graph Neural Networks: A Starting Point for Machine Learning Engineers

链接: https://arxiv.org/abs/2412.19419
作者: James H. Tanis,Chris Giannella,Adrian V. Mariano
关键词: Graph neural networks, deep neural networks, neural networks designed, neural networks, nodes or edges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks are deep neural networks designed for graphs with attributes attached to nodes or edges. The number of research papers in the literature concerning these models is growing rapidly due to their impressive performance on a broad range of tasks. This survey introduces graph neural networks through the encoder-decoder framework and provides examples of decoders for a range of graph analytic tasks. It uses theory and numerous experiments on homogeneous graphs to illustrate the behavior of graph neural networks for different training sizes and degrees of graph complexity.

[AI-30] Fully Data-driven but Interpretable Human Behavioural Modelling with Differentiable Discrete Choice Model

链接: https://arxiv.org/abs/2412.19403
作者: Fumiyasu Makinoshima,Tatsuya Mitomi,Fumiya Makihara,Eigo Segawa
关键词: complex human behaviours, human behaviours, decision-making processes, Discrete choice models, Discrete choice
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Discrete choice models are essential for modelling various decision-making processes in human behaviour. However, the specification of these models has depended heavily on domain knowledge from experts, and the fully automated but interpretable modelling of complex human behaviours has been a long-standing challenge. In this paper, we introduce the differentiable discrete choice model (Diff-DCM), a fully data-driven method for the interpretable modelling, learning, prediction, and control of complex human behaviours, which is realised by differentiable programming. Solely from input features and choice outcomes without any prior knowledge, Diff-DCM can estimate interpretable closed-form utility functions that reproduce observed behaviours. Comprehensive experiments with both synthetic and real-world data demonstrate that Diff-DCM can be applied to various types of data and requires only a small amount of computational resources for the estimations, which can be completed within tens of seconds on a laptop without any accelerators. In these experiments, we also demonstrate that, using its differentiability, Diff-DCM can provide useful insights into human behaviours, such as an optimal intervention path for effective behavioural changes. This study provides a strong basis for the fully automated and reliable modelling, prediction, and control of human behaviours.

[AI-31] Comparing Few to Rank Many: Active Human Preference Learning using Randomized Frank-Wolfe AISTATS2025

链接: https://arxiv.org/abs/2412.19396
作者: Kiran Koshy Thekumparampil,Gaurush Hiranandani,Kousha Kalantari,Shoham Sabach,Branislav Kveton
关键词: limited comparison feedback, comparison feedback, human preferences, study learning, limited comparison
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Submitted to AISTATS 2025 on October 10, 2024

点击查看摘要

Abstract:We study learning of human preferences from a limited comparison feedback. This task is ubiquitous in machine learning. Its applications such as reinforcement learning from human feedback, have been transformational. We formulate this problem as learning a Plackett-Luce model over a universe of N choices from K -way comparison feedback, where typically K \ll N . Our solution is the D-optimal design for the Plackett-Luce objective. The design defines a data logging policy that elicits comparison feedback for a small collection of optimally chosen points from all N \choose K feasible subsets. The main algorithmic challenge in this work is that even fast methods for solving D-optimal designs would have O(N \choose K) time complexity. To address this issue, we propose a randomized Frank-Wolfe (FW) algorithm that solves the linear maximization sub-problems in the FW method on randomly chosen variables. We analyze the algorithm, and evaluate it empirically on synthetic and open-source NLP datasets.

[AI-32] An Engorgio Prompt Makes Large Language Model Babble on

链接: https://arxiv.org/abs/2412.19394
作者: Jianshuo Dong,Ziyuan Zhang,Qingjie Zhang,Han Qiu,Tianwei Zhang,Hao Wang,Hewu Li,Qi Li,Chao Zhang,Ke Xu
关键词: large language models, yielded impressive performance, Auto-regressive large language, Engorgio prompts, language models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Auto-regressive large language models (LLMs) have yielded impressive performance in many real-world tasks. However, the new paradigm of these LLMs also exposes novel threats. In this paper, we explore their vulnerability to inference cost attacks, where a malicious user crafts Engorgio prompts to intentionally increase the computation cost and latency of the inference process. We design Engorgio, a novel methodology, to efficiently generate adversarial Engorgio prompts to affect the target LLM’s service availability. Engorgio has the following two technical contributions. (1) We employ a parameterized distribution to track LLMs’ prediction trajectory. (2) Targeting the auto-regressive nature of LLMs’ inference process, we propose novel loss functions to stably suppress the appearance of the EOS token, whose occurrence will interrupt the LLM’s generation process. We conduct extensive experiments on 13 open-sourced LLMs with parameters ranging from 125M to 30B. The results show that Engorgio prompts can successfully induce LLMs to generate abnormally long outputs (i.e., roughly 2-13 \times longer to reach 90%+ of the output length limit) in a white-box scenario and our real-world experiment demonstrates Engergio’s threat to LLM service with limited computing resources. The code is accessible at this https URL.

[AI-33] Large Language Models for Market Research: A Data-augmentation Approach

链接: https://arxiv.org/abs/2412.19363
作者: Mengxin Wang(Naveen Jindal School of Management, The University of Texas at Dallas),Dennis J. Zhang(Olin School of Business, Washington University in St. Louis),Heng Zhang(W. P. Carey School of Business, Arizona State University)
关键词: Large Language Models, language processing tasks, complex natural language, natural language processing, Large Language
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two. In this paper, we address this gap by proposing a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis. Our method leverages transfer learning principles to debias the LLM-generated data using a small amount of human data. This results in statistically robust estimators with consistent and asymptotically normal properties, in contrast to naive approaches that simply substitute human data with LLM-generated data, which can exacerbate bias. We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce estimation error and save data and costs by 24.9% to 79.8%. In contrast, naive approaches fail to save data due to the inherent biases in LLM-generated data compared to human data. Another empirical study on sports car choices validates the robustness of our results. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework.

[AI-34] A Reinforcement Learning-Based Task Mapping Method to Improve the Reliability of Clustered Manycores

链接: https://arxiv.org/abs/2412.19340
作者: Fatemeh Hossein-Khani,Omid Akbari
关键词: meeting performance demands, poses significant challenges, systems poses significant, manycore systems poses, performance demands
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing scale of manycore systems poses significant challenges in managing reliability while meeting performance demands. Simultaneously, these systems become more susceptible to different aging mechanisms such as negative-bias temperature instability (NBTI), hot carrier injection (HCI), and thermal cycling (TC), as well as the electromigration (EM) phenomenon. In this paper, we propose a reinforcement learning (RL)-based task mapping method to improve the reliability of manycore systems considering the aforementioned aging mechanisms, which consists of three steps including bin packing, task-to-bin mapping, and task-to-core mapping. In the initial step, a density-based spatial application with noise (DBSCAN) clustering method is employed to compose some clusters (bins) based on the cores temperature. Then, the Q-learning algorithm is used for the two latter steps, to map the arrived task on a core such that the minimum thermal variation is occurred among all the bins. Compared to the state-of-the-art works, the proposed method is performed during runtime without requiring any parameter to be calculated offline. The effectiveness of the proposed technique is evaluated on 16, 32, and 64 cores systems using SPLASH2 and PARSEC benchmark suite applications. The results demonstrate up to 27% increase in the mean time to failure (MTTF) compared to the state-of-the-art task mapping techniques.

[AI-35] Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones ICML2024

链接: https://arxiv.org/abs/2412.19325
作者: Mehrnaz Mofakhami,Reza Bayat,Ioannis Mitliagkas,Joao Monteiro,Valentina Zantedeschi
关键词: adaptively allocating compute, allocating compute resources, Early Exiting, Control Early Exiting, promising technique
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Appeared at ICML 2024 Workshop on Efficient Systems for Foundation Models (ES-FoMo-II)

点击查看摘要

Abstract:Early Exiting (EE) is a promising technique for speeding up inference by adaptively allocating compute resources to data points based on their difficulty. The approach enables predictions to exit at earlier layers for simpler samples while reserving more computation for challenging ones. In this study, we first present a novel perspective on the EE approach, showing that larger models deployed with EE can achieve higher performance than smaller models while maintaining similar computational costs. As existing EE approaches rely on confidence estimation at each exit point, we further study the impact of overconfidence on the controllability of the compute-performance trade-off. We introduce Performance Control Early Exiting (PCEE), a method that enables accuracy thresholding by basing decisions not on a data point’s confidence but on the average accuracy of samples with similar confidence levels from a held-out validation set. In our experiments, we show that PCEE offers a simple yet computationally efficient approach that provides better control over performance than standard confidence-based approaches, and allows us to scale up model sizes to yield performance gain while reducing the computational cost.

[AI-36] A novel framework for MCDM based on Z numbers and soft likelihood function

链接: https://arxiv.org/abs/2412.19321
作者: Yuanpeng He
关键词: soft likelihood function, structure of process, environment has attracted, attracted lots, lots of attention
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The optimization on the structure of process of information management under uncertain environment has attracted lots of attention from researchers around the world. Nevertheless, how to obtain accurate and rational evaluation from assessments produced by experts is still an open problem. Specially, intuitionistic fuzzy set provides an effective solution in handling indeterminate information. And Yager proposes a novel method for fusion of probabilistic evidence to handle uncertain and conflicting information lately which is called soft likelihood function. This paper devises a novel framework of soft likelihood function based on information volume of fuzzy membership and credibility measure for extracting truly useful and valuable information from uncertainty. An application is provided to verify the validity and correctness of the proposed framework. Besides, the comparisons with other existing methods further demonstrate the superiority of the novel framework of soft likelihood function.

[AI-37] From Interets to Insights: An LLM Approach to Course Recommendations Using Natural Language Queries

链接: https://arxiv.org/abs/2412.19312
作者: Hugh Van Deventer,Mark Mills,August Evrard
关键词: United States encourage, acquire academic breadth, United States, States encourage, explore academic areas
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:Most universities in the United States encourage their students to explore academic areas before declaring a major and to acquire academic breadth by satisfying a variety of requirements. Each term, students must choose among many thousands of offerings, spanning dozens of subject areas, a handful of courses to take. The curricular environment is also dynamic, and poor communication and search functions on campus can limit a student’s ability to discover new courses of interest. To support both students and their advisers in such a setting, we explore a novel Large Language Model (LLM) course recommendation system that applies a Retrieval Augmented Generation (RAG) method to the corpus of course descriptions. The system first generates an ‘ideal’ course description based on the user’s query. This description is converted into a search vector using embeddings, which is then used to find actual courses with similar content by comparing embedding similarities. We describe the method and assess the quality and fairness of some example prompts. Steps to deploy a pilot system on campus are discussed.

[AI-38] xSRL: Safety-Aware Explainable Reinforcement Learning – Safety as a Product of Explainability AAMAS2025

链接: https://arxiv.org/abs/2412.19311
作者: Risal Shahriar Shefin,Md Asifur Rahman,Thai Le,Sarra Alqahtani
关键词: shown great promise, Reinforcement learning, simulated environments, shown great, great promise
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted to 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)

点击查看摘要

Abstract:Reinforcement learning (RL) has shown great promise in simulated environments, such as games, where failures have minimal consequences. However, the deployment of RL agents in real-world systems such as autonomous vehicles, robotics, UAVs, and medical devices demands a higher level of safety and transparency, particularly when facing adversarial threats. Safe RL algorithms have been developed to address these concerns by optimizing both task performance and safety constraints. However, errors are inevitable, and when they occur, it is essential that the RL agents can also explain their actions to human operators. This makes trust in the safety mechanisms of RL systems crucial for effective deployment. Explainability plays a key role in building this trust by providing clear, actionable insights into the agent’s decision-making process, ensuring that safety-critical decisions are well understood. While machine learning (ML) has seen significant advances in interpretability and visualization, explainability methods for RL remain limited. Current tools fail to address the dynamic, sequential nature of RL and its needs to balance task performance with safety constraints over time. The re-purposing of traditional ML methods, such as saliency maps, is inadequate for safety-critical RL applications where mistakes can result in severe consequences. To bridge this gap, we propose xSRL, a framework that integrates both local and global explanations to provide a comprehensive understanding of RL agents’ behavior. xSRL also enables developers to identify policy vulnerabilities through adversarial attacks, offering tools to debug and patch agents without retraining. Our experiments and user studies demonstrate xSRL’s effectiveness in increasing safety in RL systems, making them more reliable and trustworthy for real-world deployment. Code is available at this https URL.

[AI-39] RAG with Differential Privacy

链接: https://arxiv.org/abs/2412.19291
作者: Nicolas Grislain
关键词: Large Language Models, Language Models, Large Language, moving knowledge bases, fast moving knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as the dominant technique to provide Large Language Models (LLM) with fresh and relevant context, mitigating the risk of hallucinations and improving the overall quality of responses in environments with large and fast moving knowledge bases. However, the integration of external documents into the generation process raises significant privacy concerns. Indeed, when added to a prompt, it is not possible to guarantee a response will not inadvertently expose confidential data, leading to potential breaches of privacy and ethical dilemmas. This paper explores a practical solution to this problem suitable to general knowledge extraction from personal data. It shows differentially private token generation is a viable approach to private RAG.

[AI-40] me Series Foundational Models: Their Role in Anomaly Detection and Prediction AAAI2025

链接: https://arxiv.org/abs/2412.19286
作者: Chathurangi Shyalika,Harleen Kaur Bagga,Ahan Bhatt,Renjith Prasad,Alaa Al Ghazo,Amit Sheth
关键词: time series forecasting, Time series foundational, Time series, series foundational models, series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6 figures, 5 tables. Accepted at AAAI2025 Anomaly Detection in Scientific Domains Workshop

点击查看摘要

Abstract:Time series foundational models (TSFM) have gained prominence in time series forecasting, promising state-of-the-art performance across various applications. However, their application in anomaly detection and prediction remains underexplored, with growing concerns regarding their black-box nature, lack of interpretability and applicability. This paper critically evaluates the efficacy of TSFM in anomaly detection and prediction tasks. We systematically analyze TSFM across multiple datasets, including those characterized by the absence of discernible patterns, trends and seasonality. Our analysis shows that while TSFMs can be extended for anomaly detection and prediction, traditional statistical and deep learning models often match or outperform TSFM in these tasks. Additionally, TSFMs require high computational resources but fail to capture sequential dependencies effectively or improve performance in few-shot or zero-shot scenarios. \noindent The preprocessed datasets, codes to reproduce the results and supplementary materials are available at this https URL.

[AI-41] PearSAN: A Machine Learning Method for Inverse Design using Pearson Correlated Surrogate Annealing

链接: https://arxiv.org/abs/2412.19284
作者: Michael Bezick,Blake A. Wilson,Vaishnavi Iyer,Yuheng Chen,Vladimir M. Shalaev,Sabre Kais,Alexander V. Kildishev,Alexandra Boltasseva,Brad Lackey
关键词: traditional optimizers struggle, machine learning-assisted optimization, learning-assisted optimization algorithm, optimization algorithm applicable, inverse design problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:PearSAN is a machine learning-assisted optimization algorithm applicable to inverse design problems with large design spaces, where traditional optimizers struggle. The algorithm leverages the latent space of a generative model for rapid sampling and employs a Pearson correlated surrogate model to predict the figure of merit of the true design metric. As a showcase example, PearSAN is applied to thermophotovoltaic (TPV) metasurface design by matching the working bands between a thermal radiator and a photovoltaic cell. PearSAN can work with any pretrained generative model with a discretized latent space, making it easy to integrate with VQ-VAEs and binary autoencoders. Its novel Pearson correlational loss can be used as both a latent regularization method, similar to batch and layer normalization, and as a surrogate training loss. We compare both to previous energy matching losses, which are shown to enforce poor regularization and performance, even with upgraded affine parameters. PearSAN achieves a state-of-the-art maximum design efficiency of 97%, and is at least an order of magnitude faster than previous methods, with an improved maximum figure-of-merit gain.

[AI-42] Leveraging Self-Training and Variational Autoencoder for Agitation Detection in People with Dementia Using Wearable Sensors

链接: https://arxiv.org/abs/2412.19254
作者: Abeer Badawi,Somayya Elmoghazy,Samira Choudhury,Khalid Elgazzar,Amer Burhan
关键词: past decades, neurodegenerative disorder, growing among elder, elder people, detect
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dementia is a neurodegenerative disorder that has been growing among elder people over the past decades. This growth profoundly impacts the quality of life for patients and caregivers due to the symptoms arising from it. Agitation and aggression (AA) are some of the symptoms of people with severe dementia (PwD) in long-term care or hospitals. AA not only causes discomfort but also puts the patients or others at potential risk. Existing monitoring solutions utilizing different wearable sensors integrated with Artificial Intelligence (AI) offer a way to detect AA early enough for timely and adequate medical intervention. However, most studies are limited by the availability of accurately labeled datasets, which significantly affects the efficacy of such solutions in real-world scenarios. This study presents a novel comprehensive approach to detect AA in PwD using physiological data from the Empatica E4 wristbands. The research creates a diverse dataset, consisting of three distinct datasets gathered from 14 participants across multiple hospitals in Canada. These datasets have not been extensively explored due to their limited labeling. We propose a novel approach employing self-training and a variational autoencoder (VAE) to detect AA in PwD effectively. The proposed approach aims to learn the representation of the features extracted using the VAE and then uses a semi-supervised block to generate labels, classify events, and detect AA. We demonstrate that combining Self-Training and Variational Autoencoder mechanism significantly improves model performance in classifying AA in PwD. Among the tested techniques, the XGBoost classifier achieved the highest accuracy of 90.16%. By effectively addressing the challenge of limited labeled data, the proposed system not only learns new labels but also proves its superiority in detecting AA.

[AI-43] Latenrgy: Model Agnostic Latency and Energy Consumption Prediction for Binary Classifiers

链接: https://arxiv.org/abs/2412.19241
作者: Jason M. Pittman
关键词: increasingly drive innovation, Machine learning systems, Machine learning, learning systems increasingly, systems increasingly drive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 2 tables

点击查看摘要

Abstract:Machine learning systems increasingly drive innovation across scientific fields and industry, yet challenges in compute overhead, specifically during inference, limit their scalability and sustainability. Responsible AI guardrails, essential for ensuring fairness, transparency, and privacy, further exacerbate these computational demands. This study addresses critical gaps in the literature, chiefly the lack of generalized predictive techniques for latency and energy consumption, limited cross-comparisons of classifiers, and unquantified impacts of RAI guardrails on inference performance. Using Theory Construction Methodology, this work constructed a model-agnostic theoretical framework for predicting latency and energy consumption in binary classification models during inference. The framework synthesizes classifier characteristics, dataset properties, and RAI guardrails into a unified analytical instrument. Two predictive equations are derived that capture the interplay between these factors while offering generalizability across diverse classifiers. The proposed framework provides foundational insights for designing efficient, responsible ML systems. It enables researchers to benchmark and optimize inference performance and assists practitioners in deploying scalable solutions. Finally, this work establishes a theoretical foundation for balancing computational efficiency with ethical AI principles, paving the way for future empirical validation and broader applications.

[AI-44] Are Two Hidden Layers Still Enough for the Physics-Informed Neural Networks?

链接: https://arxiv.org/abs/2412.19235
作者: Vasiliy A. Es’kin,Alexey O. Malkhanov,Mikhail E. Smorkalov
关键词: ordinary differential equations, partial differential equations, single hidden layer, differential equations, neural network
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 45 pages, 36 figures, 9 tables

点击查看摘要

Abstract:The article discusses the development of various methods and techniques for initializing and training neural networks with a single hidden layer, as well as training a separable physics-informed neural network consisting of neural networks with a single hidden layer to solve physical problems described by ordinary differential equations (ODEs) and partial differential equations (PDEs). A method for strictly deterministic initialization of a neural network with one hidden layer for solving physical problems described by an ODE is proposed. Modifications to existing methods for weighting the loss function are given, as well as new methods developed for training strictly deterministic-initialized neural networks to solve ODEs (detaching, additional weighting based on the second derivative, predicted solution-based weighting, relative residuals). An algorithm for physics-informed data-driven initialization of a neural network with one hidden layer is proposed. A neural network with pronounced generalizing properties is presented, whose generalizing abilities of which can be precisely controlled by adjusting network parameters. A metric for measuring the generalization of such neural network has been introduced. A gradient-free neuron-by-neuron fitting method has been developed for adjusting the parameters of a single-hidden-layer neural network, which does not require the use of an optimizer or solver for its implementation. The proposed methods have been extended to 2D problems using the separable physics-informed neural networks approach. Numerous experiments have been carried out to develop the above methods and approaches. Experiments on physical problems, such as solving various ODEs and PDEs, have demonstrated that these methods for initializing and training neural networks with one or two hidden layers (SPINN) achieve competitive accuracy and, in some cases, state-of-the-art results.

[AI-45] Learning Cross-Domain Representations for Transferable Drug Perturbations on Single-Cell Transcriptional Responses

链接: https://arxiv.org/abs/2412.19228
作者: Hui Liu,Shikai Jin
关键词: identify bioactive molecules, attracted widespread attention, bioactive molecules, attracted widespread, widespread attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Phenotypic drug discovery has attracted widespread attention because of its potential to identify bioactive molecules. Transcriptomic profiling provides a comprehensive reflection of phenotypic changes in cellular responses to external perturbations. In this paper, we propose XTransferCDR, a novel generative framework designed for feature decoupling and transferable representation learning across domains. Given a pair of perturbed expression profiles, our approach decouples the perturbation representations from basal states through domain separation encoders and then cross-transfers them in the latent space. The transferred representations are then used to reconstruct the corresponding perturbed expression profiles via a shared decoder. This cross-transfer constraint effectively promotes the learning of transferable drug perturbation representations. We conducted extensive evaluations of our model on multiple datasets, including single-cell transcriptional responses to drugs and single- and combinatorial genetic perturbations. The experimental results show that XTransferCDR achieved better performance than current state-of-the-art methods, showcasing its potential to advance phenotypic drug discovery.

[AI-46] Optimizing Fantasy Sports Team Selection with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2412.19215
作者: Shamik Bhattacharjee,Kamlesh Marathe,Hitesh Kapoor,Nilesh Patil
关键词: garnered immense popularity, popularity in India, India in recent, recent years, offering enthusiasts
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 Pages including references, Accepted to CODS-COMAD 2024 conference

点击查看摘要

Abstract:Fantasy sports, particularly fantasy cricket, have garnered immense popularity in India in recent years, offering enthusiasts the opportunity to engage in strategic team-building and compete based on the real-world performance of professional athletes. In this paper, we address the challenge of optimizing fantasy cricket team selection using reinforcement learning (RL) techniques. By framing the team creation process as a sequential decision-making problem, we aim to develop a model that can adaptively select players to maximize the team’s potential performance. Our approach leverages historical player data to train RL algorithms, which then predict future performance and optimize team composition. This not only represents a huge business opportunity by enabling more accurate predictions of high-performing teams but also enhances the overall user experience. Through empirical evaluation and comparison with traditional fantasy team drafting methods, we demonstrate the effectiveness of RL in constructing competitive fantasy teams. Our results show that RL-based strategies provide valuable insights into player selection in fantasy sports.

[AI-47] Multi-Attribute Constraint Satisfaction via Language Model Rewriting

链接: https://arxiv.org/abs/2412.19198
作者: Ashutosh Baheti,Debanjana Chakraborty,Faeze Brahman,Ronan Le Bras,Ximing Lu,Nouha Dziri,Yejin Choi,Mark Riedl,Maarten Sap
关键词: Obeying precise constraints, common computational problem, computational problem underlying, problem underlying seemingly, Obeying precise
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Obeying precise constraints on top of multiple external attributes is a common computational problem underlying seemingly different domains, from controlled text generation to protein engineering. Existing language model (LM) controllability methods for multi-attribute constraint satisfaction often rely on specialized architectures or gradient-based classifiers, limiting their flexibility to work with arbitrary black-box evaluators and pretrained models. Current general-purpose large language models, while capable, cannot achieve fine-grained multi-attribute control over external attributes. Thus, we create Multi-Attribute Constraint Satisfaction (MACS), a generalized method capable of finetuning language models on any sequential domain to satisfy user-specified constraints on multiple external real-value attributes. Our method trains LMs as editors by sampling diverse multi-attribute edit pairs from an initial set of paraphrased outputs. During inference, LM iteratively improves upon its previous solution to satisfy constraints for all attributes by leveraging our designed constraint satisfaction reward. We additionally experiment with reward-weighted behavior cloning to further improve the constraint satisfaction rate of LMs. To evaluate our approach, we present a new Fine-grained Constraint Satisfaction (FineCS) benchmark, featuring two challenging tasks: (1) Text Style Transfer, where the goal is to simultaneously modify the sentiment and complexity of reviews, and (2) Protein Design, focusing on modulating fluorescence and stability of Green Fluorescent Proteins (GFP). Our empirical results show that MACS achieves the highest threshold satisfaction in both FineCS tasks, outperforming strong domain-specific baselines. Our work opens new avenues for generalized and real-value multi-attribute control, with implications for diverse applications spanning NLP and bioinformatics.

[AI-48] Provably Efficient Exploration in Reward Machines with Low Regret

链接: https://arxiv.org/abs/2412.19194
作者: Hippolyte Bourel,Anders Jonsson,Odalric-Ambrym Maillard,Chenxiao Ma,Mohammad Sadegh Talebi
关键词: probabilistic reward machines, study reinforcement learning, reward machines, probabilistic reward, study reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 35 pages

点击查看摘要

Abstract:We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge of the task in the form of reward machines is available to the learner. We consider probabilistic reward machines with initially unknown dynamics, and investigate RL under the average-reward criterion, where the learning performance is assessed through the notion of regret. Our main algorithmic contribution is a model-based RL algorithm for decision processes involving probabilistic reward machines that is capable of exploiting the structure induced by such machines. We further derive high-probability and non-asymptotic bounds on its regret and demonstrate the gain in terms of regret over existing algorithms that could be applied, but obliviously to the structure. We also present a regret lower bound for the studied setting. To the best of our knowledge, the proposed algorithm constitutes the first attempt to tailor and analyze regret specifically for RL with probabilistic reward machines.

[AI-49] Mobile Robots through Task-Based Human Instructions using Incremental Curriculum Learning

链接: https://arxiv.org/abs/2412.19159
作者: Muhammad A. Muttaqien,Ayanori Yorozu,Akihisa Ohya
关键词: deep reinforcement learning, task-based human instruction, techniques to facilitate, integration of incremental, deep reinforcement
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the integration of incremental curriculum learning (ICL) with deep reinforcement learning (DRL) techniques to facilitate mobile robot navigation through task-based human instruction. By adopting a curriculum that mirrors the progressive complexity encountered in human learning, our approach systematically enhances robots’ ability to interpret and execute complex instructions over time. We explore the principles of DRL and its synergy with ICL, demonstrating how this combination not only improves training efficiency but also equips mobile robots with the generalization capability required for navigating through dynamic indoor environments. Empirical results indicate that robots trained with our ICL-enhanced DRL framework outperform those trained without curriculum learning, highlighting the benefits of structured learning progressions in robotic training.

[AI-50] o Predict or Not To Predict? Proportionally Masked Autoencoders for Tabular Data Imputation

链接: https://arxiv.org/abs/2412.19152
作者: Jungkyu Kim,Kibok Lee,Taeyoung Park
关键词: Masked autoencoders, recently demonstrated effectiveness, tabular data imputation, proportional masking strategy, recently demonstrated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Masked autoencoders (MAEs) have recently demonstrated effectiveness in tabular data imputation. However, due to the inherent heterogeneity of tabular data, the uniform random masking strategy commonly used in MAEs can disrupt the distribution of missingness, leading to suboptimal performance. To address this, we propose a proportional masking strategy for MAEs. Specifically, we first compute the statistics of missingness based on the observed proportions in the dataset, and then generate masks that align with these statistics, ensuring that the distribution of missingness is preserved after masking. Furthermore, we argue that simple MLP-based token mixing offers competitive or often superior performance compared to attention mechanisms while being more computationally efficient, especially in the tabular domain with the inherent heterogeneity. Experimental results validate the effectiveness of the proposed proportional masking strategy across various missing data patterns in tabular datasets. Code is available at: \urlthis https URL.

[AI-51] A Rhetorical Relations-Based Framework for Tailored Multimedia Document Summarization

链接: https://arxiv.org/abs/2412.19133
作者: Azze-Eddine Maredj,Madjid Sadallah
关键词: rapidly evolving landscape, presents intricate challenges, summarizing multimedia documents, encompass textual, auditory elements
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 10 pages, preprint

点击查看摘要

Abstract:In the rapidly evolving landscape of digital content, the task of summarizing multimedia documents, which encompass textual, visual, and auditory elements, presents intricate challenges. These challenges include extracting pertinent information from diverse formats, maintaining the structural integrity and semantic coherence of the original content, and generating concise yet informative summaries. This paper introduces a novel framework for multimedia document summarization that capitalizes on the inherent structure of the document to craft coherent and succinct summaries. Central to this framework is the incorporation of a rhetorical structure for structural analysis, augmented by a graph-based representation to facilitate the extraction of pivotal information. Weighting algorithms are employed to assign significance values to document units, thereby enabling effective ranking and selection of relevant content. Furthermore, the framework is designed to accommodate user preferences and time constraints, ensuring the production of personalized and contextually relevant summaries. The summarization process is elaborately delineated, encompassing document specification, graph construction, unit weighting, and summary extraction, supported by illustrative examples and algorithmic elucidation. This proposed framework represents a significant advancement in automatic summarization, with broad potential applications across multimedia document processing, promising transformative impacts in the field.

[AI-52] Discrete vs. Continuous Trade-offs for Generative Models

链接: https://arxiv.org/abs/2412.19114
作者: Jathin Korrapati,Tanish Baranwal,Rahul Shah
关键词: denoising diffusion probabilistic, complex data distributions, score-based generative models, leverage stochastic processes, Brownian motion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Numerical Analysis (math.NA)
*备注: 16 pages, 6 figures, includes theoretical analysis, experimental results, and proofs of key results

点击查看摘要

Abstract:This work explores the theoretical and practical foundations of denoising diffusion probabilistic models (DDPMs) and score-based generative models, which leverage stochastic processes and Brownian motion to model complex data distributions. These models employ forward and reverse diffusion processes defined through stochastic differential equations (SDEs) to iteratively add and remove noise, enabling high-quality data generation. By analyzing the performance bounds of these models, we demonstrate how score estimation errors propagate through the reverse process and bound the total variation distance using discrete Girsanov transformations, Pinsker’s inequality, and the data processing inequality (DPI) for an information theoretic lens.

[AI-53] Graph Mixture of Experts and Memory-augmented Routers for Multivariate Time Series Anomaly Detection AAAI2025

链接: https://arxiv.org/abs/2412.19108
作者: Xiaoyu Huang(1 and 2),Weidong Chen(1),Bo Hu(1),Zhendong Mao(1)
关键词: involves identifying abnormal, identifying abnormal patterns, multiple interrelated time, Multivariate time series, critical task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Multivariate time series (MTS) anomaly detection is a critical task that involves identifying abnormal patterns or events in data that consist of multiple interrelated time series. In order to better model the complex interdependence between entities and the various inherent characteristics of each entity, the GNN based methods are widely adopted by existing methods. In each layer of GNN, node features aggregate information from their neighboring nodes to update their information. In doing so, from shallow layer to deep layer in GNN, original individual node features continue to be weakened and more structural information,i.e., from short-distance neighborhood to long-distance neighborhood, continues to be enhanced. However, research to date has largely ignored the understanding of how hierarchical graph information is represented and their characteristics that can benefit anomaly detection. Existing methods simply leverage the output from the last layer of GNN for anomaly estimation while neglecting the essential information contained in the intermediate GNN layers. To address such limitations, in this paper, we propose a Graph Mixture of Experts (Graph-MoE) network for multivariate time series anomaly detection, which incorporates the mixture of experts (MoE) module to adaptively represent and integrate hierarchical multi-layer graph information into entity representations. It is worth noting that our Graph-MoE can be integrated into any GNN-based MTS anomaly detection method in a plug-and-play manner. In addition, the memory-augmented routers are proposed in this paper to capture the correlation temporal information in terms of the global historical features of MTS to adaptively weigh the obtained entity representations to achieve successful anomaly estimation. Extensive experiments on five challenging datasets prove the superiority of our approach and each proposed module.

[AI-54] rajGEOS: Trajectory Graph Enhanced Orientation-based Sequential Network for Mobility Prediction

链接: https://arxiv.org/abs/2412.19092
作者: Zhaoping Hu,Zongyuan Huang,Jinming Yang,Tao Yang,Yaohui Jin,Yanyan Xu
关键词: Human mobility studies, Human mobility, human mobility modeling, location-based services, studies how people
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human mobility studies how people move to access their needed resources and plays a significant role in urban planning and location-based services. As a paramount task of human mobility modeling, next location prediction is challenging because of the diversity of users’ historical trajectories that gives rise to complex mobility patterns and various contexts. Deep sequential models have been widely used to predict the next location by leveraging the inherent sequentiality of trajectory data. However, they do not fully leverage the relationship between locations and fail to capture users’ multi-level preferences. This work constructs a trajectory graph from users’ historical traces and proposes a \textbfTrajectory \textbfGraph \textbfEnhanced \textbfOrientation-based \textbfSequential network (TrajGEOS) for next-location prediction tasks. TrajGEOS introduces hierarchical graph convolution to capture location and user embeddings. Such embeddings consider not only the contextual feature of locations but also the relation between them, and serve as additional features in downstream modules. In addition, we design an orientation-based module to learn users’ mid-term preferences from sequential modeling modules and their recent trajectories. Extensive experiments on three real-world LBSN datasets corroborate the value of graph and orientation-based modules and demonstrate that TrajGEOS outperforms the state-of-the-art methods on the next location prediction task.

[AI-55] Hierarchical Multi-agent Meta-Reinforcement Learning for Cross-channel Bidding

链接: https://arxiv.org/abs/2412.19064
作者: Shenghong He,Chao Yu
关键词: online advertising ecosystems, Real-time bidding, plays a pivotal, pivotal role, role in online
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-time bidding (RTB) plays a pivotal role in online advertising ecosystems. Advertisers employ strategic bidding to optimize their advertising impact while adhering to various financial constraints, such as the return-on-investment (ROI) and cost-per-click (CPC). Primarily focusing on bidding with fixed budget constraints, traditional approaches cannot effectively manage the dynamic budget allocation problem where the goal is to achieve global optimization of bidding performance across multiple channels with a shared budget. In this paper, we propose a hierarchical multi-agent reinforcement learning framework for multi-channel bidding optimization. In this framework, the top-level strategy applies a CPC constrained diffusion model to dynamically allocate budgets among the channels according to their distinct features and complex interdependencies, while the bottom-level strategy adopts a state-action decoupled actor-critic method to address the problem of extrapolation errors in offline learning caused by out-of-distribution actions and a context-based meta-channel knowledge learning method to improve the state representation capability of the policy based on the shared knowledge among different channels. Comprehensive experiments conducted on a large scale real-world industrial dataset from the Meituan ad bidding platform demonstrate that our method achieves a state-of-the-art performance.

[AI-56] CL-attack: Textual Backdoor Attacks via Cross-Lingual Triggers AAAI2025

链接: https://arxiv.org/abs/2412.19037
作者: Jingyi Zheng,Tianyi Hu,Tianshuo Cong,Xinlei He
关键词: attacks significantly compromise, controlled content, Backdoor attacks significantly, significantly compromise, compromise the security
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: The paper has been accepted to AAAI 2025

点击查看摘要

Abstract:Backdoor attacks significantly compromise the security of large language models by triggering them to output specific and controlled content. Currently, triggers for textual backdoor attacks fall into two categories: fixed-token triggers and sentence-pattern triggers. However, the former are typically easy to identify and filter, while the latter, such as syntax and style, do not apply to all original samples and may lead to semantic shifts. In this paper, inspired by cross-lingual (CL) prompts of LLMs in real-world scenarios, we propose a higher-dimensional trigger method at the paragraph level, namely CL-attack. CL-attack injects the backdoor by using texts with specific structures that incorporate multiple languages, thereby offering greater stealthiness and universality compared to existing backdoor attack techniques. Extensive experiments on different tasks and model architectures demonstrate that CL-attack can achieve nearly 100% attack success rate with a low poisoning rate in both classification and generation tasks. We also empirically show that the CL-attack is more robust against current major defense methods compared to baseline backdoor attacks. Additionally, to mitigate CL-attack, we further develop a new defense called TranslateDefense, which can partially mitigate the impact of CL-attack.

[AI-57] Repository Structure-Aware Training Makes SLMs Better Issue Resolver

链接: https://arxiv.org/abs/2412.19031
作者: Zexiong Ma,Shengnan An,Zeqi Lin,Yanzhen Zou,Bing Xie
关键词: Small Language Models, Language models, outperform Small Language, software development tasks, Large Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Language models have been applied to various software development tasks, but the performance varies according to the scale of the models. Large Language Models (LLMs) outperform Small Language Models (SLMs) in complex tasks like repository-level issue resolving, but raise concerns about privacy and cost. In contrast, SLMs are more accessible but under-perform in complex tasks. In this paper, we introduce ReSAT (Repository Structure-Aware Training), construct training data based on a large number of issues and corresponding pull requests from open-source communities to enhance the model’s understanding of repository structure and issue resolving ability. We construct two types of training data: (1) localization training data, a multi-level progressive localization data to improve code understanding and localization capability; (2) code edit training data, which improves context-based code editing capability. The evaluation results on SWE-Bench-verified and RepoQA demonstrate that ReSAT effectively enhances SLMs’ issue-resolving and repository-level long-context understanding capabilities.

[AI-58] A theory of appropriateness with applications to generative artificial intelligence

链接: https://arxiv.org/abs/2412.19010
作者: Joel Z. Leibo,Alexander Sasha Vezhnevets,Manfred Diaz,John P. Agapiou,William A. Cunningham,Peter Sunehag,Julia Haas,Raphael Koster,Edgar A. Duéñez-Guzmán,William S. Isaac,Georgios Piliouras,Stanley M. Bileschi,Iyad Rahwan,Simon Osindero
关键词: Abstract, appropriateness, behavior, decision, making
类目: Artificial Intelligence (cs.AI)
*备注: 115 pages, 2 figures

点击查看摘要

Abstract:What is appropriateness? Humans navigate a multi-scale mosaic of interlocking notions of what is appropriate for different situations. We act one way with our friends, another with our family, and yet another in the office. Likewise for AI, appropriate behavior for a comedy-writing assistant is not the same as appropriate behavior for a customer-service representative. What determines which actions are appropriate in which contexts? And what causes these standards to change over time? Since all judgments of AI appropriateness are ultimately made by humans, we need to understand how appropriateness guides human decision making in order to properly evaluate AI decision making and improve it. This paper presents a theory of appropriateness: how it functions in human society, how it may be implemented in the brain, and what it means for responsible deployment of generative AI technology.

[AI-59] mpus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs DATE2025

链接: https://arxiv.org/abs/2412.19002
作者: Prabhu Vellaisamy,Harideep Nair,Thomas Kang,Yichen Ni,Haoyang Fan,Bin Qi,Jeff Chen,Shawn Blanton,John Paul Shen
关键词: poses significant challenges, deep neural networks, Tempus Core, Tempus Core PCU, neural networks
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: Accepted in DATE 2025

点击查看摘要

Abstract:The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA’s open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core’s PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA’s CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core’s PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017 mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference.

[AI-60] How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study

链接: https://arxiv.org/abs/2412.18989
作者: Alejandro Velasco,Daniel Rodriguez-Cardenas,David N. Palacio,Luftar Rahman Alif,Denys Poshyvanyk
关键词: Large Language Models, Large Language, shown significant potential, automating software engineering, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant potential in automating software engineering tasks, particularly in code generation. However, current evaluation benchmarks, which primarily focus on accuracy, fall short in assessing the quality of the code generated by these models, specifically their tendency to produce code smells. To address this limitation, we introduce CodeSmellEval, a benchmark designed to evaluate the propensity of LLMs for generating code smells. Our benchmark includes a novel metric: Propensity Smelly Score (PSC), and a curated dataset of method-level code smells: CodeSmellData. To demonstrate the use of CodeSmellEval, we conducted a case study with two state-of-the-art LLMs, CodeLlama and Mistral. The results reveal that both models tend to generate code smells, such as simplifiable-condition and consider-merging-isinstance. These findings highlight the effectiveness of our benchmark in evaluating LLMs, providing valuable insights into their reliability and their propensity to introduce code smells in code generation tasks.

[AI-61] ravelAgent : Generative Agents in the Built Environment

链接: https://arxiv.org/abs/2412.18985
作者: Ariel Noyman,Kai Hu,Kent Larson
关键词: user centered urban, centered urban spaces, designing functional, user centered, critical for designing
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 21 pages 9 figs

点击查看摘要

Abstract:Understanding human behavior in built environments is critical for designing functional, user centered urban spaces. Traditional approaches, such as manual observations, surveys, and simplified simulations, often fail to capture the complexity and dynamics of real world behavior. To address these limitations, we introduce TravelAgent, a novel simulation platform that models pedestrian navigation and activity patterns across diverse indoor and outdoor environments under varying contextual and environmental conditions. TravelAgent leverages generative agents integrated into 3D virtual environments, enabling agents to process multimodal sensory inputs and exhibit human-like decision-making, behavior, and adaptation. Through experiments, including navigation, wayfinding, and free exploration, we analyze data from 100 simulations comprising 1898 agent steps across diverse spatial layouts and agent archetypes, achieving an overall task completion rate of 76%. Using spatial, linguistic, and sentiment analyses, we show how agents perceive, adapt to, or struggle with their surroundings and assigned tasks. Our findings highlight the potential of TravelAgent as a tool for urban design, spatial cognition research, and agent-based modeling. We discuss key challenges and opportunities in deploying generative agents for the evaluation and refinement of spatial designs, proposing TravelAgent as a new paradigm for simulating and understanding human experiences in built environments.

[AI-62] Injecting Bias into Text Classification Models using Backdoor Attacks

链接: https://arxiv.org/abs/2412.18975
作者: A. Dilara Yavuz,M. Emre Gursoy
关键词: natural language processing, enabled accurate text, pre-trained language models, language processing, accurate text classification
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid growth of natural language processing (NLP) and pre-trained language models have enabled accurate text classification in a variety of settings. However, text classification models are susceptible to backdoor attacks, where an attacker embeds a trigger into the victim model to make the model predict attacker-desired labels in targeted scenarios. In this paper, we propose to utilize backdoor attacks for a new purpose: bias injection. We develop a backdoor attack in which a subset of the training dataset is poisoned to associate strong male actors with negative sentiment. We execute our attack on two popular text classification datasets (IMDb and SST) and seven different models ranging from traditional Doc2Vec-based models to LSTM networks and modern transformer-based BERT and RoBERTa models. Our results show that the reduction in backdoored models’ benign classification accuracy is limited, implying that our attacks remain stealthy, whereas the models successfully learn to associate strong male actors with negative sentiment (100% attack success rate with = 3% poison rate). Attacks on BERT and RoBERTa are particularly more stealthy and effective, demonstrating an increased risk of using modern and larger models. We also measure the generalizability of our bias injection by proposing two metrics: (i) U-BBSR which uses previously unseen words when measuring attack success, and (ii) P-BBSR which measures attack success using paraphrased test samples. U-BBSR and P-BBSR results show that the bias injected by our attack can go beyond memorizing a trigger phrase.

[AI-63] Recommending Pre-Trained Models for IoT Devices

链接: https://arxiv.org/abs/2412.18972
作者: Parth V. Patil,Wenxin Jiang,Huiyun Peng,Daniel Lugo,Kelechi G. Kalu,Josh LeBlanc,Lawrence Smith,Hyeonwoo Heo,Nathanael Aou,James C. Davis
关键词: enabled faster deployment, extensive training, availability of pre-trained, enabled faster, faster deployment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Accepted at SERP4IOT’25

点击查看摘要

Abstract:The availability of pre-trained models (PTMs) has enabled faster deployment of machine learning across applications by reducing the need for extensive training. Techniques like quantization and distillation have further expanded PTM applicability to resource-constrained IoT hardware. Given the many PTM options for any given task, engineers often find it too costly to evaluate each model’s suitability. Approaches such as LogME, LEEP, and ModelSpider help streamline model selection by estimating task relevance without exhaustive tuning. However, these methods largely leave hardware constraints as future work-a significant limitation in IoT settings. In this paper, we identify the limitations of current model recommendation approaches regarding hardware constraints and introduce a novel, hardware-aware method for PTM selection. We also propose a research agenda to guide the development of effective, hardware-conscious model recommendation systems for IoT applications.

[AI-64] Bridging Interpretability and Robustness Using LIME-Guided Model Refinement

链接: https://arxiv.org/abs/2412.18952
作者: Navid Nayyem,Abdullah Rakin,Longwei Wang
关键词: deep learning models, deep learning, paper explores, explores the intricate, intricate relationship
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 15 figures

点击查看摘要

Abstract:This paper explores the intricate relationship between interpretability and robustness in deep learning models. Despite their remarkable performance across various tasks, deep learning models often exhibit critical vulnerabilities, including susceptibility to adversarial attacks, over-reliance on spurious correlations, and a lack of transparency in their decision-making processes. To address these limitations, we propose a novel framework that leverages Local Interpretable Model-Agnostic Explanations (LIME) to systematically enhance model robustness. By identifying and mitigating the influence of irrelevant or misleading features, our approach iteratively refines the model, penalizing reliance on these features during training. Empirical evaluations on multiple benchmark datasets demonstrate that LIME-guided refinement not only improves interpretability but also significantly enhances resistance to adversarial perturbations and generalization to out-of-distribution data.

[AI-65] Constraint-Adaptive Policy Switching for Offline Safe Reinforcement Learning

链接: https://arxiv.org/abs/2412.18946
作者: Yassine Chemingui,Aryan Deshwal,Honghao Wei,Alan Fern,Janardhan Rao Doppa
关键词: safe reinforcement learning, Offline safe reinforcement, pre-defined safety constraints, involves learning, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline safe reinforcement learning (OSRL) involves learning a decision-making policy to maximize rewards from a fixed batch of training data to satisfy pre-defined safety constraints. However, adapting to varying safety constraints during deployment without retraining remains an under-explored challenge. To address this challenge, we introduce constraint-adaptive policy switching (CAPS), a wrapper framework around existing offline RL algorithms. During training, CAPS uses offline data to learn multiple policies with a shared representation that optimize different reward and cost trade-offs. During testing, CAPS switches between those policies by selecting at each state the policy that maximizes future rewards among those that satisfy the current cost constraint. Our experiments on 38 tasks from the DSRL benchmark demonstrate that CAPS consistently outperforms existing methods, establishing a strong wrapper-based baseline for OSRL. The code is publicly available at this https URL.

[AI-66] Exemplar-condensed Federated Class-incremental Learning

链接: https://arxiv.org/abs/2412.18926
作者: Rui Sun,Yumin Zhang,Varun Ojha,Tejal Shah,Haoran Duan,Bo Wei,Rajiv Ranjan
关键词: propose Exemplar-Condensed federated, Exemplar-Condensed federated class-incremental, informative rehearsal exemplars, federated class-incremental learning, federated continual learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose Exemplar-Condensed federated class-incremental learning (ECoral) to distil the training characteristics of real images from streaming data into informative rehearsal exemplars. The proposed method eliminates the limitations of exemplar selection in replay-based approaches for mitigating catastrophic forgetting in federated continual learning (FCL). The limitations particularly related to the heterogeneity of information density of each summarized data. Our approach maintains the consistency of training gradients and the relationship to past tasks for the summarized exemplars to represent the streaming data compared to the original images effectively. Additionally, our approach reduces the information-level heterogeneity of the summarized data by inter-client sharing of the disentanglement generative model. Extensive experiments show that our ECoral outperforms several state-of-the-art methods and can be seamlessly integrated with many existing approaches to enhance performance.

[AI-67] Long-Range Tasks Using Short-Context LLM s: Incremental Reasoning With Structured Memories

链接: https://arxiv.org/abs/2412.18914
作者: Dulhan Jayalath,James Bradley Wendt,Nicholas Monath,Sandeep Tata,Beliz Gunel
关键词: Long-range tasks require, tasks require reasoning, long inputs, require reasoning, reasoning over long
类目: Artificial Intelligence (cs.AI)
*备注: 23 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Long-range tasks require reasoning over long inputs. Existing solutions either need large compute budgets, training data, access to model weights, or use complex, task-specific approaches. We present PRISM, which alleviates these concerns by processing information as a stream of chunks, maintaining a structured in-context memory specified by a typed hierarchy schema. This approach demonstrates superior performance to baselines on diverse tasks while using at least 4x smaller contexts than long-context models. Moreover, PRISM is token-efficient. By producing short outputs and efficiently leveraging key-value (KV) caches, it achieves up to 54% cost reduction when compared to alternative short-context approaches. The method also scales down to tiny information chunks (e.g., 500 tokens) without increasing the number of tokens encoded or sacrificing quality. Furthermore, we show that it is possible to generate schemas to generalize our approach to new tasks with minimal effort.

[AI-68] GAI: Generative Agents for Innovation

链接: https://arxiv.org/abs/2412.18899
作者: Masahiro Sato
关键词: examines whether collective, collective reasoning, thinking that leads, multiple generative agents, generative agents
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study examines whether collective reasoning among generative agents can facilitate novel and coherent thinking that leads to innovation. To achieve this, it proposes GAI, a new LLM-empowered framework designed for reflection and interaction among multiple generative agents to replicate the process of innovation. The core of the GAI framework lies in an architecture that dynamically processes the internal states of agents and a dialogue scheme specifically tailored to facilitate analogy-driven innovation. The framework’s functionality is evaluated using Dyson’s invention of the bladeless fan as a case study, assessing the extent to which the core ideas of the innovation can be replicated through a set of fictional technical documents. The experimental results demonstrate that models with internal states significantly outperformed those without, achieving higher average scores and lower variance. Notably, the model with five heterogeneous agents equipped with internal states successfully replicated the key ideas underlying the Dyson’s invention. This indicates that the internal state enables agents to refine their ideas, resulting in the construction and sharing of more coherent and comprehensive concepts.

[AI-69] CoEvo: Continual Evolution of Symbolic Solutions Using Large Language Models

链接: https://arxiv.org/abs/2412.18890
作者: Ping Guo,Qingfu Zhang,Xi Lin
关键词: Large Language Models, understanding extensive human, Large Language, Language Models, artificial intelligence
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence, capable of processing and understanding extensive human knowledge to enhance problem-solving across various domains. This paper explores the potential of LLMs to drive the discovery of symbolic solutions within scientific and engineering disciplines, where such solutions are crucial for advancing theoretical and practical applications. We propose a novel framework that utilizes LLMs in an evolutionary search methodology, augmented by a dynamic knowledge library that integrates and refines insights in an \textitopen-ended manner. This approach aims to tackle the dual challenges of efficiently navigating complex symbolic representation spaces and leveraging both existing and newly generated knowledge to foster open-ended innovation. By enabling LLMs to interact with and expand upon a knowledge library, we facilitate the continuous generation of novel solutions in diverse forms such as language, code, and mathematical expressions. Our experimental results demonstrate that this method not only enhances the efficiency of searching for symbolic solutions but also supports the ongoing discovery process, akin to human scientific endeavors. This study represents a first effort in conceptualizing the search for symbolic solutions as a lifelong, iterative process, marking a significant step towards harnessing AI in the perpetual pursuit of scientific and engineering breakthroughs. We have open-sourced our code and data, please visit \urlthis https URL for more information.

[AI-70] Computing Approximate Graph Edit Distance via Optimal Transport SIGMOD2025

链接: https://arxiv.org/abs/2412.18857
作者: Qihao Cheng,Da Yan,Tianhao Wu,Zhongyi Huang,Qin Zhang
关键词: edit operations converting, minimum number, GED, graph edit distance, optimal transport
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by SIGMOD2025. 26 pages, 21 figures

点击查看摘要

Abstract:Given a graph pair (G^1, G^2) , graph edit distance (GED) is defined as the minimum number of edit operations converting G^1 to G^2 . GED is a fundamental operation widely used in many applications, but its exact computation is NP-hard, so the approximation of GED has gained a lot of attention. Data-driven learning-based methods have been found to provide superior results compared to classical approximate algorithms, but they directly fit the coupling relationship between a pair of vertices from their vertex features. We argue that while pairwise vertex features can capture the coupling cost (discrepancy) of a pair of vertices, the vertex coupling matrix should be derived from the vertex-pair cost matrix through a more well-established method that is aware of the global context of the graph pair, such as optimal transport. In this paper, we propose an ensemble approach that integrates a supervised learning-based method and an unsupervised method, both based on optimal transport. Our learning method, GEDIOT, is based on inverse optimal transport that leverages a learnable Sinkhorn algorithm to generate the coupling matrix. Our unsupervised method, GEDGW, models GED computation as a linear combination of optimal transport and its variant, Gromov-Wasserstein discrepancy, for node and edge operations, respectively, which can be solved efficiently without needing the ground truth. Our ensemble method, GEDHOT, combines GEDIOT and GEDGW to further boost the performance. Extensive experiments demonstrate that our methods significantly outperform the existing methods in terms of the performance of GED computation, edit path generation, and model generalizability.

[AI-71] Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset ICASSP2025

链接: https://arxiv.org/abs/2412.18839
作者: Neil Shah,Shirish Karande,Vineet Gandhi
关键词: Current Non-Audible Murmur, to-speech techniques rely, Current Non-Audible, Non-Audible Murmur, to-speech techniques
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted at IEEE ICASSP 2025

点击查看摘要

Abstract:Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over 7.96 hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at \urlthis https URL

[AI-72] MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI ICASSP2025

链接: https://arxiv.org/abs/2412.18836
作者: Neil Shah,Ayan Kashyap,Shirish Karande,Vineet Gandhi
关键词: Previous real-time MRI, Previous real-time, synthesis models depend, models depend heavily, based speech synthesis
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted at IEEE ICASSP 2025

点击查看摘要

Abstract:Previous real-time MRI (rtMRI)-based speech synthesis models depend heavily on noisy ground-truth speech. Applying loss directly over ground truth mel-spectrograms entangles speech content with MRI noise, resulting in poor intelligibility. We introduce a novel approach that adapts the multi-modal self-supervised AV-HuBERT model for text prediction from rtMRI and incorporates a new flow-based duration predictor for speaker-specific alignment. The predicted text and durations are then used by a speech decoder to synthesize aligned speech in any novel voice. We conduct thorough experiments on two datasets and demonstrate our method’s generalization ability to unseen speakers. We assess our framework’s performance by masking parts of the rtMRI video to evaluate the impact of different articulators on text prediction. Our method achieves a 15.18% Word Error Rate (WER) on the USC-TIMIT MRI corpus, marking a huge improvement over the current state-of-the-art. Speech samples are available at \urlthis https URL

[AI-73] LLM -assisted vector similarity search

链接: https://arxiv.org/abs/2412.18819
作者: Md Riyadh,Muqi Li,Felix Haryanto Lie,Jia Long Loh,Haotian Mi,Sayam Bohra
关键词: data retrieval demands, Vector similarity search, Vector similarity, demands become increasingly, fall short
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As data retrieval demands become increasingly complex, traditional search methods often fall short in addressing nuanced and conceptual queries. Vector similarity search has emerged as a promising technique for finding semantically similar information efficiently. However, its effectiveness diminishes when handling intricate queries with contextual nuances. This paper explores a hybrid approach combining vector similarity search with Large Language Models (LLMs) to enhance search accuracy and relevance. The proposed two-step solution first employs vector similarity search to shortlist potential matches, followed by an LLM for context-aware ranking of the results. Experiments on structured datasets demonstrate that while vector similarity search alone performs well for straightforward queries, the LLM-assisted approach excels in processing complex queries involving constraints, negations, or conceptual requirements. By leveraging the natural language understanding capabilities of LLMs, this method improves the accuracy of search results for complex tasks without sacrificing efficiency. We also discuss real-world applications and propose directions for future research to refine and scale this technique for diverse datasets and use cases. Original article: this https URL Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2412.18819 [cs.AI] (or arXiv:2412.18819v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.18819 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-74] GSAVS: Gaussian Splatting-based Autonomous Vehicle Simulator

链接: https://arxiv.org/abs/2412.18816
作者: Rami Wilson
关键词: Modern autonomous vehicle, Modern autonomous, feature an ever-growing, ever-growing library, vehicle simulators feature
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern autonomous vehicle simulators feature an ever-growing library of assets, including vehicles, buildings, roads, pedestrians, and more. While this level of customization proves beneficial when creating virtual urban environments, this process becomes cumbersome when intending to train within a digital twin or a duplicate of a real scene. Gaussian splatting emerged as a powerful technique in scene reconstruction and novel view synthesis, boasting high fidelity and rendering speeds. In this paper, we introduce GSAVS, an autonomous vehicle simulator that supports the creation and development of autonomous vehicle models. Every asset within the simulator is a 3D Gaussian splat, including the vehicles and the environment. However, the simulator runs within a classical 3D engine, rendering 3D Gaussian splats in real-time. This allows the simulator to utilize the photorealism that 3D Gaussian splatting boasts while providing the customization and ease of use of a classical 3D engine.

[AI-75] Ister: Inverted Seasonal-Trend Decomposition Transformer for Explainable Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2412.18798
作者: Fanpu Cao,Shu Yang,Zhengjian Chen,Ye Liu,Laizhong Cui
关键词: achieved great success, capture long-range dependencies, great success, long-range dependencies, achieved great
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In long-term time series forecasting, Transformer-based models have achieved great success, due to its ability to capture long-range dependencies. However, existing transformer-based methods face challenges in accurately identifying which variables play a pivotal role in the prediction process and tend to overemphasize noisy channels, thereby limiting the interpretability and practical effectiveness of the models. Besides, it faces scalability issues due to quadratic computational complexity of self-attention. In this paper, we propose a new model named Inverted Seasonal-Trend Decomposition Transformer (Ister), which addresses these challenges in long-term multivariate time series forecasting by designing an improved Transformer-based structure. Ister firstly decomposes original time series into seasonal and trend components. Then we propose a new Dot-attention mechanism to process the seasonal component, which improves both accuracy, computation complexity and interpretability. Upon completion of the training phase, it allows users to intuitively visualize the significance of each feature in the overall prediction. We conduct comprehensive experiments, and the results show that Ister achieves state-of-the-art (SOTA) performance on multiple datasets, surpassing existing models in long-term prediction tasks.

[AI-76] orque-Aware Momentum

链接: https://arxiv.org/abs/2412.18790
作者: Pranshu Malviya,Goncalo Mordido,Aristide Baratin,Reza Babanezhad Harikandeh,Gintare Karolina Dziugaite,Razvan Pascanu,Sarath Chandar
关键词: Efficiently exploring complex, deep neural networks, exploring complex loss, complex loss landscapes, Efficiently exploring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

[AI-77] Data clustering: an essential technique in data science

链接: https://arxiv.org/abs/2412.18760
作者: Wong Hauchi,Daniil Lisik,Tai Dinh
关键词: emphasizing its methodologies, comprehensive exploration, clustering, Abstract, data
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper provides a comprehensive exploration of data clustering, emphasizing its methodologies and applications across different fields. Traditional techniques, including partitional and hierarchical clustering, are discussed alongside other approaches such as data stream, subspace and network clustering, highlighting their role in addressing complex, high-dimensional datasets. The paper also reviews the foundational principles of clustering, introduces common tools and methods, and examines its diverse applications in data science. Finally, the discussion concludes with insights into future directions, underscoring the centrality of clustering in driving innovation and enabling data-driven decision making.

[AI-78] he Impact of Input Order Bias on Large Language Models for Software Fault Localization

链接: https://arxiv.org/abs/2412.18750
作者: Md Nakhla Rafi,Dong Jae Kim,Tse-Hsun Chen,Shaowei Wang
关键词: Large Language Models, Automatic Program Repair, Large Language, Language Models, Fault Localization
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show great promise in software engineering tasks like Fault Localization (FL) and Automatic Program Repair (APR). This study examines how input order and context size affect LLM performance in FL, a key step for many downstream software engineering tasks. We test different orders for methods using Kendall Tau distances, including “perfect” (where ground truths come first) and “worst” (where ground truths come last). Our results show a strong bias in order, with Top-1 accuracy falling from 57% to 20% when we reverse the code order. Breaking down inputs into smaller contexts helps reduce this bias, narrowing the performance gap between perfect and worst orders from 22% to just 1%. We also look at ordering methods based on traditional FL techniques and metrics. Ordering using DepGraph’s ranking achieves 48% Top-1 accuracy, better than more straightforward ordering approaches like CallGraph. These findings underscore the importance of how we structure inputs, manage contexts, and choose ordering methods to improve LLM performance in FL and other software engineering tasks.

[AI-79] Predicting Time Series of Networked Dynamical Systems without Knowing Topology

链接: https://arxiv.org/abs/2412.18734
作者: Yanna Ding,Zijie Huang,Malik Magdon-Ismail,Jianxi Gao
关键词: produce multivariate time, time series, multivariate time series, epidemic spreading networks, real-world complex systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many real-world complex systems, such as epidemic spreading networks and ecosystems, can be modeled as networked dynamical systems that produce multivariate time series. Learning the intrinsic dynamics from observational data is pivotal for forecasting system behaviors and making informed decisions. However, existing methods for modeling networked time series often assume known topologies, whereas real-world networks are typically incomplete or inaccurate, with missing or spurious links that hinder precise predictions. Moreover, while networked time series often originate from diverse topologies, the ability of models to generalize across topologies has not been systematically evaluated. To address these gaps, we propose a novel framework for learning network dynamics directly from observed time-series data, when prior knowledge of graph topology or governing dynamical equations is absent. Our approach leverages continuous graph neural networks with an attention mechanism to construct a latent topology, enabling accurate reconstruction of future trajectories for network states. Extensive experiments on real and synthetic networks demonstrate that our model not only captures dynamics effectively without topology knowledge but also generalizes to unseen time series originating from diverse topologies.

[AI-80] SAFLITE: Fuzzing Autonomous Systems via Large Language Models

链接: https://arxiv.org/abs/2412.18727
作者: Taohong Zhu,Adrians Skapars,Fardeen Mackenzie,Declan Kehoe,William Newton,Suzanne Embury,Youcheng Sun
关键词: vast search spaces, uncovers software vulnerabilities, effectively uncovers software, complex state spaces, testing effectively uncovers
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Fuzz testing effectively uncovers software vulnerabilities; however, it faces challenges with Autonomous Systems (AS) due to their vast search spaces and complex state spaces, which reflect the unpredictability and complexity of real-world environments. This paper presents a universal framework aimed at improving the efficiency of fuzz testing for AS. At its core is SaFliTe, a predictive component that evaluates whether a test case meets predefined safety criteria. By leveraging the large language model (LLM) with information about the test objective and the AS state, SaFliTe assesses the relevance of each test case. We evaluated SaFliTe by instantiating it with various LLMs, including GPT-3.5, Mistral-7B, and Llama2-7B, and integrating it into four fuzz testing tools: PGFuzz, DeepHyperion-UAV, CAMBA, and TUMB. These tools are designed specifically for testing autonomous drone control systems, such as ArduPilot, PX4, and PX4-Avoidance. The experimental results demonstrate that, compared to PGFuzz, SaFliTe increased the likelihood of selecting operations that triggered bug occurrences in each fuzzing iteration by an average of 93.1%. Additionally, after integrating SaFliTe, the ability of DeepHyperion-UAV, CAMBA, and TUMB to generate test cases that caused system violations increased by 234.5%, 33.3%, and 17.8%, respectively. The benchmark for this evaluation was sourced from a UAV Testing Competition.

[AI-81] Optimization and Scalability of Collaborative Filtering Algorithms in Large Language Models

链接: https://arxiv.org/abs/2412.18715
作者: Haowei Yang,Longfei Yun,Jinghan Cao,Qingyi Lu,Yuming Tu
关键词: enhancing user experience, Collaborative filtering algorithms, Collaborative filtering, personalized content, driving engagement
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the rapid development of large language models (LLMs) and the growing demand for personalized content, recommendation systems have become critical in enhancing user experience and driving engagement. Collaborative filtering algorithms, being core to many recommendation systems, have garnered significant attention for their efficiency and interpretability. However, traditional collaborative filtering approaches face numerous challenges when integrated into large-scale LLM-based systems, including high computational costs, severe data sparsity, cold start problems, and lack of scalability. This paper investigates the optimization and scalability of collaborative filtering algorithms in large language models, addressing these limitations through advanced optimization strategies. Firstly, we analyze the fundamental principles of collaborative filtering algorithms and their limitations when applied in LLM-based contexts. Next, several optimization techniques such as matrix factorization, approximate nearest neighbor search, and parallel computing are proposed to enhance computational efficiency and model accuracy. Additionally, strategies such as distributed architecture and model compression are explored to facilitate dynamic updates and scalability in data-intensive environments.

[AI-82] Enhanced Recommendation Combining Collaborative Filtering and Large Language Models

链接: https://arxiv.org/abs/2412.18713
作者: Xueting Lin,Zhan Cheng,Longfei Yun,Qingyi Lu,Yuanshuai Luo
关键词: information explosion era, collaborative filtering, explosion era, increasingly significant, applications is increasingly
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the advent of the information explosion era, the importance of recommendation systems in various applications is increasingly significant. Traditional collaborative filtering algorithms are widely used due to their effectiveness in capturing user behavior patterns, but they encounter limitations when dealing with cold start problems and data sparsity. Large Language Models (LLMs), with their strong natural language understanding and generation capabilities, provide a new breakthrough for recommendation systems. This study proposes an enhanced recommendation method that combines collaborative filtering and LLMs, aiming to leverage collaborative filtering’s advantage in modeling user preferences while enhancing the understanding of textual information about users and items through LLMs to improve recommendation accuracy and diversity. This paper first introduces the fundamental theories of collaborative filtering and LLMs, then designs a recommendation system architecture that integrates both, and validates the system’s effectiveness through experiments. The results show that the hybrid model based on collaborative filtering and LLMs significantly improves precision, recall, and user satisfaction, demonstrating its potential in complex recommendation scenarios.

[AI-83] CAG: Chunked Augmented Generation for Google Chromes Built-in Gemini Nano

链接: https://arxiv.org/abs/2412.18708
作者: Vivek Vellaiyappan Surulimuthu,Aditya Karnam Gururaj Rao
关键词: Chunked Augmented Generation, present Chunked Augmented, Google Chrome built-in, Augmented Generation, built-in Gemini Nano
类目: Artificial Intelligence (cs.AI)
*备注: 36 pages, 19 figures

点击查看摘要

Abstract:We present Chunked Augmented Generation (CAG), an architecture specifically designed to overcome the context window limitations of Google Chrome’s built-in Gemini Nano model. While Chrome’s integration of Gemini Nano represents a significant advancement in bringing AI capabilities directly to the browser, its restricted context window poses challenges for processing large inputs. CAG addresses this limitation through intelligent input chunking and processing strategies, enabling efficient handling of extensive content while maintaining the model’s performance within browser constraints. Our implementation demonstrates particular efficacy in processing large documents and datasets directly within Chrome, making sophisticated AI capabilities accessible through the browser without external API dependencies. Get started now at this https URL.

[AI-84] SurvAttack: Black-Box Attack On Survival Models through Ontology-Informed EHR Perturbation

链接: https://arxiv.org/abs/2412.18706
作者: Mohsen Nayebi Kerdabadi,Arya Hadizadeh Moghaddam,Bin Liu,Mei Liu,Zijun Yao
关键词: electronic health records, mining electronic health, prioritizing high-risk patients, health records, widely studied
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Survival analysis (SA) models have been widely studied in mining electronic health records (EHRs), particularly in forecasting the risk of critical conditions for prioritizing high-risk patients. However, their vulnerability to adversarial attacks is much less explored in the literature. Developing black-box perturbation algorithms and evaluating their impact on state-of-the-art survival models brings two benefits to medical applications. First, it can effectively evaluate the robustness of models in pre-deployment testing. Also, exploring how subtle perturbations would result in significantly different outcomes can provide counterfactual insights into the clinical interpretation of model prediction. In this work, we introduce SurvAttack, a novel black-box adversarial attack framework leveraging subtle clinically compatible, and semantically consistent perturbations on longitudinal EHRs to degrade survival models’ predictive performance. We specifically develop a greedy algorithm to manipulate medical codes with various adversarial actions throughout a patient’s medical history. Then, these adversarial actions are prioritized using a composite scoring strategy based on multi-aspect perturbation quality, including saliency, perturbation stealthiness, and clinical meaningfulness. The proposed adversarial EHR perturbation algorithm is then used in an efficient SA-specific strategy to attack a survival model when estimating the temporal ranking of survival urgency for patients. To demonstrate the significance of our work, we conduct extensive experiments, including baseline comparisons, explainability analysis, and case studies. The experimental results affirm our research’s effectiveness in illustrating the vulnerabilities of patient survival models, model interpretation, and ultimately contributing to healthcare quality.

[AI-85] Agents on the Bench: Large Language Model Based Multi Agent Framework for Trustworthy Digital Justice

链接: https://arxiv.org/abs/2412.18697
作者: Cong Jiang,Xiaolei Yang
关键词: uphold public trust, justice system, system has increasingly, increasingly employed, employed AI techniques
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Draft version; Under review

点击查看摘要

Abstract:The justice system has increasingly employed AI techniques to enhance efficiency, yet limitations remain in improving the quality of decision-making, particularly regarding transparency and explainability needed to uphold public trust in legal AI. To address these challenges, we propose a large language model based multi-agent framework named AgentsBench, which aims to simultaneously improve both efficiency and quality in judicial decision-making. Our approach leverages multiple LLM-driven agents that simulate the collaborative deliberation and decision making process of a judicial bench. We conducted experiments on legal judgment prediction task, and the results show that our framework outperforms existing LLM based methods in terms of performance and decision quality. By incorporating these elements, our framework reflects real-world judicial processes more closely, enhancing accuracy, fairness, and society consideration. AgentsBench provides a more nuanced and realistic methods of trustworthy AI decision-making, with strong potential for application across various case types and legal scenarios.

[AI-86] Map2Text: New Content Generation from Low-Dimensional Visualizations

链接: https://arxiv.org/abs/2412.18673
作者: Xingjian Zhang,Ziyang Xiong,Shixuan Liu,Yutong Xie,Tolga Ergen,Dongsub Shim,Hua Xu,Honglak Lee,Qiaozhu Me
关键词: creative industries, industries as effective, effective tools, tools for interpreting, Low-dimensional visualizations
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Low-dimensional visualizations, or “projection maps” of datasets, are widely used across scientific research and creative industries as effective tools for interpreting large-scale and complex information. These visualizations not only support understanding existing knowledge spaces but are often used implicitly to guide exploration into unknown areas. While powerful methods like TSNE or UMAP can create such visual maps, there is currently no systematic way to leverage them for generating new content. To bridge this gap, we introduce Map2Text, a novel task that translates spatial coordinates within low-dimensional visualizations into new, coherent, and accurately aligned textual content. This allows users to explore and navigate undiscovered information embedded in these spatial layouts interactively and intuitively. To evaluate the performance of Map2Text methods, we propose Atometric, an evaluation metric that provides a granular assessment of logical coherence and alignment of the atomic statements in the generated texts. Experiments conducted across various datasets demonstrate the versatility of Map2Text in generating scientific research hypotheses, crafting synthetic personas, and devising strategies for testing large language models. Our findings highlight the potential of Map2Text to unlock new pathways for interacting with and navigating large-scale textual datasets, offering a novel framework for spatially guided content generation and discovery.

[AI-87] Interplay of ISMS and AIMS in context of the EU AI Act

链接: https://arxiv.org/abs/2412.18670
作者: Jordan Pötsch
关键词: risk management system, quality management system, management system, information security management, security management system
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The EU AI Act (AIA) mandates the implementation of a risk management system (RMS) and a quality management system (QMS) for high-risk AI systems. The ISO/IEC 42001 standard provides a foundation for fulfilling these requirements but does not cover all EU-specific regulatory stipulations. To enhance the implementation of the AIA in Germany, the Federal Office for Information Security (BSI) could introduce the national standard BSI 200-5, which specifies AIA requirements and integrates existing ISMS standards, such as ISO/IEC 27001. This paper examines the interfaces between an information security management system (ISMS) and an AI management system (AIMS), demonstrating that incorporating existing ISMS controls with specific AI extensions presents an effective strategy for complying with Article 15 of the AIA. Four new AI modules are introduced, proposed for inclusion in the BSI IT Grundschutz framework to comprehensively ensure the security of AI systems. Additionally, an approach for adapting BSI’s qualification and certification systems is outlined to ensure that expertise in secure AI handling is continuously developed. Finally, the paper discusses how the BSI could bridge international standards and the specific requirements of the AIA through the nationalization of ISO/IEC 42001, creating synergies and bolstering the competitiveness of the German AI landscape.

[AI-88] Nationality Race and Ethnicity Biases in and Consequences of Detecting AI-Generated Self-Presentations

链接: https://arxiv.org/abs/2412.18647
作者: Haoran Chu,Linjuan Rita Men,Sixiao Liu,Shupei Yuan,Yuan Sun
关键词: high-stakes self-presentation context, specifically race, college applications, theories to investigate, affect judgments
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study builds on person perception and human AI interaction (HAII) theories to investigate how content and source cues, specifically race, ethnicity, and nationality, affect judgments of AI-generated content in a high-stakes self-presentation context: college applications. Results of a pre-registered experiment with a nationally representative U.S. sample (N = 644) show that content heuristics, such as linguistic style, played a dominant role in AI detection. Source heuristics, such as nationality, also emerged as a significant factor, with international students more likely to be perceived as using AI, especially when their statements included AI-sounding features. Interestingly, Asian and Hispanic applicants were more likely to be judged as AI users when labeled as domestic students, suggesting interactions between racial stereotypes and AI detection. AI attribution led to lower perceptions of personal statement quality and authenticity, as well as negative evaluations of the applicant’s competence, sociability, morality, and future success.

[AI-89] A Grounded Observer Framework for Establishing Guardrails for Foundation Models in Socially Sensitive Domains

链接: https://arxiv.org/abs/2412.18639
作者: Rebecca Ramnauth,Dražen Brščić,Brian Scassellati
关键词: increasingly permeate sensitive, permeate sensitive domains, meets desired outcomes, behavior meets desired, models increasingly permeate
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2412.18023

点击查看摘要

Abstract:As foundation models increasingly permeate sensitive domains such as healthcare, finance, and mental health, ensuring their behavior meets desired outcomes and social expectations becomes critical. Given the complexities of these high-dimensional models, traditional techniques for constraining agent behavior, which typically rely on low-dimensional, discrete state and action spaces, cannot be directly applied. Drawing inspiration from robotic action selection techniques, we propose the grounded observer framework for constraining foundation model behavior that offers both behavioral guarantees and real-time variability. This method leverages real-time assessment of low-level behavioral characteristics to dynamically adjust model actions and provide contextual feedback. To demonstrate this, we develop a system capable of sustaining contextually appropriate, casual conversations (“small talk”), which we then apply to a robot for novel, unscripted interactions with humans. Finally, we discuss potential applications of the framework for other social contexts and areas for further research.

[AI-90] ackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

链接: https://arxiv.org/abs/2412.18106
作者: Mingcong Song,Xinru Tang,Fengfan Hou,Jing Li,Wei Wei,Yipeng Ma,Runqiu Xiao,Hongjie Si,Dingcheng Jiang,Shouyi Yin,Yang Hu,Guoping Long
关键词: Meeting growing demands, production-grade large language, requires integrating advanced, large language model, advanced optimization techniques
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especially DSAs with tile-based programming models. To address this challenge, we introduce XY-Serve, a versatile, Ascend native, end-to-end production LLM-serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into unified, hardware-friendly, fine-grained meta primitives. For attention, we propose a meta-kernel that computes the basic pattern of matmul-softmax-matmul with architectural-aware tile sizes. For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes. XY-Serve sits harmoniously with vLLM. Experimental results show up to 89% end-to-end throughput improvement compared with current publicly available baselines on Ascend NPUs. Additionally, our approach outperforms existing GEMM (average 14.6% faster) and attention (average 21.5% faster) kernels relative to existing libraries. While the work is Ascend native, we believe the approach can be readily applicable to SIMT architectures as well.

[AI-91] Efficient Identification of Direct Causal Parents via Invariance and Minimum Error Testing

链接: https://arxiv.org/abs/2409.12797
作者: Minh Nguyen,Mert R. Sabuncu
关键词: Invariant causal prediction, exploiting distribution shifts, Invariant causal, invariance testing, distribution shifts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at TMLR

点击查看摘要

Abstract:Invariant causal prediction (ICP) is a popular technique for finding causal parents (direct causes) of a target via exploiting distribution shifts and invariance testing (Peters et al., 2016). However, since ICP needs to run an exponential number of tests and fails to identify parents when distribution shifts only affect a few variables, applying ICP to practical large scale problems is challenging. We propose MMSE-ICP and fastICP, two approaches which employ an error inequality to address the identifiability problem of ICP. The inequality states that the minimum prediction error of the predictor using causal parents is the smallest among all predictors which do not use descendants. fastICP is an efficient approximation tailored for large problems as it exploits the inequality and a heuristic to run fewer tests. MMSE-ICP and fastICP not only outperform competitive baselines in many simulations but also achieve state-of-the-art result on a large scale real data benchmark.

[AI-92] A Survey of NL2SQL with Large Language Models : Where are we and where are we going?

链接: https://arxiv.org/abs/2408.05109
作者: Xinyu Liu,Shuyu Shen,Boyan Li,Peixian Ma,Runzhi Jiang,Yuxin Zhang,Ju Fan,Guoliang Li,Nan Tang,Yuyu Luo
关键词: Translating users’ natural, natural language queries, users’ natural language, significantly reduce barriers, SQL queries
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 20 pages, 11 figures, 2 tables

点击查看摘要

Abstract:Translating users’ natural language queries (NL) into SQL queries (i.e., NL2SQL, a.k.a., Text-to-SQL) can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of NL2SQL has been greatly enhanced with the emergence of Large Language Models (LLMs). In this survey, we provide a comprehensive review of NL2SQL techniques powered by LLMs, covering its entire lifecycle from the following four aspects: (1) Model: NL2SQL translation techniques that tackle not only NL ambiguity and under-specification, but also properly map NL with database schema and instances; (2) Data: From the collection of training data, data synthesis due to training data scarcity, to NL2SQL benchmarks; (3) Evaluation: Evaluating NL2SQL methods from multiple angles using different metrics and granularities; and (4) Error Analysis: analyzing NL2SQL errors to find the root cause and guiding NL2SQL models to evolve. Moreover, we provide a rule of thumb for developing NL2SQL solutions. Finally, we discuss the research challenges and open problems of NL2SQL in the LLMs era.

[AI-93] Model-based Multi-agent Reinforcement Learning: Recent Progress and Prospects

链接: https://arxiv.org/abs/2203.10603
作者: Xihuai Wang,Zhicheng Zhang,Weinan Zhang
关键词: Multi-Agent Reinforcement Learning, Reinforcement Learning, involving multiple participants, tackles sequential decision-making, sequential decision-making problems
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Significant advances have recently been achieved in Multi-Agent Reinforcement Learning (MARL) which tackles sequential decision-making problems involving multiple participants. However, MARL requires a tremendous number of samples for effective training. On the other hand, model-based methods have been shown to achieve provable advantages of sample efficiency. However, the attempts of model-based methods to MARL have just started very recently. This paper presents a review of the existing research on model-based MARL, including theoretical analyses, algorithms, and applications, and analyzes the advantages and potential of model-based MARL. Specifically, we provide a detailed taxonomy of the algorithms and point out the pros and cons for each algorithm according to the challenges inherent to multi-agent scenarios. We also outline promising directions for future development of this field.

[AI-94] Complement or substitute? How AI increases the demand for human skills

链接: https://arxiv.org/abs/2412.19754
作者: Elina Mäkelä,Fabian Stephany
关键词: complements human work, human work, complements human, central to debates, skills
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注: 84

点击查看摘要

Abstract:The question of whether AI substitutes or complements human work is central to debates on the future of work. This paper examines the impact of AI on skill demand and compensation in the U.S. economy, analysing 12 million online job vacancies from 2018 to 2023. It investigates internal effects (within-job substitution and complementation) and external effects (across occupations, industries, and regions). Our findings reveal a significant increase in demand for AI-complementary skills, such as digital literacy, teamwork, and resilience, alongside rising wage premiums for these skills in AI roles like Data Scientist. Conversely, substitute skills, including customer service and text review, have declined in both demand and value within AI-related positions. Examining external effects, we find a notable rise in demand for complementary skills in non-AI roles linked to the growth of AI-related jobs in specific industries or regions. At the same time, there is a moderate decline in non-AI roles requiring substitute skills. Overall, AI’s complementary effect is up to 50% larger than its substitution effect, resulting in net positive demand for skills. These results, replicated for the UK and Australia, highlight AI’s transformative impact on workforce skill requirements. They suggest reskilling efforts should prioritise not only technical AI skills but also complementary skills like ethics and digital literacy.

[AI-95] Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

链接: https://arxiv.org/abs/2412.19191
作者: Haonan He,Yuchen Ren,Yining Tang,Ziyang Xu,Junxian Li,Minghao Yang,Di Zhang,Dong Yuan,Tao Chen,Shufei Zhang,Yuqiang Li,Nanqing Dong,Wanli Ouyang,Dongzhan Zhou,Peng Ye
关键词: Large language models, general domains, revolutionary transformation, Large language, demonstrated their formidable
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models have already demonstrated their formidable capabilities in general domains, ushering in a revolutionary transformation. However, exploring and exploiting the extensive knowledge of these models to comprehend multi-omics biology remains underexplored. To fill this research gap, we first introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset including DNA, RNA, proteins, and multi-molecules, designed to bridge the gap between large language models (LLMs) and complex biological sequences-related tasks. This dataset can enhance the versatility of LLMs by integrating diverse biological sequenced-based prediction tasks with advanced reasoning capabilities, while maintaining conversational fluency. Additionally, we reveal significant performance limitations in even state-of-the-art LLMs on biological sequence-related multi-omics tasks without specialized pre-training and instruction-tuning. We further develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline, demonstrating the powerful ability to understand biology by using Biology-Instructions. Biology-Instructions and ChatMultiOmics are publicly available and crucial resources for enabling more effective integration of LLMs with multi-omics sequence analysis.

[AI-96] Master Stability Functions in Complex Networks

链接: https://arxiv.org/abs/2412.19163
作者: Suman Acharyya,Priodyuti Pradhan,Chandrakala Meena
关键词: Master Stability Function, MSF, dynamical networks, complex dynamical networks, emergent phenomenon
类目: Adaptation and Self-Organizing Systems (nlin.AO); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD)
*备注: 38 pages, 1 figure

点击查看摘要

Abstract:Synchronization is an emergent phenomenon in coupled dynamical networks. The Master Stability Function (MSF) is a highly elegant and powerful tool for characterizing the stability of synchronization states. However, a significant challenge lies in determining the MSF for complex dynamical networks driven by nonlinear interaction mechanisms. These mechanisms introduce additional complexity through the intricate connectivity of interacting elements within the network and the intrinsic dynamics, which are governed by nonlinear processes with diverse parameters and higher dimensionality of systems. Over the past 25 years, extensive research has focused on determining the MSF for pairwise coupled identical systems with diffusive coupling. Our literature survey highlights two significant advancements in recent years: the consideration of multilayer networks instead of single-layer networks and the extension of MSF analysis to incorporate higher-order interactions alongside pairwise interactions. In this review article, we revisit the analysis of the MSF for diffusively pairwise coupled dynamical systems and extend this framework to more general coupling schemes. Furthermore, we systematically derive the MSF for multilayer dynamical networks and single-layer coupled systems by incorporating higher-order interactions alongside pairwise interactions. The primary focus of our review is on the analytical derivation and numerical computation of the MSF for complex dynamical networks. Finally, we demonstrate the application of the MSF in data science, emphasizing its relevance and potential in this rapidly evolving field. Comments: 38 pages, 1 figure Subjects: Adaptation and Self-Organizing Systems (nlin.AO); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD) Cite as: arXiv:2412.19163 [nlin.AO] (or arXiv:2412.19163v1 [nlin.AO] for this version) https://doi.org/10.48550/arXiv.2412.19163 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-97] Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization AAAI2025

链接: https://arxiv.org/abs/2412.19005
作者: Yihan Wu,Yichen Lu,Yifan Peng,Xihua Wang,Ruihua Song,Shinji Watanabe
关键词: Audiovisual Automatic Speech, Automatic Speech Recognition, Speech Recognition, Audiovisual Automatic, Automatic Speech
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.

[AI-98] Implicit factorized transformer approach to fast prediction of turbulent channel flows

链接: https://arxiv.org/abs/2412.18840
作者: Huiyu Yang,Yunpeng Wang,Jianchun Wang
关键词: partial differential equations, nonlinear systems governed, Fourier neural operator, Transformer neural operators, neural operator
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer neural operators have recently become an effective approach for surrogate modeling of nonlinear systems governed by partial differential equations (PDEs). In this paper, we introduce a modified implicit factorized transformer (IFactFormer-m) model which replaces the original chained factorized attention with parallel factorized attention. The IFactFormer-m model successfully performs long-term predictions for turbulent channel flow, whereas the original IFactFormer (IFactFormer-o), Fourier neural operator (FNO), and implicit Fourier neural operator (IFNO) exhibit a poor performance. Turbulent channel flows are simulated by direct numerical simulation using fine grids at friction Reynolds numbers \textRe_\tau\approx 180,395,590 , and filtered to coarse grids for training neural operator. The neural operator takes the current flow field as input and predicts the flow field at the next time step, and long-term prediction is achieved in the posterior through an autoregressive approach. The prediction results show that IFactFormer-m, compared to other neural operators and the traditional large eddy simulation (LES) methods including dynamic Smagorinsky model (DSM) and the wall-adapted local eddy-viscosity (WALE) model, reduces prediction errors in the short term, and achieves stable and accurate long-term prediction of various statistical properties and flow structures, including the energy spectrum, mean streamwise velocity, root mean square (rms) values of fluctuating velocities, Reynolds shear stress, and spatial structures of instantaneous velocity. Moreover, the trained IFactFormer-m is much faster than traditional LES methods.

[AI-99] PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation

链接: https://arxiv.org/abs/2412.18827
作者: ChenRui Duan,Zelin Zang,Siyuan Li,Yongjie Xu,Stan Z. Li
关键词: remains challenging due, Markov Chain Monte, Chain Monte Carlo, inference remains challenging, Traditional Markov Chain
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Phylogenetic trees elucidate evolutionary relationships among species, but phylogenetic inference remains challenging due to the complexity of combining continuous (branch lengths) and discrete parameters (tree topology). Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens. Existing Variational Inference methods, which require pre-generated topologies and typically treat tree structures and branch lengths independently, may overlook critical sequence features, limiting their accuracy and flexibility. We propose PhyloGen, a novel method leveraging a pre-trained genomic language model to generate and optimize phylogenetic trees without dependence on evolutionary models or aligned sequence constraints. PhyloGen views phylogenetic inference as a conditionally constrained tree structure generation problem, jointly optimizing tree topology and branch lengths through three core modules: (i) Feature Extraction, (ii) PhyloTree Construction, and (iii) PhyloTree Structure Modeling. Meanwhile, we introduce a Scoring Function to guide the model towards a more stable gradient descent. We demonstrate the effectiveness and robustness of PhyloGen on eight real-world benchmark datasets. Visualization results confirm PhyloGen provides deeper insights into phylogenetic relationships.

机器学习

[LG-0] nsor Network Estimation of Distribution Algorithms

链接: https://arxiv.org/abs/2412.19780
作者: John Gardiner,Javier Lopez-Piqueres
关键词: many-body quantum physics, Tensor networks, integrating tensor networks, computational sciences, machine learning
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Tensor networks are a tool first employed in the context of many-body quantum physics that now have a wide range of uses across the computational sciences, from numerical methods to machine learning. Methods integrating tensor networks into evolutionary optimization algorithms have appeared in the recent literature. In essence, these methods can be understood as replacing the traditional crossover operation of a genetic algorithm with a tensor network-based generative model. We investigate these methods from the point of view that they are Estimation of Distribution Algorithms (EDAs). We find that optimization performance of these methods is not related to the power of the generative model in a straightforward way. Generative models that are better (in the sense that they better model the distribution from which their training data is drawn) do not necessarily result in better performance of the optimization algorithm they form a part of. This raises the question of how best to incorporate powerful generative models into optimization routines. In light of this we find that adding an explicit mutation operator to the output of the generative model often improves optimization performance.

[LG-1] Analysis of Premature Death Rates in Texas Counties: The Impact of Air Quality Socioeconomic Factors and COPD Prevalence

链接: https://arxiv.org/abs/2412.19774
作者: Richard Rich,Ernesto Diaz
关键词: Understanding factors contributing, Understanding factors, mortality is critical, public health planning, utilizing EPA air
类目: Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:Understanding factors contributing to premature mortality is critical for public health planning. This study examines the relationships between premature death rates and multiple risk factors across several Texas counties, utilizing EPA air quality data, Census information, and county health records from recent years. We analyze the impact of air quality (PM2.5 levels), socioeconomic factors (median household income), and health conditions (COPD prevalence) through statistical analysis and modeling techniques. Results reveal COPD prevalence as a strong predictor of premature death rates, with higher prevalence associated with a substantial increase in years of potential life lost. While socioeconomic factors show a significant negative correlation, air quality demonstrates more complex indirect relationships. These findings emphasize the need for integrated public health interventions that prioritize key health conditions while addressing underlying socioeconomic disparities.

[LG-2] Fortran2CPP: Automating Fortran-to-C Migration using LLM s via Multi-Turn Dialogue and Dual-Agent Integration

链接: https://arxiv.org/abs/2412.19770
作者: Le Chen,Bin Lei,Dunzhi Zhou,Pei-Hung Lin,Chunhua Liao,Caiwen Ding,Ali Jannesari
关键词: Migrating Fortran code, Migrating Fortran, scientific computing teams, modern programming paradigms, leverage modern programming
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Migrating Fortran code to C++ is a common task for many scientific computing teams, driven by the need to leverage modern programming paradigms, enhance cross-platform compatibility, and improve maintainability. Automating this translation process using large language models (LLMs) has shown promise, but the lack of high-quality, specialized datasets has hindered their effectiveness. In this paper, we address this challenge by introducing a novel multi-turn dialogue dataset, Fortran2CPP, specifically designed for Fortran-to-C++ code migration. Our dataset, significantly larger than existing alternatives, is generated using a unique LLM-driven, dual-agent pipeline incorporating iterative compilation, execution, and code repair to ensure high quality and functional correctness. To demonstrate the effectiveness of our dataset, we fine-tuned several open-weight LLMs on Fortran2CPP and evaluated their performance on two independent benchmarks. Fine-tuning on our dataset led to remarkable gains, with models achieving up to a 3.31x increase in CodeBLEU score and a 92% improvement in compilation success rate. This highlights the dataset’s ability to enhance both the syntactic accuracy and compilability of the translated C++ code. Our dataset and model have been open-sourced and are available on our public GitHub repository\footnote\urlthis https URL.

[LG-3] From Ceilings to Walls: Universal Dynamic Perching of Small Aerial Robots on Surfaces with Variable Orientations

链接: https://arxiv.org/abs/2412.19765
作者: Bryan Habas,Aaron Brown,Donghyeon Lee,Mitchell Goldman,Bo Cheng
关键词: work demonstrates universal, demonstrates universal dynamic, universal dynamic perching, dynamic perching capabilities, work demonstrates
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 8 Figures

点击查看摘要

Abstract:This work demonstrates universal dynamic perching capabilities for quadrotors of various sizes and on surfaces with different orientations. By employing a non-dimensionalization framework and deep reinforcement learning, we systematically assessed how robot size and surface orientation affect landing capabilities. We hypothesized that maintaining geometric proportions across different robot scales ensures consistent perching behavior, which was validated in both simulation and experimental tests. Additionally, we investigated the effects of joint stiffness and damping in the landing gear on perching behaviors and performance. While joint stiffness had minimal impact, joint damping ratios influenced landing success under vertical approaching conditions. The study also identified a critical velocity threshold necessary for successful perching, determined by the robot’s maneuverability and leg geometry. Overall, this research advances robotic perching capabilities, offering insights into the role of mechanical design and scaling effects, and lays the groundwork for future drone autonomy and operational efficiency in unstructured environments.

[LG-4] Generative Pretrained Embedding and Hierarchical Irregular Time Series Representation for Daily Living Activity Recognition

链接: https://arxiv.org/abs/2412.19732
作者: Damien Bouchabou,Sao Mai Nguyen
关键词: data stands paramount, sensor data stands, ambient sensor data, daily living activities, smart homes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Within the evolving landscape of smart homes, the precise recognition of daily living activities using ambient sensor data stands paramount. This paper not only aims to bolster existing algorithms by evaluating two distinct pretrained embeddings suited for ambient sensor activations but also introduces a novel hierarchical architecture. We delve into an architecture anchored on Transformer Decoder-based pre-trained embeddings, reminiscent of the GPT design, and contrast it with the previously established state-of-the-art (SOTA) ELMo embeddings for ambient sensors. Our proposed hierarchical structure leverages the strengths of each pre-trained embedding, enabling the discernment of activity dependencies and sequence order, thereby enhancing classification precision. To further refine recognition, we incorporate into our proposed architecture an hour-of-the-day embedding. Empirical evaluations underscore the preeminence of the Transformer Decoder embedding in classification endeavors. Additionally, our innovative hierarchical design significantly bolsters the efficacy of both pre-trained embeddings, notably in capturing inter-activity nuances. The integration of temporal aspects subtly but distinctively augments classification, especially for time-sensitive activities. In conclusion, our GPT-inspired hierarchical approach, infused with temporal insights, outshines the SOTA ELMo benchmark.

[LG-5] EEG-Reptile: An Automatized Reptile-Based Meta-Learning Library for BCIs

链接: https://arxiv.org/abs/2412.19725
作者: Daniil A. Berdyshev,Artem M. Grachev,Sergei L. Shishkin,Bogdan L. Kozyrskiy
关键词: BCI classifier training, enable efficient BCI, Reptile meta-learning algorithm, promising approach, approach to enable
类目: Machine Learning (cs.LG)
*备注: For proposed python library, see EEG-Reptile GitHub: this https URL

点击查看摘要

Abstract:Meta-learning, i.e., “learning to learn”, is a promising approach to enable efficient BCI classifier training with limited amounts of data. It can effectively use collections of in some way similar classification tasks, with rapid adaptation to new tasks where only minimal data are available. However, applying meta-learning to existing classifiers and BCI tasks requires significant effort. To address this issue, we propose EEG-Reptile, an automated library that leverages meta-learning to improve classification accuracy of neural networks in BCIs and other EEG-based applications. It utilizes the Reptile meta-learning algorithm to adapt neural network classifiers of EEG data to the inter-subject domain, allowing for more efficient fine-tuning for a new subject on a small amount of data. The proposed library incorporates an automated hyperparameter tuning module, a data management pipeline, and an implementation of the Reptile meta-learning algorithm. EEG-Reptile automation level allows using it without deep understanding of meta-learning. We demonstrate the effectiveness of EEG-Reptile on two benchmark datasets (BCI IV 2a, Lee2019 MI) and three neural network architectures (EEGNet, FBCNet, EEG-Inception). Our library achieved improvement in both zero-shot and few-shot learning scenarios compared to traditional transfer learning approaches.

[LG-6] oward Scalable Multirobot Control: Fast Policy Learning in Distributed MPC

链接: https://arxiv.org/abs/2412.19669
作者: Xinglong Zhang,Wei Pan,Cong Li,Xin Xu,Xiangke Wang,Ronghua Zhang,Dewen Hu
关键词: MRS, achieving optimal cooperative, DMPC, promising in achieving, achieving optimal
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 26 pages, 19 figures

点击查看摘要

Abstract:Distributed model predictive control (DMPC) is promising in achieving optimal cooperative control in multirobot systems (MRS). However, real-time DMPC implementation relies on numerical optimization tools to periodically calculate local control sequences online. This process is computationally demanding and lacks scalability for large-scale, nonlinear MRS. This article proposes a novel distributed learning-based predictive control (DLPC) framework for scalable multirobot control. Unlike conventional DMPC methods that calculate open-loop control sequences, our approach centers around a computationally fast and efficient distributed policy learning algorithm that generates explicit closed-loop DMPC policies for MRS without using numerical solvers. The policy learning is executed incrementally and forward in time in each prediction interval through an online distributed actor-critic implementation. The control policies are successively updated in a receding-horizon manner, enabling fast and efficient policy learning with the closed-loop stability guarantee. The learned control policies could be deployed online to MRS with varying robot scales, enhancing scalability and transferability for large-scale MRS. Furthermore, we extend our methodology to address the multirobot safe learning challenge through a force field-inspired policy learning approach. We validate our approach’s effectiveness, scalability, and efficiency through extensive experiments on cooperative tasks of large-scale wheeled robots and multirotor drones. Our results demonstrate the rapid learning and deployment of DMPC policies for MRS with scales up to 10,000 units.

[LG-7] Asymmetrical Reciprocity-based Federated Learning for Resolving Disparities in Medical Diagnosis KDD2025

链接: https://arxiv.org/abs/2412.19654
作者: Jiaqi Wang,Ziyi Yin,Quanzeng You,Lingjuan Lyu,Fenglong Ma
关键词: pressing global challenge, middle-income nations, pose a pressing, pressing global, underserved regions
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Jiaqi Wang and Ziyi Yin equally contributed to this paper. This paper has been accepted by KDD 2025

点击查看摘要

Abstract:Geographic health disparities pose a pressing global challenge, particularly in underserved regions of low- and middle-income nations. Addressing this issue requires a collaborative approach to enhance healthcare quality, leveraging support from medically more developed areas. Federated learning emerges as a promising tool for this purpose. However, the scarcity of medical data and limited computation resources in underserved regions make collaborative training of powerful machine learning models challenging. Furthermore, there exists an asymmetrical reciprocity between underserved and developed regions. To overcome these challenges, we propose a novel cross-silo federated learning framework, named FedHelp, aimed at alleviating geographic health disparities and fortifying the diagnostic capabilities of underserved regions. Specifically, FedHelp leverages foundational model knowledge via one-time API access to guide the learning process of underserved small clients, addressing the challenge of insufficient data. Additionally, we introduce a novel asymmetric dual knowledge distillation module to manage the issue of asymmetric reciprocity, facilitating the exchange of necessary knowledge between developed large clients and underserved small clients. We validate the effectiveness and utility of FedHelp through extensive experiments on both medical image classification and segmentation tasks. The experimental results demonstrate significant performance improvement compared to state-of-the-art baselines, particularly benefiting clients in underserved regions.

[LG-8] Goal-oriented Communications based on Recursive Early Exit Neural Networks

链接: https://arxiv.org/abs/2412.19587
作者: Jary Pomponi,Mattia Merluzzi,Alessio Devoto,Mateus Pontes Mota,Paolo Di Lorenzo,Simone Scardapane
关键词: goal-oriented semantic communications, semantic communications leveraging, communications leveraging recursive, early exit models, paper presents
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel framework for goal-oriented semantic communications leveraging recursive early exit models. The proposed approach is built on two key components. First, we introduce an innovative early exit strategy that dynamically partitions computations, enabling samples to be offloaded to a server based on layer-wise recursive prediction dynamics that detect samples for which the confidence is not increasing fast enough over layers. Second, we develop a Reinforcement Learning-based online optimization framework that jointly determines early exit points, computation splitting, and offloading strategies, while accounting for wireless conditions, inference accuracy, and resource costs. Numerical evaluations in an edge inference scenario demonstrate the method’s adaptability and effectiveness in striking an excellent trade-off between performance, latency, and resource efficiency.

[LG-9] Ultralight Signal Classification Model for Automatic Modulation Recognition

链接: https://arxiv.org/abs/2412.19585
作者: Alessandro Daniele Genuardi Oquendo,Agustín Matías Galante Cerviño,Nilotpal Sinha,Luc Andrea,Sam Mugel,Román Orús
关键词: radar signals demands, signals demands responsive, accurate detection systems, resource-constrained edge devices, growing complexity
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:The growing complexity of radar signals demands responsive and accurate detection systems that can operate efficiently on resource-constrained edge devices. Existing models, while effective, often rely on substantial computational resources and large datasets, making them impractical for edge deployment. In this work, we propose an ultralight hybrid neural network optimized for edge applications, delivering robust performance across unfavorable signal-to-noise ratios (mean accuracy of 96.3% at 0 dB) using less than 100 samples per class, and significantly reducing computational overhead.

[LG-10] he Value of AI Advice: Personalized and Value-Maximizing AI Advisors Are Necessary to Reliably Benefit Experts and Organizations

链接: https://arxiv.org/abs/2412.19530
作者: Nicholas Wolczynski,Maytal Saar-Tsechansky,Tong Wang
关键词: performance and interpretability, increase the time, time and effort, invest to make, undermine experts’ decisions
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite advances in AI’s performance and interpretability, AI advisors can undermine experts’ decisions and increase the time and effort experts must invest to make decisions. Consequently, AI systems deployed in high-stakes settings often fail to consistently add value across contexts and can even diminish the value that experts alone provide. Beyond harm in specific domains, such outcomes impede progress in research and practice, underscoring the need to understand when and why different AI advisors add or diminish value. To bridge this gap, we stress the importance of assessing the value AI advice brings to real-world contexts when designing and evaluating AI advisors. Building on this perspective, we characterize key pillars – pathways through which AI advice impacts value – and develop a framework that incorporates these pillars to create reliable, personalized, and value-adding advisors. Our results highlight the need for system-level, value-driven development of AI advisors that advise selectively, are tailored to experts’ unique behaviors, and are optimized for context-specific trade-offs between decision improvements and advising costs. They also reveal how the lack of inclusion of these pillars in the design of AI advising systems may be contributing to the failures observed in practical applications.

[LG-11] Real-time classification of EEG signals using Machine Learning deployment

链接: https://arxiv.org/abs/2412.19515
作者: Swati Chowdhuri,Satadip Saha,Samadrita Karmakar,Ankur Chanda
关键词: prevailing educational methods, educational methods predominantly, methods predominantly rely, traditional classroom instruction, online delivery
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Published in Romanian Journal of Information Technology and Automatic Control

点击查看摘要

Abstract:The prevailing educational methods predominantly rely on traditional classroom instruction or online delivery, often limiting the teachers’ ability to engage effectively with all the students simultaneously. A more intrinsic method of evaluating student attentiveness during lectures can enable the educators to tailor the course materials and their teaching styles in order to better meet the students’ needs. The aim of this paper is to enhance teaching quality in real time, thereby fostering a higher student engagement in the classroom activities. By monitoring the students’ electroencephalography (EEG) signals and employing machine learning algorithms, this study proposes a comprehensive solution for addressing this challenge. Machine learning has emerged as a powerful tool for simplifying the analysis of complex variables, enabling the effective assessment of the students’ concentration levels based on specific parameters. However, the real-time impact of machine learning models necessitates a careful consideration as their deployment is concerned. This study proposes a machine learning-based approach for predicting the level of students’ comprehension with regard to a certain topic. A browser interface was introduced that accesses the values of the system’s parameters to determine a student’s level of concentration on a chosen topic. The deployment of the proposed system made it necessary to address the real-time challenges faced by the students, consider the system’s cost, and establish trust in its efficacy. This paper presents the efforts made for approaching this pertinent issue through the implementation of innovative technologies and provides a framework for addressing key considerations for future research directions.

[LG-12] Uncertainty quantification for improving radiomic-based models in radiation pneumonitis prediction

链接: https://arxiv.org/abs/2412.19511
作者: Chanon Puttanawarut,Romen Samuel Wabina,Nat Sirirutbunkajorn
关键词: Background and Objective, thoracic radiation therapy, Radiation pneumonitis, radiation therapy, thoracic radiation
类目: Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Background and Objective: Radiation pneumonitis (RP) is a side effect of thoracic radiation therapy. Recently, Machine learning (ML) models enhanced with radiomic and dosiomic features provide better predictions by incorporating spatial information beyond DVHs. However, to improve the clinical decision process, we propose to use uncertainty quantification (UQ) to improve the confidence in model prediction. This study evaluates the impact of post hoc UQ methods on the discriminative performance and calibration of ML models for RP prediction. Methods: This study evaluated four ML models: logistic regression (LR), support vector machines (SVM), extreme gradient boosting (XGB), and random forest (RF), using radiomic, dosiomic, and dosimetric features to predict RP. We applied UQ methods, including Patt scaling, isotonic regression, Venn-ABERS predictor, and Conformal Prediction, to quantify uncertainty. Model performance was assessed through Area Under the Receiver Operating Characteristic curve (AUROC), Area Under the Precision-Recall Curve (AUPRC), and Adaptive Calibration Error (ACE) using Leave-One-Out Cross-Validation (LOO-CV). Results: UQ methods enhanced predictive performance, particularly for high-certainty predictions, while also improving calibration. Radiomic and dosiomic features increased model accuracy but introduced calibration challenges, especially for non-linear models like XGB and RF. Performance gains from UQ methods were most noticeable at higher certainty thresholds. Conclusion: Integrating UQ into ML models with radiomic and dosiomic features improves both predictive accuracy and calibration, supporting more reliable clinical decision-making. The findings emphasize the value of UQ methods in enhancing applicability of predictive models for RP in healthcare settings.

[LG-13] RobotDiffuse: Motion Planning for Redundant Manipulator based on Diffusion Model

链接: https://arxiv.org/abs/2412.19500
作者: Xiaohan Zhang,Xudong Mou,Rui Wang,Tianyu Wo,Ningbo Gu,Tiejun Wang,Cangbai Xu,Xudong Liu
关键词: Degrees of Freedom, offer enhanced kinematic, enhanced kinematic performance, higher Degrees, surgical robotics
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Redundant manipulators, with their higher Degrees of Freedom (DOFs), offer enhanced kinematic performance and versatility, making them suitable for applications like manufacturing, surgical robotics, and human-robot collaboration. However, motion planning for these manipulators is challenging due to increased DOFs and complex, dynamic environments. While traditional motion planning algorithms struggle with high-dimensional spaces, deep learning-based methods often face instability and inefficiency in complex tasks. This paper introduces RobotDiffuse, a diffusion model-based approach for motion planning in redundant manipulators. By integrating physical constraints with a point cloud encoder and replacing the U-Net structure with an encoder-only transformer, RobotDiffuse improves the model’s ability to capture temporal dependencies and generate smoother, more coherent motion plans. We validate the approach using a complex simulator, and release a new dataset with 35M robot poses and 0.14M obstacle avoidance scenarios. Experimental results demonstrate the effectiveness of RobotDiffuse and the promise of diffusion models for motion planning tasks. The code can be accessed at this https URL.

[LG-14] owards Simple and Provable Parameter-Free Adaptive Gradient Methods

链接: https://arxiv.org/abs/2412.19444
作者: Yuanzhe Tao,Huizhuo Yuan,Xun Zhou,Yuan Cao,Quanquan Gu
关键词: Adam, significantly advanced, advanced the training, models by dynamically, dynamically adjusting
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 34 pages, 16 figures, 3 tables

点击查看摘要

Abstract:Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, adhoc tuning of learning rates poses a challenge, leading to inefficiencies in practice. To address this issue, recent research has focused on developing “learning-rate-free” or “parameter-free” algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates. Experimental results across various deep learning tasks validate the competitive performance of AdaGrad++ and Adam++.

[LG-15] Improving Generalization for AI-Synthesized Voice Detection AAAI25

链接: https://arxiv.org/abs/2412.19279
作者: Hainan Ren,Lin Li,Chun-Hao Liu,Xin Wang,Shu Hu
关键词: create realistic human, realistic human voices, AI-synthesized voice technology, beneficial applications, malicious purposes
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: AAAI25

点击查看摘要

Abstract:AI-synthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused for malicious purposes. While existing AI-synthesized voice detection models excel in intra-domain evaluation, they face challenges in generalizing across different domains, potentially becoming obsolete as new voice generators emerge. Current solutions use diverse data and advanced machine learning techniques (e.g., domain-invariant representation, self-supervised learning), but are limited by predefined vocoders and sensitivity to factors like background noise and speaker identity. In this work, we introduce an innovative disentanglement framework aimed at extracting domain-agnostic artifact features related to vocoders. Utilizing these features, we enhance model learning in a flat loss landscape, enabling escape from suboptimal solutions and improving generalization. Extensive experiments on benchmarks show our approach outperforms state-of-the-art methods, achieving up to 5.12% improvement in the equal error rate metric in intra-domain and 7.59% in cross-domain evaluations.

[LG-16] Virtual Nodes Can Help: Tackling Distribution Shifts in Federated Graph Learning AAAI2025

链接: https://arxiv.org/abs/2412.19229
作者: Xingbo Fu,Zihan Chen,Yinhan He,Song Wang,Binchi Zhang,Chen Chen,Jundong Li
关键词: Graph Neural Networks, Federated Graph Learning, Neural Networks, powerful graph learning, graph-related downstream tasks
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Federated Graph Learning (FGL) enables multiple clients to jointly train powerful graph learning models, e.g., Graph Neural Networks (GNNs), without sharing their local graph data for graph-related downstream tasks, such as graph property prediction. In the real world, however, the graph data can suffer from significant distribution shifts across clients as the clients may collect their graph data for different purposes. In particular, graph properties are usually associated with invariant label-relevant substructures (i.e., subgraphs) across clients, while label-irrelevant substructures can appear in a client-specific manner. The issue of distribution shifts of graph data hinders the efficiency of GNN training and leads to serious performance degradation in FGL. To tackle the aforementioned issue, we propose a novel FGL framework entitled FedVN that eliminates distribution shifts through client-specific graph augmentation strategies with multiple learnable Virtual Nodes (VNs). Specifically, FedVN lets the clients jointly learn a set of shared VNs while training a global GNN model. To eliminate distribution shifts, each client trains a personalized edge generator that determines how the VNs connect local graphs in a client-specific manner. Furthermore, we provide theoretical analyses indicating that FedVN can eliminate distribution shifts of graph data across clients. Comprehensive experiments on four datasets under five settings demonstrate the superiority of our proposed FedVN over nine baselines.

[LG-17] Multi-view Fake News Detection Model Based on Dynamic Hypergraph

链接: https://arxiv.org/abs/2412.19227
作者: Rongping Ye,Xiaobing Pei
关键词: content moderation mechanisms, online social networks, moderation mechanisms, rapid development, development of online
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of online social networks and the inadequacies in content moderation mechanisms, the detection of fake news has emerged as a pressing concern for the public. Various methods have been proposed for fake news detection, including text-based approaches as well as a series of graph-based approaches. However, the deceptive nature of fake news renders text-based approaches less effective. Propagation tree-based methods focus on the propagation process of individual news, capturing pairwise relationships but lacking the capability to capture high-order complex relationships. Large heterogeneous graph-based approaches necessitate the incorporation of substantial additional information beyond news text and user data, while hypergraph-based approaches rely on predefined hypergraph structures. To tackle these issues, we propose a novel dynamic hypergraph-based multi-view fake news detection model (DHy-MFND) that learns news embeddings across three distinct views: text-level, propagation tree-level, and hypergraph-level. By employing hypergraph structures to model complex high-order relationships among multiple news pieces and introducing dynamic hypergraph structure learning, we optimize predefined hypergraph structures while learning news embeddings. Additionally, we introduce contrastive learning to capture authenticity-relevant embeddings across different views. Extensive experiments on two benchmark datasets demonstrate the effectiveness of our proposed DHy-MFND compared with a broad range of competing baselines.

[LG-18] Applying the maximum entropy principle to multi-species neural networks improves species distribution models

链接: https://arxiv.org/abs/2412.19217
作者: Maxime Ryckewaert,Diego Marcos,Christophe Botella,Maximilien Servajean,Pierre Bonnet,Alexis Joly
关键词: citizen science initiatives, biodiversity databases, rapid expansion, expansion of citizen, citizen science
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to Methods in Ecology and Evolution

点击查看摘要

Abstract:The rapid expansion of citizen science initiatives has led to a significant growth of biodiversity databases, and particularly presence-only (PO) observations. PO data are invaluable for understanding species distributions and their dynamics, but their use in Species Distribution Models (SDM) is curtailed by sampling biases and the lack of information on absences. Poisson point processes are widely used for SDMs, with Maxent being one of the most popular methods. Maxent maximises the entropy of a probability distribution across sites as a function of predefined transformations of environmental variables, called features. In contrast, neural networks and deep learning have emerged as a promising technique for automatic feature extraction from complex input variables. In this paper, we propose DeepMaxent, which harnesses neural networks to automatically learn shared features among species, using the maximum entropy principle. To do so, it employs a normalised Poisson loss where for each species, presence probabilities across sites are modelled by a neural network. We evaluate DeepMaxent on a benchmark dataset known for its spatial sampling biases, using PO data for calibration and presence-absence (PA) data for validation across six regions with different biological groups and environmental covariates. Our results indicate that DeepMaxent improves model performance over Maxent and other state-of-the-art SDMs across regions and taxonomic groups. The method performs particularly well in regions of uneven sampling, demonstrating substantial potential to improve species distribution modelling. The method opens the possibility to learn more robust environmental features predicting jointly many species and scales to arbitrary large numbers of sites without an increased memory demand.

[LG-19] owards Better Spherical Sliced-Wasserstein Distance Learning with Data-Adaptive Discriminative Projection Direction AAAI2025

链接: https://arxiv.org/abs/2412.19212
作者: Hongliang Zhang,Shuo Chen,Lei Luo,Jian Yang
关键词: original SSW distance, Discriminative Spherical Sliced-Wasserstein, original SSW, spherical data distributions, deep representation learning
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Spherical Sliced-Wasserstein (SSW) has recently been proposed to measure the discrepancy between spherical data distributions in various fields, such as geology, medical domains, computer vision, and deep representation learning. However, in the original SSW, all projection directions are treated equally, which is too idealistic and cannot accurately reflect the importance of different projection directions for various data distributions. To address this issue, we propose a novel data-adaptive Discriminative Spherical Sliced-Wasserstein (DSSW) distance, which utilizes a projected energy function to determine the discriminative projection direction for SSW. In our new DSSW, we introduce two types of projected energy functions to generate the weights for projection directions with complete theoretical guarantees. The first type employs a non-parametric deterministic function that transforms the projected Wasserstein distance into its corresponding weight in each projection direction. This improves the performance of the original SSW distance with negligible additional computational overhead. The second type utilizes a neural network-induced function that learns the projection direction weight through a parameterized neural network based on data projections. This further enhances the performance of the original SSW distance with less extra computational overhead. Finally, we evaluate the performance of our proposed DSSW by comparing it with several state-of-the-art methods across a variety of machine learning tasks, including gradient flows, density estimation on real earth data, and self-supervised learning.

[LG-20] Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining

链接: https://arxiv.org/abs/2412.19211
作者: Yuxin You,Zhen Liu,Xiangchao Wen,Yongtao Zhang,Wei Ai
关键词: extracting valuable information, involves extracting valuable, Large Language Models, important area, extracting valuable
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph mining is an important area in data mining and machine learning that involves extracting valuable information from graph-structured data. In recent years, significant progress has been made in this field through the development of graph neural networks (GNNs). However, GNNs are still deficient in generalizing to diverse graph data. Aiming to this issue, Large Language Models (LLMs) could provide new solutions for graph mining tasks with their superior semantic understanding. In this review, we systematically review the combination and application techniques of LLMs and GNNs and present a novel taxonomy for research in this interdisciplinary field, which involves three main categories: GNN-driving-LLM, LLM-driving-GNN, and GNN-LLM-co-driving. Within this framework, we reveal the capabilities of LLMs in enhancing graph feature extraction as well as improving the effectiveness of downstream tasks such as node classification, link prediction, and community detection. Although LLMs have demonstrated their great potential in handling graph-structured data, their high computational requirements and complexity remain challenges. Future research needs to continue to explore how to efficiently fuse LLMs and GNNs to achieve more powerful graph learning and reasoning capabilities and provide new impetus for the development of graph mining techniques.

[LG-21] Context-Aware Deep Learning for Multi Modal Depression Detection

链接: https://arxiv.org/abs/2412.19209
作者: Genevieve Lam,Huang Dongyan,Weisi Lin
关键词: Analysis Interview Corpus, Distress Analysis Interview, focus on automated, automated approaches, approaches to detect
类目: Machine Learning (cs.LG)
*备注: Presented as an Oral at International Conference on Acoustics, Speech and Signal Processing 2019, United Kingdom

点击查看摘要

Abstract:In this study, we focus on automated approaches to detect depression from clinical interviews using multi-modal machine learning (ML). Our approach differentiates from other successful ML methods such as context-aware analysis through feature engineering and end-to-end deep neural networks for depression detection utilizing the Distress Analysis Interview Corpus. We propose a novel method that incorporates: (1) pre-trained Transformer combined with data augmentation based on topic modelling for textual data; and (2) deep 1D convolutional neural network (CNN) for acoustic feature modeling. The simulation results demonstrate the effectiveness of the proposed method for training multi-modal deep learning models. Our deep 1D CNN and Transformer models achieved state-of-the-art performance for audio and text modalities respectively. Combining them in a multi-modal framework also outperforms state-of-the-art for the combined setting. Code available at this https URL

[LG-22] Developing Explainable Machine Learning Model using Augmented Concept Activation Vector

链接: https://arxiv.org/abs/2412.19208
作者: Reza Hassanpour,Kasim Oztoprak,Niels Netten,Tony Busker,Mortaza S. Bargh,Sunil Choenni,Beyza Kizildag,Leyla Sena Kilinc
关键词: high dimensional feature, dimensional feature spaces, Machine learning models, Machine learning, class labels
类目: Machine Learning (cs.LG)
*备注: 11 pages, 8 figures, “to be published in the journal of Computer SCience”

点击查看摘要

Abstract:Machine learning models use high dimensional feature spaces to map their inputs to the corresponding class labels. However, these features often do not have a one-to-one correspondence with physical concepts understandable by humans, which hinders the ability to provide a meaningful explanation for the decisions made by these models. We propose a method for measuring the correlation between high-level concepts and the decisions made by a machine learning model. Our method can isolate the impact of a given high-level concept and accurately measure it quantitatively. Additionally, this study aims to determine the prevalence of frequent patterns in machine learning models, which often occur in imbalanced datasets. We have successfully applied the proposed method to fundus images and managed to quantitatively measure the impact of radiomic patterns on the model decisions.

[LG-23] GAIS: A Novel Approach to Instance Selection with Graph Attention Networks

链接: https://arxiv.org/abs/2412.19201
作者: Zahiriddin Rustamov,Ayham Zaitouny,Rafat Damseh,Nazar Zaki
关键词: Attention-based Instance Selection, Graph Attention Networks, reduce dataset size, Instance selection, Attention Networks
类目: Machine Learning (cs.LG)
*备注: Accepted at ICKG 2024. Code is available at this https URL

点击查看摘要

Abstract:Instance selection (IS) is a crucial technique in machine learning that aims to reduce dataset size while maintaining model performance. This paper introduces a novel method called Graph Attention-based Instance Selection (GAIS), which leverages Graph Attention Networks (GATs) to identify the most informative instances in a dataset. GAIS represents the data as a graph and uses GATs to learn node representations, enabling it to capture complex relationships between instances. The method processes data in chunks, applies random masking and similarity thresholding during graph construction, and selects instances based on confidence scores from the trained GAT model. Experiments on 13 diverse datasets demonstrate that GAIS consistently outperforms traditional IS methods in terms of effectiveness, achieving high reduction rates (average 96%) while maintaining or improving model performance. Although GAIS exhibits slightly higher computational costs, its superior performance in maintaining accuracy with significantly reduced training data makes it a promising approach for graph-based data selection.

[LG-24] ERGNN: Spectral Graph Neural Network with Explicitly-optimized Rational Graph Filters ICASSP2025

链接: https://arxiv.org/abs/2412.19106
作者: Guoming Li,Jian Yang,Shangsong Liang
关键词: shown substantial performance, Approximation-based spectral graph, graph neural networks, graph learning tasks, Approximation-based spectral
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Numerical Analysis (math.NA)
*备注: Accepted in 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

点击查看摘要

Abstract:Approximation-based spectral graph neural networks, which construct graph filters with function approximation, have shown substantial performance in graph learning tasks. Despite their great success, existing works primarily employ polynomial approximation to construct the filters, whereas another superior option, namely ration approximation, remains underexplored. Although a handful of prior works have attempted to deploy the rational approximation, their implementations often involve intensive computational demands or still resort to polynomial approximations, hindering full potential of the rational graph filters. To address the issues, this paper introduces ERGNN, a novel spectral GNN with explicitly-optimized rational filter. ERGNN adopts a unique two-step framework that sequentially applies the numerator filter and the denominator filter to the input signals, thus streamlining the model paradigm while enabling explicit optimization of both numerator and denominator of the rational filter. Extensive experiments validate the superiority of ERGNN over state-of-the-art methods, establishing it as a practical solution for deploying rational-based GNNs.

[LG-25] nt Your Models Task-wise for Improved Multi-task Model Merging

链接: https://arxiv.org/abs/2412.19098
作者: Aecheon Jung,Seunghwan Lee,Dongyoon Han,Sungeun Hong
关键词: Traditional model merging, Traditional model, sign consensus, multi-task learning, weight averaging
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional model merging methods for multi-task learning (MTL) address task conflicts with straightforward strategies such as weight averaging, sign consensus, or minimal test-time adjustments. This presumably counts on the assumption that a merged encoder still retains abundant task knowledge from individual encoders, implying that its shared representation is sufficiently general across tasks. However, our insight is that adding just a single trainable task-specific layer further can bring striking performance gains, as demonstrated by our pilot study. Motivated by this finding, we propose Model Tinting, a new test-time approach that introduces a single task-specific layer for each task as trainable adjustments. Our method jointly trains merging coefficients and task-specific layers, which effectively reduces task conflicts with minimal additional costs. Additionally, we propose a sampling method that utilizes the difference in confidence levels of both merged and individual encoders. Extensive experiments demonstrate our method’s effectiveness, which achieves state-of-the-art performance across both computer vision and natural language processing tasks and significantly surpasses prior works. Our code is available at this https URL.

[LG-26] Assessing Pre-trained Models for Transfer Learning through Distribution of Spectral Components

链接: https://arxiv.org/abs/2412.19085
作者: Tengxue Zhang,Yang Shu,Xinyang Chen,Yifei Long,Chenjuan Guo,Bin Yang
关键词: Pre-trained model assessment, Pre-trained model, Spectral Components, Pre-trained, aims to identify
类目: Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:Pre-trained model assessment for transfer learning aims to identify the optimal candidate for the downstream tasks from a model hub, without the need of time-consuming fine-tuning. Existing advanced works mainly focus on analyzing the intrinsic characteristics of the entire features extracted by each pre-trained model or how well such features fit the target labels. This paper proposes a novel perspective for pre-trained model assessment through the Distribution of Spectral Components (DISCO). Through singular value decomposition of features extracted from pre-trained models, we investigate different spectral components and observe that they possess distinct transferability, contributing diversely to the fine-tuning performance. Inspired by this, we propose an assessment method based on the distribution of spectral components which measures the proportions of their corresponding singular values. Pre-trained models with features concentrating on more transferable components are regarded as better choices for transfer learning. We further leverage the labels of downstream data to better estimate the transferability of each spectral component and derive the final assessment criterion. Our proposed method is flexible and can be applied to both classification and regression tasks. We conducted comprehensive experiments across three benchmarks and two tasks including image classification and object detection, demonstrating that our method achieves state-of-the-art performance in choosing proper pre-trained models from the model hub for transfer learning.

[LG-27] Effective and secure federated online learning to rank

链接: https://arxiv.org/abs/2412.19069
作者: Shuyi Wang
关键词: Federated Online Learning, Learning to Rank, optimises ranking models, Online Learning, implicit user feedback
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注: PhD Thesis

点击查看摘要

Abstract:Online Learning to Rank (OLTR) optimises ranking models using implicit user feedback, such as clicks. Unlike traditional Learning to Rank (LTR) methods that rely on a static set of training data with relevance judgements to learn a ranking model, OLTR methods update the model continually as new data arrives. Thus, it addresses several drawbacks such as the high cost of human annotations, potential misalignment between user preferences and human judgments, and the rapid changes in user query intents. However, OLTR methods typically require the collection of searchable data, user queries, and clicks, which poses privacy concerns for users. Federated Online Learning to Rank (FOLTR) integrates OLTR within a Federated Learning (FL) framework to enhance privacy by not sharing raw data. While promising, FOLTR methods currently lag behind traditional centralised OLTR due to challenges in ranking effectiveness, robustness with respect to data distribution across clients, susceptibility to attacks, and the ability to unlearn client interactions and data. This thesis presents a comprehensive study on Federated Online Learning to Rank, addressing its effectiveness, robustness, security, and unlearning capabilities, thereby expanding the landscape of FOLTR. Comments: PhD Thesis Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Retrieval (cs.IR) Cite as: arXiv:2412.19069 [cs.LG] (or arXiv:2412.19069v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.19069 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.14264/2264705 Focus to learn more DOI(s) linking to related resources

[LG-28] FFCG: Effective and Fast Family Column Generation for Solving Large-Scale Linear Program

链接: https://arxiv.org/abs/2412.19066
作者: Yi-Xiang Hu,Feng Wu,Shaoang Li,Yifang Zhao,Xiang-Yang Li
关键词: large-scale linear programs, solve large-scale linear, linear programs, Family Column Generation, Column Generation
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Column Generation (CG) is an effective and iterative algorithm to solve large-scale linear programs (LP). During each CG iteration, new columns are added to improve the solution of the LP. Typically, CG greedily selects one column with the most negative reduced cost, which can be improved by adding more columns at once. However, selecting all columns with negative reduced costs would lead to the addition of redundant columns that do not improve the objective value. Therefore, selecting the appropriate columns to add is still an open problem and previous machine-learning-based approaches for CG only add a constant quantity of columns per iteration due to the state-space explosion problem. To address this, we propose Fast Family Column Generation (FFCG) – a novel reinforcement-learning-based CG that selects a variable number of columns as needed in an iteration. Specifically, we formulate the column selection problem in CG as an MDP and design a reward metric that balances both the convergence speed and the number of redundant columns. In our experiments, FFCG converges faster on the common benchmarks and reduces the number of CG iterations by 77.1% for Cutting Stock Problem (CSP) and 84.8% for Vehicle Routing Problem with Time Windows (VRPTW), and a 71.4% reduction in computing time for CSP and 84.0% for VRPTW on average compared to several state-of-the-art baselines.

[LG-29] Revealing the Self: Brainwave-Based Human Trait Identification

链接: https://arxiv.org/abs/2412.19041
作者: Md Mirajul Islam,Md Nahiyan Uddin,Maoyejatun Hasana,Debojit Pandit,Nafis Mahmud Rahman,Sriram Chellappan,Sami Azam,A. B. M. Alim Al Islam
关键词: People exhibit unique, exhibit unique emotional, People exhibit, unique emotional responses, exhibit unique
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
*备注: 11th International Conference on Networking, Systems, and Security (NSysS '24)

点击查看摘要

Abstract:People exhibit unique emotional responses. In the same scenario, the emotional reactions of two individuals can be either similar or vastly different. For instance, consider one person’s reaction to an invitation to smoke versus another person’s response to a query about their sleep quality. The identification of these individual traits through the observation of common physical parameters opens the door to a wide range of applications, including psychological analysis, criminology, disease prediction, addiction control, and more. While there has been previous research in the fields of psychometrics, inertial sensors, computer vision, and audio analysis, this paper introduces a novel technique for identifying human traits in real time using brainwave data. To achieve this, we begin with an extensive study of brainwave data collected from 80 participants using a portable EEG headset. We also conduct a statistical analysis of the collected data utilizing box plots. Our analysis uncovers several new insights, leading us to a groundbreaking unified approach for identifying diverse human traits by leveraging machine learning techniques on EEG data. Our analysis demonstrates that this proposed solution achieves high accuracy. Moreover, we explore two deep-learning models to compare the performance of our solution. Consequently, we have developed an integrated, real-time trait identification solution using EEG data, based on the insights from our analysis. To validate our approach, we conducted a rigorous user evaluation with an additional 20 participants. The outcomes of this evaluation illustrate both high accuracy and favorable user ratings, emphasizing the robust potential of our proposed method to serve as a versatile solution for human trait identification.

[LG-30] Detection and classification of DDoS flooding attacks by machine learning method

链接: https://arxiv.org/abs/2412.18990
作者: Dmytro Tymoshchuk,Oleh Yasniy,Mykola Mytnyk,Nataliya Zagorodna,Vitaliy Tymoshchuk
关键词: ACK Flooding, HTTP Flooding, SYN Flooding, UDP Flooding, Flooding
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Paper Submitted to BAIT 2024 CEUR-WS, see this https URL

点击查看摘要

Abstract:This study focuses on a method for detecting and classifying distributed denial of service (DDoS) attacks, such as SYN Flooding, ACK Flooding, HTTP Flooding, and UDP Flooding, using neural networks. Machine learning, particularly neural networks, is highly effective in detecting malicious traffic. A dataset containing normal traffic and various DDoS attacks was used to train a neural network model with a 24-106-5 architecture. The model achieved high Accuracy (99.35%), Precision (99.32%), Recall (99.54%), and F-score (0.99) in the classification task. All major attack types were correctly identified. The model was also further tested in the lab using virtual infrastructures to generate normal and DDoS traffic. The results showed that the model can accurately classify attacks under near-real-world conditions, demonstrating 95.05% accuracy and balanced F-score scores for all attack types. This confirms that neural networks are an effective tool for detecting DDoS attacks in modern information security systems.

[LG-31] Evaluating deep learning models for fault diagnosis of a rotating machinery with epistemic and aleatoric uncertainty

链接: https://arxiv.org/abs/2412.18980
作者: Reza Jalayer,Masoud Jalayer,Andrea Mor,Carlotta Orsenigo,Carlo Vercellis
关键词: recently gained attention, uncertainty-aware fault diagnosis, Uncertainty-aware deep learning, models recently gained, fault diagnosis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty-aware deep learning (DL) models recently gained attention in fault diagnosis as a way to promote the reliable detection of faults when out-of-distribution (OOD) data arise from unseen faults (epistemic uncertainty) or the presence of noise (aleatoric uncertainty). In this paper, we present the first comprehensive comparative study of state-of-the-art uncertainty-aware DL architectures for fault diagnosis in rotating machinery, where different scenarios affected by epistemic uncertainty and different types of aleatoric uncertainty are investigated. The selected architectures include sampling by dropout, Bayesian neural networks, and deep ensembles. Moreover, to distinguish between in-distribution and OOD data in the different scenarios two uncertainty thresholds, one of which is introduced in this paper, are alternatively applied. Our empirical findings offer guidance to practitioners and researchers who have to deploy real-world uncertainty-aware fault diagnosis systems. In particular, they reveal that, in the presence of epistemic uncertainty, all DL models are capable of effectively detecting, on average, a substantial portion of OOD data across all the scenarios. However, deep ensemble models show superior performance, independently of the uncertainty threshold used for discrimination. In the presence of aleatoric uncertainty, the noise level plays an important role. Specifically, low noise levels hinder the models’ ability to effectively detect OOD data. Even in this case, however, deep ensemble models exhibit a milder degradation in performance, dominating the others. These achievements, combined with their shorter inference time, make deep ensemble architectures the preferred choice.

[LG-32] Adopting Trustworthy AI for Sleep Disorder Prediction: Deep Time Series Analysis with Temporal Attention Mechanism and Counterfactual Explanations

链接: https://arxiv.org/abs/2412.18971
作者: Pegah Ahadian,Wei Xu,Sherry Wang,Qiang Guan
关键词: sleep disorder prediction, major impact, Sleep disorders, Effective sleep disorder, Temporal Convolutional Networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sleep disorders have a major impact on both lifestyle and health. Effective sleep disorder prediction from lifestyle and physiological data can provide essential details for early intervention. This research utilizes three deep time series models and facilitates them with explainability approaches for sleep disorder prediction. Specifically, our approach adopts Temporal Convolutional Networks (TCN), Long Short-Term Memory (LSTM) for time series data analysis, and Temporal Fusion Transformer model (TFT). Meanwhile, the temporal attention mechanism and counterfactual explanation with SHapley Additive exPlanations (SHAP) approach are employed to ensure dependable, accurate, and interpretable predictions. Finally, using a large dataset of sleep health measures, our evaluation demonstrates the effect of our method in predicting sleep disorders.

[LG-33] Malware Classification using a Hybrid Hidden Markov Model-Convolutional Neural Network

链接: https://arxiv.org/abs/2412.18932
作者: Ritik Mehta,Olha Jureckova,Mark Stamp
关键词: malware detection approaches, Convolutional Neural Network, malware variants poses, traditional malware detection, Hidden Markov Model
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2307.11032

点击查看摘要

Abstract:The proliferation of malware variants poses a significant challenges to traditional malware detection approaches, such as signature-based methods, necessitating the development of advanced machine learning techniques. In this research, we present a novel approach based on a hybrid architecture combining features extracted using a Hidden Markov Model (HMM), with a Convolutional Neural Network (CNN) then used for malware classification. Inspired by the strong results in previous work using an HMM-Random Forest model, we propose integrating HMMs, which serve to capture sequential patterns in opcode sequences, with CNNs, which are adept at extracting hierarchical features. We demonstrate the effectiveness of our approach on the popular Malicia dataset, and we obtain superior performance, as compared to other machine learning methods – our results surpass the aforementioned HMM-Random Forest model. Our findings underscore the potential of hybrid HMM-CNN architectures in bolstering malware classification capabilities, offering several promising avenues for further research in the field of cybersecurity.

[LG-34] FedCFA: Alleviating Simpsons Paradox in Model Aggregation with Counterfactual Federated Learning

链接: https://arxiv.org/abs/2412.18904
作者: Zhonghua Jiang,Jimin Xu,Shengyu Zhang,Tao Shen,Jiwei Li,Kun Kuang,Haibin Cai,Fei Wu
关键词: Simpson Paradox, Federated learning, distributed optimization, promising technology, privacy and distributed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a promising technology for data privacy and distributed optimization, but it suffers from data imbalance and heterogeneity among clients. Existing FL methods try to solve the problems by aligning client with server model or by correcting client model with control variables. These methods excel on IID and general Non-IID data but perform mediocrely in Simpson’s Paradox scenarios. Simpson’s Paradox refers to the phenomenon that the trend observed on the global dataset disappears or reverses on a subset, which may lead to the fact that global model obtained through aggregation in FL does not accurately reflect the distribution of global data. Thus, we propose FedCFA, a novel FL framework employing counterfactual learning to generate counterfactual samples by replacing local data critical factors with global average data, aligning local data distributions with the global and mitigating Simpson’s Paradox effects. In addition, to improve the quality of counterfactual samples, we introduce factor decorrelation (FDC) loss to reduce the correlation among features and thus improve the independence of extracted factors. We conduct extensive experiments on six datasets and verify that our method outperforms other FL methods in terms of efficiency and global model accuracy under limited communication rounds.

[LG-35] Adversarial Training for Graph Neural Networks via Graph Subspace Energy Optimization

链接: https://arxiv.org/abs/2412.18886
作者: Ganlin Liu,Ziling Liang,Xiaowei Huang,Xinping Yi,Shi Jin
关键词: GNN adversarial training, graph neural networks, adversarial training, GNN, GNN adversarial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite impressive capability in learning over graph-structured data, graph neural networks (GNN) suffer from adversarial topology perturbation in both training and inference phases. While adversarial training has demonstrated remarkable effectiveness in image classification tasks, its suitability for GNN models has been doubted until a recent advance that shifts the focus from transductive to inductive learning. Still, GNN robustness in the inductive setting is under-explored, and it calls for deeper understanding of GNN adversarial training. To this end, we propose a new concept of graph subspace energy (GSE) – a generalization of graph energy that measures graph stability – of the adjacency matrix, as an indicator of GNN robustness against topology perturbations. To further demonstrate the effectiveness of such concept, we propose an adversarial training method with the perturbed graphs generated by maximizing the GSE regularization term, referred to as AT-GSE. To deal with the local and global topology perturbations raised respectively by LRBCD and PRBCD, we employ randomized SVD (RndSVD) and Nystrom low-rank approximation to favor the different aspects of the GSE terms. An extensive set of experiments shows that AT-GSE outperforms consistently the state-of-the-art GNN adversarial training methods over different homophily and heterophily datasets in terms of adversarial accuracy, whilst more surprisingly achieving a superior clean accuracy on non-perturbed graphs.

[LG-36] Optimistic Critic Reconstruction and Constrained Fine-Tuning for General Offline-to-Online RL NEURIPS2024

链接: https://arxiv.org/abs/2412.18855
作者: Qin-Wen Luo,Ming-Kun Xie,Ye-Wen Wang,Sheng-Jun Huang
关键词: limited online interactions, improve performance rapidly, offline pre-trained policy, policy as initialization, initialization to improve
类目: Machine Learning (cs.LG)
*备注: Accepted to Neurips 2024

点击查看摘要

Abstract:Offline-to-online (O2O) reinforcement learning (RL) provides an effective means of leveraging an offline pre-trained policy as initialization to improve performance rapidly with limited online interactions. Recent studies often design fine-tuning strategies for a specific offline RL method and cannot perform general O2O learning from any offline method. To deal with this problem, we disclose that there are evaluation and improvement mismatches between the offline dataset and the online environment, which hinders the direct application of pre-trained policies to online fine-tuning. In this paper, we propose to handle these two mismatches simultaneously, which aims to achieve general O2O learning from any offline method to any online method. Before online fine-tuning, we re-evaluate the pessimistic critic trained on the offline dataset in an optimistic way and then calibrate the misaligned critic with the reliable offline actor to avoid erroneous update. After obtaining an optimistic and and aligned critic, we perform constrained fine-tuning to combat distribution shift during online learning. We show empirically that the proposed method can achieve stable and efficient performance improvement on multiple simulated tasks when compared to the state-of-the-art methods.

[LG-37] PCH: Tensor-interacted Projection and Cooperative Hashing for Multi-view Clustering

链接: https://arxiv.org/abs/2412.18847
作者: Zhongwen Wang,Xingfeng Li,Yinghui Sun,Quansen Sun,Yuan Sun,Han Ling,Jian Dai,Zhenwen Ren
关键词: hash-based multi-view clustering, recent years, anchor and hash-based, handling large-scale data, gained attention
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, anchor and hash-based multi-view clustering methods have gained attention for their efficiency and simplicity in handling large-scale data. However, existing methods often overlook the interactions among multi-view data and higher-order cooperative relationships during projection, negatively impacting the quality of hash representation in low-dimensional spaces, clustering performance, and sensitivity to noise. To address this issue, we propose a novel approach named Tensor-Interacted Projection and Cooperative Hashing for Multi-View Clustering(TPCH). TPCH stacks multiple projection matrices into a tensor, taking into account the synergies and communications during the projection process. By capturing higher-order multi-view information through dual projection and Hamming space, TPCH employs an enhanced tensor nuclear norm to learn more compact and distinguishable hash representations, promoting communication within and between views. Experimental results demonstrate that this refined method significantly outperforms state-of-the-art methods in clustering on five large-scale multi-view datasets. Moreover, in terms of CPU time, TPCH achieves substantial acceleration compared to the most advanced current methods. The code is available at \textcolorred\urlthis https URL.

[LG-38] Enhancing Federated Graph Learning via Adaptive Fusion of Structural and Node Characteristics

链接: https://arxiv.org/abs/2412.18845
作者: Xianjun Gao,Jianchun Liu,Hongli Xu,Shilong Wang,Liusheng Huang
关键词: Federated Graph Learning, Graph Neural Network, Neural Network, global Graph Neural, Federated Graph
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Graph Learning (FGL) has demonstrated the advantage of training a global Graph Neural Network (GNN) model across distributed clients using their local graph data. Unlike Euclidean data (\eg, images), graph data is composed of nodes and edges, where the overall node-edge connections determine the topological structure, and individual nodes along with their neighbors capture local node features. However, existing studies tend to prioritize one aspect over the other, leading to an incomplete understanding of the data and the potential misidentification of key characteristics across varying graph scenarios. Additionally, the non-independent and identically distributed (non-IID) nature of graph data makes the extraction of these two data characteristics even more challenging. To address the above issues, we propose a novel FGL framework, named FedGCF, which aims to simultaneously extract and fuse structural properties and node features to effectively handle diverse graph scenarios. FedGCF first clusters clients by structural similarity, performing model aggregation within each cluster to form the shared structural model. Next, FedGCF selects the clients with common node features and aggregates their models to generate a common node model. This model is then propagated to all clients, allowing common node features to be shared. By combining these two models with a proper ratio, FedGCF can achieve a comprehensive understanding of the graph data and deliver better performance, even under non-IID distributions. Experimental results show that FedGCF improves accuracy by 4.94%-7.24% under different data distributions and reduces communication cost by 64.18%-81.25% to reach the same accuracy compared to baselines.

[LG-39] CausalTAD: Causal Implicit Generative Model for Debiased Online Trajectory Anomaly Detection ICDE2024

链接: https://arxiv.org/abs/2412.18820
作者: Wenbin Li,Di Yao,Chang Gong,Xiaokai Chu,Quanliang Jing,Xiaolei Zhou,Yuxuan Zhang,Yunxia Fan,Jingping Bi
关键词: Trajectory anomaly detection, real-world applications, Trajectory anomaly, anomaly risk, anomaly
类目: Machine Learning (cs.LG)
*备注: Accepted by ICDE 2024

点击查看摘要

Abstract:Trajectory anomaly detection, aiming to estimate the anomaly risk of trajectories given the Source-Destination (SD) pairs, has become a critical problem for many real-world applications. Existing solutions directly train a generative model for observed trajectories and calculate the conditional generative probability P(T|C) as the anomaly risk, where T and C represent the trajectory and SD pair respectively. However, we argue that the observed trajectories are confounded by road network preference which is a common cause of both SD distribution and trajectories. Existing methods ignore this issue limiting their generalization ability on out-of-distribution trajectories. In this paper, we define the debiased trajectory anomaly detection problem and propose a causal implicit generative model, namely CausalTAD, to solve it. CausalTAD adopts do-calculus to eliminate the confounding bias of road network preference and estimates P(T|do©) as the anomaly criterion. Extensive experiments show that CausalTAD can not only achieve superior performance on trained trajectories but also generally improve the performance of out-of-distribution data, with improvements of 2.1% \sim 5.7% and 10.6% \sim 32.7% respectively.

[LG-40] On Improved Regret Bounds In Bayesian Optimization with Gaussian Noise

链接: https://arxiv.org/abs/2412.18789
作者: Jingyi Wang,Haowei Wang,Cosmin G. Petra,Nai-Yuan Chiang
关键词: black-box optimization method, powerful black-box optimization, optimization method, black-box optimization, surrogate models
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) with Gaussian process (GP) surrogate models is a powerful black-box optimization method. Acquisition functions are a critical part of a BO algorithm as they determine how the new samples are selected. Some of the most widely used acquisition functions include upper confidence bound (UCB) and Thompson sampling (TS). The convergence analysis of BO algorithms has focused on the cumulative regret under both the Bayesian and frequentist settings for the objective. In this paper, we establish new pointwise bounds on the prediction error of GP under the frequentist setting with Gaussian noise. Consequently, we prove improved convergence rates of cumulative regret bound for both GP-UCB and GP-TS. Of note, the new prediction error bound under Gaussian noise can be applied to general BO algorithms and convergence analysis, e.g., the asymptotic convergence of expected improvement (EI) with noise.

[LG-41] hermal-Mechanical Physics Informed Deep Learning For Fast Prediction of Thermal Stress Evolution in Laser Metal Deposition

链接: https://arxiv.org/abs/2412.18786
作者: R. Sharma,Y.B. Guo
关键词: producing high-quality components, Understanding thermal stress, metal additive manufacturing, Understanding thermal, additive manufacturing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding thermal stress evolution in metal additive manufacturing (AM) is crucial for producing high-quality components. Recent advancements in machine learning (ML) have shown great potential for modeling complex multiphysics problems in metal AM. While physics-based simulations face the challenge of high computational costs, conventional data-driven ML models require large, labeled training datasets to achieve accurate predictions. Unfortunately, generating large datasets for ML model training through time-consuming experiments or high-fidelity simulations is highly expensive in metal AM. To address these challenges, this study introduces a physics-informed neural network (PINN) framework that incorporates governing physical laws into deep neural networks (NNs) to predict temperature and thermal stress evolution during the laser metal deposition (LMD) process. The study also discusses the enhanced accuracy and efficiency of the PINN model when supplemented with small simulation data. Furthermore, it highlights the PINN transferability, enabling fast predictions with a set of new process parameters using a pre-trained PINN model as an online soft sensor, significantly reducing computation time compared to physics-based numerical models while maintaining accuracy.

[LG-42] Robustness Evaluation of Offline Reinforcement Learning for Robot Control Against Action Perturbations

链接: https://arxiv.org/abs/2412.18781
作者: Shingo Ayabe,Takuto Otomo,Hiroshi Kera,Kazuhiko Kawamoto
关键词: Offline reinforcement learning, reinforcement learning, reinforcement learning methods, environmental interaction, gained attention
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:Offline reinforcement learning, which learns solely from datasets without environmental interaction, has gained attention. This approach, similar to traditional online deep reinforcement learning, is particularly promising for robot control applications. Nevertheless, its robustness against real-world challenges, such as joint actuator faults in robots, remains a critical concern. This study evaluates the robustness of existing offline reinforcement learning methods using legged robots from OpenAI Gym based on average episodic rewards. For robustness evaluation, we simulate failures by incorporating both random and adversarial perturbations, representing worst-case scenarios, into the joint torque signals. Our experiments show that existing offline reinforcement learning methods exhibit significant vulnerabilities to these action perturbations and are more vulnerable than online reinforcement learning methods, highlighting the need for more robust approaches in this field.

[LG-43] owards a Statistical Understanding of Neural Networks: Beyond the Neural Tangent Kernel Theories

链接: https://arxiv.org/abs/2412.18756
作者: Haobo Zhang,Jianfa Lai,Yicheng Li,Qian Lin,Jun S. Liu
关键词: theoretically analyze due, feature learning, feature learning characteristics, training dynamics, neural networks lies
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:A primary advantage of neural networks lies in their feature learning characteristics, which is challenging to theoretically analyze due to the complexity of their training dynamics. We propose a new paradigm for studying feature learning and the resulting benefits in generalizability. After reviewing the neural tangent kernel (NTK) theory and recent results in kernel regression, which address the generalization issue of sufficiently wide neural networks, we examine limitations and implications of the fixed kernel theory (as the NTK theory) and review recent theoretical advancements in feature learning. Moving beyond the fixed kernel/feature theory, we consider neural networks as adaptive feature models. Finally, we propose an over-parameterized Gaussian sequence model as a prototype model to study the feature learning characteristics of neural networks.

[LG-44] Adaptive Self-supervised Learning for Social Recommendations

链接: https://arxiv.org/abs/2412.18735
作者: Xin He,Shanru Lin,Wenqi Fan,Mingchen Sun,Ying Wang,Xin Wang
关键词: self-supervised auxiliary tasks, auxiliary tasks, self-supervised auxiliary, exploit social relations, auxiliary
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:In recent years, researchers have attempted to exploit social relations to improve the performance in recommendation systems. Generally, most existing social recommendation methods heavily depends on substantial domain knowledge and expertise in primary recommendation tasks for designing useful auxiliary tasks. Meanwhile, Self-Supervised Learning (SSL) recently has received considerable attention in the field of recommendation, since it can provide self-supervision signals in assisting the improvement of target recommendation systems by constructing self-supervised auxiliary tasks from raw data without human-annotated labels. Despite the great success, these SSL-based social recommendations are insufficient to adaptively balance various self-supervised auxiliary tasks, since assigning equal weights on various auxiliary tasks can result in sub-optimal recommendation performance, where different self-supervised auxiliary tasks may contribute differently to improving the primary social recommendation across different datasets. To address this issue, in this work, we propose Adaptive Self-supervised Learning for Social Recommendations (AdasRec) by taking advantage of various self-supervised auxiliary tasks. More specifically, an adaptive weighting mechanism is proposed to learn adaptive weights for various self-supervised auxiliary tasks, so as to balance the contribution of such self-supervised auxiliary tasks for enhancing representation learning in social recommendations. The adaptive weighting mechanism is used to assign different weights on auxiliary tasks to achieve an overall weighting of the entire auxiliary tasks and ultimately assist the primary recommendation task, achieved by a meta learning optimization problem with an adaptive weighting network. Comprehensive experiments on various real-world datasets are constructed to verify the effectiveness of our proposed method.

[LG-45] Elucidating Flow Matching ODE Dynamics with respect to Data Geometries

链接: https://arxiv.org/abs/2412.18730
作者: Gal Mishne,Zhengchao Wan,Qingsong Wang,Yusu Wang
关键词: flow matching models, flow matching, matching models, image generation, standard for image
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-based generative models have become the standard for image generation. ODE-based samplers and flow matching models improve efficiency, in comparison to diffusion models, by reducing sampling steps through learned vector fields. However, the theoretical foundations of flow matching models remain limited, particularly regarding the convergence of individual sample trajectories at terminal time - a critical property that impacts sample quality and being critical assumption for models like the consistency model. In this paper, we advance the theory of flow matching models through a comprehensive analysis of sample trajectories, centered on the denoiser that drives ODE dynamics. We establish the existence, uniqueness and convergence of ODE trajectories at terminal time, ensuring stable sampling outcomes under minimal assumptions. Our analysis reveals how trajectories evolve from capturing global data features to local structures, providing the geometric characterization of per-sample behavior in flow matching models. We also explain the memorization phenomenon in diffusion-based training through our terminal time analysis. These findings bridge critical gaps in understanding flow matching models, with practical implications for sampling stability and model design.

[LG-46] Effective and Lightweight Representation Learning for Link Sign Prediction in Signed Bipartite Graphs

链接: https://arxiv.org/abs/2412.18720
作者: Gyeongmin Gu,Minseo Jeon,Hyun-Je Song,Jinhong Jung
关键词: signed bipartite, signed bipartite graphs, bipartite, effectively and efficiently, bipartite graphs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How can we effectively and efficiently learn node representations in signed bipartite graphs? A signed bipartite graph is a graph consisting of two nodes sets where nodes of different types are positively or negative connected, and it has been extensively used to model various real-world relationships such as e-commerce, etc. To analyze such a graph, previous studies have focused on designing methods for learning node representations using graph neural networks. In particular, these methods insert edges between nodes of the same type based on balance theory, enabling them to leverage augmented structures in their learning. However, the existing methods rely on a naive message passing design, which is prone to over-smoothing and susceptible to noisy interactions in real-world graphs. Furthermore, they suffer from computational inefficiency due to their heavy design and the significant increase in the number of added edges. In this paper, we propose ELISE, an effective and lightweight GNN-based approach for learning signed bipartite graphs. We first extend personalized propagation to a signed bipartite graph, incorporating signed edges during message passing. This extension adheres to balance theory without introducing additional edges, mitigating the over-smoothing issue and enhancing representation power. We then jointly learn node embeddings on a low-rank approximation of the signed bipartite graph, which reduces potential noise and emphasizes its global structure, further improving expressiveness without significant loss of efficiency. We encapsulate these ideas into ELISE, designing it to be lightweight, unlike the previous methods that add too many edges and cause inefficiency. Through extensive experiments on real-world signed bipartite graphs, we demonstrate that ELISE outperforms its competitors for predicting link signs while providing faster training and inference time. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.18720 [cs.LG] (or arXiv:2412.18720v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.18720 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-47] Variational Bayesian Inference for Tensor Robust Principal Component Analysis

链接: https://arxiv.org/abs/2412.18717
作者: Chao Wang,Huiwen Zheng,Raymond Chan,Youwen Wen
关键词: Principal Component Analysis, Robust Principal Component, Tensor Robust Principal, Robust Principal, Component Analysis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tensor Robust Principal Component Analysis (TRPCA) holds a crucial position in machine learning and computer vision. It aims to recover underlying low-rank structures and characterizing the sparse structures of noise. Current approaches often encounter difficulties in accurately capturing the low-rank properties of tensors and balancing the trade-off between low-rank and sparse components, especially in a mixed-noise scenario. To address these challenges, we introduce a Bayesian framework for TRPCA, which integrates a low-rank tensor nuclear norm prior and a generalized sparsity-inducing prior. By embedding the proposed priors within the Bayesian framework, our method can automatically determine the optimal tensor nuclear norm and achieve a balance between the nuclear norm and sparse components. Furthermore, our method can be efficiently extended to the weighted tensor nuclear norm model. Experiments conducted on synthetic and real-world datasets demonstrate the effectiveness and superiority of our method compared to state-of-the-art approaches.

[LG-48] melyLLM : Segmented LLM Serving System for Time-sensitive Robotic Applications

链接: https://arxiv.org/abs/2412.18695
作者: Neiwen Ling,Guojun Chen,Lin Zhong
关键词: Large Language Models, Large Language, Language Models, comprehend complex commands, process diverse tasks
类目: Robotics (cs.RO); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend complex commands and process diverse tasks. This advancement facilitates their application in controlling drones and robots for various tasks. However, existing LLM serving systems typically employ a first-come, first-served (FCFS) batching mechanism, which fails to address the time-sensitive requirements of robotic applications. To address it, this paper proposes a new system named TimelyLLM serving multiple robotic agents with time-sensitive requests. TimelyLLM introduces novel mechanisms of segmented generation and scheduling that optimally leverage redundancy between robot plan generation and execution phases. We report an implementation of TimelyLLM on a widely-used LLM serving framework and evaluate it on a range of robotic applications. Our evaluation shows that TimelyLLM improves the time utility up to 1.97x, and reduces the overall waiting time by 84%.

[LG-49] Comparing analytic and data-driven approaches to parameter identifiability: A power systems case study

链接: https://arxiv.org/abs/2412.18663
作者: Nikolaos Evangelou,Alexander M. Stankovic,Ioannis G. Kevrekidis,Mark K. Transtrum
关键词: Parameter identifiability refers, quantify parameter identifiability, Parameter identifiability, Parameter, capability of accurately
类目: Machine Learning (cs.LG)
*备注: 15 Pages, 14 Figures, 5 Tables

点击查看摘要

Abstract:Parameter identifiability refers to the capability of accurately inferring the parameter values of a model from its observations (data). Traditional analysis methods exploit analytical properties of the closed form model, in particular sensitivity analysis, to quantify the response of the model predictions to variations in parameters. Techniques developed to analyze data, specifically manifold learning methods, have the potential to complement, and even extend the scope of the traditional analytical approaches. We report on a study comparing and contrasting analytical and data-driven approaches to quantify parameter identifiability and, importantly, perform parameter reduction tasks. We use the infinite bus synchronous generator model, a well-understood model from the power systems domain, as our benchmark problem. Our traditional analysis methods use the Fisher Information Matrix to quantify parameter identifiability analysis, and the Manifold Boundary Approximation Method to perform parameter reduction. We compare these results to those arrived at through data-driven manifold learning schemes: Output - Diffusion Maps and Geometric Harmonics. For our test case, we find that the two suites of tools (analytical when a model is explicitly available, as well as data-driven when the model is lacking and only measurement data are available) give (correct) comparable results; these results are also in agreement with traditional analysis based on singular perturbation theory. We then discuss the prospects of using data-driven methods for such model analysis.

[LG-50] An efficient search-and-score algorithm for ancestral graphs using multivariate information scores

链接: https://arxiv.org/abs/2412.17508
作者: Nikita Lagrange,Herve Isambert
关键词: unobserved latent variables, propose a greedy, originating from unobserved, latent variables, include directed
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:We propose a greedy search-and-score algorithm for ancestral graphs, which include directed as well as bidirected edges, originating from unobserved latent variables. The normalized likelihood score of ancestral graphs is estimated in terms of multivariate information over relevant ``ac-connected subsets’’ of vertices, C, that are connected through collider paths confined to the ancestor set of C. For computational efficiency, the proposed two-step algorithm relies on local information scores limited to the close surrounding vertices of each node (step 1) and edge (step 2). This computational strategy, although restricted to information contributions from ac-connected subsets containing up to two-collider paths, is shown to outperform state-of-the-art causal discovery methods on challenging benchmark datasets.

[LG-51] LASER: A new method for locally adaptive nonparametric regression

链接: https://arxiv.org/abs/2412.19802
作者: Sabyasachi Chatterjee,Subhajit Goswami,Soumendu Sundar Mukherjee
关键词: Adaptive Smoothing Estimator, Smoothing Estimator, performs variable bandwidth, Locally Adaptive Smoothing, computationally efficient locally
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 29 pages, 6 figures

点击查看摘要

Abstract:In this article, we introduce \textsfLASER (Locally Adaptive Smoothing Estimator for Regression), a computationally efficient locally adaptive nonparametric regression method that performs variable bandwidth local polynomial regression. We prove that it adapts (near-)optimally to the local Hölder exponent of the underlying regression function \textttsimultaneously at all points in its domain. Furthermore, we show that there is a single ideal choice of a global tuning parameter under which the above mentioned local adaptivity holds. Despite the vast literature on nonparametric regression, instances of practicable methods with provable guarantees of such a strong notion of local adaptivity are rare. The proposed method achieves excellent performance across a broad range of numerical experiments in comparison to popular alternative locally adaptive methods.

[LG-52] Symbolic Approximations to Ricci-flat Metrics Via Extrinsic Symmetries of Calabi-Yau Hypersurfaces

链接: https://arxiv.org/abs/2412.19778
作者: Viktor Mirjanić,Challenger Mishra
关键词: Yau non-constructive existence, non-constructive existence proof, explicit construction remains, Yau non-constructive, proof of Ricci-flat
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Algebraic Geometry (math.AG); Differential Geometry (math.DG)
*备注: 40 pages, 14 figures

点击查看摘要

Abstract:Ever since Yau’s non-constructive existence proof of Ricci-flat metrics on Calabi-Yau manifolds, finding their explicit construction remains a major obstacle to development of both string theory and algebraic geometry. Recent computational approaches employ machine learning to create novel neural representations for approximating these metrics, offering high accuracy but limited interpretability. In this paper, we analyse machine learning approximations to flat metrics of Fermat Calabi-Yau n-folds and some of their one-parameter deformations in three dimensions in order to discover their new properties. We formalise cases in which the flat metric has more symmetries than the underlying manifold, and prove that these symmetries imply that the flat metric admits a surprisingly compact representation for certain choices of complex structure moduli. We show that such symmetries uniquely determine the flat metric on certain loci, for which we present an analytic form. We also incorporate our theoretical results into neural networks to achieve state-of-the-art reductions in Ricci curvature for multiple Calabi-Yau manifolds. We conclude by distilling the ML models to obtain for the first time closed form expressions for Kahler metrics with near-zero scalar curvature.

[LG-53] Learning to Forget: Bayesian Time Series Forecasting using Recurrent Sparse Spectrum Signature Gaussian Processes

链接: https://arxiv.org/abs/2412.19727
作者: Csaba Tóth,Masaki Adachi,Michael A. Osborne,Harald Oberhauser
关键词: strong theoretical guarantees, time series, signature features, time series forecasting, stochastic analysis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The signature kernel is a kernel between time series of arbitrary length and comes with strong theoretical guarantees from stochastic analysis. It has found applications in machine learning such as covariance functions for Gaussian processes. A strength of the underlying signature features is that they provide a structured global description of a time series. However, this property can quickly become a curse when local information is essential and forgetting is required; so far this has only been addressed with ad-hoc methods such as slicing the time series into subsegments. To overcome this, we propose a principled, data-driven approach by introducing a novel forgetting mechanism for signatures. This allows the model to dynamically adapt its context length to focus on more recent information. To achieve this, we revisit the recently introduced Random Fourier Signature Features, and develop Random Fourier Decayed Signature Features (RFDSF) with Gaussian processes (GPs). This results in a Bayesian time series forecasting algorithm with variational inference, that offers a scalable probabilistic algorithm that processes and transforms a time series into a joint predictive distribution over time steps in one pass using recurrence. For example, processing a sequence of length 10^4 steps in \approx 10^-2 seconds and in 1\textGB of GPU memory. We demonstrate that it outperforms other GP-based alternatives and competes with state-of-the-art probabilistic time series forecasting algorithms.

[LG-54] Causal machine learning for heterogeneous treatment effects in the presence of missing outcome data

链接: https://arxiv.org/abs/2412.19711
作者: Matthew Pryce,Karla Diaz-Ordaz,Ruth H. Keogh,Stijn Vansteelandt
关键词: estimating heterogeneous treatment, treatment effect estimation, heterogeneous treatment effects, complicate treatment effect, treatment effect
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 pages, 6 figures, 4 tables

点击查看摘要

Abstract:When estimating heterogeneous treatment effects, missing outcome data can complicate treatment effect estimation, causing certain subgroups of the population to be poorly represented. In this work, we discuss this commonly overlooked problem and consider the impact that missing at random (MAR) outcome data has on causal machine learning estimators for the conditional average treatment effect (CATE). We then propose two de-biased machine learning estimators for the CATE, the mDR-learner and mEP-learner, which address the issue of under-representation by integrating inverse probability of censoring weights into the DR-learner and EP-learner respectively. We show that under reasonable conditions, these estimators are oracle efficient, and illustrate their favorable performance through simulated data settings, comparing them to existing CATE estimators, including comparison to estimators which use common missing data techniques. Guidance on the implementation of these estimators is provided and we present an example of their application using the ACTG175 trial, exploring treatment effect heterogeneity when comparing Zidovudine mono-therapy against alternative antiretroviral therapies among HIV-1-infected individuals.

[LG-55] Combining Machine Learning with Recurrence Analysis for resonance detection

链接: https://arxiv.org/abs/2412.19683
作者: Ondřej Zelenka,Ondřej Kopáček,Georgios Lukes-Gerakopoulos
关键词: chaotic motion, compact object systems, EMRI, compact object, integrable system
类目: General Relativity and Quantum Cosmology (gr-qc); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:The width of a resonance in a nearly integrable system, i.e. in a non-integrable system where chaotic motion is still not prominent, can tell us how a perturbation parameter is driving the system away from integrability. Although the tool that we are presenting here can be used is quite generic and can be used in a variety of systems, our particular interest lies in binary compact object systems known as extreme mass ratio inspirals (EMRIs). In an EMRI a lighter compact object, like a black hole or a neutron star, inspirals into a supermassive black hole due to gravitational radiation reaction. During this inspiral the lighter object crosses resonances, which are still not very well modeled. Measuring the width of resonances in EMRI models allows us to estimate the importance of each perturbation parameter able to drive the system away from resonances and decide whether its impact should be included in EMRI waveform modeling or not. To tackle this issue in our study we show first that recurrence quantifiers of orbits carry imprints of resonant behavior, regardless of the system’s dimensionality. As a next step, we apply a long short-term memory machine learning architecture to automate the resonance detection procedure. Our analysis is developed on a simple standard map and gradually we extend it to more complicated systems until finally we employ it in a generic deformed Kerr spacetime known in the literature as the Johannsen-Psaltis spacetime.

[LG-56] Deep ReLU networks – injectivity capacity upper bounds

链接: https://arxiv.org/abs/2412.19677
作者: Mihailo Stojnic
关键词: feed forward neural, forward neural networks, ReLU feed forward, feed forward, forward neural
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study deep ReLU feed forward neural networks (NN) and their injectivity abilities. The main focus is on \emphprecisely determining the so-called injectivity capacity. For any given hidden layers architecture, it is defined as the minimal ratio between number of network’s outputs and inputs which ensures unique recoverability of the input from a realizable output. A strong recent progress in precisely studying single ReLU layer injectivity properties is here moved to a deep network level. In particular, we develop a program that connects deep l -layer net injectivity to an l -extension of the \ell_0 spherical perceptrons, thereby massively generalizing an isomorphism between studying single layer injectivity and the capacity of the so-called (1-extension) \ell_0 spherical perceptrons discussed in [82]. \emphRandom duality theory (RDT) based machinery is then created and utilized to statistically handle properties of the extended \ell_0 spherical perceptrons and implicitly of the deep ReLU NNs. A sizeable set of numerical evaluations is conducted as well to put the entire RDT machinery in practical use. From these we observe a rapidly decreasing tendency in needed layers’ expansions, i.e., we observe a rapid \emphexpansion saturation effect. Only 4 layers of depth are sufficient to closely approach level of no needed expansion – a result that fairly closely resembles observations made in practical experiments and that has so far remained completely untouchable by any of the existing mathematical methodologies.

[LG-57] Deep Linear Hawkes Processes

链接: https://arxiv.org/abs/2412.19634
作者: Yuxin Chang,Alex Boyd,Cao Xiao,Taha Kass-Hout,Parminder Bhatia,Padhraic Smyth,Andrew Warrington
关键词: Marked temporal point, irregular arrival times, Marked temporal, temporal point processes, linear Hawkes processes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Marked temporal point processes (MTPPs) are used to model sequences of different types of events with irregular arrival times, with broad applications ranging from healthcare and social networks to finance. We address shortcomings in existing point process models by drawing connections between modern deep state-space models (SSMs) and linear Hawkes processes (LHPs), culminating in an MTPP that we call the deep linear Hawkes process (DLHP). The DLHP modifies the linear differential equations in deep SSMs to be stochastic jump differential equations, akin to LHPs. After discretizing, the resulting recurrence can be implemented efficiently using a parallel scan. This brings parallelism and linear scaling to MTPP models. This contrasts with attention-based MTPPs, which scale quadratically, and RNN-based MTPPs, which do not parallelize across the sequence length. We show empirically that DLHPs match or outperform existing models across a broad range of metrics on eight real-world datasets. Our proposed DLHP model is the first instance of the unique architectural capabilities of SSMs being leveraged to construct a new class of MTPP models.

[LG-58] Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping

链接: https://arxiv.org/abs/2412.19529
作者: Zijian Liu,Zhengyuan Zhou
关键词: first-order nonconvex stochastic, nonconvex stochastic optimization, mathfrak, Batched Normalized Stochastic, Stochastic Gradient Descent
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: In submission

点击查看摘要

Abstract:Recently, the study of heavy-tailed noises in first-order nonconvex stochastic optimization has gotten a lot of attention since it was recognized as a more realistic condition as suggested by many empirical observations. Specifically, the stochastic noise (the difference between the stochastic and true gradient) is considered only to have a finite \mathfrakp -th moment where \mathfrakp\in\left(1,2\right] instead of assuming it always satisfies the classical finite variance assumption. To deal with this more challenging setting, people have proposed different algorithms and proved them to converge at an optimal \mathcalO(T^\frac1-\mathfrakp3\mathfrakp-2) rate for smooth objectives after T iterations. Notably, all these new-designed algorithms are based on the same technique - gradient clipping. Naturally, one may want to know whether the clipping method is a necessary ingredient and the only way to guarantee convergence under heavy-tailed noises. In this work, by revisiting the existing Batched Normalized Stochastic Gradient Descent with Momentum (Batched NSGDM) algorithm, we provide the first convergence result under heavy-tailed noises but without gradient clipping. Concretely, we prove that Batched NSGDM can achieve the optimal \mathcalO(T^\frac1-\mathfrakp3\mathfrakp-2) rate even under the relaxed smooth condition. More interestingly, we also establish the first \mathcalO(T^\frac1-\mathfrakp2\mathfrakp) convergence rate in the case where the tail index \mathfrakp is unknown in advance, which is arguably the common scenario in practice.

[LG-59] Meta-Learning-Based Delayless Subband Adaptive Filter using Complex Self-Attention for Active Noise Control

链接: https://arxiv.org/abs/2412.19471
作者: Pengxing Feng,Hing Cheung So
关键词: Active noise control, typically employs adaptive, employs adaptive filtering, control typically employs, noise control typically
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 31 pages, 8 figures

点击查看摘要

Abstract:Active noise control typically employs adaptive filtering to generate secondary noise, where the least mean square algorithm is the most widely used. However, traditional updating rules are linear and exhibit limited effectiveness in addressing nonlinear environments and nonstationary noise. To tackle this challenge, we reformulate the active noise control problem as a meta-learning problem and propose a meta-learning-based delayless subband adaptive filter with deep neural networks. The core idea is to utilize a neural network as an adaptive algorithm that can adapt to different environments and types of noise. The neural network will train under noisy observations, implying that it recognizes the optimized updating rule without true labels. A single-headed attention recurrent neural network is devised with learnable feature embedding to update the adaptive filter weight efficiently, enabling accurate computation of the secondary source to attenuate the unwanted primary noise. In order to relax the time constraint on updating the adaptive filter weights, the delayless subband architecture is employed, which will allow the system to be updated less frequently as the downsampling factor increases. In addition, the delayless subband architecture does not introduce additional time delays in active noise control systems. A skip updating strategy is introduced to decrease the updating frequency further so that machines with limited resources have more possibility to board our meta-learning-based model. Extensive multi-condition training ensures generalization and robustness against various types of noise and environments. Simulation results demonstrate that our meta-learning-based model achieves superior noise reduction performance compared to traditional methods.

[LG-60] Comparative Performance Analysis of Quantum Machine Learning Architectures for Credit Card Fraud Detection

链接: https://arxiv.org/abs/2412.19441
作者: Mansour El Alami,Nouhaila Innan,Muhammad Shafique,Mohamed Bennai
关键词: Quantum Neural Network, Quantum Machine Learning, Sampler Quantum Neural, Estimator Quantum Neural, effective detection methods
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 12 pages, 17 figures, 7 tables, under review

点击查看摘要

Abstract:As financial fraud becomes increasingly complex, effective detection methods are essential. Quantum Machine Learning (QML) introduces certain capabilities that may enhance both accuracy and efficiency in this area. This study examines how different quantum feature map and ansatz configurations affect the performance of three QML-based classifiers-the Variational Quantum Classifier (VQC), the Sampler Quantum Neural Network (SQNN), and the Estimator Quantum Neural Network (EQNN)-when applied to two non-standardized financial fraud datasets. Different quantum feature map and ansatz configurations are evaluated, revealing distinct performance patterns. The VQC consistently demonstrates strong classification results, achieving an F1 score of 0.88, while the SQNN also delivers promising outcomes. In contrast, the EQNN struggles to produce robust results, emphasizing the challenges presented by non-standardized data. These findings highlight the importance of careful model configuration in QML-based financial fraud detection. By showing how specific feature maps and ansatz choices influence predictive success, this work guides researchers and practitioners in refining QML approaches for complex financial applications.

[LG-61] Low-Rank Contextual Reinforcement Learning from Heterogeneous Human Feedback

链接: https://arxiv.org/abs/2412.19436
作者: Seong Jin Lee,Will Wei Sun,Yufeng Liu
关键词: aligning large language, large language models, cornerstone for aligning, aligning large, large language
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has become a cornerstone for aligning large language models with human preferences. However, the heterogeneity of human feedback, driven by diverse individual contexts and preferences, poses significant challenges for reward learning. To address this, we propose a Low-rank Contextual RLHF (LoCo-RLHF) framework that integrates contextual information to better model heterogeneous feedback while maintaining computational efficiency. Our approach builds on a contextual preference model, leveraging the intrinsic low-rank structure of the interaction between user contexts and query-answer pairs to mitigate the high dimensionality of feature representations. Furthermore, we address the challenge of distributional shifts in feedback through our Pessimism in Reduced Subspace (PRS) policy, inspired by pessimistic offline reinforcement learning techniques. We theoretically demonstrate that our policy achieves a tighter sub-optimality gap compared to existing methods. Extensive experiments validate the effectiveness of LoCo-RLHF, showcasing its superior performance in personalized RLHF settings and its robustness to distribution shifts.

[LG-62] Asymptotically Optimal Search for a Change Point Anomaly under a Composite Hypothesis Model

链接: https://arxiv.org/abs/2412.19392
作者: Liad Lea Didi,Tomer Gafni,Kobi Cohen
关键词: change point, problem of searching, finite set, anomalous process, anomalous process transitions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:We address the problem of searching for a change point in an anomalous process among a finite set of M processes. Specifically, we address a composite hypothesis model in which each process generates measurements following a common distribution with an unknown parameter (vector). This parameter belongs to either a normal or abnormal space depending on the current state of the process. Before the change point, all processes, including the anomalous one, are in a normal state; after the change point, the anomalous process transitions to an abnormal state. Our goal is to design a sequential search strategy that minimizes the Bayes risk by balancing sample complexity and detection accuracy. We propose a deterministic search algorithm with the following notable properties. First, we analytically demonstrate that when the distributions of both normal and abnormal processes are unknown, the algorithm is asymptotically optimal in minimizing the Bayes risk as the error probability approaches zero. In the second setting, where the parameter under the null hypothesis is known, the algorithm achieves asymptotic optimality with improved detection time based on the true normal state. Simulation results are presented to validate the theoretical findings.

[LG-63] Minimal Batch Adaptive Learning Policy Engine for Real-Time Mid-Price Forecasting in High-Frequency Trading

链接: https://arxiv.org/abs/2412.19372
作者: Adamantios Ntakaris,Gbenga Ibikunle
关键词: making reliable short-term, modern financial markets, transformed modern financial, reliable short-term price, High-frequency trading
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-frequency trading (HFT) has transformed modern financial markets, making reliable short-term price forecasting models essential. In this study, we present a novel approach to mid-price forecasting using Level 1 limit order book (LOB) data from NASDAQ, focusing on 100 U.S. stocks from the SP 500 index during the period from September to November 2022. Expanding on our previous work with Radial Basis Function Neural Networks (RBFNN), which leveraged automated feature importance techniques based on mean decrease impurity (MDI) and gradient descent (GD), we introduce the Adaptive Learning Policy Engine (ALPE) - a reinforcement learning (RL)-based agent designed for batch-free, immediate mid-price forecasting. ALPE incorporates adaptive epsilon decay to dynamically balance exploration and exploitation, outperforming a diverse range of highly effective machine learning (ML) and deep learning (DL) models in forecasting performance.

[LG-64] Deep learning and whole-brain networks for biomarker discovery: modeling the dynamics of brain fluctuations in resting-state and cognitive tasks

链接: https://arxiv.org/abs/2412.19329
作者: Facundo Roffet,Gustavo Deco,Claudio Delrieux,Gustavo Patow
关键词: biomarkers remains underexplored, models offer insights, network models offer, remains underexplored, bifurcation
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 1 table

点击查看摘要

Abstract:Background: Brain network models offer insights into brain dynamics, but the utility of model-derived bifurcation parameters as biomarkers remains underexplored. Objective: This study evaluates bifurcation parameters from a whole-brain network model as biomarkers for distinguishing brain states associated with resting-state and task-based cognitive conditions. Methods: Synthetic BOLD signals were generated using a supercritical Hopf brain network model to train deep learning models for bifurcation parameter prediction. Inference was performed on Human Connectome Project data, including both resting-state and task-based conditions. Statistical analyses assessed the separability of brain states based on bifurcation parameter distributions. Results: Bifurcation parameter distributions differed significantly across task and resting-state conditions ( p 0.0001 for all but one comparison). Task-based brain states exhibited higher bifurcation values compared to rest. Conclusion: Bifurcation parameters effectively differentiate cognitive and resting states, warranting further investigation as biomarkers for brain state characterization and neurological disorder assessment.

[LG-65] Adaptive Conformal Inference by Betting

链接: https://arxiv.org/abs/2412.19318
作者: Aleksandr Podkopaev,Darren Xu,Kuang-Chih Lee
关键词: quantifying predictive uncertainty, machine learning models, adaptive conformal inference, adaptive conformal, valuable tool
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction is a valuable tool for quantifying predictive uncertainty of machine learning models. However, its applicability relies on the assumption of data exchangeability, a condition which is often not met in real-world scenarios. In this paper, we consider the problem of adaptive conformal inference without any assumptions about the data generating process. Existing approaches for adaptive conformal inference are based on optimizing the pinball loss using variants of online gradient descent. A notable shortcoming of such approaches is in their explicit dependence on and sensitivity to the choice of the learning rates. In this paper, we propose a different approach for adaptive conformal inference that leverages parameter-free online convex optimization techniques. We prove that our method controls long-term miscoverage frequency at a nominal level and demonstrate its convincing empirical performance without any need of performing cumbersome parameter tuning.

[LG-66] Localized exploration in contextual dynamic pricing achieves dimension-free regret

链接: https://arxiv.org/abs/2412.19252
作者: Jinhang Chai,Yaqi Duan,Jianqing Fan,Kaizheng Wang
关键词: linear demand model, contextual dynamic pricing, demand model, study the problem, problem of contextual
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 60 pages, 9 figures

点击查看摘要

Abstract:We study the problem of contextual dynamic pricing with a linear demand model. We propose a novel localized exploration-then-commit (LetC) algorithm which starts with a pure exploration stage, followed by a refinement stage that explores near the learned optimal pricing policy, and finally enters a pure exploitation stage. The algorithm is shown to achieve a minimax optimal, dimension-free regret bound when the time horizon exceeds a polynomial of the covariate dimension. Furthermore, we provide a general theoretical framework that encompasses the entire time spectrum, demonstrating how to balance exploration and exploitation when the horizon is limited. The analysis is powered by a novel critical inequality that depicts the exploration-exploitation trade-off in dynamic pricing, mirroring its existing counterpart for the bias-variance trade-off in regularized regression. Our theoretical results are validated by extensive experiments on synthetic and real-world data.

[LG-67] Sentiment trading with large language models

链接: https://arxiv.org/abs/2412.19245
作者: Kemal Kirtac,Guido Germano
关键词: large language models, Loughran-McDonald dictionary model, Loughran-McDonald dictionary, large language, potential in predicting
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Econometrics (econ.EM); Portfolio Management (q-fin.PM); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:We investigate the efficacy of large language models (LLMs) in sentiment analysis of U.S. financial news and their potential in predicting stock market returns. We analyze a dataset comprising 965,375 news articles that span from January 1, 2010, to June 30, 2023; we focus on the performance of various LLMs, including BERT, OPT, FINBERT, and the traditional Loughran-McDonald dictionary model, which has been a dominant methodology in the finance literature. The study documents a significant association between LLM scores and subsequent daily stock returns. Specifically, OPT, which is a GPT-3 based LLM, shows the highest accuracy in sentiment prediction with an accuracy of 74.4%, slightly ahead of BERT (72.5%) and FINBERT (72.2%). In contrast, the Loughran-McDonald dictionary model demonstrates considerably lower effectiveness with only 50.1% accuracy. Regression analyses highlight a robust positive impact of OPT model scores on next-day stock returns, with coefficients of 0.274 and 0.254 in different model specifications. BERT and FINBERT also exhibit predictive relevance, though to a lesser extent. Notably, we do not observe a significant relationship between the Loughran-McDonald dictionary model scores and stock returns, challenging the efficacy of this traditional method in the current financial context. In portfolio performance, the long-short OPT strategy excels with a Sharpe ratio of 3.05, compared to 2.11 for BERT and 2.07 for FINBERT long-short strategies. Strategies based on the Loughran-McDonald dictionary yield the lowest Sharpe ratio of 1.23. Our findings emphasize the superior performance of advanced LLMs, especially OPT, in financial market prediction and portfolio management, marking a significant shift in the landscape of financial analysis tools with implications to financial regulation and policy analysis.

[LG-68] Stochastic normalizing flows for Effective String Theory

链接: https://arxiv.org/abs/2412.19109
作者: Michele Caselle,Elia Cellini,Alessandro Nada
关键词: Effective String Theory, thin vibrating string, Effective String, pure gauge theories, vibrating string
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 1+ 10 pages, 2 figures, contribution for the 41st International Symposium on Lattice Field Theory (Lattice 2024), 28 July - 3 August 2024, Liverpool, UK;

点击查看摘要

Abstract:Effective String Theory (EST) is a powerful tool used to study confinement in pure gauge theories by modeling the confining flux tube connecting a static quark-anti-quark pair as a thin vibrating string. Recently, flow-based samplers have been applied as an efficient numerical method to study EST regularized on the lattice, opening the route to study observables previously inaccessible to standard analytical methods. Flow-based samplers are a class of algorithms based on Normalizing Flows (NFs), deep generative models recently proposed as a promising alternative to traditional Markov Chain Monte Carlo methods in lattice field theory calculations. By combining NF layers with out-of-equilibrium stochastic updates, we obtain Stochastic Normalizing Flows (SNFs), a scalable class of machine learning algorithms that can be explained in terms of stochastic thermodynamics. In this contribution, we outline EST and SNFs, and report some numerical results for the shape of the flux tube.

[LG-69] Neural Networks Perform Sufficient Dimension Reduction

链接: https://arxiv.org/abs/2412.19033
作者: Shuntuo Xu,Zhou Yu
关键词: sufficient dimension reduction, inherently perform SDR, networks inherently perform, neural networks inherently, neural networks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the connection between neural networks and sufficient dimension reduction (SDR), demonstrating that neural networks inherently perform SDR in regression tasks under appropriate rank regularizations. Specifically, the weights in the first layer span the central mean subspace. We establish the statistical consistency of the neural network-based estimator for the central mean subspace, underscoring the suitability of neural networks in addressing SDR-related challenges. Numerical experiments further validate our theoretical findings, and highlight the underlying capability of neural networks to facilitate SDR compared to the existing methods. Additionally, we discuss an extension to unravel the central subspace, broadening the scope of our investigation.

[LG-70] Adaptivity can help exponentially for shadow tomography

链接: https://arxiv.org/abs/2412.19022
作者: Sitan Chen,Weiyuan Gong,Zhihan Zhang
关键词: make unentangled measurements, recent years, significant interest, interest in understanding, understanding the statistical
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:In recent years there has been significant interest in understanding the statistical complexity of learning from quantum data under the constraint that one can only make unentangled measurements. While a key challenge in establishing tight lower bounds in this setting is to deal with the fact that the measurements can be chosen in an adaptive fashion, a recurring theme has been that adaptivity offers little advantage over more straightforward, nonadaptive protocols. In this note, we offer a counterpoint to this. We show that for the basic task of shadow tomography, protocols that use adaptively chosen two-copy measurements can be exponentially more sample-efficient than any protocol that uses nonadaptive two-copy measurements. Comments: 6 pages Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2412.19022 [quant-ph] (or arXiv:2412.19022v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2412.19022 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-71] Optimal Federated Learning for Functional Mean Estimation under Heterogeneous Privacy Constraints

链接: https://arxiv.org/abs/2412.18992
作者: Tony Cai,Abhinav Chakraborty,Lasse Vuursteen
关键词: gained significant importance, significant importance due, learning technique designed, preserve data privacy, machine learning technique
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a distributed machine learning technique designed to preserve data privacy and security, and it has gained significant importance due to its broad range of applications. This paper addresses the problem of optimal functional mean estimation from discretely sampled data in a federated setting. We consider a heterogeneous framework where the number of individuals, measurements per individual, and privacy parameters vary across one or more servers, under both common and independent design settings. In the common design setting, the same design points are measured for each individual, whereas in the independent design, each individual has their own random collection of design points. Within this framework, we establish minimax upper and lower bounds for the estimation error of the underlying mean function, highlighting the nuanced differences between common and independent designs under distributed privacy constraints. We propose algorithms that achieve the optimal trade-off between privacy and accuracy and provide optimality results that quantify the fundamental limits of private functional mean estimation across diverse distributed settings. These results characterize the cost of privacy and offer practical insights into the potential for privacy-preserving statistical analysis in federated environments. Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG) MSC classes: 62G08, 62G20 Cite as: arXiv:2412.18992 [math.ST] (or arXiv:2412.18992v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2412.18992 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-72] Derandomized shallow shadows: Efficient Pauli learning with bounded-depth circuits

链接: https://arxiv.org/abs/2412.18973
作者: Katherine Van Kirk,Christian Kokail,Jonathan Kunjummen,Hong-Ye Hu,Yanting Teng,Madelyn Cain,Jacob Taylor,Susanne F. Yelin,Hannes Pichler,Mikhail Lukin
关键词: quantum science tasks, estimating large numbers, Efficiently estimating large, non-commuting observables, science tasks
类目: Quantum Physics (quant-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: 10+29 pages, 9 figures

点击查看摘要

Abstract:Efficiently estimating large numbers of non-commuting observables is an important subroutine of many quantum science tasks. We present the derandomized shallow shadows (DSS) algorithm for efficiently learning a large set of non-commuting observables, using shallow circuits to rotate into measurement bases. Exploiting tensor network techniques to ensure polynomial scaling of classical resources, our algorithm outputs a set of shallow measurement circuits that approximately minimizes the sample complexity of estimating a given set of Pauli strings. We numerically demonstrate systematic improvement, in comparison with state-of-the-art techniques, for energy estimation of quantum chemistry benchmarks and verification of quantum many-body systems, and we observe DSS’s performance consistently improves as one allows deeper measurement circuits. These results indicate that in addition to being an efficient, low-depth, stand-alone algorithm, DSS can also benefit many larger quantum algorithms requiring estimation of multiple non-commuting observables.

[LG-73] Label-free SERS Discrimination of Proline from Hydroxylated Proline at Single-molecule Level Assisted by a Deep Learning Model

链接: https://arxiv.org/abs/2412.18935
作者: Yingqi Zhao,Kuo Zhan,Pei-Lin Xin,Zuyan Chen,Shuai Li,Francesco De Angelis,Jianan Huang
关键词: eval-uating therapeutic outcomes, Discriminating the low-abundance, require single-molecule sensors, low-abundance hydroxylated proline, hydroxylated proline
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:Discriminating the low-abundance hydroxylated proline from hydroxylated proline is crucial for monitoring diseases and eval-uating therapeutic outcomes that require single-molecule sensors. While the plasmonic nanopore sensor can detect the hydrox-ylation with single-molecule sensitivity by surface enhanced Raman spectroscopy (SERS), it suffers from intrinsic fluctuations of single-molecule signals as well as strong interference from citrates. Here, we used the occurrence frequency histogram of the single-molecule SERS peaks to extract overall dataset spectral features, overcome the signal fluctuations and investigate the citrate-replaced plasmonic nanopore sensors for clean and distinguishable signals of proline and hydroxylated proline. By ligand exchange of the citrates by analyte molecules, the representative peaks of citrates decreased with incubation time, prov-ing occupation of the plasmonic hot spot by the analytes. As a result, the discrimination of the single-molecule SERS signals of proline and hydroxylated proline was possible with the convolutional neural network model with 96.6% accuracy.

[LG-74] Learning Broken Symmetries with Approximate Invariance

链接: https://arxiv.org/abs/2412.18773
作者: Seth Nabat,Aishik Ghosh,Edmund Witkowski,Gregor Kasieczka,Daniel Whiteson
关键词: Recognizing symmetries, neural network training, significant boosts, training data, Recognizing
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 7 pages, 8 figures

点击查看摘要

Abstract:Recognizing symmetries in data allows for significant boosts in neural network training, which is especially important where training data are limited. In many cases, however, the exact underlying symmetry is present only in an idealized dataset, and is broken in actual data, due to asymmetries in the detector, or varying response resolution as a function of particle momentum. Standard approaches, such as data augmentation or equivariant networks fail to represent the nature of the full, broken symmetry, effectively overconstraining the response of the neural network. We propose a learning model which balances the generality and asymptotic performance of unconstrained networks with the rapid learning of constrained networks. This is achieved through a dual-subnet structure, where one network is constrained by the symmetry and the other is not, along with a learned symmetry factor. In a simplified toy example that demonstrates violation of Lorentz invariance, our model learns as rapidly as symmetry-constrained networks but escapes its performance limitations.

[LG-75] BoostMD: Accelerating molecular sampling by leveraging ML force field features from previous time-steps

链接: https://arxiv.org/abs/2412.18633
作者: Lars L. Schaaf,Ilyes Batatia,Christoph Brunken,Thomas D. Barrett,Jules Tilly
关键词: Simulating atomic-scale processes, Simulating atomic-scale, atomic-scale processes, catalytic reactions, advancements in biology
类目: Chemical Physics (physics.chem-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Simulating atomic-scale processes, such as protein dynamics and catalytic reactions, is crucial for advancements in biology, chemistry, and materials science. Machine learning force fields (MLFFs) have emerged as powerful tools that achieve near quantum mechanical accuracy, with promising generalization capabilities. However, their practical use is often limited by long inference times compared to classical force fields, especially when running extensive molecular dynamics (MD) simulations required for many biological applications. In this study, we introduce BoostMD, a surrogate model architecture designed to accelerate MD simulations. BoostMD leverages node features computed at previous time steps to predict energies and forces based on positional changes. This approach reduces the complexity of the learning task, allowing BoostMD to be both smaller and significantly faster than conventional MLFFs. During simulations, the computationally intensive reference MLFF is evaluated only every N steps, while the lightweight BoostMD model handles the intermediate steps at a fraction of the computational cost. Our experiments demonstrate that BoostMD achieves an eight-fold speedup compared to the reference model and generalizes to unseen dipeptides. Furthermore, we find that BoostMD accurately samples the ground-truth Boltzmann distribution when running molecular dynamics. By combining efficient feature reuse with a streamlined architecture, BoostMD offers a robust solution for conducting large-scale, long-timescale molecular simulations, making high-accuracy ML-driven modeling more accessible and practical.

[LG-76] How to explain grokking

链接: https://arxiv.org/abs/2412.18624
作者: S.V. Kozyrev
关键词: gradient Langevin dynamics, stochastic gradient Langevin, Brownian motion, Langevin dynamics, delayed generalization
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Explanation of grokking (delayed generalization) in learning is given by modeling grokking by the stochastic gradient Langevin dynamics (Brownian motion) and applying the ideas of thermodynamics.

信息检索

[IR-0] RecLM: Recommendation Instruction Tuning

链接: https://arxiv.org/abs/2412.19302
作者: Yangqin Jiang,Yuhao Yang,Lianghao Xia,Da Luo,Kangyi Lin,Chao Huang
关键词: deeply understand users’, understand users’ complex, Graph Neural Networks, Modern recommender systems, Modern recommender
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modern recommender systems aim to deeply understand users’ complex preferences through their past interactions. While deep collaborative filtering approaches using Graph Neural Networks (GNNs) excel at capturing user-item relationships, their effectiveness is limited when handling sparse data or zero-shot scenarios, primarily due to constraints in ID-based embedding functions. To address these challenges, we propose a model-agnostic recommendation instruction-tuning paradigm that seamlessly integrates large language models with collaborative filtering. Our proposed Recommendation Language Model (RecLM) enhances the capture of user preference diversity through a carefully designed reinforcement learning reward function that facilitates self-augmentation of language models. Comprehensive evaluations demonstrate significant advantages of our approach across various settings, and its plug-and-play compatibility with state-of-the-art recommender systems results in notable performance enhancements.

[IR-1] Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning AAAI AAAI-25

链接: https://arxiv.org/abs/2412.19200
作者: Dengming Zhang,Weitao You,Ziheng Liu,Lingyun Sun,Pei Chen
关键词: Dynamic Music Emotion, Music Emotion Recognition, music information retrieval, Dynamic Music, music information
类目: ound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.

[IR-2] owards Popularity-Aware Recommendation: A Multi-Behavior Enhanced Framework with Orthogonality Constraint

链接: https://arxiv.org/abs/2412.19172
作者: Yishan Han,Biao Xu,Yao Wang,Shanxing Gao
关键词: involves inferring latent, inferring latent user, recommendation involves inferring, generating personalized recommendations, latent user preferences
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Top- K recommendation involves inferring latent user preferences and generating personalized recommendations accordingly, which is now ubiquitous in various decision systems. Nonetheless, recommender systems usually suffer from severe \textitpopularity bias, leading to the over-recommendation of popular items. Such a bias deviates from the central aim of reflecting user preference faithfully, compromising both customer satisfaction and retailer profits. Despite the prevalence, existing methods tackling popularity bias still have limitations due to the considerable accuracy-debias tradeoff and the sensitivity to extensive parameter selection, further exacerbated by the extreme sparsity in positive user-item interactions. In this paper, we present a \textbfPopularity-aware top- K recommendation algorithm integrating multi-behavior \textbfSide \textbfInformation (PopSI), aiming to enhance recommendation accuracy and debias performance simultaneously. Specifically, by leveraging multiple user feedback that mirrors similar user preferences and formulating it as a three-dimensional tensor, PopSI can utilize all slices to capture the desiring user preferences effectively. Subsequently, we introduced a novel orthogonality constraint to refine the estimated item feature space, enforcing it to be invariant to item popularity features thereby addressing our model’s sensitivity to popularity bias. Comprehensive experiments on real-world e-commerce datasets demonstrate the general improvements of PopSI over state-of-the-art debias methods with a marginal accuracy-debias tradeoff and scalability to practical applications. The source code for our algorithm and experiments is available at \urlthis https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2412.19172 [cs.IR] (or arXiv:2412.19172v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2412.19172 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] Jasper and Stella: distillation of SOTA embedding models

链接: https://arxiv.org/abs/2412.19048
作者: Dun Zhang,FulongWang
关键词: FAQ and RAG, Maximum Inner Product, Product Search, convert raw text, deep learning applications
类目: Information Retrieval (cs.IR)
*备注: 7 pages, 1 figures

点击查看摘要

Abstract:A crucial component of many deep learning applications (such as FAQ and RAG) is dense retrieval, in which embedding models are used to convert raw text to numerical vectors and then get the most similar text by MIPS (Maximum Inner Product Search). Some text embedding benchmarks (e.g. MTEB, BEIR, and AIR-Bench) have been established to evaluate embedding models accurately. Thanks to these benchmarks, we can use SOTA models; however, the deployment and application of these models in industry were hampered by their large vector dimensions and numerous parameters. To alleviate this problem, 1) we present a distillation technique that can enable a smaller student model to achieve good performance. 2) Inspired by MRL we present a training approach of reducing the vector dimensions based on its own vectors or its teacher vectors. 3) We do simple yet effective alignment training between images and text to make our model a multimodal encoder. We trained Stella and Jasper models using the technologies above and achieved high scores on the MTEB leaderboard. We release the model and data at Hugging Face Hub (this https URL) and the training logs are at this https URL.

[IR-4] Dont Lose Yourself: Boosting Multimodal Recommendation via Reducing Node-neighbor Discrepancy in Graph Convolutional Network ICASSP2025

链接: https://arxiv.org/abs/2412.18962
作者: Zheyu Chen,Jinfeng Xu,Haibo Hu
关键词: multimodal recommendation systems, recommendation systems, data sparsity problem, rapid expansion, expansion of multimedia
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:The rapid expansion of multimedia contents has led to the emergence of multimodal recommendation systems. It has attracted increasing attention in recommendation systems because its full utilization of data from different modalities alleviates the persistent data sparsity problem. As such, multimodal recommendation models can learn personalized information about nodes in terms of visual and textual. To further alleviate the data sparsity problem, some previous works have introduced graph convolutional networks (GCNs) for multimodal recommendation systems, to enhance the semantic representation of users and items by capturing the potential relationships between them. However, adopting GCNs inevitably introduces the over-smoothing problem, which make nodes to be too similar. Unfortunately, incorporating multimodal information will exacerbate this challenge because nodes that are too similar will lose the personalized information learned through multimodal information. To address this problem, we propose a novel model that retains the personalized information of ego nodes during feature aggregation by Reducing Node-neighbor Discrepancy (RedN^nD). Extensive experiments on three public datasets show that RedN^nD achieves state-of-the-art performance on accuracy and robustness, with significant improvements over existing GCN-based multimodal frameworks.

[IR-5] Musings About the Future of Search: A Return to the Past?

链接: https://arxiv.org/abs/2412.18956
作者: Jimmy Lin,Pankaj Gupta,Will Horn,Gilad Mishne
关键词: question answered, question, Abstract, effective, answered
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:When you have a question, the most effective way to have the question answered is to directly connect with experts on the topic and have a conversation with them. Prior to the invention of writing, this was the only way. Although effective, this solution exhibits scalability challenges. Writing allowed knowledge to be materialized, preserved, and replicated, enabling the development of different technologies over the centuries to connect information seekers with relevant information. This progression ultimately culminated in the ten-blue-links web search paradigm we’re familiar with, just before the recent emergence of generative AI. However, we often forget that consuming static content is an imperfect solution. With the advent of large language models, it has become possible to develop a superior experience by allowing users to directly engage with experts. These interactions can of course satisfy information needs, but expert models can do so much more. This coming future requires reimagining search.

[IR-6] Attack-in-the-Chain: Bootstrapping Large Language Models for Attacks Against Black-box Neural Ranking Models AAAI25

链接: https://arxiv.org/abs/2412.18770
作者: Yu-An Liu,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Yixing Fan,Xueqi Cheng
关键词: Neural ranking models, Neural ranking, highly effective, effective in terms, terms of retrieval
类目: Information Retrieval (cs.IR)
*备注: Accepted by AAAI25

点击查看摘要

Abstract:Neural ranking models (NRMs) have been shown to be highly effective in terms of retrieval performance. Unfortunately, they have also displayed a higher degree of sensitivity to attacks than previous generation models. To help expose and address this lack of robustness, we introduce a novel ranking attack framework named Attack-in-the-Chain, which tracks interactions between large language models (LLMs) and NRMs based on chain-of-thought (CoT) prompting to generate adversarial examples under black-box settings. Our approach starts by identifying anchor documents with higher ranking positions than the target document as nodes in the reasoning chain. We then dynamically assign the number of perturbation words to each node and prompt LLMs to execute attacks. Finally, we verify the attack performance of all nodes at each reasoning step and proceed to generate the next reasoning step. Empirical results on two web search benchmarks show the effectiveness of our method.

[IR-7] On the Robustness of Generative Information Retrieval Models ECIR2025

链接: https://arxiv.org/abs/2412.18768
作者: Yu-An Liu,Ruqing Zhang,Jiafeng Guo,Changjiang Zhou,Maarten de Rijke,Xueqi Cheng
关键词: methods retrieve documents, OOD robustness, information retrieval methods, retrieval methods retrieve, generating their identifiers
类目: Information Retrieval (cs.IR)
*备注: Accepted by ECIR 2025. arXiv admin note: substantial text overlap with arXiv:2306.12756

点击查看摘要

Abstract:Generative information retrieval methods retrieve documents by directly generating their identifiers. Much effort has been devoted to developing effective generative IR models. Less attention has been paid to the robustness of these models. It is critical to assess the out-of-distribution (OOD) generalization of generative IR models, i.e., how would such models generalize to new distributions? To answer this question, we focus on OOD scenarios from four perspectives in retrieval problems: (i)query variations; (ii)unseen query types; (iii)unseen tasks; and (iv)corpus expansion. Based on this taxonomy, we conduct empirical studies to analyze the OOD robustness of representative generative IR models against dense retrieval models. Our empirical results indicate that the OOD robustness of generative IR models is in need of improvement. By inspecting the OOD robustness of generative IR models we aim to contribute to the development of more reliable IR models. The code is available at \urlthis https URL.

[IR-8] Position-aware Graph Transformer for Recommendation

链接: https://arxiv.org/abs/2412.18731
作者: Jiajia Chen,Jiancan Wu,Jiawei Chen,Chongming Gao,Yong Li,Xiang Wang
关键词: fundamentally involves learning, involves learning high-quality, learning high-quality user, recommendation fundamentally involves, Collaborative recommendation fundamentally
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Collaborative recommendation fundamentally involves learning high-quality user and item representations from interaction data. Recently, graph convolution networks (GCNs) have advanced the field by utilizing high-order connectivity patterns in interaction graphs, as evidenced by state-of-the-art methods like PinSage and LightGCN. However, one key limitation has not been well addressed in existing solutions: capturing long-range collaborative filtering signals, which are crucial for modeling user preference. In this work, we propose a new graph transformer (GT) framework – \textitPosition-aware Graph Transformer for Recommendation (PGTR), which combines the global modeling capability of Transformer blocks with the local neighborhood feature extraction of GCNs. The key insight is to explicitly incorporate node position and structure information from the user-item interaction graph into GT architecture via several purpose-designed positional encodings. The long-range collaborative signals from the Transformer block are then combined linearly with the local neighborhood features from the GCN backbone to enhance node embeddings for final recommendations. Empirical studies demonstrate the effectiveness of the proposed PGTR method when implemented on various GCN-based backbones across four real-world datasets, and the robustness against interaction sparsity as well as noise.

[IR-9] Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants

链接: https://arxiv.org/abs/2412.18784
作者: Mequanent Argaw Muluneh,Yan-Tsung Peng,Worku Abebe Degife,Nigussie Abate Tadesse,Aknachew Mebreku Demeku,Li Su
关键词: musical styles worldwide, Orthodox Tewahedo Church, Ethiopian Orthodox Tewahedo, advancing music production, Computational music research
类目: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR); Signal Processing (eess.SP)
*备注: 6 pages

点击查看摘要

Abstract:Computational music research plays a critical role in advancing music production, distribution, and understanding across various musical styles worldwide. Despite the immense cultural and religious significance, the Ethiopian Orthodox Tewahedo Church (EOTC) chants are relatively underrepresented in computational music research. This paper contributes to this field by introducing a new dataset specifically tailored for analyzing EOTC chants, also known as Yaredawi Zema. This work provides a comprehensive overview of a 10-hour dataset, 369 instances, creation, and curation process, including rigorous quality assurance measures. Our dataset has a detailed word-level temporal boundary and reading tone annotation along with the corresponding chanting mode label of audios. Moreover, we have also identified the chanting options associated with multiple chanting notations in the manuscript by annotating them accordingly. Our goal in making this dataset available to the public 1 is to encourage more research and study of EOTC chants, including lyrics transcription, lyric-to-audio alignment, and music generation tasks. Such research work will advance knowledge and efforts to preserve this distinctive liturgical music, a priceless cultural artifact for the Ethiopian people.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-30

目录

概览 (2024-12-30)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载