Arxiv今日论文 | 2024-12-02

本篇博文主要展示 2024-12-02 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决多模态大语言模型（MLLMs）在视频理解领域中的应用问题，特别是零样本推理（zero-shot inference）和进一步微调（fine-tuning）方法的局限性。论文指出零样本推理存在泛化能力有限和缺乏时间理解能力的问题，而直接使用所有视频数据进行微调则面临学习效率低下的挑战，这主要是由于训练数据中的指令多样性不足。为此，论文提出了一种名为T2Vid的数据增强方法，通过合成视频样本来丰富训练语料库中的指令多样性。这种方法使得在仅使用15%样本量的情况下，训练效果可与使用完整视频数据集相媲美，甚至更优。此外，该方法还能在不使用长视频样本的情况下提升对长视频的理解能力。解决方案的关键在于通过T2Vid方法增强训练数据的多样性，从而提高模型的学习效率和性能。

链接: https://arxiv.org/abs/2411.19951
作者: Shukang Yin,Chaoyou Fu,Sirui Zhao,Yunhang Shen,Chunjiang Ge,Yan Yang,Zuwei Long,Yuhan Dai,Tong Xu,Xing Sun,Ran He,Caifeng Shan,Enhong Chen
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 9 figures, 5 tables. Project page: this https URL

点击查看摘要

Abstract:The success of Multimodal Large Language Models (MLLMs) in the image domain has garnered wide attention from the research community. Drawing on previous successful experiences, researchers have recently explored extending the success to the video understanding realms. Apart from training from scratch, an efficient way is to utilize the pre-trained image-LLMs, leading to two mainstream approaches, i.e. zero-shot inference and further fine-tuning with video data. In this work, our study of these approaches harvests an effective data augmentation method. We first make a deeper inspection of the zero-shot inference way and identify two limitations, i.e. limited generalization and lack of temporal understanding capabilities. Thus, we further investigate the fine-tuning approach and find a low learning efficiency when simply using all the video data samples, which can be attributed to a lack of instruction diversity. Aiming at this issue, we develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus. Integrating these data enables a simple and efficient training scheme, which achieves performance comparable to or even superior to using full video datasets by training with just 15% the sample size. Meanwhile, we find that the proposed scheme can boost the performance of long video understanding without training with long video samples. We hope our study will spark more thinking about using MLLMs for video understanding and curation of high-quality data. The code is released at this https URL.
zh

[NLP-1] Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM s Reasoning Capability

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在推理任务中因关键标记 (critical tokens) 导致错误推理路径的问题。解决方案的关键在于提出了一种新的方法——cDPO (contrastive DPO)，该方法通过对比估计自动识别关键标记，并在对齐过程中对这些标记进行标记级别的奖励。具体来说，cDPO 通过分别微调正负模型 (positive and negative models) 来比较生成可能性，从而识别出导致错误结果的关键标记，并利用这些标记的差异可能性作为标记级别 DPO 的重要权重，以实现模型与关键标记信息的进一步对齐。实验结果表明，cDPO 在 GSM8K 和 MATH500 基准测试中显著提升了 Llama-3 (8B 和 70B) 以及 deepseek-math (7B) 模型的性能。

链接: https://arxiv.org/abs/2411.19943
作者: Zicheng Lin,Tian Liang,Jiahao Xu,Xing Wang,Ruilin Luo,Chufan Shi,Siheng Li,Yujiu Yang,Zhaopeng Tu
关键词-EN: Large Language Models, Large Language, exhibited remarkable performance, critical tokens, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens’’ that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO this http URL results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.
zh

[NLP-2] Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark

【速读】：该论文旨在通过组织第二次感知测试挑战（Perception Test challenge）来评估和基准测试当前最先进的视频模型，并衡量自去年以来的进展。解决方案的关键在于引入了新的视频问答基准——1h-walk VQA，这是一个针对长时间视频理解的新任务，涵盖了从低级到高级的任务，包括对象跟踪、点跟踪、时间动作定位、时间声音定位、多选视频问答、基于视频的问答以及长时间视频问答。通过这些任务，挑战赛不仅扩展了去年的六个赛道至七个，还引入了新的视频和音频模态，以及语言和非语言接口，从而全面评估视频模型的性能。

链接: https://arxiv.org/abs/2411.19941
作者: Joseph Heyward,João Carreira,Dima Damen,Andrew Zisserman,Viorica Pătrăucean
关键词-EN: CVF European Conference, Perception Test benchmark, Perception Test challenge, Perception Test, CVF European
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2312.13090

点击查看摘要

Abstract:Following the successful 2023 edition, we organised the Second Perception Test challenge as a half-day workshop alongside the IEEE/CVF European Conference on Computer Vision (ECCV) 2024, with the goal of benchmarking state-of-the-art video models and measuring the progress since last year using the Perception Test benchmark. This year, the challenge had seven tracks (up from six last year) and covered low-level and high-level tasks, with language and non-language interfaces, across video, audio, and text modalities; the additional track covered hour-long video understanding and introduced a novel video QA benchmark 1h-walk VQA. Overall, the tasks in the different tracks were: object tracking, point tracking, temporal action localisation, temporal sound localisation, multiple-choice video question-answering, grounded video question-answering, and hour-long video question-answering. We summarise in this report the challenge tasks and results, and introduce in detail the novel hour-long video QA benchmark 1h-walk VQA.
zh

[NLP-3] VLSBench: Unveiling Visual Leakage in Multimodal Safety

【速读】：该论文试图解决多模态大语言模型（MLLMs）在安全性评估中存在的视觉安全信息泄露（VSIL）问题。具体来说，现有的一些多模态安全基准测试中，图像中的潜在风险和敏感内容通过文本查询被泄露，导致模型能够轻易拒绝这些敏感的图文查询。论文的关键解决方案是构建了一个多模态视觉无泄漏安全基准（VLSBench），该基准通过2.4k的图文对，防止了从图像到文本查询的视觉安全信息泄露。实验结果表明，VLSBench对开源和闭源的MLLMs（如LLaVA、Qwen2-VL、Llama3.2-Vision和GPT-4o）提出了显著挑战，证明了在存在VSIL的情况下，文本对齐足以应对多模态安全场景，而在无VSIL的情况下，多模态对齐是更为有效的解决方案。

链接: https://arxiv.org/abs/2411.19939
作者: Xuhao Hu,Dongrui Liu,Hao Li,Xuanjing Huang,Jing Shao
关键词-EN: large language models, Multimodal large language, Safety, multimodal safety benchmarks, multimodal safety
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counter-intuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs trained with image-text pairs. To explain such a counter-intuitive phenomenon, we discover a visual safety information leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky and sensitive content in the image has been revealed in the textual query. In this way, MLLMs can easily refuse these sensitive text-image queries according to textual queries. However, image-text pairs without VSIL are common in real-world scenarios and are overlooked by existing multimodal safety benchmarks. To this end, we construct multimodal visual leakless safety benchmark (VLSBench) preventing visual safety leakage from image to textual query with 2.4k image-text pairs. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o. This study demonstrates that textual alignment is enough for multimodal safety scenarios with VSIL, while multimodal alignment is a more promising solution for multimodal safety scenarios without VSIL. Please see our code and data at: this http URL
zh

[NLP-4] On Domain-Specific Post-Training for Multimodal Large Language Models

【速读】：该论文试图解决将通用多模态大语言模型（MLLMs）适应于特定领域（如科学领域和工业应用）的问题。解决方案的关键在于通过后训练（post-training）进行系统性的领域适应，具体包括：(1) 数据合成：利用开源模型开发视觉指令合成器，从领域特定的图像-文本对中生成多样化的视觉指令任务，这些合成任务在提升MLLMs的领域特定性能方面优于手动规则、GPT-4和GPT-4V生成的任务；(2) 训练流程：采用单阶段训练流程，以增强领域特定后训练的任务多样性，而非传统的两阶段训练流程；(3) 任务评估：在生物医学和食品两个领域中，对不同来源和规模的MLLMs进行后训练，并评估其在各种领域特定任务中的表现。

链接: https://arxiv.org/abs/2411.19930
作者: Daixuan Cheng,Shaohan Huang,Ziyu Zhu,Xintong Zhang,Wayne Xin Zhao,Zhongzhi Luan,Bo Dai,Zhenliang Zhang
关键词-EN: multimodal large language, Recent years, general multimodal large, large language models, years have witnessed
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent years have witnessed the rapid development of general multimodal large language models (MLLMs). However, adapting general MLLMs to specific domains, such as scientific fields and industrial applications, remains less explored. This paper systematically investigates domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs. (2) Training Pipeline: While the two-stage training–initially on image-caption pairs followed by visual instruction tasks–is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks. To support further research in MLLM domain adaptation, we will open-source our implementations.
zh

[NLP-5] SIMS: Simulating Human-Scene Interactions with Real World Script Planning

【速读】：该论文试图解决长期人类与场景交互的模拟问题，特别是生成具有详细叙事的物理基础动画。解决方案的关键在于结合大型语言模型 (LLMs) 和从视频中提取的脚本数据，通过一个基于LLM的流程来提取和生成新的脚本，捕捉复杂的时间序列人类行为和环境交互。论文提出了一种双感知策略，能够在语言理解和场景理解之间取得平衡，指导角色在上下文和空间约束下的运动。此外，论文还贡献了一个包含多样化运动序列的综合规划数据集，并重新标注了现有运动数据集中的片段，以增强策略的学习能力。实验结果表明，该框架在多种任务执行中表现出色，并具有良好的泛化能力。

链接: https://arxiv.org/abs/2411.19921
作者: Wenjia Wang,Liang Pan,Zhiyang Dou,Zhouyingcheng Liao,Yuke Lou,Lei Yang,Jingbo Wang,Taku Komura
关键词-EN: Simulating long-term human-scene, Simulating long-term, long-term human-scene interaction, Large Language Models, challenging yet fascinating
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Simulating long-term human-scene interaction is a challenging yet fascinating task. Previous works have not effectively addressed the generation of long-term human scene interactions with detailed narratives for physics-based animation. This paper introduces a novel framework for the planning and controlling of long-horizon physical plausible human-scene interaction. On the one hand, films and shows with stylish human locomotions or interactions with scenes are abundantly available on the internet, providing a rich source of data for script planning. On the other hand, Large Language Models (LLMs) can understand and generate logical storylines. This motivates us to marry the two by using an LLM-based pipeline to extract scripts from videos, and then employ LLMs to imitate and create new scripts, capturing complex, time-series human behaviors and interactions with environments. By leveraging this, we utilize a dual-aware policy that achieves both language comprehension and scene understanding to guide character motions within contextual and spatial constraints. To facilitate training and evaluation, we contribute a comprehensive planning dataset containing diverse motion sequences extracted from real-world videos and expand them with large language models. We also collect and re-annotate motion clips from existing kinematic datasets to enable our policy learn diverse skills. Extensive experiments demonstrate the effectiveness of our framework in versatile task execution and its generalization ability to various scenarios, showing remarkably enhanced performance compared with existing methods. Our code and data will be publicly available soon. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR) Cite as: arXiv:2411.19921 [cs.CV] (or arXiv:2411.19921v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.19921 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-6] AIDetx: a compression-based method for identification of machine-learning generated text

【速读】：该论文试图解决机器生成文本的检测问题，传统方法如深度学习分类器存在高计算成本和解释性不足的局限。解决方案的关键在于提出了一种基于数据压缩技术的分类框架AIDetx，利用有限上下文模型（Finite-Context Models, FCMs）构建人类书写和AI生成文本的独特压缩模型，通过比较新输入文本在两个模型中的压缩率来进行分类。该方法在两个基准数据集上分别实现了超过97%和99%的F1分数，显著提高了检测的准确性，同时相比现有的大语言模型（Large Language Models, LLMs），AIDetx提供了更高的解释性和计算效率，大幅减少了训练时间和硬件需求。

链接: https://arxiv.org/abs/2411.19869
作者: Leonardo Almeida,Pedro Rodrigues,Diogo Magalhães,Armando J. Pinho,Diogo Pratas
关键词-EN: data compression techniques, detecting machine-generated text, paper introduces AIDetx, paper introduces, detecting machine-generated
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces AIDetx, a novel method for detecting machine-generated text using data compression techniques. Traditional approaches, such as deep learning classifiers, often suffer from high computational costs and limited interpretability. To address these limitations, we propose a compression-based classification framework that leverages finite-context models (FCMs). AIDetx constructs distinct compression models for human-written and AI-generated text, classifying new inputs based on which model achieves a higher compression ratio. We evaluated AIDetx on two benchmark datasets, achieving F1 scores exceeding 97% and 99%, respectively, highlighting its high accuracy. Compared to current methods, such as large language models (LLMs), AIDetx offers a more interpretable and computationally efficient solution, significantly reducing both training time and hardware requirements (e.g., no GPUs needed). The full implementation is publicly available at this https URL.
zh

[NLP-7] Reverse Thinking Makes LLM s Stronger Reasoners

【速读】：该论文试图解决的问题是如何使大型语言模型（LLMs）具备反向思维能力，即从解决方案推理回问题本身。解决方案的关键在于引入了一个名为“反向增强思维”（Reverse-Enhanced Thinking, RevThink）的框架，该框架通过数据增强和多任务学习目标来实现。具体来说，RevThink通过收集教师模型的结构化正向和反向推理数据来增强训练数据集，并采用三个学习目标来训练学生模型：生成正向推理、生成反向问题和生成反向推理。实验结果表明，RevThink在常识、数学和逻辑推理等多个数据集上显著提升了学生模型的性能，并展示了样本效率和强大的泛化能力。

链接: https://arxiv.org/abs/2411.19865
作者: Justin Chih-Yao Chen,Zifeng Wang,Hamid Palangi,Rujun Han,Sayna Ebrahimi,Long Le,Vincent Perot,Swaroop Mishra,Mohit Bansal,Chen-Yu Lee,Tomas Pfister
关键词-EN: Reverse thinking plays, reasoning, Reverse thinking, forward reasoning, plays a crucial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages

点击查看摘要

Abstract:Reverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning performance as it enables consistency checks between their forward and backward thinking. To enable Large Language Models (LLMs) to perform reverse thinking, we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data augmentation and learning objectives. In RevThink, we augment the dataset by collecting structured forward-backward reasoning from a teacher model, consisting of: (1) the original question, (2) forward reasoning, (3) backward question, and (4) backward reasoning. We then employ three objectives to train a smaller student model in a multi-task learning fashion: (a) generate forward reasoning from a question, (b) generate a backward question from a question, and © generate backward reasoning from the backward question. Experiments across 12 datasets covering commonsense, math, and logical reasoning show an average 13.53% improvement over the student model’s zero-shot performance and a 6.84% improvement over the strongest knowledge distillation baselines. Moreover, our method demonstrates sample efficiency – using only 10% of the correct forward reasoning from the training data, it outperforms a standard fine-tuning method trained on 10x more forward reasoning. RevThink also exhibits strong generalization to out-of-distribution held-out datasets.
zh

[NLP-8] What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric review

【速读】：该论文试图解决的问题是深入分析语言学与人工智能（AI）之间的关联，特别是通过深度学习语言模型来展现这种关联。解决方案的关键在于利用CiteSpace和VOSviewer两款强大的软件，对1974年至2024年间发表的5750篇Web of Science索引文章进行科学计量分析，生成知识图谱和可视化图表，以揭示研究趋势、新兴热点以及深度学习语言模型（如ChatGPT）的发展和应用。

链接: https://arxiv.org/abs/2411.19858
作者: Mohammed Q. Shormani
关键词-EN: artificial intelligence, deep learning language, learning language models, strong correlation, linguistics and artificial
类目: Computation and Language (cs.CL)
备注: 26 pages, 15 figures

点击查看摘要

Abstract:There is a strong correlation between linguistics and artificial intelligence (AI), best manifested by deep learning language models. This study provides a thorough scientometric analysis of this correlation, synthesizing the intellectual production during 51 years, from 1974 to 2024. It involves 5750 Web of Science-indexed articles published in 2124 journals, which are written by 20835 authors belonging to 13773 research centers in 794 countries. Two powerful software, viz., CiteSpace and VOSviewer, were used to generate mapping visualizations of the intellectual landscape, trending issues and (re)emerging hotspots. The results indicate that in the 1980s and 1990s, linguistics and AI research was not robust, characterized by unstable publication over time. It has, however, witnessed a remarkable increase of publication since then, reaching 1478 articles in 2023, and 546 articles in January-March timespan in 2024, involving emerging issues and hotspots, addressing new horizons, new topics, and launching new applications and powerful deep learning language models including ChatGPT.
zh

[NLP-9] Artificial intelligence contribution to translation industry: looking back and forward

【速读】：该论文旨在全面分析人工智能（AI）对翻译行业（ACTI）研究的贡献，并探讨其自1980年至2024年间的演变。解决方案的关键在于通过科学计量学和主题分析两种方法，对从WoS、Scopus和Lens三个数据库中检索到的13220篇文章进行综合分析。科学计量学分析关注集群、学科类别、关键词、突发性、中心性和研究中心，而主题分析则聚焦于18篇精选文章，探讨其目的、方法、发现及其对ACTI未来方向的贡献。研究发现，过去AI在翻译行业的应用不够严谨，导致基于规则的机器翻译和统计机器翻译效果不佳。随着AI的发展，机器翻译引入了神经网络算法和（深度）语言学习模型如ChatGPT，翻译输出质量显著提升。然而，仍需更多严谨研究来解决低资源语言、多方言和自由词序语言以及文化宗教语境中的翻译问题。

链接: https://arxiv.org/abs/2411.19855
作者: Mohammed Q. Shormani
关键词-EN: artificial intelligence, years from 1980-2024, forty-one years, translation, comprehensive analysis
类目: Computation and Language (cs.CL)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:This study provides a comprehensive analysis of artificial intelligence (AI) contribution to translation industry (ACTI) research, synthesizing it over forty-one years from 1980-2024. 13220 articles were retrieved from three sources, namely WoS, Scopus, and Lens. We provided two types of analysis, viz., scientometric and thematic, focusing on cluster, subject categories, keywords, burstness, centrality and research centers as for the former. For the latter, we thematically review 18 articles, selected purposefully from the articles involved, centering on purpose, approach, findings, and contribution to ACTI future directions. The findings reveal that in the past AI contribution to translation industry was not rigorous, resulting in rule-based machine translation and statistical machine translation whose output was not satisfactory. However, the more AI develops, the more machine translation develops, incorporating Neural Networking Algorithms and (Deep) Language Learning Models like ChatGPT whose translation output has developed considerably. However, much rigorous research is still needed to overcome several problems encountering translation industry, specifically concerning low-source languages, multi-dialectical and free word order languages, and cultural and religious registers.
zh

[NLP-10] Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation

【速读】：该论文试图解决现有内容审核工具在定制化、跨类别敏感内容检测准确性以及隐私保护方面的局限性问题。解决方案的关键在于构建了一个统一的数据集，涵盖了六个敏感类别：冲突性语言、亵渎性语言、色情内容、毒品相关内容、自残行为和垃圾信息。通过一致的数据收集和标注策略，该数据集弥补了以往研究在检测其他敏感类别（如物质滥用或自残）方面的不足。论文分析表明，在新的数据集上微调大型语言模型（LLMs）显著提升了检测性能，相较于现成的开源模型（如LLaMA）和专有的OpenAI模型，性能提升了10-15%。这一改进尤其针对那些难以定制以适应特定敏感内容类别的流行审核API。

链接: https://arxiv.org/abs/2411.19832
作者: Dimosthenis Antypas,Indira Sen,Carla Perez-Almendros,Jose Camacho-Collados,Francesco Barbieri
关键词-EN: crucial for ensuring, ensuring that shared, shared and analysed, free from harmful, sensitive categories
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The detection of sensitive content in large datasets is crucial for ensuring that shared and analysed data is free from harmful material. However, current moderation tools, such as external APIs, suffer from limitations in customisation, accuracy across diverse sensitive categories, and privacy concerns. Additionally, existing datasets and open-source models focus predominantly on toxic language, leaving gaps in detecting other sensitive categories such as substance abuse or self-harm. In this paper, we put forward a unified dataset tailored for social media content moderation across six sensitive categories: conflictual language, profanity, sexually explicit material, drug-related content, self-harm, and spam. By collecting and annotating data with consistent retrieval strategies and guidelines, we address the shortcomings of previous focalised research. Our analysis demonstrates that fine-tuning large language models (LLMs) on this novel dataset yields significant improvements in detection performance compared to open off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which underperform by 10-15% overall. This limitation is even more pronounced on popular moderation APIs, which cannot be easily tailored to specific sensitive content categories, among others.
zh

[NLP-11] SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for Incomplete Multimodal Learning in Conversational Emotion Recognition

【速读】：该论文试图解决多模态对话情感识别（Multimodal Emotion Recognition in Conversations, MERC）中常见的不完整模态问题（Incomplete Multimodal Emotion Recognition in Conversations, IMERC）。解决方案的关键在于提出了一种频谱域重构图神经网络（Spectral Domain Reconstruction Graph Neural Network, SDR-GNN），该网络通过构建基于说话者和上下文关系的滑动窗口语句语义交互图，来捕捉情感依赖关系。SDR-GNN利用加权关系聚合来确保语句间语义特征提取的一致性，并在频谱域中进行多频率聚合，从而有效恢复不完整模态信息。此外，通过多头部注意力机制融合和优化特征，以提升情感识别的准确性。实验结果表明，该方法在处理不完整模态学习方面表现出色，并优于当前最先进的方法。

链接: https://arxiv.org/abs/2411.19822
作者: Fangze Fu,Wei Ai,Fan Yang,Yuntao Shou,Tao Meng,Keqin Li
关键词-EN: Multimodal Emotion Recognition, Emotion Recognition, Incomplete Multimodal, aims to classify, visual modal features
类目: Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Multimodal Emotion Recognition in Conversations (MERC) aims to classify utterance emotions using textual, auditory, and visual modal features. Most existing MERC methods assume each utterance has complete modalities, overlooking the common issue of incomplete modalities in real-world scenarios. Recently, graph neural networks (GNNs) have achieved notable results in Incomplete Multimodal Emotion Recognition in Conversations (IMERC). However, traditional GNNs focus on binary relationships between nodes, limiting their ability to capture more complex, higher-order information. Moreover, repeated message passing can cause over-smoothing, reducing their capacity to preserve essential high-frequency details. To address these issues, we propose a Spectral Domain Reconstruction Graph Neural Network (SDR-GNN) for incomplete multimodal learning in conversational emotion recognition. SDR-GNN constructs an utterance semantic interaction graph using a sliding window based on both speaker and context relationships to model emotional dependencies. To capture higher-order and high-frequency information, SDR-GNN utilizes weighted relationship aggregation, ensuring consistent semantic feature extraction across utterances. Additionally, it performs multi-frequency aggregation in the spectral domain, enabling efficient recovery of incomplete modalities by extracting both high- and low-frequency information. Finally, multi-head attention is applied to fuse and optimize features for emotion recognition. Extensive experiments on various real-world datasets demonstrate that our approach is effective in incomplete multimodal learning and outperforms current state-of-the-art methods.
zh

[NLP-12] A Cross-Corpus Speech Emotion Recognition Method Based on Supervised Contrastive Learning

【速读】：该论文试图解决语音情感识别 (Speech Emotion Recognition, SER) 中面临的大规模公共数据集缺乏和跨数据分布泛化能力有限的问题。解决方案的关键在于提出了一种基于监督对比学习的跨语料库语音情感识别方法。该方法通过两阶段微调过程实现：首先，使用监督对比学习在多个语音情感数据集上微调自监督语音表示模型；然后，在目标数据集上微调分类器。实验结果表明，基于WavLM的模型在IEMOCAP数据集上达到了77.41%的未加权准确率 (UA)，在CASIA数据集上达到了96.49%的UA，优于当前最先进的结果。

链接: https://arxiv.org/abs/2411.19803
作者: Xiang minjie
关键词-EN: Speech Emotion Recognition, limited generalization capability, large-scale public datasets, emotion recognition method, Speech Emotion
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Research on Speech Emotion Recognition (SER) often faces challenges such as the lack of large-scale public datasets and limited generalization capability when dealing with data from different distributions. To solve this problem, this paper proposes a cross-corpus speech emotion recognition method based on supervised contrast learning. The method employs a two-stage fine-tuning process: first, the self-supervised speech representation model is fine-tuned using supervised contrastive learning on multiple speech emotion datasets; then, the classifier is fine-tuned on the target dataset. The experimental results show that the WavLM-based model achieved unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset, outperforming the state-of-the-art results on the two datasets.
zh

[NLP-13] INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

【速读】：该论文试图解决多语言大型语言模型（LLM）在不同语言环境下的性能差异问题，特别是由于缺乏高质量的非英语评估资源而导致的瓶颈。解决方案的关键在于构建了一个名为INCLUDE的综合知识与推理基准，该基准包含197,243个来自本地考试资源的问答对，涵盖44种书面语言，旨在评估多语言LLM在实际部署环境中的表现，从而克服当前多语言基准构建中依赖英语资源和忽视区域文化知识的局限。

链接: https://arxiv.org/abs/2411.19799
作者: Angelika Romanou,Negar Foroutan,Anna Sotnikova,Zeming Chen,Sree Harsha Nelaturu,Shivalika Singh,Rishabh Maheshwary,Micol Altomare,Mohamed A. Haggag,Snegha A,Alfonso Amayuelas,Azril Hafizi Amirudin,Viraat Aryabumi,Danylo Boiko,Michael Chang,Jenny Chim,Gal Cohen,Aditya Kumar Dalmia,Abraham Diress,Sharad Duwal,Daniil Dzenhaliou,Daniel Fernando Erazo Florez,Fabian Farestam,Joseph Marvin Imperial,Shayekh Bin Islam,Perttu Isotalo,Maral Jabbarishiviari,Börje F. Karlsson,Eldar Khalilov,Christopher Klamm,Fajri Koto,Dominik Krzemiński,Gabriel Adriano de Melo,Syrielle Montariol,Yiyang Nan,Joel Niklaus,Jekaterina Novikova,Johan Samir Obando Ceron,Debjit Paul,Esther Ploeger,Jebish Purbey,Swati Rajwal,Selvan Sunitha Ravi,Sara Rydell,Roshan Santhosh,Drishti Sharma,Marjana Prifti Skenduli,Arshia Soltani Moakhar,Bardia Soltani Moakhar,Ran Tamir,Ayush Kumar Tarun,Azmine Toushik Wasi,Thenuka Ovin Weerasinghe,Serhan Yilmaz,Mike Zhang,Imanol Schlag,Marzieh Fadaee,Sara Hooker,Antoine Bosselut
关键词-EN: large language models, inhibiting the potential, differential of large, hinders their effective, effective deployment
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.
zh

[NLP-14] Voice Communication Analysis in Esports

【速读】：该论文试图解决团队电子竞技中语音通信效率和协同作用的问题，特别是在正式比赛中如何通过改善语音通信来提升团队表现。解决方案的关键在于应用大型语言模型（LLM）和自然语言处理（NLP）技术，以深入理解并优化语音通信的有效性。研究以《英雄联盟》电子竞技为例，但其核心概念和方法可广泛应用于其他团队相关的电子竞技项目。

链接: https://arxiv.org/abs/2411.19793
作者: Aymeric Vinot,Nicolas Perez
关键词-EN: Large Language Models, Natural Language Processing, efficiency and synergy, effective voice communication, team effective voice
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 17 pages, 11 figures. Independent research

点击查看摘要

Abstract:In most team-based esports, voice communications are prominent in the team efficiency and synergy. In fact it has been observed that not only the skill aspect of the team but also the team effective voice communication comes into play when trying to have good performance in official matches. With the recent emergence of LLM (Large Language Models) tools regarding NLP (Natural Language Processing) (Vaswani et. al.), we decided to try applying them in order to have a better understanding on how to improve the effectiveness of the voice communications. In this paper the study has been made through the prism of League of Legends esport. However the main concepts and ideas can be easily applicable in any other team related esports.
zh

[NLP-15] MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks

【速读】：该论文试图解决现有方法在处理文本与运动生成任务时，主要关注文本描述生成运动而忽视了运动生成文本的互逆任务的问题。解决方案的关键在于提出了一个统一的多模态模型MoTe，该模型通过同时学习运动和文本的边缘分布、条件分布和联合分布，能够处理配对的文本-运动生成、运动描述生成以及文本驱动的运动生成任务。MoTe的核心组件包括运动编码器-解码器（MED）、文本编码器-解码器（TED）和运动-文本扩散模型（MTDM），其中MED和TED用于提取潜在嵌入并分别重建运动序列和文本描述，而MTDM则通过迭代去噪过程处理多样化的任务。实验结果表明，MoTe在文本到运动的生成任务上表现优异，并在运动描述生成任务上具有竞争力。

链接: https://arxiv.org/abs/2411.19786
作者: Yiming Wu,Wei Ji,Kecheng Zheng,Zicheng Wang,Dong Xu
关键词-EN: experienced great improvement, great improvement due, inspiring generative models, large language model, human motion analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Five figures, six tables

点击查看摘要

Abstract:Recently, human motion analysis has experienced great improvement due to inspiring generative models such as the denoising diffusion model and large language model. While the existing approaches mainly focus on generating motions with textual descriptions and overlook the reciprocal task. In this paper, we present~\textbfMoTe, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously. MoTe enables us to handle the paired text-motion generation, motion captioning, and text-driven motion generation by simply modifying the input context. Specifically, MoTe is composed of three components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for extracting latent embeddings, and subsequently reconstructing the motion sequences and textual descriptions from the extracted embeddings, respectively. MTDM, on the other hand, performs an iterative denoising process on the input context to handle diverse tasks. Experimental results on the benchmark datasets demonstrate the superior performance of our proposed method on text-to-motion generation and competitive performance on motion captioning.
zh

[NLP-16] PerLA: Perceptive 3D Language Assistant

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在理解和处理三维物理世界中的点云数据时，如何平衡局部细节和全局上下文信息的问题。当前的方法通常通过降采样或分割点云来处理，但这可能导致关键局部细节或全局上下文的丢失。论文提出的解决方案是引入PerLA，一种3D语言助手，其关键在于通过并行捕捉不同点云区域的高分辨率局部细节，并与从低分辨率整体点云中获取的全局上下文信息进行整合。具体实现上，PerLA采用了一种新颖的算法，通过Hilbert曲线保留点云的局部性，并利用交叉注意力和图神经网络有效地聚合局部到全局的信息。此外，论文还引入了一种新的局部表示一致性损失，以提升训练稳定性。实验结果表明，PerLA在ScanQA、ScanRefer和Nr3D等任务上均优于现有的最先进方法。

链接: https://arxiv.org/abs/2411.19774
作者: Guofeng Mei,Wei Lin,Luigi Riz,Yujiao Wu,Fabio Poiesi,Yiming Wang
关键词-EN: Enabling Large Language, Large Language Models, Enabling Large, challenging research direction, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to understand the 3D physical world is an emerging yet challenging research direction. Current strategies for processing point clouds typically downsample the scene or divide it into smaller parts for separate analysis. However, both approaches risk losing key local details or global contextual information. In this paper, we introduce PerLA, a 3D language assistant designed to be more perceptive to both details and context, making visual representations more informative for the LLM. PerLA captures high-resolution (local) details in parallel from different point cloud areas and integrates them with (global) context obtained from a lower-resolution whole point cloud. We present a novel algorithm that preserves point cloud locality through the Hilbert curve and effectively aggregates local-to-global information via cross-attention and a graph neural network. Lastly, we introduce a novel loss for local representation consensus to promote training stability. PerLA outperforms state-of-the-art 3D language assistants, with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on ScanRefer and +3.88 on Nr3D for dense captioning.\urlthis https URL
zh

[NLP-17] LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

【速读】：该论文试图解决多模态视频理解中缺乏细粒度事件标注数据的问题，特别是针对包含视觉、音频和语音信息的视频。解决方案的关键在于提出了一种自动化流程，包括高质量多模态视频筛选、语义一致的多模态事件边界检测以及跨模态关联感知的事件描述生成。通过这一流程，论文构建了首个名为LongVALE的视觉-音频-语言事件理解基准，包含105K个多模态事件，具有精确的时间边界和详细的关联感知描述，涵盖8.4K个高质量长视频。此外，论文还建立了一个基线模型，利用LongVALE首次实现了视频大语言模型（LLMs）在多模态细粒度时间视频理解中的应用。

链接: https://arxiv.org/abs/2411.19772
作者: Tiantian Geng,Jinrui Zhang,Qingni Wang,Teng Wang,Jinming Duan,Feng Zheng
关键词-EN: efforts remain limited, visual-only video tasks, impressive advancements, efforts remain, remain limited
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 18 pages, 15 figures

点击查看摘要

Abstract:Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
zh

[NLP-18] Noro: A Noise-Robust One-shot Voice Conversion System with Hidden Speaker Representation Capabilities

【速读】：该论文试图解决在实际应用中，单次语音转换（One-shot Voice Conversion, VC）因目标语音参考样本中存在背景噪声等干扰而导致效果下降的问题。解决方案的关键在于引入Noro系统，该系统通过创新的组件设计来增强对噪声的鲁棒性，包括双分支参考编码模块和噪声无关对比说话人损失（noise-agnostic contrastive speaker loss）。这些组件使得Noro在处理含噪声的参考语音时表现优异，显著提升了在实际应用场景中的效果。此外，论文还探讨了将基线系统的参考编码器重新用作说话人编码器的潜力，展示了其在说话人表示学习方面的竞争力。

链接: https://arxiv.org/abs/2411.19770
作者: Haorui He,Yuchen Song,Yuancheng Wang,Haoyang Li,Xueyao Zhang,Li Wang,Gongping Huang,Eng Siong Chng,Zhizheng Wu
关键词-EN: original source speech, One-shot voice conversion, single reference speech, source speech, Noise Robust One-shot
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to IEEE OJSP

点击查看摘要

Abstract:One-shot voice conversion (VC) aims to alter the timbre of speech from a source speaker to match that of a target speaker using just a single reference speech from the target, while preserving the semantic content of the original source speech. Despite advancements in one-shot VC, its effectiveness decreases in real-world scenarios where reference speeches, often sourced from the internet, contain various disturbances like background noise. To address this issue, we introduce Noro, a Noise Robust One-shot VC system. Noro features innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss. Experimental results demonstrate that Noro outperforms our baseline system in both clean and noisy scenarios, highlighting its efficacy for real-world applications. Additionally, we investigate the hidden speaker representation capabilities of our baseline system by repurposing its reference encoder as a speaker encoder. The results shows that it is competitive with several advanced self-supervised learning models for speaker representation under the SUPERB settings, highlighting the potential for advancing speaker representation learning through one-shot VC task.
zh

[NLP-19] A Deep Learning Approach to Language-independent Gender Prediction on Twitter

【速读】：该论文试图解决基于语言无关特征预测Twitter用户性别的问题。解决方案的关键在于利用从用户推文中提取的语言无关特征，并通过逻辑回归 (Logistic Regression, LR) 和前馈神经网络 (Feed-forward Neural Networks, FFNN) 在两种不同设置下构建模型：同语言设置 (Inter-Lingual, IL) 和跨语言设置 (Cross-Lingual, CL)。在IL设置中，训练和测试在同一语言上进行，而在CL设置中，意大利语和德语数据集仅用于测试，其余语言数据集组合用于训练和开发。结果表明，在小训练集情况下，神经网络模型表现不如传统模型，但在足够大的数据集上，神经网络模型显著优于传统模型。此外，特征分析证实了男性和女性在写作风格上存在语言无关的差异。

链接: https://arxiv.org/abs/2411.19733
作者: Reyhaneh Hashempour,Barbara Plank,Aline Villavicencio,Renato Cordeiro de Amorim
关键词-EN: Twitter users based, gender of Twitter, language-independent features extracted, Twitter users, work presents
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work presents a set of experiments conducted to predict the gender of Twitter users based on language-independent features extracted from the text of the users’ tweets. The experiments were performed on a version of TwiSty dataset including tweets written by the users of six different languages: Portuguese, French, Dutch, English, German, and Italian. Logistic regression (LR), and feed-forward neural networks (FFNN) with back-propagation were used to build models in two different settings: Inter-Lingual (IL) and Cross-Lingual (CL). In the IL setting, the training and testing were performed on the same language whereas in the CL, Italian and German datasets were set aside and only used as test sets and the rest were combined to compose training and development sets. In the IL, the highest accuracy score belongs to LR whereas in the CL, FFNN with three hidden layers yields the highest score. The results show that neural network based models underperform traditional models when the size of the training set is small; however, they beat traditional models by a non-trivial margin, when they are fed with large enough data. Finally, the feature analysis confirms that men and women have different writing styles independent of their language.
zh

[NLP-20] owards Santali Linguistic Inclusion: Building the First Santali-to-English Translation Model using mT5 Transformer and Data Augmentation

【速读】：该论文试图解决桑塔利语（Santali）缺乏翻译模型的问题，并探讨在低资源环境下构建桑塔利语翻译模型的可行性。解决方案的关键在于利用迁移学习（transfer learning）技术，特别是通过使用预训练的mT5变压器模型（mT5 transformer），该模型在大量英语数据上进行了训练，从而在桑塔利语-英语平行语料库上表现优于未训练的变压器模型和桑塔利语-孟加拉语平行语料库。此外，数据增强（data augmentation）也被证明是提高模型性能的有效手段。

链接: https://arxiv.org/abs/2411.19726
作者: Syed Mohammed Mostaque Billah,Ateya Ahmed Subarna,Sudipta Nandi Sarna,Ahmad Shawkat Wasit,Anika Fariha,Asif Sushmit,Arig Yousuf Sadeque
关键词-EN: Nepal speak Santali, individuals in India, Nepal speak, Austroasiatic language, Austroasiatic language family
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Around seven million individuals in India, Bangladesh, Bhutan, and Nepal speak Santali, positioning it as nearly the third most commonly used Austroasiatic language. Despite its prominence among the Austroasiatic language family’s Munda subfamily, Santali lacks global recognition. Currently, no translation models exist for the Santali language. Our paper aims to include Santali to the NPL spectrum. We aim to examine the feasibility of building Santali translation models based on available Santali corpora. The paper successfully addressed the low-resource problem and, with promising results, examined the possibility of creating a functional Santali machine translation model in a low-resource setup. Our study shows that Santali-English parallel corpus performs better when in transformers like mt5 as opposed to untrained transformers, proving that transfer learning can be a viable technique that works with Santali language. Besides the mT5 transformer, Santali-English performs better than Santali-Bangla parallel corpus as the mT5 has been trained in way more English data than Bangla data. Lastly, our study shows that with data augmentation, our model performs better.
zh

[NLP-21] akeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets

【速读】：该论文试图解决的问题是如何高效地从克罗地亚新闻媒体中提取、收集和语义分析新闻文章，以揭示一般搜索引擎无法提供的趋势、模式和相关性。解决方案的关键在于开发了一个名为TakeLab Retriever的AI驱动搜索引擎，该引擎利用先进的自然语言处理（NLP）方法，通过网络应用程序使用命名实体、短语和主题来筛选文章。此外，论文还详细介绍了该搜索引擎的设计，并提出了构建一个能够处理超过两千万篇新闻文章的微服务架构的解决方案，以应对软件工程中的挑战。

链接: https://arxiv.org/abs/2411.19718
作者: David Dukić,Marin Petričević,Sven Ćurković,Jan Šnajder
关键词-EN: AI-driven search engine, search engine designed, designed to discover, TakeLab Retriever, semantically analyze
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:TakeLab Retriever is an AI-driven search engine designed to discover, collect, and semantically analyze news articles from Croatian news outlets. It offers a unique perspective on the history and current landscape of Croatian online news media, making it an essential tool for researchers seeking to uncover trends, patterns, and correlations that general-purpose search engines cannot provide. TakeLab retriever utilizes cutting-edge natural language processing (NLP) methods, enabling users to sift through articles using named entities, phrases, and topics through the web application. This technical report is divided into two parts: the first explains how TakeLab Retriever is utilized, while the second provides a detailed account of its design. In the second part, we also address the software engineering challenges involved and propose solutions for developing a microservice-based semantic search engine capable of handling over ten million news articles published over the past two decades.
zh

[NLP-22] MIMDE: Exploring the Use of Synthetic vs Human Data for Evaluating Multi-Insight Multi-Document Extraction Tasks

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在复杂、现实世界应用中的评估难题。解决方案的关键在于定义并开发了一个名为“多视角多文档提取任务 (Multi-Insight Multi-Document Extraction, MIMDE)”的评估框架，该任务涉及从文档集合中提取最佳见解集合并将其映射回源文档。论文通过引入一组互补的人工和合成数据集，探讨了合成数据在LLM评估中的潜力，并使用这些数据集对20个最先进的LLM进行了基准测试。研究发现，尽管LLM在两个数据集上的见解提取能力具有强相关性 (0.71)，但合成数据未能充分捕捉文档级分析的复杂性。这些发现为合成数据在文本分析系统评估中的应用提供了重要指导，既突显了其潜力，也指出了其局限性。

链接: https://arxiv.org/abs/2411.19689
作者: John Francis,Saba Esnaashari,Anton Poletaev,Sukankana Chakraborty,Youmna Hashem,Jonathan Bright
关键词-EN: Large language models, demonstrated remarkable capabilities, Large language, real-world applications remains, applications remains challenging
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in text analysis tasks, yet their evaluation on complex, real-world applications remains challenging. We define a set of tasks, Multi-Insight Multi-Document Extraction (MIMDE) tasks, which involves extracting an optimal set of insights from a document corpus and mapping these insights back to their source documents. This task is fundamental to many practical applications, from analyzing survey responses to processing medical records, where identifying and tracing key insights across documents is crucial. We develop an evaluation framework for MIMDE and introduce a novel set of complementary human and synthetic datasets to examine the potential of synthetic data for LLM evaluation. After establishing optimal metrics for comparing extracted insights, we benchmark 20 state-of-the-art LLMs on both datasets. Our analysis reveals a strong correlation (0.71) between the ability of LLMs to extracts insights on our two datasets but synthetic data fails to capture the complexity of document-level analysis. These findings offer crucial guidance for the use of synthetic data in evaluating text analysis systems, highlighting both its potential and limitations.
zh

[NLP-23] ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

【速读】：该论文试图解决在大规模语言模型（LLMs）预训练过程中，现有粗粒度文本数据集无法满足日益增长的领域特定能力和安全性需求的问题。解决方案的关键在于提出了一个名为MDFG-tool的工具链，用于构建具有多维度和细粒度信息的高质量中文数据集。该工具链首先通过人工制定的规则剔除显性噪声文本，然后设计了质量评估模型、领域分类器和毒性评估模型来分别评估清洗后的数据。最终，将这些细粒度信息（质量评分、领域标签、毒性标签和毒性评分）整合到每个文本中。通过这种方法，论文发布了包含3.8TB数据的中国WebText2.0，每个文本都附有细粒度信息，便于LLM研究人员根据不同类型的细粒度信息选择数据。

链接: https://arxiv.org/abs/2411.19668
作者: Wanyue Zhang,Ziyong Li,Wen Yang,Chunlin Leng,Yinan Bai,Qianlong Du,Chengqing Zong,Jiajun Zhang
关键词-EN: shaping LLMs’ capabilities, large language models, pre-training data play, development of large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ChineseWebTex2.0 dataset is available at this https URL

点击查看摘要

Abstract:During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs’ capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website this https URL
zh

[NLP-24] ruth or Mirage? Towards End-to-End Factuality Evaluation with LLM -OASIS

【速读】：该论文试图解决大语言模型（LLMs）在自然语言生成任务中产生的幻觉问题，即输出内容缺乏事实依据。解决方案的关键在于引入了一个名为LLM-Oasis的资源，这是迄今为止最大的用于训练端到端事实性评估器的资源。LLM-Oasis通过从维基百科中提取声明，并对其进行部分伪造，生成事实和非事实文本对，然后依赖人工注释者验证数据集质量并创建基准测试集。实验表明，LLM-Oasis对现有最先进的LLMs构成了显著挑战，GPT-4o在其提出的端到端事实性评估任务中仅达到60%的准确率，突显了其在推动该领域未来研究中的潜力。

链接: https://arxiv.org/abs/2411.19655
作者: Alessandro Scirè,Andrei Stefan Bejgu,Simone Tedeschi,Karim Ghonim,Federico Martelli,Roberto Navigli
关键词-EN: Large Language Models, Natural Language Generation, including Text Summarization, Machine Translation, Summarization and Machine
类目: Computation and Language (cs.CL)
备注: 15 pages. To be submitted to CL journal

点击查看摘要

Abstract:After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource for training end-to-end factuality evaluators. LLM-Oasis is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for benchmarking factuality evaluation systems. Our experiments demonstrate that LLM-Oasis presents a significant challenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlighting its potential to drive future research in the field. Comments: 15 pages. To be submitted to CL journal Subjects: Computation and Language (cs.CL) Cite as: arXiv:2411.19655 [cs.CL] (or arXiv:2411.19655v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.19655 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alessandro Scirè [view email] [v1] Fri, 29 Nov 2024 12:21:15 UTC (7,740 KB)
zh

[NLP-25] CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

【速读】：该论文试图解决现有大型视觉-语言-动作（Vision-Language-Action, VLA）模型在机器人操作任务中任务成功率低的问题。解决方案的关键在于提出了一种新的VLA架构，该架构不同于以往直接将预训练的大型视觉-语言模型（Vision-Language-Models, VLM）用于动作预测的方法。论文提出了一种组件化的VLA架构，其中包含一个专门设计的动作模块，该模块根据VLM的输出进行条件化处理。通过引入扩散动作变换器（diffusion action transformers）进行动作序列建模，显著提升了模型的性能，并展示了良好的扩展行为。实验结果表明，该模型在模拟和真实机器人实验中均显著超越了现有的VLA模型，特别是在任务成功率和适应新机器人及泛化到未见对象和背景方面表现出色。

链接: https://arxiv.org/abs/2411.19650
作者: Qixiu Li,Yaobo Liang,Zeyu Wang,Lin Luo,Xi Chen,Mozheng Liao,Fangyun Wei,Yu Deng,Sicheng Xu,Yizhong Zhang,Xiaofan Wang,Bei Liu,Jianlong Fu,Jianmin Bao,Dong Chen,Yuanchun Shi,Jiaolong Yang,Baining Guo
关键词-EN: improved robotic manipulation, language-guided task execution, significantly improved robotic, improved robotic, robotic manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Webpage: this https URL

点击查看摘要

Abstract:The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (this https URL).
zh

[NLP-26] LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

【速读】：该论文试图解决多语言新闻分类的问题，特别是在无需手动数据标注的情况下，如何利用大型语言模型（LLMs）开发出合理大小的多语言新闻分类模型。解决方案的关键在于采用了一种教师-学生框架：使用生成式预训练变压器（GPT）模型作为教师模型，通过自动标注新闻文章生成IPTC媒体主题训练数据集，并在斯洛文尼亚语、克罗地亚语、希腊语和加泰罗尼亚语上展示了高零样本性能。随后，通过微调较小的BERT类学生模型，以处理每日数百万文本的计算需求，这些学生模型在性能上与教师模型相当。此外，研究还探讨了训练数据规模对学生模型性能的影响，并验证了其在单语言、多语言和零样本跨语言任务中的能力。最终，论文发布了一个高性能的新闻主题分类器，支持IPTC媒体主题架构下的多语言分类。

链接: https://arxiv.org/abs/2411.19638
作者: Taja Kuzman,Nikola Ljubešić
关键词-EN: enhancing readers’ access, IPTC Media Topic, Generative Pretrained Transformer, stories available online, relevant content
类目: Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers’ access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop an IPTC Media Topic training dataset through automatic annotation of news articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits a high zero-shot performance on all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.
zh

[NLP-27] Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 中视觉标记 (visual tokens) 过度使用导致的冗余和计算成本过高的问题。解决方案的关键在于提出了一种名为动态视觉标记退出 (Dynamic Visual-Token Exit, DyVTE) 的方法。DyVTE 通过使用轻量级超网络 (hyper-networks) 感知文本标记状态，并在特定层后决定移除所有视觉标记，从而有效解决了视觉标记的冗余问题，显著提升了 MLLMs 的效率。

链接: https://arxiv.org/abs/2411.19628
作者: Qiong Wu,Wenhao Lin,Weihao Ye,Yiyi Zhou,Xiaoshuai Sun,Rongrong Ji
关键词-EN: Large Language Models, Multimoal Large Language, existing Multimoal Large, Language Models, Multimoal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer, thereby addressing the observed visual redundancy. To validate VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL, and conduct extensive experiments on a bunch of benchmarks. The experiment results not only show the effectiveness of our VTE in improving MLLMs’ efficiency, but also yield the general modeling patterns of MLLMs, well facilitating the in-depth understanding of MLLMs. Our code is anonymously released at this https URL.
zh

[NLP-28] Can Large Language Models Reason about the Region Connection Calculus?

【速读】：该论文试图解决的问题是评估大型语言模型（Large Language Models, LLMs）在经典定性空间推理任务中的表现，特别是基于RCC-8（Region Connection Calculus-8）的推理能力。解决方案的关键在于通过三组实验（组合表重建、与人类组合偏好对齐、概念邻域重建）来测试LLMs在处理同名关系和匿名关系时的表现，以确定LLMs是否依赖于训练过程中获得的关系名称知识。每组实验重复30次，以测量LLMs的随机性。

链接: https://arxiv.org/abs/2411.19589
作者: Anthony G Cohn,Robert E Blackwell
关键词-EN: Geographical Information Systems, Computer Vision, Geographical Information, Information Systems, Systems to Robotics
类目: Computation and Language (cs.CL)
备注: 13 pages. arXiv admin note: text overlap with arXiv:2309.15577

点击查看摘要

Abstract:Qualitative Spatial Reasoning is a well explored area of Knowledge Representation and Reasoning and has multiple applications ranging from Geographical Information Systems to Robotics and Computer Vision. Recently, many claims have been made for the reasoning capabilities of Large Language Models (LLMs). Here, we investigate the extent to which a set of representative LLMs can perform classical qualitative spatial reasoning tasks on the mereotopological Region Connection Calculus, RCC-8. We conduct three pairs of experiments (reconstruction of composition tables, alignment to human composition preferences, conceptual neighbourhood reconstruction) using state-of-the-art LLMs; in each pair one experiment uses eponymous relations and one, anonymous relations (to test the extent to which the LLM relies on knowledge about the relation names obtained during training). All instances are repeated 30 times to measure the stochasticity of the LLMs.
zh

[NLP-29] In-Context Learning with Noisy Labels

【速读】：该论文试图解决在上下文学习（in-context learning）中，由于任务演示（task demonstrations）中存在不可避免的噪声标签（noisy labels）而导致性能下降的问题。解决方案的关键在于提出了一种新的任务——带有噪声标签的上下文学习（in-context learning with noisy labels），并基于噪声标签学习（learning with noisy labels）的研究，设计了一种新的方法和基线方法来应对这一挑战。实验结果表明，所提出的方法能够有效防止因噪声标签导致的性能下降。

链接: https://arxiv.org/abs/2411.19581
作者: Junyong Kang,Donghyun Son,Hwanjun Song,Buru Chang
关键词-EN: large language models, In-context learning, language models, additional training, noisy labels
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning refers to the emerging ability of large language models (LLMs) to perform a target task without additional training, utilizing demonstrations of the task. Recent studies aim to enhance in-context learning performance by selecting more useful demonstrations. However, they overlook the presence of inevitable noisy labels in task demonstrations that arise during the labeling process in the real-world. In this paper, we propose a new task, in-context learning with noisy labels, which aims to solve real-world problems for in-context learning where labels in task demonstrations would be corrupted. Moreover, we propose a new method and baseline methods for the new task, inspired by studies in learning with noisy labels. Through experiments, we demonstrate that our proposed method can serve as a safeguard against performance degradation in in-context learning caused by noisy labels.
zh

[NLP-30] ICPR 2024 Competition on Multilingual Claim-Span Identification ICPR2024

【速读】：该论文试图解决社交媒体帖子中“声明范围识别”（Claim Span Identification）的问题，即在给定的文本中自动识别出哪些部分是声明。解决方案的关键在于采用先进的模式识别（Pattern Recognition）、自然语言处理（Natural Language Processing）和机器学习（Machine Learning）技术，以应对这一任务比传统的二分类（将文本分为声明或非声明）更为复杂的挑战。论文中提到的解决方案基于一个新开发的数据集HECSI，该数据集包含约8000条英文和8000条印地语的帖子，且声明部分已由人工标注，为参赛团队提供了训练和评估模型的基础。

链接: https://arxiv.org/abs/2411.19579
作者: Soham Poddar,Biswajit Paul,Moumita Basu,Saptarshi Ghosh
关键词-EN: social media posts, misinformation or fake, media posts, social media, Natural Language Processing
类目: Computation and Language (cs.CL)
备注: To appear at ICPR 2024

点击查看摘要

Abstract:A lot of claims are made in social media posts, which may contain misinformation or fake news. Hence, it is crucial to identify claims as a first step towards claim verification. Given the huge number of social media posts, the task of identifying claims needs to be automated. This competition deals with the task of ‘Claim Span Identification’ in which, given a text, parts / spans that correspond to claims are to be identified. This task is more challenging than the traditional binary classification of text into claim or not-claim, and requires state-of-the-art methods in Pattern Recognition, Natural Language Processing and Machine Learning. For this competition, we used a newly developed dataset called HECSI containing about 8K posts in English and about 8K posts in Hindi with claim-spans marked by human annotators. This paper gives an overview of the competition, and the solutions developed by the participating teams.
zh

[NLP-31] KV Shifting Attention Enhances Language Modeling

【速读】：该论文试图解决当前大型语言模型中基于解码器结构Transformer的上下文学习（In-context Learning, ICL）能力依赖于深度和宽度较大的归纳头（induction heads）机制的问题。解决方案的关键在于提出了一种新的KV偏移注意力（KV shifting attention）机制，通过理论证明和实验验证，该机制能够有效降低模型对归纳头机制深度和宽度的要求，从而在保持或提升模型性能的同时，加速模型的收敛速度，尤其是在参数规模超过100亿的大型预训练模型中表现尤为显著。

链接: https://arxiv.org/abs/2411.19574
作者: Mingyu Xu,Wei Cheng,Bingning Wang,Weipeng Chen
关键词-EN: decode-only structure transformers, induction heads mechanism, current large language, great in-context learning, induction heads
类目: Computation and Language (cs.CL)
备注: 22 pages

点击查看摘要

Abstract:The current large language models are mainly based on decode-only structure transformers, which have great in-context learning (ICL) capabilities. It is generally believed that the important foundation of its ICL capability is the induction heads mechanism, which requires at least two layers attention. In order to more efficiently implement the ability of the model’s induction, we revisit the induction heads mechanism and proposed a KV shifting attention. We theoretically prove that the KV shifting attention reducing the model’s requirements for the depth and width of the induction heads mechanism. Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling, which lead to better performance or faster convergence from toy models to the pre-training models with more than 10 B parameters.
zh

[NLP-32] Ensemble Watermarks for Large Language Models

【速读】：该论文试图解决现有大语言模型 (Large Language Models, LLMs) 水印方法在面对重述攻击 (paraphrasing attacks) 时缺乏灵活性和检测率低的问题。解决方案的关键在于提出一种多特征集成水印方法 (multi-feature ensemble watermarking)，通过结合多种独特的水印特征（如首字母缩略词 (acrostica)、感官运动规范 (sensorimotor norms) 和传统的红绿水印 (red-green watermark)），显著提高水印的检测率和抗攻击能力。具体来说，该方法在未受攻击时达到98%的检测率，即使在重述攻击后仍保持95%的高检测率，远超单一红绿水印的49%检测率。通过灵活组合不同特征，该方法能够适应不同的需求和权衡，同时保持检测函数的一致性，从而在提高模型可追溯性和防止社会危害方面具有重要意义。

链接: https://arxiv.org/abs/2411.19563
作者: Georg Niess,Roman Kern
关键词-EN: large language models, language models, humans and machines, rapid advancement, advancement of large
类目: Computation and Language (cs.CL)
备注: 9 pages in the main body. Code is available at this http URL . arXiv admin note: substantial text overlap with arXiv:2405.08400

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. While watermarks already exist for LLMs, they often lack flexibility, and struggle with attacks such as paraphrasing. To address these issues, we propose a multi-feature method for generating watermarks that combines multiple distinct watermark features into an ensemble watermark. Concretely, we combine acrostica and sensorimotor norms with the established red-green watermark to achieve a 98% detection rate. After a paraphrasing attack the performance remains high with 95% detection rate. The red-green feature alone as baseline achieves a detection rate of 49%. The evaluation of all feature combinations reveals that the ensemble of all three consistently has the highest detection rate across several LLMs and watermark strength settings. Due to the flexibility of combining features in the ensemble, various requirements and trade-offs can be addressed. Additionally, for all ensemble configurations the same detection function can be used without adaptations. This method is particularly of interest to facilitate accountability and prevent societal harm.
zh

[NLP-33] Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning

【速读】：该论文试图解决在低秩适配器（Low-rank Adapters）中实现高效微调大型语言模型（LLMs）时，性能往往不如全模型微调的问题。解决方案的关键在于提出了一种名为LoRA Silver Bullet（LoRA-SB）的方法，通过精心设计的初始化策略，在低秩子空间中近似全模型微调。具体来说，论文通过在B和A之间插入一个可训练的(r x r)矩阵，并保持其他矩阵固定，从而在理论上证明了LoRA-XS架构提供了实现这种近似所需的精确条件。这种方法利用了其受限的更新空间，实现了高秩梯度更新的最佳缩放，同时消除了对超参数调整的需求。实验结果表明，LoRA-SB在数学推理、常识推理和语言理解任务中，不仅超越了标准LoRA的性能，而且使用的参数数量减少了27-90倍，全面优于LoRA-XS。

链接: https://arxiv.org/abs/2411.19557
作者: Kaustubh Ponkshe,Raghav Singhal,Eduard Gorbunov,Alexey Tumanov,Samuel Horvath,Praneeth Vepakomma
关键词-EN: efficiently fine-tuning large, large language models, LoRA Silver Bullet, full fine-tuning, fall short
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Kaustubh Ponkshe and Raghav Singhal contributed equally to this work

点击查看摘要

Abstract:Low-rank adapters have become a standard approach for efficiently fine-tuning large language models (LLMs), but they often fall short of achieving the performance of full fine-tuning. We propose a method, LoRA Silver Bullet or LoRA-SB, that approximates full fine-tuning within low-rank subspaces using a carefully designed initialization strategy. We theoretically demonstrate that the architecture of LoRA-XS, which inserts a trainable (r x r) matrix between B and A while keeping other matrices fixed, provides the precise conditions needed for this approximation. We leverage its constrained update space to achieve optimal scaling for high-rank gradient updates while removing the need for hyperparameter tuning. We prove that our initialization offers an optimal low-rank approximation of the initial gradient and preserves update directions throughout training. Extensive experiments across mathematical reasoning, commonsense reasoning, and language understanding tasks demonstrate that our approach exceeds the performance of standard LoRA while using 27-90x fewer parameters, and comprehensively outperforms LoRA-XS. Our findings establish that it is possible to simulate full fine-tuning in low-rank subspaces, and achieve significant efficiency gains without sacrificing performance. Our code is publicly available at this https URL.
zh

[NLP-34] raining Agents with Weakly Supervised Feedback from Large Language Models

【速读】：该论文试图解决现有基于大型语言模型（Large Language Models, LLMs）的代理在复杂任务中依赖专家轨迹或明确环境反馈的问题，这些方法限制了其在特定场景（如游戏或代码生成）中的应用。论文提出的解决方案关键在于使用弱监督信号，即通过一个批评型LLM（critic LLM）来选择和更新代理生成的轨迹，从而在不需要专家轨迹或明确反馈的情况下，实现代理能力的迭代提升。该方法在API-bank数据集上的广泛测试表明，尽管使用参数较少的开源模型，其性能仍可与GPT-4相媲美。

链接: https://arxiv.org/abs/2411.19547
作者: Dihong Gong,Pu Lu,Zelong Wang,Meng Zhou,Xiuqiang He
关键词-EN: Large Language Models, Large Language, tackle complex tasks, Language Models, offer a promising
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer a promising basis for creating agents that can tackle complex tasks through iterative environmental interaction. Existing methods either require these agents to mimic expert-provided trajectories or rely on definitive environmental feedback for reinforcement learning which limits their application to specific scenarios like gaming or code generation. This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM, bypassing the need for expert trajectories or definitive feedback. Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction. Subsequently, a critic LLM selects a subset of good trajectories, which are then used to update the agents, enabling them to generate improved trajectories in the next iteration. Extensive tests on the API-bank dataset show consistent improvement in our agents’ capabilities and comparable performance to GPT-4, despite using open-source models with much fewer parameters.
zh

[NLP-35] Knowledge Management for Automobile Failure Analysis Using Graph RAG

【速读】：该论文试图解决汽车故障分析中知识传递的难题，特别是从经验丰富的工程师到年轻工程师的知识传递。解决方案的关键在于优化基于检索增强生成 (Retrieval-Augmented Generation, RAG) 的大语言模型 (Large Language Models, LLMs) 和知识图谱 (Knowledge Graphs, KGs) 的结合，以提高从现有知识图谱中提取和理解子图的效率。通过改进 Graph RAG 流程，论文提出的方法在原始 QA 数据集上的 ROUGE F1 分数平均提高了 157.6%，显著提升了汽车故障分析的效果。

链接: https://arxiv.org/abs/2411.19539
作者: Yuta Ojima,Hiroki Sakaji,Tadashi Nakamura,Hiroaki Sakata,Kazuya Seki,Yuu Teshigawara,Masami Yamashita,Kazuhiro Aoyama
关键词-EN: large language models, automobile failure analysis, Graph RAG, failure analysis, knowledge graphs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 7 pages, 6 figures, to be published in 2024 IEEE International Conference on Bid Data (BigData)

点击查看摘要

Abstract:This paper presents a knowledge management system for automobile failure analysis using retrieval-augmented generation (RAG) with large language models (LLMs) and knowledge graphs (KGs). In the automotive industry, there is a growing demand for knowledge transfer of failure analysis from experienced engineers to young engineers. However, failure events are phenomena that occur in a chain reaction, making them difficult for beginners to analyze them. While knowledge graphs, which can describe semantic relationships and structure information is effective in representing failure events, due to their capability of representing the relationships between components, there is much information in KGs, so it is challenging for young engineers to extract and understand sub-graphs from the KG. On the other hand, there is increasing interest in the use of Graph RAG, a type of RAG that combines LLMs and KGs for knowledge management. However, when using the current Graph RAG framework with an existing knowledge graph for automobile failures, several issues arise because it is difficult to generate executable queries for a knowledge graph database which is not constructed by LLMs. To address this, we focused on optimizing the Graph RAG pipeline for existing knowledge graphs. Using an original QA dataset, the ROUGE F1 score of the sentences generated by the proposed method showed an average improvement of 157.6% compared to the current method. This highlights the effectiveness of the proposed method for automobile failure analysis.
zh

[NLP-36] QA-Bench: Evaluating LLM s for Multi-Table Question Answering with Scalable Context and Symbolic Extension

【速读】：该论文试图解决在大规模语言模型（LLMs）在复杂的多表关系数据上的问答（QA）任务中，现有基准测试主要集中在单表QA，未能充分捕捉跨多个关系表的推理复杂性这一问题。解决方案的关键在于提出了一个新的多表QA基准测试——TQA-Bench，该基准测试通过引入多样化的真实世界数据集和灵活的采样机制，创建了具有不同多表上下文长度的任务（从8K到64K tokens），并结合符号扩展评估框架，以评估LLMs在复杂数据检索和推理能力上的表现。通过系统地评估不同规模（从7亿到700亿参数）的开源和闭源LLMs，该研究揭示了LLMs在多表QA任务中的性能瓶颈和改进机会。

链接: https://arxiv.org/abs/2411.19504
作者: Zipeng Qiu,You Peng,Guangxin He,Binhang Yuan,Chen Wang
关键词-EN: unlocked great opportunities, question answering, unlocked great, large language models, complicated multi-table relational
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The advent of large language models (LLMs) has unlocked great opportunities in complex data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing heterogeneous table structures and potential large scale of serialized relational data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of reasoning across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. To address this gap, we present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of LLMs in tackling complex QA tasks over relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets and introduces a flexible sampling mechanism to create tasks with varying multi-table context lengths, ranging from 8K to 64K tokens. To ensure robustness and reliability, we integrate symbolic extensions into the evaluation framework, enabling the assessment of LLM reasoning capabilities beyond simple data retrieval or probabilistic pattern matching. We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments. Our benchmark implementation and results are available at this https URL.
zh

[NLP-37] COLD: Causal reasOning in cLosed Daily activities NEURIPS2024

【速读】：该论文试图解决大语言模型（LLMs）在因果推理方面的局限性问题，特别是在日常活动中的因果关系理解。解决方案的关键在于提出了COLD（Causal reasOning in cLosed Daily activities）框架，该框架基于人类对日常真实世界活动的理解，用于推理事件的因果性质。COLD框架不仅能够生成大量因果查询（约900万条），接近于模拟因果推理的迷你图灵测试，还能通过后门准则评估事件之间的因果强度，从而更接近真实世界的因果推理。

链接: https://arxiv.org/abs/2411.19500
作者: Abhinav Joshi,Areeb Ahmad,Ashutosh Modi
关键词-EN: Large Language Models, Large Language, Language Models, causal reasoning, causal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Paper accepted at NeurIPS 2024; Total 37 Pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown state-of-the-art performance in a variety of tasks, including arithmetic and reasoning; however, to gauge the intellectual capabilities of LLMs, causal reasoning has become a reliable proxy for validating a general understanding of the mechanics and intricacies of the world similar to humans. Previous works in natural language processing (NLP) have either focused on open-ended causal reasoning via causal commonsense reasoning (CCR) or framed a symbolic representation-based question answering for theoretically backed-up analysis via a causal inference engine. The former adds an advantage of real-world grounding but lacks theoretically backed-up analysis/validation, whereas the latter is far from real-world grounding. In this work, we bridge this gap by proposing the COLD (Causal reasOning in cLosed Daily activities) framework, which is built upon human understanding of daily real-world activities to reason about the causal nature of events. We show that the proposed framework facilitates the creation of enormous causal queries (~ 9 million) and comes close to the mini-turing test, simulating causal reasoning to evaluate the understanding of a daily real-world task. We evaluate multiple LLMs on the created causal queries and find that causal reasoning is challenging even for activities trivial to humans. We further explore (the causal reasoning abilities of LLMs) using the backdoor criterion to determine the causal strength between events.
zh

[NLP-38] A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models

【速读】：该论文提出了一种针对大型语言模型（LLMs）测试时计算效率的两阶段通用算法，旨在通过证明其可扩展性来解决LLMs在测试阶段的计算效率问题。解决方案的关键在于：首先生成N个候选解决方案，然后通过多轮淘汰赛选择最佳方案。每对候选方案进行K次比较，只有胜者进入下一轮。该算法在最小化实现中仅依赖于黑箱LLM，无需外部验证器或奖励模型，总共需要N×(K+1)次高度并行化的LLM调用。理论证明显示，当候选方案生成正确概率p_gen > 0且比较正确与错误方案的胜率p_comp > 0.5时，算法的失败概率随N和K的增加呈指数级衰减。

链接: https://arxiv.org/abs/2411.19477
作者: Yanxi Chen,Xuchen Pan,Yaliang Li,Bolin Ding,Jingren Zhou
关键词-EN: general two-stage algorithm, large language models, provable scaling law, proposed algorithm, propose a general
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:We propose a general two-stage algorithm that enjoys a provable scaling law for the test-time compute of large language models (LLMs). Given an input problem, the proposed algorithm first generates N candidate solutions, and then chooses the best one via a multiple-round knockout tournament where each pair of candidates are compared for K times and only the winners move on to the next round. In a minimalistic implementation, both stages can be executed with a black-box LLM alone and nothing else (e.g., no external verifier or reward model), and a total of N \times (K + 1) highly parallelizable LLM calls are needed for solving an input problem. Assuming that a generated candidate solution is correct with probability p_\textgen 0 and a comparison between a pair of correct and incorrect solutions identifies the right winner with probability p_\textcomp 0.5 (i.e., better than a random guess), we prove theoretically that the failure probability of the proposed algorithm decays to zero exponentially with respect to N and K : \mathbbP(\textfinal output is incorrect) \le (1 - p_\textgen)^N + \lceil \log_2 N \rceil e^-2 K (p_\textcomp - 0.5)^2. Our empirical results with the challenging MMLU-Pro benchmark validate the technical assumptions, as well as the efficacy of the proposed algorithm and the gains from scaling up its test-time compute.
zh

[NLP-39] Beyond Surface Structure: A Causal Assessment of LLM s Comprehension Ability

【速读】：该论文试图解决的问题是评估大型语言模型（LLMs）在自然语言任务中是否真正理解深层结构（即核心语义），而不仅仅是依赖于表层结构（如呈现格式）。解决方案的关键在于提出了因果中介分析方法，通过将深层结构的理解定义为直接因果效应（DCE），将表层结构的理解定义为间接因果效应（ICE），并开发了相应的可量化替代指标，包括近似DCE（ADCE）和近似ICE（AICE）。这些指标解决了原始DCE和ICE因无法隔离深层和表层结构相互影响而难以估计的问题。通过应用ADCE评估一系列主流LLMs，研究发现大多数模型展现出深层结构理解能力，且这种能力随着预测准确性的提高而增强。此外，比较ADCE和AICE的结果显示，闭源LLMs更依赖深层结构，而开源LLMs对表层结构更敏感，但随着模型规模的增大，这种敏感性降低。理论上，ADCE作为一种双向评估方法，不仅衡量深层结构变化对输出变化的充分性，还衡量其必要性，从而提供了比传统准确性评估更全面的评估方式。

链接: https://arxiv.org/abs/2411.19456
作者: Yujin Han,Lei Xu,Sirui Chen,Difan Zou,Chaochao Lu
关键词-EN: Large language models, natural language tasks, Large language, deep structure, surface structure
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 14 figures, 10 tables

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capability in natural language tasks, yet debate persists on whether they truly comprehend deep structure (i.e., core semantics) or merely rely on surface structure (e.g., presentation format). Prior studies observe that LLMs’ performance declines when intervening on surface structure, arguing their success relies on surface structure recognition. However, surface structure sensitivity does not prevent deep structure comprehension. Rigorously evaluating LLMs’ capability requires analyzing both, yet deep structure is often overlooked. To this end, we assess LLMs’ comprehension ability using causal mediation analysis, aiming to fully discover the capability of using both deep and surface structures. Specifically, we formulate the comprehension of deep structure as direct causal effect (DCE) and that of surface structure as indirect causal effect (ICE), respectively. To address the non-estimability of original DCE and ICE – stemming from the infeasibility of isolating mutual influences of deep and surface structures, we develop the corresponding quantifiable surrogates, including approximated DCE (ADCE) and approximated ICE (AICE). We further apply the ADCE to evaluate a series of mainstream LLMs, showing that most of them exhibit deep structure comprehension ability, which grows along with the prediction accuracy. Comparing ADCE and AICE demonstrates closed-source LLMs rely more on deep structure, while open-source LLMs are more surface-sensitive, which decreases with model scale. Theoretically, ADCE is a bidirectional evaluation, which measures both the sufficiency and necessity of deep structure changes in causing output variations, thus offering a more comprehensive assessment than accuracy, a common evaluation in LLMs. Our work provides new insights into LLMs’ deep structure comprehension and offers novel methods for LLMs evaluation.
zh

[NLP-40] Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models

【速读】：该论文试图解决现有迭代检索方法在实现过程中依赖于少样本提示或手动构建规则，导致推理开销增加且未能充分利用大型语言模型（LLMs）的强大推理能力的问题。解决方案的关键在于提出了Auto-RAG，一种基于LLM决策能力的自主迭代检索模型。Auto-RAG通过与检索器进行多轮对话，系统地规划检索并优化查询，以获取有价值的知识，直至收集到足够的外部信息后将结果呈现给用户。该方法的核心在于自主合成基于推理的决策指令，并微调最新的开源LLMs，从而实现高效的自主迭代交互，显著提升了在多个基准测试中的表现。

链接: https://arxiv.org/abs/2411.19443
作者: Tian Yu,Shaolei Zhang,Yang Feng
关键词-EN: Iterative retrieval refers, Retrieval-Augmented Generation, Iterative retrieval, iterative retrieval model, model continuously queries
类目: Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Iterative retrieval refers to the process in which the model continuously queries the retriever during generation to enhance the relevance of the retrieved knowledge, thereby improving the performance of Retrieval-Augmented Generation (RAG). Existing work typically employs few-shot prompting or manually constructed rules to implement iterative retrieval. This introduces additional inference overhead and overlooks the remarkable reasoning capabilities of Large Language Models (LLMs). In this paper, we introduce Auto-RAG, an autonomous iterative retrieval model centered on the LLM’s powerful decision-making capabilities. Auto-RAG engages in multi-turn dialogues with the retriever, systematically planning retrievals and refining queries to acquire valuable knowledge. This process continues until sufficient external information is gathered, at which point the results are presented to the user. To this end, we develop a method for autonomously synthesizing reasoning-based decision-making instructions in iterative retrieval and fine-tuned the latest open-source LLMs. The experimental results indicate that Auto-RAG is capable of autonomous iterative interaction with the retriever, effectively leveraging the remarkable reasoning and decision-making abilities of LLMs, which lead to outstanding performance across six benchmarks. Further analysis reveals that Auto-RAG can autonomously adjust the number of iterations based on the difficulty of the questions and the utility of the retrieved knowledge, without requiring any human intervention. Moreover, Auto-RAG expresses the iterative retrieval process in natural language, enhancing interpretability while providing users with a more intuitive experience\footnoteCode is available at \urlthis https URL.
zh

[NLP-41] Actions and Objects Pathways for Domain Adaptation in Video Question Answering

【速读】：该论文试图解决视频问答任务中的跨域泛化问题，即在不经过显式训练的情况下，如何使模型在未见过的领域中表现良好。解决方案的关键在于提出了Actions and Objects Pathways (AOPath)，该方法通过将预训练模型中的特征分离为动作特征和对象特征，并通过独立的推理路径进行处理，从而增强了模型的泛化能力。AOPath引入了一种新颖的模块，该模块能够将跨域特征转换为与域无关的特征，且不引入任何可训练的权重。实验结果表明，AOPath在跨域和同域数据集上的表现分别比传统分类器高出5%和4%，并且相比需要训练数百万参数的先前方法，AOPath仅训练了极少的参数。

链接: https://arxiv.org/abs/2411.19434
作者: Safaa Abdullahi Moallim Mohamud,Ho-Young Jung
关键词-EN: question answering tasks, video question answering, generalization in video, answering tasks, video question
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce the Actions and Objects Pathways (AOPath) for out-of-domain generalization in video question answering tasks. AOPath leverages features from a large pretrained model to enhance generalizability without the need for explicit training on the unseen domains. Inspired by human brain, AOPath dissociates the pretrained features into action and object features, and subsequently processes them through separate reasoning pathways. It utilizes a novel module which converts out-of-domain features into domain-agnostic features without introducing any trainable weights. We validate the proposed approach on the TVQA dataset, which is partitioned into multiple subsets based on genre to facilitate the assessment of generalizability. The proposed approach demonstrates 5% and 4% superior performance over conventional classifiers on out-of-domain and in-domain datasets, respectively. It also outperforms prior methods that involve training millions of parameters, whereas the proposed approach trains very few parameters.
zh

[NLP-42] Libra: Leveraging Temporal Images for Biomedical Radiology Analysis

【速读】：该论文试图解决放射报告生成 (Radiology Report Generation, RRG) 中多时间点图像分析时忽略时间信息的问题。解决方案的关键在于引入了一个名为 Libra 的时序感知多模态大语言模型 (Multimodal Large Language Model, MLLM)，该模型结合了放射学专用图像编码器和 MLLM，并通过一种新颖的时序对齐连接器 (Temporal Alignment Connector) 来精确捕捉和综合不同时间点图像的时序信息。这一创新使得 Libra 在 MIMIC-CXR 数据集上的 RRG 任务中达到了新的最先进水平，显著提升了 RadCliQ 指标和所有词汇指标的表现。

链接: https://arxiv.org/abs/2411.19378
作者: Xi Zhang,Zaiqiao Meng,Jake Lever,Edmond S. L. Ho
关键词-EN: Radiology report generation, Radiology report, accurate report generation, report generation, understanding of medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Radiology report generation (RRG) is a challenging task, as it requires a thorough understanding of medical images, integration of multiple temporal inputs, and accurate report generation. Effective interpretation of medical images, such as chest X-rays (CXRs), demands sophisticated visual-language reasoning to map visual findings to structured reports. Recent studies have shown that multimodal large language models (MLLMs) can acquire multimodal capabilities by aligning with pre-trained vision encoders. However, current approaches predominantly focus on single-image analysis or utilise rule-based symbolic processing to handle multiple images, thereby overlooking the essential temporal information derived from comparing current images with prior ones. To overcome this critical limitation, we introduce Libra, a temporal-aware MLLM tailored for CXR report generation using temporal images. Libra integrates a radiology-specific image encoder with a MLLM and utilises a novel Temporal Alignment Connector to capture and synthesise temporal information of images across different time points with unprecedented precision. Extensive experiments show that Libra achieves new state-of-the-art performance among the same parameter scale MLLMs for RRG tasks on the MIMIC-CXR. Specifically, Libra improves the RadCliQ metric by 12.9% and makes substantial gains across all lexical metrics compared to previous models.
zh

[NLP-43] DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities

【速读】：该论文试图解决的问题是如何系统地评估影响语言模型（LMs）在长输入上下文中提取特定信息（即“大海捞针”任务）能力的因素。解决方案的关键在于开发了一个名为DENIAHL（Data-oriented Evaluation of NIAH for LLM’s）的综合基准测试，该基准不仅考虑了上下文长度，还系统地分析了数据类型、数据大小和数据模式等因素对模型性能的影响。通过对比GPT-3.5和LLaMA 2-7B在DENIAHL上的表现，研究发现这些因素显著影响了模型的“大海捞针”能力，特别是在增加项目大小或改变数据类型时，模型的召回性能显著下降。这一发现对于理解大型上下文模型在实际应用中的表现具有重要意义。

链接: https://arxiv.org/abs/2411.19360
作者: Hui Dai,Dan Pechi,Xinyi Yang,Garvit Banga,Raghav Mantri
关键词-EN: assess language models’, long input context, language models’, general task, assess language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Needle-in-a-haystack (NIAH) test is a general task used to assess language models’ (LMs’) abilities to recall particular information from long input context. This framework however does not provide a means of analyzing what factors, beyond context length, contribute to LMs’ abilities or inabilities to separate and recall needles from their haystacks. To provide a systematic means of assessing what features contribute to LMs’ NIAH capabilities, we developed a synthetic benchmark called DENIAHL (Data-oriented Evaluation of NIAH for LLM’s). Our work expands on previous NIAH studies by ablating NIAH features beyond typical context length including data type, size, and patterns. We find stark differences between GPT-3.5 and LLaMA 2-7B’s performance on DENIAHL, and drops in recall performance when features like item size are increased, and to some degree when data type is changed from numbers to letters. This has implications for increasingly large context models, demonstrating factors beyond item-number impact NIAH capabilities.
zh

[NLP-44] CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

【速读】：该论文试图解决在无标签数据情况下，如何提升基于CLIP的图像分类性能的问题。解决方案的关键在于提出了一种无需标签的提示调优方法，该方法结合了自监督学习模型（DINO）的丰富视觉特征和大语言模型（LLMs）的广泛文本知识。具体步骤包括：(1) 利用LLMs生成更准确表示对象类别的文本特征嵌入，以实现更有效的零样本分类；(2) 使用这些文本嵌入生成伪标签，训练一个整合LLM描述性文本嵌入和DINO视觉特征的对齐模块；(3) 通过DINO辅助监督，使用训练好的对齐模块对CLIP的视觉编码器进行提示调优。这种方法充分利用了视觉和文本基础模型的优势，显著提升了无标签分类的性能，并在多个数据集上超越了现有最先进的方法。

链接: https://arxiv.org/abs/2411.19346
作者: Mohamed Fazli Imam,Rufael Fedaku Marew,Jameel Hassan,Mustansar Fiaz,Alham Fikri Aji,Hisham Cholakkal
关键词-EN: common embedding space, tool for aligning, aligning text, visual features, visual
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the era of foundation models, CLIP has emerged as a powerful tool for aligning text and visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We generate robust textual feature embeddings that more accurately represent object classes by leveraging class-specific descriptions from LLMs, enabling more effective zero-shot classification compared to CLIP’s default name-specific prompts. (2) These textual embeddings are then used to produce pseudo-labels to train an alignment module that integrates the complementary strengths of LLM description-based textual embeddings and DINO’s visual features. (3) Finally, we prompt-tune CLIP’s vision encoder through DINO-assisted supervision using the trained alignment module. This three-step process allows us to harness the best of visual and textual foundation models, resulting in a powerful and efficient approach that surpasses state-of-the-art label-free classification methods. Notably, our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFter across 11 diverse image classification datasets.
zh

[NLP-45] alking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

【速读】：该论文试图解决开放词汇分割 (Open-Vocabulary Segmentation, OVS) 中图像分割任务的问题，即在没有预定义训练类别的情况下，根据自由形式的文本概念进行图像分割。现有方法如 CLIP 和 DINO 各自存在局限：CLIP 在全局对齐图像和文本特征时面临空间定位挑战，而 DINO 在细粒度视觉编码方面表现出色但缺乏与语言的整合。论文提出的解决方案之关键是 Talk2DINO，一种结合 DINOv2 的空间准确性和 CLIP 的语言理解能力的新型混合方法。通过学习映射函数，Talk2DINO 将 CLIP 的文本嵌入与 DINOv2 的补丁级特征对齐，无需微调底层模型。训练时，利用 DINOv2 的注意力图选择性地对齐局部视觉补丁与文本嵌入，从而增强分割过程的语义和定位能力，实现更自然、噪声更少的分割效果，并有效区分前景和背景。实验结果表明，Talk2DINO 在多个无监督 OVS 基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2411.19331
作者: Luca Barsellotti,Lorenzo Bianchi,Nicola Messina,Fabio Carrara,Marcella Cornia,Lorenzo Baraldi,Fabrizio Falchi,Rita Cucchiara
关键词-EN: predefined training classes, free-form textual concepts, aims at segmenting, concepts without predefined, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: this https URL.
zh

[NLP-46] Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows

【速读】：该论文试图解决从科学文章中提取生物信息学工作流程详细信息的难题，这一难题由于缺乏标注语料库而受到阻碍。解决方案的关键在于将问题框架为低资源提取任务，并测试了四种策略：1) 创建定制的标注语料库；2) 使用自回归语言模型进行少样本命名实体识别 (NER)；3) 使用掩码语言模型结合现有和新语料库进行NER；4) 将工作流程知识整合到NER模型中。通过使用新构建的BioToFlow语料库（包含52篇文章和16个实体的标注），基于SciBERT的NER模型达到了70.4的F-measure，与标注者间的一致性相当。尽管知识整合在特定实体上提升了性能，但在整个信息模式上效果较差。研究结果表明，高性能的生物信息学工作流程信息提取是可实现的。

链接: https://arxiv.org/abs/2411.19295
作者: Clémence Sebe,Sarah Cohen-Boulakia,Olivier Ferret,Aurélie Névéol
关键词-EN: complex biological data, biological data analyses, public repositories, essential for complex, complex biological
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.
zh

[NLP-47] Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks

【速读】：该论文试图解决尼泊尔语（Nepali）在自然语言处理（NLP）评估中面临的独特挑战，特别是其复杂的文字系统（Devanagari script）、形态学特征和多种方言带来的问题。现有的Nepali Language Understanding Evaluation (Nep-gLUE)基准仅涵盖四个任务，限制了其对NLP模型进行全面评估的能力。为解决这一局限，论文引入了八个新的数据集，创建了新的Nepali Language Understanding Evaluation (NLUE)基准，涵盖了总共12个任务，包括单句分类、相似性和释义任务以及自然语言推理（NLI）任务。通过扩展任务范围，论文揭示了现有模型在处理复杂NLU任务时的不足，从而为评估、比较和推进模型设定了新的标准，对推动低资源语言的NLP研究具有重要意义。

链接: https://arxiv.org/abs/2411.19244
作者: Jinu Nyachhyon,Mridul Sharma,Prajwal Thapa,Bal Krishna Bal
关键词-EN: Nepali Language Understanding, Language Understanding Evaluation, Devanagari script, distinct linguistic features, Nepali language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Nepali language has distinct linguistic features, especially its complex script (Devanagari script), morphology, and various dialects, which pose a unique challenge for natural language processing (NLP) evaluation. While the Nepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a foundation for evaluating models, it remains limited in scope, covering four tasks. This restricts their utility for comprehensive assessments of NLP models. To address this limitation, we introduce eight new datasets, creating a new benchmark, the Nepali Language Understanding Evaluation (NLUE) benchmark, which covers a total of 12 tasks for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks. The added tasks include single-sentence classification, similarity and paraphrase tasks, and Natural Language Inference (NLI) tasks. On evaluating the models using added tasks, we observe that the existing models fall short in handling complex NLU tasks effectively. This expanded benchmark sets a new standard for evaluating, comparing, and advancing models, contributing significantly to the broader goal of advancing NLP research for low-resource languages.
zh

[NLP-48] How far can bias go? – Tracing bias from pretraining data to alignment

【速读】：该论文试图解决生成式大型语言模型 (LLMs) 在用户应用中集成的过程中，如何应对和理解性别职业偏见的问题。解决方案的关键在于研究预训练数据中的性别职业偏见与其在模型输出中的表现之间的关联，特别是通过分析Dolma数据集和OLMo模型。研究采用零样本提示和词元共现分析方法，揭示了预训练数据中的偏见如何在模型输出中被放大。此外，研究还探讨了提示类型、超参数和指令微调对偏见表达的影响，发现指令微调能在一定程度上缓解表征偏见，但仍保留总体的性别刻板印象，而超参数和提示变化对偏见表达的影响较小。研究强调了在预训练阶段缓解偏见的重要性。

链接: https://arxiv.org/abs/2411.19240
作者: Marion Thaler,Abdullatif Köksal,Alina Leidinger,Anna Korhonen,Hinrich Schütze
关键词-EN: perpetuate societal inequalities, user-facing applications, inequalities is crucial, increasingly integrated, integrated into user-facing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender associations, whereas hyperparameters and prompting variation have a lesser effect on bias expression. Our research traces bias throughout the LLM development pipeline and underscores the importance of mitigating bias at the pretraining stage.
zh

[NLP-49] An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation

【速读】：该论文试图解决大语言模型（LLMs）在数据到文本生成（DTG）任务中生成事实一致性文本的挑战。解决方案的关键在于对五个广泛使用的DTG数据集（E2E, ViGGo, WikiTableText, DART, WebNLG）和五个知名的LLM家族（T5, BART, OPT, BLOOM, Llama 2）进行深入的事实一致性评估。通过采用四种最先进的自动评估指标和必要的人工评估，论文揭示了三个关键发现：1) Llama 2在生成事实一致性文本方面表现优异，而较小的模型如T5和BART在词汇多样性较低的大型数据集上也能实现较强的事实一致性；2) 模型规模的增加（即训练参数数量的增加）通常会提高LLMs在DTG中的事实一致性；3) 源参考文本的语义分歧通常会降低LLMs在DTG中的事实一致性。

链接: https://arxiv.org/abs/2411.19203
作者: Joy Mahapatra,Utpal Garain
关键词-EN: Large Language Models, Large Language, shown exceptional performance, DTG, factual consistency
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown exceptional performance across various Data-to-Text Generation (DTG) tasks. However, generating factually consistent text in DTG remains challenging for LLMs. Despite this, in-depth evaluations of LLM factual consistency for DTG remain missing in the current literature. This paper addresses this gap by providing an extensive evaluation of factual consistency in LLMs for DTG. Our evaluation covers five widely used DTG datasets (E2E, ViGGo, WikiTableText, DART, and WebNLG) and five prominent LLM families (T5, BART, OPT, BLOOM, and Llama 2). To ensure a thorough evaluation of factual consistency, we use four state-of-the-art automatic metrics and include essential human assessments. Our extensive evaluations reveals three key findings regarding factual consistency in LLMs for DTG. First, Llama 2 often excels in generating factually consistent text, although smaller models like T5 and BART can achieve strong factual consistency on larger, lexically less-diverse datasets. Second, the average rate of change (AROC) indicates that increasing model size (number of model trainable parameters) generally enhances factual consistency of LLMs in DTG. Third, we observe that source-reference divergence (i.e., when the reference text diverges semantically from the source) typically reduces the factual consistency of LLMs in DTG.
zh

[NLP-50] Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection Grounding in VLMs

【速读】：该论文试图解决大模态模型（Large Multimodal Models, LMMs）中普遍存在的幻觉问题，特别是在视觉理解方面的幻觉。解决方案的关键在于提出了一种改进的方法，该方法利用LMMs中间层的上下文词嵌入（contextual token embeddings）来增强幻觉检测和定位能力。这种方法不仅显著提高了对多种类别（如动作和OCR）的幻觉检测和定位效果，还在需要上下文理解的任务（如空间关系和属性比较）中表现出色。通过生成高度精确的边界框，该技术促进了从零样本对象分割（Zero-Shot Object Segmentation）到基于定位的视觉问答（Grounded Visual Question Answering）的过渡，从而为构建更可靠和可解释的多模态模型铺平了道路。

链接: https://arxiv.org/abs/2411.19187
作者: Anirudh Phukan,Divyansh,Harshit Kumar Morj,Vaishnavi,Apoorv Saxena,Koustava Goswami
关键词-EN: Large Language Models, integrating modality-specific encoders, Large Language, Large Multimodal Models, development of Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of Large Multimodal Models (LMMs) has significantly advanced multimodal understanding by harnessing the language abilities of Large Language Models (LLMs) and integrating modality-specific encoders. However, LMMs are plagued by hallucinations that limit their reliability and adoption. While traditional methods to detect and mitigate these hallucinations often involve costly training or rely heavily on external models, recent approaches utilizing internal model features present a promising alternative. In this paper, we critically assess the limitations of the state-of-the-art training-free technique, the logit lens, in handling generalized visual hallucinations. We introduce a refined method that leverages contextual token embeddings from middle layers of LMMs. This approach significantly improves hallucination detection and grounding across diverse categories, including actions and OCR, while also excelling in tasks requiring contextual understanding, such as spatial relations and attribute comparison. Our novel grounding technique yields highly precise bounding boxes, facilitating a transition from Zero-Shot Object Segmentation to Grounded Visual Question Answering. Our contributions pave the way for more reliable and interpretable multimodal models.
zh

[NLP-51] Examining Multimodal Gender and Content Bias in ChatGPT-4o

【速读】：该论文试图解决生成式 AI 模型（如 ChatGPT-4o）在多模态内容生成中存在的性别偏见和内容审查不平衡问题。研究指出，ChatGPT-4o 在处理性内容和裸露内容时表现出严格的审查，而在处理暴力和毒品相关主题时则较为宽松，尤其是对女性相关内容的审查更为严格。解决方案的关键在于强调 AI 系统需要超越政治正确，真正遵循伦理标准和承担责任，以实现更平衡和道德的内容审查实践。研究呼吁业界在设计和实施 AI 内容审查机制时，应更加注重公正性和伦理考量，以减少偏见并提升系统的透明度和可信度。

链接: https://arxiv.org/abs/2411.19140
作者: Roberto Balestri
关键词-EN: highlighting significant disparities, nudity versus violent, study investigates, highlighting significant, drug-related themes
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Other Statistics (stat.OT)
备注: 17 pages, 4 figures, 3 tables. Conference: “14th International Conference on Artificial Intelligence, Soft Computing and Applications (AIAA 2024), London, 23-24 November 2024” It will be published in the proceedings “David C. Wyld et al. (Eds): IoTE, CNDC, DSA, AIAA, NLPTA, DPPR - 2024”

点击查看摘要

Abstract:This study investigates ChatGPT-4o’s multimodal content generation, highlighting significant disparities in its treatment of sexual content and nudity versus violent and drug-related themes. Detailed analysis reveals that ChatGPT-4o consistently censors sexual content and nudity, while showing leniency towards violence and drug use. Moreover, a pronounced gender bias emerges, with female-specific content facing stricter regulation compared to male-specific content. This disparity likely stems from media scrutiny and public backlash over past AI controversies, prompting tech companies to impose stringent guidelines on sensitive issues to protect their reputations. Our findings emphasize the urgent need for AI systems to uphold genuine ethical standards and accountability, transcending mere political correctness. This research contributes to the understanding of biases in AI-driven language and multimodal models, calling for more balanced and ethical content moderation practices.
zh

[NLP-52] Integration of Contextual Descriptors in Ontology Alignment for Enrichment of Semantic Correspondence ATC

【速读】：该论文试图解决语义本体对齐问题，特别是如何通过引入上下文描述符来提升本体对齐的准确性。解决方案的关键在于开发了一种形式化方法，能够将基本描述符和上下文描述符整合，构建一个综合的知识模型。这种模型通过层次化的语义结构和数学工具来分析概念间的潜在冲突，如在人工智能背景下“透明性”与“隐私”之间的冲突。实验结果显示，引入上下文描述符后，本体对齐的指标显著提升，特别是在隐私、责任和自由自主性等领域，平均整体提升约4.36%。这表明该方法能更准确地反映知识的复杂性和其上下文依赖性。

链接: https://arxiv.org/abs/2411.19113
作者: Eduard Manziuk,Oleksander Barmak,Pavlo Radiuk,Vladislav Kuznetsov,Iurii Krak,Sergiy Yakovlev
关键词-EN: contextual descriptors, paper proposes, semantic ontology alignment, contextual, ontology alignment
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Ontology alignment, contextual descriptors, semantic matching, knowledge representation, essential descriptors, ontology integration, hierarchical structure, semantic heterogeneity, ethical AI

点击查看摘要

Abstract:This paper proposes a novel approach to semantic ontology alignment using contextual descriptors. A formalization was developed that enables the integration of essential and contextual descriptors to create a comprehensive knowledge model. The hierarchical structure of the semantic approach and the mathematical apparatus for analyzing potential conflicts between concepts, particularly in the example of “Transparency” and “Privacy” in the context of artificial intelligence, are demonstrated. Experimental studies showed a significant improvement in ontology alignment metrics after the implementation of contextual descriptors, especially in the areas of privacy, responsibility, and freedom autonomy. The application of contextual descriptors achieved an average overall improvement of approximately 4.36%. The results indicate the effectiveness of the proposed approach for more accurately reflecting the complexity of knowledge and its contextual dependence.
zh

[NLP-53] VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

【速读】：该论文试图解决的问题是如何构建一个高效的双语（韩语-英语）视觉语言模型 (VLM)，以实现图像与文本的理解和生成。解决方案的关键在于引入了一种逐步训练策略，使得模型能够在保留骨干模型知识的同时，学习语言和视觉信息。具体来说，VARCO-VISION 模型通过这种策略在双语图像-文本理解与生成任务中表现出色，并具备定位、引用和光学字符识别 (OCR) 的能力，从而扩展了其在实际应用中的潜力。此外，论文还发布了五个韩语评估数据集，进一步支持了模型的评估和应用。

链接: https://arxiv.org/abs/2411.19103
作者: Jeongho Ju,Daeyoung Kim,SunYoung Park,Youngjune Kim
关键词-EN: open-source Korean-English vision-language, Korean-English vision-language model, introduce an open-source, open-source Korean-English, Korean-English vision-language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 24 pages, 15 figures, 4 tables. Model weights at this https URL . Benchmarks released at NCSOFT’s HuggingFace repositories (K-MMBench, K-SEED, K-MMStar, K-DTCBench, K-LLaVA-W). VARCO-VISION is an open-source Korean-English VLM with OCR, grounding, and referring capabilities

点击查看摘要

Abstract:In this paper, we introduce an open-source Korean-English vision-language model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that allows a model learn both linguistic and visual information while preserving the backbone model’s knowledge. Our model demonstrates outstanding performance in diverse settings requiring bilingual image-text understanding and generation abilities compared to models of similar size. VARCO-VISION is also capable of grounding, referring, and OCR, expanding its usage and potential applications for real-world scenarios. In addition to the model, we release five Korean evaluation datasets, including four closed-set and one openset benchmarks. We anticipate that our milestone will broaden the opportunities for AI researchers aiming to train VLMs. VARCO-VISION is available at this https URL.
zh

[NLP-54] Pralekha: An Indic Document Alignment Evaluation Benchmark

【速读】：该论文试图解决现有句子嵌入模型在处理文档级信息时的局限性，以及缺乏高质量平行文档对评估基准的问题，特别是在印度语言（Indic languages）中的应用。解决方案的关键在于引入了大规模基准Pralekha，该基准包含超过200万份文档，涵盖11种印度语言和英语，并采用1:2的未对齐与对齐文档比例。论文提出了一种新的评分方法——文档对齐系数（Document Alignment Coefficient, DAC），用于在句子级和块级对齐中评估文档对齐效果。DAC显著优于传统的池化方法，在噪声环境下实现了平均20-30%的精确度提升和15-20%的F1分数提升，从而有效解决了印度语言平行文档挖掘中的对齐问题。

链接: https://arxiv.org/abs/2411.19096
作者: Sanjay Suryanarayanan,Haiyue Song,Mohammed Safi Ur Rahman Khan,Anoop Kunchukuttan,Mitesh M. Khapra,Raj Dabre
关键词-EN: limited context windows, capturing document-level information, effectively capturing document-level, Indic languages, context windows
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC’s effectiveness in parallel document mining for Indic languages.
zh

[NLP-55] Way to Specialist: Closing Loop Between Specialized LLM and Evolving Domain Knowledge Graph KDD2025

【速读】：该论文试图解决通用大型语言模型（LLMs）在需要专业知识的推理任务中表现不足的问题。解决方案的关键在于提出了Way-to-Specialist (WTS)框架，该框架通过结合检索增强生成（retrieval-augmented generation）和知识图谱（Knowledge Graphs, KGs），在不进行专门训练的情况下增强LLMs的专业能力。WTS创新性地提出了“LLM \circlearrowright KG”范式，实现了专业LLM与领域知识图谱（Domain Knowledge Graph, DKG）之间的双向增强。该范式包括两个紧密耦合的组件：DKG增强的LLM和LLM辅助的DKG进化。前者从DKG中检索与问题相关的领域知识，并用其提示LLM以增强领域特定任务的推理能力；后者利用LLM从处理的任务中生成新的领域知识，并用其进化DKG。WTS通过闭环机制，使得DKG增强的LLM和LLM辅助的DKG进化之间形成持续改进，从而在逐步回答和学习领域特定问题的过程中不断提升领域专业化水平。

链接: https://arxiv.org/abs/2411.19064
作者: Yutong Zhang,Lixing Chen,Shenghong Li,Nan Cao,Yang Shi,Jiaxin Ding,Zhe Qu,Pan Zhou,Yang Bai
关键词-EN: Large language models, Large language, LLM, LLM-Assisted DKG Evolution, demonstrated exceptional performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional performance across a wide variety of domains. Nonetheless, generalist LLMs continue to fall short in reasoning tasks necessitating specialized knowledge. Prior investigations into specialized LLMs focused on domain-specific training, which entails substantial efforts in domain data acquisition and model parameter fine-tuning. To address these challenges, this paper proposes the Way-to-Specialist (WTS) framework, which synergizes retrieval-augmented generation with knowledge graphs (KGs) to enhance the specialized capability of LLMs in the absence of specialized training. In distinction to existing paradigms that merely utilize external knowledge from general KGs or static domain KGs to prompt LLM for enhanced domain-specific reasoning, WTS proposes an innovative “LLM \circlearrowright KG” paradigm, which achieves bidirectional enhancement between specialized LLM and domain knowledge graph (DKG). The proposed paradigm encompasses two closely coupled components: the DKG-Augmented LLM and the LLM-Assisted DKG Evolution. The former retrieves question-relevant domain knowledge from DKG and uses it to prompt LLM to enhance the reasoning capability for domain-specific tasks; the latter leverages LLM to generate new domain knowledge from processed tasks and use it to evolve DKG. WTS closes the loop between DKG-Augmented LLM and LLM-Assisted DKG Evolution, enabling continuous improvement in the domain specialization as it progressively answers and learns from domain-specific questions. We validate the performance of WTS on 6 datasets spanning 5 domains. The experimental results show that WTS surpasses the previous SOTA in 4 specialized domains and achieves a maximum performance improvement of 11.3%.
zh

[NLP-56] DIESEL – Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLM s

【速读】：该论文试图解决对话式大型语言模型（LLMs）在生成响应时可能不符合人类价值观（如伦理标准、安全性和社会规范）的问题。解决方案的关键是提出了一种轻量级的推理引导技术，称为DIESEL，它能够无缝集成到任何自回归LLM中，通过在潜在空间中重新排序LLM提出的标记，基于其与预定义负面概念的相似性，从而从响应中语义过滤掉不希望的概念。DIESEL不仅作为独立的安全保障，还可以作为额外的防御层，增强响应的安全性，同时保持高效的计算性能。

链接: https://arxiv.org/abs/2411.19038
作者: Ben Ganon,Alon Zolfi,Omer Hofman,Inderjeet Singh,Hisashi Kojima,Yuval Elovici,Asaf Shabtai
关键词-EN: making significant advancements, online customer engagement, shown tremendous success, large language models, conversational large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, conversational large language models (LLMs) have shown tremendous success in tasks such as casual conversation, question answering, and personalized dialogue, making significant advancements in domains like virtual assistance, social interaction, and online customer engagement. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety, or social norms), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM’s proposed tokens based on their similarity to predefined negative concepts in the latent space. This approach provides an efficient and effective solution for maintaining alignment with human values. Our evaluation demonstrates DIESEL’s effectiveness on state-of-the-art conversational models (e.g., Llama 3), even in challenging jailbreaking scenarios that test the limits of response safety. We further show that DIESEL can be generalized to use cases other than safety, providing a versatile solution for general-purpose response filtering with minimal computational overhead.
zh

[NLP-57] A Survey on Automatic Online Hate Speech Detection in Low-Resource Languages

【速读】：该论文试图解决低资源语言中仇恨言论检测的问题。解决方案的关键在于详细调查全球范围内低资源语言的仇恨言论检测现状，包括可用的数据集、利用的特征和采用的技术。论文进一步讨论了现有的调查、仇恨言论相关的重叠概念、研究挑战和机遇，旨在为政策制定者和研究人员提供全面的视角和方法，以应对低资源语言中仇恨言论检测的复杂性。

链接: https://arxiv.org/abs/2411.19017
作者: Susmita Das,Arpita Dutta,Kingshuk Roy,Abir Mondal,Arnab Mukhopadhyay
关键词-EN: social media platforms, hate speech, social media, expanding influence, past decade
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 34 pages, 12 figures

点击查看摘要

Abstract:The expanding influence of social media platforms over the past decade has impacted the way people communicate. The level of obscurity provided by social media and easy accessibility of the internet has facilitated the spread of hate speech. The terms and expressions related to hate speech gets updated with changing times which poses an obstacle to policy-makers and researchers in case of hate speech identification. With growing number of individuals using their native languages to communicate with each other, hate speech in these low-resource languages are also growing. Although, there is awareness about the English-related approaches, much attention have not been provided to these low-resource languages due to lack of datasets and online available data. This article provides a detailed survey of hate speech detection in low-resource languages around the world with details of available datasets, features utilized and techniques used. This survey further discusses the prevailing surveys, overlapping concepts related to hate speech, research challenges and opportunities.
zh

[NLP-58] alking to oneself in CMC: a study of self replies in Wikipedia talk pages

【速读】：该论文试图解决的问题是对维基百科讨论页面中自我回复（self replies）现象的定性分析，特别是当讨论的前两条消息由同一用户撰写的情况。解决方案的关键在于提出了一个七类别的分类法，用于标注和分析这种特定模式。通过分析两个参考样本（英语和法语各100个讨论线程），研究比较了人工标注者和指令微调的大型语言模型（LLMs）在处理这些类别时的表现，揭示了人工标注者在整体效率上表现合理，而LLMs在某些类别上遇到较大困难。

链接: https://arxiv.org/abs/2411.19007
作者: Ludovic Tanguy(CLLE),Céline Poudat,Lydia-Mai Ho-Dac(CLLE)
关键词-EN: Wikipedia talk pages, replies in Wikipedia, Wikipedia talk, talk pages, qualitative analysis
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study proposes a qualitative analysis of self replies in Wikipedia talk pages, more precisely when the first two messages of a discussion are written by the same user. This specific pattern occurs in more than 10% of threads with two messages or more and can be explained by a number of reasons. After a first examination of the lexical specificities of second messages, we propose a seven categories typology and use it to annotate two reference samples (English and French) of 100 threads each. Finally, we analyse and compare the performance of human annotators (who reach a reasonable global efficiency) and instruction-tuned LLMs (which encounter important difficulties with several categories).
zh

[NLP-59] USTCCTSU at SemEval-2024 Task 1: Reducing Anisotropy for Cross-lingual Semantic Textual Relatedness Task

【速读】：该论文试图解决跨语言语义文本相关性任务（Cross-lingual semantic textual relatedness task），这一任务在跨语言交流和文本理解中具有重要意义。解决方案的关键在于选择XLM-R-base作为基础模型，并通过基于白化的预训练句子表示来减少维度。此外，设计了一种精细的数据过滤方法以缓解多语言数据集的诅咒效应。这些策略使得论文在西班牙语和印尼语的评测中分别获得了第二和第三的成绩，并在竞赛的C赛道中多次进入前十。

链接: https://arxiv.org/abs/2411.18990
作者: Jianjian Li,Shengwei Liang,Yong Liao,Hongping Deng,Haiyang Yu
关键词-EN: semantic textual relatedness, textual relatedness task, Cross-lingual semantic textual, http URL, textual relatedness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Cross-lingual semantic textual relatedness task is an important research task that addresses challenges in cross-lingual communication and text understanding. It helps establish semantic connections between different languages, crucial for downstream tasks like machine translation, multilingual information retrieval, and cross-lingual text this http URL on extensive comparative experiments, we choose the XLM-R-base as our base model and use pre-trained sentence representations based on whitening to reduce this http URL, for the given training data, we design a delicate data filtering method to alleviate the curse of multilingualism. With our approach, we achieve a 2nd score in Spanish, a 3rd in Indonesian, and multiple entries in the top ten results in the competition’s track C. We further do a comprehensive analysis to inspire future research aimed at improving performance on cross-lingual tasks.
zh

[NLP-60] Zero-shot Slot Filling in the Age of LLM s for Dialogue Systems COLING2025

【速读】：该论文试图解决在对话数据中进行零样本槽填充（Zero-shot slot filling）的挑战，特别是在面对对话数据的动态性、话题突变、打断和隐含引用等复杂情况时，现有方法难以直接应用的问题。解决方案的关键在于提出了自动数据标注策略，结合槽位归纳（slot induction）和从教师大语言模型（LLM）到较小模型的黑箱知识蒸馏（black-box knowledge distillation, KD），从而显著提升了F1分数（绝对提升26%）。此外，论文还设计了一种高效的系统架构，适用于呼叫中心产品设置，相比现成的抽取模型，F1分数相对提升了34%，能够在保持低延迟的同时实现近乎实时的对话流推理，并提高准确性。

链接: https://arxiv.org/abs/2411.18980
作者: Mansi Rana,Kadri Hacioglu,Sindhuja Gopalan,Maragathamani Boothalingam
关键词-EN: Natural Language Understanding, subtask of Natural, Language Understanding, Natural Language, NLU
类目: Computation and Language (cs.CL)
备注: To appear in Proceedings of COLING 2025

点击查看摘要

Abstract:Zero-shot slot filling is a well-established subtask of Natural Language Understanding (NLU). However, most existing methods primarily focus on single-turn text data, overlooking the unique complexities of conversational dialogue. Conversational data is highly dynamic, often involving abrupt topic shifts, interruptions, and implicit references that make it difficult to directly apply zero-shot slot filling techniques, even with the remarkable capabilities of large language models (LLMs). This paper addresses these challenges by proposing strategies for automatic data annotation with slot induction and black-box knowledge distillation (KD) from a teacher LLM to a smaller model, outperforming vanilla LLMs on internal datasets by 26% absolute increase in F1 score. Additionally, we introduce an efficient system architecture for call center product settings that surpasses off-the-shelf extractive models by 34% relative F1 score, enabling near real-time inference on dialogue streams with higher accuracy, while preserving low latency.
zh

[NLP-61] Rephrasing Electronic Health Records for Pretraining Clinical Language Models

【速读】：该论文试图解决临床语言模型预训练数据获取困难的问题，特别是由于患者隐私保护导致的电子健康记录（EHR）中的临床文本难以大规模获取。解决方案的关键在于利用大型语言模型（LLMs）对现有临床笔记进行改写，生成合成预训练语料库。通过这种方式，研究者能够创建不依赖真实临床文本的合成临床文本，用于预训练解码器和编码器语言模型。实验结果表明，这种方法在语言建模和下游任务中表现优于以往的合成方法，并且通过不同LLMs生成的合成语料库增强原始临床笔记，即使在有限的标记预算下也能提升性能，显示出该方法在机构层面进行预训练或扩展至大规模临床语料库合成的潜力。

链接: https://arxiv.org/abs/2411.18940
作者: Jinghui Liu,Anthony Nguyen
关键词-EN: applications in healthcare, development depends, depends on access, access to extensive, extensive clinical text
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical language models are important for many applications in healthcare, but their development depends on access to extensive clinical text for pretraining. However, obtaining clinical notes from electronic health records (EHRs) at scale is challenging due to patient privacy concerns. In this study, we rephrase existing clinical notes using LLMs to generate synthetic pretraining corpora, drawing inspiration from previous work on rephrasing web data. We examine four popular small-sized LLMs (10B) to create synthetic clinical text to pretrain both decoder-based and encoder-based language models. The method yields better results in language modeling and downstream tasks than previous synthesis approaches without referencing real clinical text. We find that augmenting original clinical notes with synthetic corpora from different LLMs improves performances even at a small token budget, showing the potential of this method to support pretraining at the institutional level or be scaled to synthesize large-scale clinical corpora.
zh

[NLP-62] ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges

【速读】：该论文试图解决现有大型多模态模型（LMMs）在视觉编程推理能力评估中的局限性问题。现有评估方法主要依赖于图像到代码的基准测试，这些测试将逻辑推理和多模态理解能力分开，无法全面评估模型的编程意图理解能力。论文提出的解决方案是引入ScratchEval，这是一个基于Scratch（一种广泛用于儿童编程教育的块状视觉编程语言）的新型基准测试。ScratchEval通过整合视觉元素和嵌入式编程逻辑，要求模型同时处理视觉信息和代码结构，从而全面评估其编程意图理解能力。这一方法超越了传统的图像到代码映射，强调统一的逻辑思维和问题解决能力，为评估LMMs的视觉编程能力提供了一个更全面和更具挑战性的框架。

链接: https://arxiv.org/abs/2411.18932
作者: Rao Fu,Ziyang Luo,Hongzhan Lin,Zhen Ye,Jing Ma
关键词-EN: Recent advancements, code generation capabilities, showcased impressive code, impressive code generation, generation capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in large multimodal models (LMMs) have showcased impressive code generation capabilities, primarily evaluated through image-to-code benchmarks. However, these benchmarks are limited to specific visual programming scenarios where the logic reasoning and the multimodal understanding capacities are split apart. To fill this gap, we propose ScratchEval, a novel benchmark designed to evaluate the visual programming reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based visual programming language widely used in children’s programming education. By integrating visual elements and embedded programming logic, ScratchEval requires the model to process both visual information and code structure, thereby comprehensively evaluating its programming intent understanding ability. Our evaluation approach goes beyond the traditional image-to-code mapping and focuses on unified logical thinking and problem-solving abilities, providing a more comprehensive and challenging framework for evaluating the visual programming ability of LMMs. ScratchEval not only fills the gap in existing evaluation methods, but also provides new insights for the future development of LMMs in the field of visual programming. Our benchmark can be accessed at this https URL .
zh

[NLP-63] he Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GPT Models

【速读】：该论文试图解决在自动作文评分 (Automated Essay Scoring, AES) 中，使用少量样本提示 (few-shot prompting) 时，示例选择对GPT模型性能的影响问题。解决方案的关键在于通过实验评估不同版本的GPT-3.5和GPT-4模型在不同示例选择和排序下的表现，并使用二次加权Kappa (Quadratic Weighted Kappa, QWK) 来衡量GPT评分与人工评分的一致性。研究结果表明，示例选择对GPT-3.5的影响大于GPT-4，且存在多数标签偏差 (majority label bias) 和最近示例偏差 (recency bias)。通过精心选择示例，GPT-3.5模型甚至可以超越某些GPT-4模型。此外，研究强调了每个模型版本（即使是微小版本）的独立性能评估的重要性。

链接: https://arxiv.org/abs/2411.18924
作者: Lui Yoshida
关键词-EN: au-tomated essay scoring, few-shot prompting, study investigates, GPT models, models
类目: Computation and Language (cs.CL)
备注: Accepted in AIED2024. This preprint has not undergone any post-submission improvements or corrections. The Version of Record of this contribution is published in Communications in Com-puter and Information Science, vol 2150, and is available online at this https URL

点击查看摘要

Abstract:This study investigates the impact of example selection on the performance of au-tomated essay scoring (AES) using few-shot prompting with GPT models. We evaluate the effects of the choice and order of examples in few-shot prompting on several versions of GPT-3.5 and GPT-4 models. Our experiments involve 119 prompts with different examples, and we calculate the quadratic weighted kappa (QWK) to measure the agreement between GPT and human rater scores. Regres-sion analysis is used to quantitatively assess biases introduced by example selec-tion. The results show that the impact of example selection on QWK varies across models, with GPT-3.5 being more influenced by examples than GPT-4. We also find evidence of majority label bias, which is a tendency to favor the majority la-bel among the examples, and recency bias, which is a tendency to favor the label of the most recent example, in GPT-generated essay scores and QWK, with these biases being more pronounced in GPT-3.5. Notably, careful example selection enables GPT-3.5 models to outperform some GPT-4 models. However, among the GPT models, the June 2023 version of GPT-4, which is not the latest model, exhibits the highest stability and performance. Our findings provide insights into the importance of example selection in few-shot prompting for AES, especially in GPT-3.5 models, and highlight the need for individual performance evaluations of each model, even for minor versions.
zh

[NLP-64] EzSQL: An SQL intermediate representation for improving SQL-to-text Generation

【速读】：该论文试图解决SQL到文本生成任务中，将SQL查询直接作为序列输入预训练生成式语言模型 (Generative Language Models) 的不足。解决方案的关键在于提出了一种新的SQL中间表示 (SQL intermediate representation) 称为EzSQL，通过简化SQL查询并使其更接近自然语言文本，从而优化了SQL到文本的生成过程。EzSQL通过修改操作符和关键词，并去除集合操作符，使得SQL查询更易于被生成式语言模型理解和处理。论文提出的模型使用EzSQL作为输入，结合预训练生成式语言模型，显著提升了在WikiSQL和Spider数据集上的文本生成效果，并展示了通过生成预训练数据来增强Text-to-SQL解析器性能的潜力。

链接: https://arxiv.org/abs/2411.18923
作者: Meher Bhardwaj,Hrishikesh Ethari,Dennis Singh Moirangthem
关键词-EN: generation task traditionally, template base, traditionally uses template, SQL, pre-trained generative language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review at Expert System With Applications Journal

点击查看摘要

Abstract:The SQL-to-text generation task traditionally uses template base, Seq2Seq, tree-to-sequence, and graph-to-sequence models. Recent models take advantage of pre-trained generative language models for this task in the Seq2Seq framework. However, treating SQL as a sequence of inputs to the pre-trained models is not optimal. In this work, we put forward a new SQL intermediate representation called EzSQL to align SQL with the natural language text sequence. EzSQL simplifies the SQL queries and brings them closer to natural language text by modifying operators and keywords, which can usually be described in natural language. EzSQL also removes the need for set operators. Our proposed SQL-to-text generation model uses EzSQL as the input to a pre-trained generative language model for generating the text descriptions. We demonstrate that our model is an effective state-of-the-art method to generate text narrations from SQL queries on the WikiSQL and Spider datasets. We also show that by generating pretraining data using our SQL-to-text generation model, we can enhance the performance of Text-to-SQL parsers.
zh

[NLP-65] Devising a Set of Compact and Explainable Spoken Language Feature for Screening Alzheimers Disease

【速读】：该论文试图解决阿尔茨海默病 (Alzheimer’s disease, AD) 的早期检测问题，特别是在老龄化社会中，利用口语语言进行大规模检测的需求。解决方案的关键在于设计了一种可解释且有效的特征集，该特征集结合了大型语言模型 (Large Language Model, LLM) 的视觉能力和词频-逆文档频率 (Term Frequency-Inverse Document Frequency, TF-IDF) 模型。通过基于“Cookie Theft”图片描述任务的实验，论文展示了新提出的特征在两个不同分类器中均优于传统语言特征，并且具有高维度效率。这些新特征不仅提高了自动AD筛查的准确性，还增强了其可解释性，使得每一步的解释和解读都更加清晰。

链接: https://arxiv.org/abs/2411.18922
作者: Junan Li,Yunxiang Li,Yuren Wang,Xixin Wu,Helen Meng
关键词-EN: significant health challenges, Alzheimer disease, aging society, significant health, health challenges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published at ISCSLP 2024

点击查看摘要

Abstract:Alzheimer’s disease (AD) has become one of the most significant health challenges in an aging society. The use of spoken language-based AD detection methods has gained prevalence due to their scalability due to their scalability. Based on the Cookie Theft picture description task, we devised an explainable and effective feature set that leverages the visual capabilities of a large language model (LLM) and the Term Frequency-Inverse Document Frequency (TF-IDF) model. Our experimental results show that the newly proposed features consistently outperform traditional linguistic features across two different classifiers with high dimension efficiency. Our new features can be well explained and interpreted step by step which enhance the interpretability of automatic AD screening.
zh

[NLP-66] MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications

【速读】：该论文试图解决在处理表格数据问题时，依赖于闭源或大型模型、外部数据或大量提示工程的问题。解决方案的关键是引入了一种名为MATATA的新方法，通过推理、规划和工具使用来训练小型语言模型（Small Language Models, SLMs），特别是适用于本地托管和数据隐私至关重要的敏感业务环境。MATATA采用渐进式自我改进范式和迭代弱监督，使3.8B/8B的SLM能够在不同数据集上灵活且可重用地使用工具，从而实现强大的性能和有效的任务共享扩展。实验表明，MATATA在基于开源模型的推理框架中达到了FinQA和TAT-QA的最新性能，并且在TabMWP上与基于GPT-4的框架竞争。

链接: https://arxiv.org/abs/2411.18915
作者: Vishnou Vinayagame,Gregory Senay,Luis Martí
关键词-EN: Mathematical reasoning capabilities, extensive prompt engineering, tool-augmented language agents, Mathematical reasoning, prompt engineering
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mathematical reasoning capabilities are increasing with tool-augmented language agents, but methods often rely either on closed-source or large models, external data, or extensive prompt engineering. This work introduces MATATA, a novel cost-effective method to train LLM agents for tabular data problems through reasoning, planning, and tool use. With a progressive self-improvement paradigm and an iterative weak supervision, it empowers 3.8B/8B Small Language Models (SLMs), particularly suited for local hosting and sensitive business contexts where data privacy is crucial. By employing a flexible and reusable tools across different datasets, it achieves robust performance with effective scalability across shared tasks. Experiments show that MATATA reaches state-of-the-art performances on FinQA and TAT-QA among reasoning frameworks based on open-source models. Moreover, MATATA models compete with GPT-4 based frameworks on TabMWP, while being SLMs.
zh

[NLP-67] Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

【速读】：该论文试图解决稀疏自编码器 (Sparse Autoencoders, SAEs) 在缺乏高质量性能评估指标方面的瓶颈问题。解决方案的关键在于引入基于 SHIFT 任务的自动化评估方法，通过使用大型语言模型 (LLM) 替代人工标注者来判断任务无关特征，从而实现对 SAE 质量的自动评估。此外，论文还提出了目标探针扰动 (Targeted Probe Perturbation, TPP) 指标，用于量化 SAE 在解耦相似概念方面的能力，从而扩展 SHIFT 的应用范围至更多数据集。通过在多个开源模型上应用 SHIFT 和 TPP 指标，论文验证了这些评估方法能够有效区分不同 SAE 训练超参数和架构的性能差异。

链接: https://arxiv.org/abs/2411.18895
作者: Adam Karvonen,Can Rager,Samuel Marks,Neel Nanda
关键词-EN: interpretability technique aimed, decomposing neural network, neural network activations, Sparse Autoencoders, Sparse Feature Circuits
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units. However, a major bottleneck for SAE development has been the lack of high-quality performance metrics, with prior work largely relying on unsupervised proxies. In this work, we introduce a family of evaluations based on SHIFT, a downstream task from Marks et al. (Sparse Feature Circuits, 2024) in which spurious cues are removed from a classifier by ablating SAE features judged to be task-irrelevant by a human annotator. We adapt SHIFT into an automated metric of SAE quality; this involves replacing the human annotator with an LLM. Additionally, we introduce the Targeted Probe Perturbation (TPP) metric that quantifies an SAE’s ability to disentangle similar concepts, effectively scaling SHIFT to a wider range of datasets. We apply both SHIFT and TPP to multiple open-source models, demonstrating that these metrics effectively differentiate between various SAE training hyperparameters and architectures.
zh

[NLP-68] ArEEG_Words: Dataset for Envisioned Speech Recognition using EEG for Arabic Words

【速读】：该论文试图解决阿拉伯语脑电图（EEG）数据集稀缺的问题，特别是在脑机接口（BCI）研究领域中缺乏公开可用的阿拉伯语EEG数据集。解决方案的关键在于引入了ArEEG_Words数据集，这是一个新颖的EEG数据集，由22名参与者使用14通道的Emotiv Epoc X设备记录，参与者在想象16个常用阿拉伯语单词时进行记录。该数据集包含352个EEG记录，每个记录被分割成多个250毫秒的信号，总计15,360个EEG信号。这是首个公开可用的阿拉伯语EEG数据集，旨在填补阿拉伯语EEG研究领域的数据空白。

链接: https://arxiv.org/abs/2411.18888
作者: Hazem Darwish,Abdalrahman Al Malah,Khloud Al Jallad,Nada Ghneim
关键词-EN: support communication-impaired patients, BCI involves Electroencephalography, translating neural signals, BCI EEG research, EEG
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2402.15733

点击查看摘要

Abstract:Brain-Computer-Interface (BCI) aims to support communication-impaired patients by translating neural signals into speech. A notable research topic in BCI involves Electroencephalography (EEG) signals that measure the electrical activity in the brain. While significant advancements have been made in BCI EEG research, a major limitation still exists: the scarcity of publicly available EEG datasets for non-English languages, such as Arabic. To address this gap, we introduce in this paper ArEEG_Words dataset, a novel EEG dataset recorded from 22 participants with mean age of 22 years (5 female, 17 male) using a 14-channel Emotiv Epoc X device. The participants were asked to be free from any effects on their nervous system, such as coffee, alcohol, cigarettes, and so 8 hours before recording. They were asked to stay calm in a clam room during imagining one of the 16 Arabic Words for 10 seconds. The words include 16 commonly used words such as up, down, left, and right. A total of 352 EEG recordings were collected, then each recording was divided into multiple 250ms signals, resulting in a total of 15,360 EEG signals. To the best of our knowledge, ArEEG_Words data is the first of its kind in Arabic EEG domain. Moreover, it is publicly available for researchers as we hope that will fill the gap in Arabic EEG research.
zh

[NLP-69] Sneaking Syntax into Transformer Language Models with Tree Regularization

【速读】：该论文试图解决的问题是如何在不显著增加模型复杂度或限制其表达能力的情况下，为基于Transformer的语言模型（LMs）引入语法归纳偏置（syntactic inductive biases），以提高其在数据效率和语法泛化能力方面的表现。解决方案的关键在于引入了一种名为TREEREG的辅助损失函数，该函数通过将银解析（silver parses）中的括号决策转换为向量隐藏状态上的可微正交约束，从而“软性”地注入语法结构信息。TREEREG能够无缝集成到标准的语言模型目标函数中，无需对模型架构进行任何修改，同时显著提升了模型在分布外数据上的困惑度（perplexity）和语法泛化能力。

链接: https://arxiv.org/abs/2411.18885
作者: Ananjan Nandi,Christopher D. Manning,Shikhar Murty
关键词-EN: hierarchical tree-like process, direct inductive bias, human language understanding, tree-like process, compositional accounts
类目: Computation and Language (cs.CL)
备注: 17 pages, 16 figures, 8 tables

点击查看摘要

Abstract:While compositional accounts of human language understanding are based on a hierarchical tree-like process, neural models like transformers lack a direct inductive bias for such tree structures. Introducing syntactic inductive biases could unlock more robust and data-efficient learning in transformer language models (LMs), but existing methods for incorporating such structure greatly restrict models, either limiting their expressivity or increasing inference complexity. This work instead aims to softly inject syntactic inductive biases into given transformer circuits, through a structured regularizer. We introduce TREEREG, an auxiliary loss function that converts bracketing decisions from silver parses into a set of differentiable orthogonality constraints on vector hidden states. TREEREG integrates seamlessly with the standard LM objective, requiring no architectural changes. LMs pre-trained with TreeReg on natural language corpora such as WikiText-103 achieve up to 10% lower perplexities on out-of-distribution data and up to 9.5 point improvements in syntactic generalization, requiring less than half the training data to outperform standard LMs. TreeReg still provides gains for pre-trained LLMs: Continued pre-training of Sheared Llama with TreeReg results in improved syntactic generalization, and fine-tuning on MultiNLI with TreeReg mitigates degradation of performance on adversarial NLI benchmarks by 41.2 points.
zh

[NLP-70] Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark

【速读】：该论文试图解决的问题是如何评估生物医学文献的方法学强度，以确保基于这些文献的问答系统能够得出可靠的结论。解决方案的关键在于提出了一个基于风险-of-bias框架的基准（benchmark），用于测量生物医学论文的方法学强度。该基准包括四个任务，涵盖研究方法分析和偏倚风险评估，并包含2000个专家生成的偏倚注释和一个经过人工验证的细粒度对齐研究论文内容的流程。通过这一基准，论文评估了多种大型语言模型（large language models）的性能，发现这些模型在专家级表现上存在显著差距。该基准的提出有助于指导大规模科学数据聚合系统，确保其对研究质量的判断更加标准化和准确。

链接: https://arxiv.org/abs/2411.18831
作者: Jianyou Wang,Weili Cao,Longtian Bao,Youze Zheng,Gil Pasternak,Kaicheng Wang,Xiaoyue Wang,Ramamohan Paturi,Leon Bergen
关键词-EN: increasingly feasible, answer questions, questions by reviewing, Systems that answer, Systems
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Systems that answer questions by reviewing the scientific literature are becoming increasingly feasible. To draw reliable conclusions, these systems should take into account the quality of available evidence, placing more weight on studies that use a valid methodology. We present a benchmark for measuring the methodological strength of biomedical papers, drawing on the risk-of-bias framework used for systematic reviews. The four benchmark tasks, drawn from more than 500 papers, cover the analysis of research study methodology, followed by evaluation of risk of bias in these studies. The benchmark contains 2000 expert-generated bias annotations, and a human-validated pipeline for fine-grained alignment with research paper content. We evaluate a range of large language models on the benchmark, and find that these models fall significantly short of expert-level performance. By providing a standardized tool for measuring judgments of study quality, the benchmark can help to guide systems that perform large-scale aggregation of scientific data. The dataset is available at this https URL.
zh

[NLP-71] NewsEdits 2.0: Learning the Intentions Behind Updating News

【速读】：该论文试图解决新闻文章中事实更新导致的过时信息传播问题。解决方案的关键在于利用语言特征（linguistic features）预测新闻文章中哪些事实可能会更新，而无需依赖外部资源如搜索引擎。通过引入NewsEdits 2.0分类法，将事实更新与风格和叙事更新区分开来，并训练高分的集成模型来应用这一分类法，从而实现对旧文章草稿中事实更新情况的高精度预测。最终，通过构建语言模型问答（LLM-QA）任务，展示如何让语言模型在信息可能过时时选择不回答问题，以达到接近理想水平的准确性。

链接: https://arxiv.org/abs/2411.18811
作者: Alexander Spangher,Kung-Hsiang Huang,Hyundong Cho,Jonathan May
关键词-EN: risk propagating outdated, events progress, risk propagating, propagating outdated facts, update
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 9 pages main body, 11 pages appendix

点击查看摘要

Abstract:As events progress, news articles often update with new information: if we are not cautious, we risk propagating outdated facts. In this work, we hypothesize that linguistic features indicate factual fluidity, and that we can predict which facts in a news article will update using solely the text of a news article (i.e. not external resources like search engines). We test this hypothesis, first, by isolating fact-updates in large news revisions corpora. News articles may update for many reasons (e.g. factual, stylistic, narrative). We introduce the NewsEdits 2.0 taxonomy, an edit-intentions schema that separates fact updates from stylistic and narrative updates in news writing. We annotate over 9,200 pairs of sentence revisions and train high-scoring ensemble models to apply this schema. Then, taking a large dataset of silver-labeled pairs, we show that we can predict when facts will update in older article drafts with high precision. Finally, to demonstrate the usefulness of these findings, we construct a language model question asking (LLM-QA) abstention task. We wish the LLM to abstain from answering questions when information is likely to become outdated. Using our predictions, we show, LLM absention reaches near oracle levels of accuracy.
zh

[NLP-72] Reconstructing Animals and the Wild

【速读】：该论文试图解决从单张图像中重建包含树木、灌木、岩石和动物等自然场景的3D结构问题。解决方案的关键在于利用大型语言模型（Large Language Models）中嵌入的强世界先验知识，并训练一个自回归模型（autoregressive model）将CLIP嵌入解码为结构化的组合场景表示（RAW）。通过构建一个包含百万张图像和数千个资产的合成数据集，该方法在仅使用合成数据训练的情况下，能够泛化到真实世界图像中动物及其环境的重建任务。

链接: https://arxiv.org/abs/2411.18807
作者: Peter Kulits,Michael J. Black,Silvia Zuffi
关键词-EN: computer vision, understanding is foundational, foundational in computer, scene understanding, animals
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 12 pages; project page: this https URL

点击查看摘要

Abstract:The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here, we consider a more challenging problem where such assumptions do not hold: the reconstruction of natural scenes containing trees, bushes, boulders, and animals. While numerous works have attempted to tackle the problem of reconstructing animals in the wild, they have focused solely on the animal, neglecting environmental context. This limits their usefulness for analysis tasks, as animals exist inherently within the 3D world, and information is lost when environmental factors are disregarded. We propose a method to reconstruct natural scenes from single images. We base our approach on recent advances leveraging the strong world priors ingrained in Large Language Models and train an autoregressive model to decode a CLIP embedding into a structured compositional scene representation, encompassing both animals and the wild (RAW). To enable this, we propose a synthetic dataset comprising one million images and thousands of assets. Our approach, having been trained solely on synthetic data, generalizes to the task of reconstructing animals and their environments in real-world images. We will release our dataset and code to encourage future research at this https URL
zh

[NLP-73] UOE: Unlearning One Expert Is Enough For Mixture-of-experts LLM S

【速读】：该论文试图解决在大规模语言模型（LLM）中，特别是稀疏混合专家模型（Mixture-of-Experts, MoE）中，如何有效且高效地进行数据遗忘（unlearning）的问题。解决方案的关键在于提出了一种新的单专家遗忘框架，称为UOE（Unlearning on a single Expert）。该框架通过专家归属（expert attribution）将遗忘集中于与特定知识最相关的专家，同时应用锚定损失（anchor loss）来稳定路由器对目标专家的选择，从而实现精确控制遗忘过程并保持模型效用。UOE框架不仅提高了遗忘质量（最高达5%），还提升了模型效用（最高达35%），并且仅涉及极少量的模型参数（0.06%）。

链接: https://arxiv.org/abs/2411.18797
作者: Haomin Zhuang,Yihua Zhang,Kehan Guo,Jinghan Jia,Gaowen Liu,Sijia Liu,Xiangliang Zhang
关键词-EN: shown remarkable success, removing unwanted data-model, unwanted data-model influences, Recent advancements, MoE LLMs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language model (LLM) unlearning have shown remarkable success in removing unwanted data-model influences while preserving the model’s utility for legitimate knowledge. However, despite these strides, sparse Mixture-of-Experts (MoE) LLMs–a key subset of the LLM family–have received little attention and remain largely unexplored in the context of unlearning. As MoE LLMs are celebrated for their exceptional performance and highly efficient inference processes, we ask: How can unlearning be performed effectively and efficiently on MoE LLMs? And will traditional unlearning methods be applicable to MoE architectures? Our pilot study shows that the dynamic routing nature of MoE LLMs introduces unique challenges, leading to substantial utility drops when existing unlearning methods are applied. Specifically, unlearning disrupts the router’s expert selection, causing significant selection shift from the most unlearning target-related experts to irrelevant ones. As a result, more experts than necessary are affected, leading to excessive forgetting and loss of control over which knowledge is erased. To address this, we propose a novel single-expert unlearning framework, referred to as UOE, for MoE LLMs. Through expert attribution, unlearning is concentrated on the most actively engaged expert for the specified knowledge. Concurrently, an anchor loss is applied to the router to stabilize the active state of this targeted expert, ensuring focused and controlled unlearning that preserves model utility. The proposed UOE framework is also compatible with various unlearning algorithms. Extensive experiments demonstrate that UOE enhances both forget quality up to 5% and model utility by 35% on MoE LLMs across various benchmarks, LLM architectures, while only unlearning 0.06% of the model parameters.
zh

[NLP-74] Cyber-Attack Technique Classification Using Two-Stage Trained Large Language Models

【速读】：该论文试图解决从非结构化文本中提取网络攻击技术信息的问题，特别是在网络安全情报 (CTI) 报告中。解决方案的关键在于提出了一种利用辅助数据增强训练数据的方法，以提高低资源环境下网络攻击分类任务的性能。具体来说，系统首先使用增强后的训练数据训练模型，然后再使用主要数据进行进一步训练。实验结果表明，该方法在TRAM数据集上显著提升了Macro-F1分数，同时保持了Micro-F1分数的竞争力。

链接: https://arxiv.org/abs/2411.18755
作者: Weiqiu You,Youngja Park
关键词-EN: mitigation measures, crucial for comprehending, comprehending the attacker, attacker behaviors, behaviors and implementing
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Understanding the attack patterns associated with a cyberattack is crucial for comprehending the attacker’s behaviors and implementing the right mitigation measures. However, majority of the information regarding new attacks is typically presented in unstructured text, posing significant challenges for security analysts in collecting necessary information. In this paper, we present a sentence classification system that can identify the attack techniques described in natural language sentences from cyber threat intelligence (CTI) reports. We propose a new method for utilizing auxiliary data with the same labels to improve classification for the low-resource cyberattack classification task. The system first trains the model using the augmented training data and then trains more using only the primary data. We validate our model using the TRAM data1 and the MITRE ATTCK framework. Experiments show that our method enhances Macro-F1 by 5 to 9 percentage points and keeps Micro-F1 scores competitive when compared to the baseline performance on the TRAM dataset.
zh

[NLP-75] Multi-Task Model Merging via Adaptive Weight Disentanglement

【速读】：该论文试图解决在多任务学习中，任务向量（task vectors）之间的干扰问题，这种干扰会降低合并模型的性能。解决方案的关键在于提出了任务一致性属性（Task Consistency Property），并通过理论推导表明，通过寻找正交的任务向量可以近似实现这一属性。基于此，论文提出了自适应权重解耦（Adaptive Weight Disentanglement, AWD）方法，该方法将传统的任务向量分解为冗余向量和多个解耦的任务向量，主要优化目标是实现这些解耦任务向量之间的正交性，从而接近理想的解决方案。这些解耦的任务向量可以无缝集成到现有的合并方法中，实验结果表明，AWD显著且一致地改进了之前的合并方法，达到了最先进的性能。

链接: https://arxiv.org/abs/2411.18729
作者: Feng Xiong,Runxi Cheng,Wang Chen,Zhanqiu Zhang,Yiwen Guo,Chun Yuan,Ruifeng Xu
关键词-EN: gained increasing attention, task vectors, unified multi-task model, integrating task-specific weights, Task
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Model merging has gained increasing attention as an efficient and effective technique for integrating task-specific weights from various tasks into a unified multi-task model without retraining or additional data. As a representative approach, Task Arithmetic (TA) has demonstrated that combining task vectors through arithmetic operations facilitates efficient capability transfer between different tasks. In this framework, task vectors are obtained by subtracting the parameter values of a pre-trained model from those of individually fine-tuned models initialized from it. Despite the notable effectiveness of TA, interference among task vectors can adversely affect the performance of the merged model. In this paper, we relax the constraints of Task Arithmetic Property and propose Task Consistency Property, which can be regarded as being free from task interference. Through theoretical derivation, we show that such a property can be approximately achieved by seeking orthogonal task vectors. Guiding by this insight, we propose Adaptive Weight Disentanglement (AWD), which decomposes traditional task vectors into a redundant vector and several disentangled task vectors. The primary optimization objective of AWD is to achieve orthogonality among the disentangled task vectors, thereby closely approximating the desired solution. Notably, these disentangled task vectors can be seamlessly integrated into existing merging methodologies. Experimental results demonstrate that our AWD consistently and significantly improves upon previous merging approaches, achieving state-of-the-art results. Our code is available at \hrefthis https URLthis https URL.
zh

[NLP-76] Evaluating Vision-Language Models as Evaluators in Path Planning

【速读】：该论文试图解决的问题是：尽管大型语言模型（LLMs）在复杂推理方面具有潜力，但在端到端规划任务中表现有限。论文探讨了这些模型是否可以作为规划框架中的有用计划评估者，特别是当这些模型与视觉理解能力结合时，即视觉语言模型（VLMs）。解决方案的关键在于引入了一个名为PathEval的新基准，用于评估VLMs在复杂路径规划场景中作为计划评估者的能力。成功通过该基准要求VLM能够从场景描述中抽象出最优路径的特征，展示对每条路径的精确低级感知，并将这些信息整合以决定更好的路径。论文分析发现，当前最先进的VLMs在基准测试中面临显著挑战，尤其是视觉组件在感知路径低级细节方面存在瓶颈。实验结果表明，简单的端到端微调无法解决这一问题，需要对这些视觉编码器进行任务特定的判别性适应，以使VLMs成为有效的路径评估者。

链接: https://arxiv.org/abs/2411.18711
作者: Mohamed Aghzal,Xiang Yue,Erion Plaku,Ziyu Yao
关键词-EN: large language models, perform complex reasoning, large language, promise to perform, limited effectiveness
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning. This has inspired an intriguing question: if these models cannot plan well, can they still contribute to the planning framework as a helpful plan evaluator? In this work, we generalize this question to consider LLMs augmented with visual understanding, i.e., Vision-Language Models (VLMs). We introduce PathEval, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios. Succeeding in the benchmark requires a VLM to be able to abstract traits of optimal paths from the scenario description, demonstrate precise low-level perception on each path, and integrate this information to decide the better path. Our analysis of state-of-the-art VLMs reveals that these models face significant challenges on the benchmark. We observe that the VLMs can precisely abstract given scenarios to identify the desired traits and exhibit mixed performance in integrating the provided information. Yet, their vision component presents a critical bottleneck, with models struggling to perceive low-level details about a path. Our experimental results show that this issue cannot be trivially addressed via end-to-end fine-tuning; rather, task-specific discriminative adaptation of these vision encoders is needed for these VLMs to become effective path evaluators.
zh

[NLP-77] On the Effectiveness of Incremental Training of Large Language Models

【速读】：该论文试图解决训练大型语言模型（LLMs）时计算资源消耗巨大的问题，提出了一种增量层级训练策略，即逐步引入层级以优化训练过程。解决方案的关键在于将训练过程分为多个阶段，逐步增加层级，期望通过这种方式加速收敛并更高效地利用计算资源。然而，实验结果表明，尽管增量训练在初期显示出一定的计算效率，但最终达到与传统全规模训练相当的性能时，其总体计算成本更高。虽然增量训练最终能缩小与基准的性能差距，但这需要显著延长的持续训练时间。因此，论文得出结论，增量层级训练可能不是训练大型语言模型的可行替代方案，并指出了该方法的局限性和效率问题。

链接: https://arxiv.org/abs/2411.18700
作者: Miles Q. Li,Benjamin C. M. Fung,Shih-Chia Huang
关键词-EN: computationally intensive process, requires substantial resources, Training, training process, computationally intensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large language models is a computationally intensive process that often requires substantial resources to achieve state-of-the-art results. Incremental layer-wise training has been proposed as a potential strategy to optimize the training process by progressively introducing layers, with the expectation that this approach would lead to faster convergence and more efficient use of computational resources. In this paper, we investigate the effectiveness of incremental training for LLMs, dividing the training process into multiple stages where layers are added progressively. Our experimental results indicate that while the incremental approach initially demonstrates some computational efficiency, it ultimately requires greater overall computational costs to reach comparable performance to traditional full-scale training. Although the incremental training process can eventually close the performance gap with the baseline, it does so only after significantly extended continual training. These findings suggest that incremental layer-wise training may not be a viable alternative for training large language models, highlighting its limitations and providing valuable insights into the inefficiencies of this approach.
zh

[NLP-78] An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)

【速读】：该论文试图解决的问题是如何绕过文本生成模型（如DALL-E 3）的伦理保护机制，使其生成有害内容。解决方案的关键在于引入了一种名为“单轮渐进攻击”（Single-Turn Crescendo Attack, STCA）的创新方法，通过在单一提示中策略性地逐步增加上下文，并结合信任建立机制，巧妙地诱导模型产生非预期的输出。该方法在文本到图像模型中的应用表明，其能够有效绕过广泛使用的模型的防护机制，生成的输出与未受审查的模型Flux Schnell相当。这一研究为评估和增强文本到图像模型防护机制的鲁棒性提供了框架。

链接: https://arxiv.org/abs/2411.18699
作者: Ted Kwartler,Nataliia Bagan,Ivan Banny,Alan Aqrawi,Arian Abbasi
关键词-EN: Single-Turn Crescendo Attack, generate harmful content, Aqrawi and Abbasi, innovative method designed, Single-Turn Crescendo
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Single-Turn Crescendo Attack (STCA), first introduced in Aqrawi and Abbasi [2024], is an innovative method designed to bypass the ethical safeguards of text-to-text AI models, compelling them to generate harmful content. This technique leverages a strategic escalation of context within a single prompt, combined with trust-building mechanisms, to subtly deceive the model into producing unintended outputs. Extending the application of STCA to text-to-image models, we demonstrate its efficacy by compromising the guardrails of a widely-used model, DALL-E 3, achieving outputs comparable to outputs from the uncensored model Flux Schnell, which served as a baseline control. This study provides a framework for researchers to rigorously evaluate the robustness of guardrails in text-to-image models and benchmark their resilience against adversarial attacks.
zh

[NLP-79] Verbalized Representation Learning for Interpretable Few-Shot Generalization

【速读】：该论文试图解决在低数据环境下对象识别模型的泛化能力问题。解决方案的关键在于提出了一种名为“语言化表示学习 (Verbalized Representation Learning, VRL)”的新方法，通过使用视觉-语言模型 (Vision-Language Model, VLM) 自动提取人类可解释的特征。VRL 通过捕捉不同类别之间的差异和同一类别内的共性，以自然语言的形式表达这些特征，并将这些语言化特征映射为数值向量，从而显著提升模型在少样本数据情况下的泛化能力。实验结果表明，VRL 在相同模型规模下，相比现有最先进方法，实现了24%的绝对性能提升，同时使用数据量减少了95%，模型规模也更小。

链接: https://arxiv.org/abs/2411.18651
作者: Cheng-Fu Yang,Da Yin,Wenbo Hu,Nanyun Peng,Bolei Zhou,Kai-Wei Chang
关键词-EN: Humans recognize objects, remarkable capability enabled, Humans recognize, inherent language understanding, real-world environment
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: this https URL.
zh

[NLP-80] owards Advanced Speech Signal Processing: A Statistical Perspective on Convolution-Based Architectures and its Applications

【速读】：该论文旨在探讨基于卷积的模型（如卷积神经网络 (CNNs)、Conformers、残差网络 (ResNets) 和卷积循环神经网络 (CRNNs)）在语音信号处理中的应用，并提供这些模型的统计背景及其在语音识别、说话人识别、情感识别和语音增强等领域的应用。解决方案的关键在于通过比较训练成本、模型大小、准确性和速度等指标，评估各模型的优缺点，识别潜在错误，并提出进一步研究的方向，从而强调这些模型在推动语音技术应用中的核心作用。

链接: https://arxiv.org/abs/2411.18636
作者: Nirmal Joshua Kapu,Raghav Karan
关键词-EN: convolutional neural networks, article surveys convolution-based, including convolutional neural, CRNNs-as speech signal, speech signal processing
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This article surveys convolution-based models including convolutional neural networks (CNNs), Conformers, ResNets, and CRNNs-as speech signal processing models and provide their statistical backgrounds and speech recognition, speaker identification, emotion recognition, and speech enhancement applications. Through comparative training cost assessment, model size, accuracy and speed assessment, we compare the strengths and weaknesses of each model, identify potential errors and propose avenues for further research, emphasizing the central role it plays in advancing applications of speech technologies.
zh

[NLP-81] Semantic Orthographic and Morphological Biases in Humans Wordle Gameplay

【速读】：该论文试图解决的问题是探究人类玩家在Wordle游戏中猜测单词时，其猜测行为是否受到先前猜测单词的语义、正字法和形态学特征的影响。解决方案的关键在于通过比较实际玩家猜测与接近最优猜测之间的差异，揭示出玩家猜测偏向于与先前猜测在语义、正字法和形态学上相似的现象。

链接: https://arxiv.org/abs/2411.18634
作者: Gary Liang,Adam Kabbara,Cindy Liu,Ronaldo Luo,Kina Kim,Michael Guerzhoy
关键词-EN: human players’ gameplay, human players’ guesses, game of Wordle, Wordle is influenced, human players’
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We show that human players’ gameplay in the game of Wordle is influenced by the semantics, orthography, and morphology of the player’s previous guesses. We demonstrate this influence by comparing actual human players’ guesses to near-optimal guesses, showing that human players’ guesses are biased to be similar to previous guesses semantically, orthographically, and morphologically.
zh

[NLP-82] Classical and Quantum Algorithms for the Deterministic L-system Inductive Inference Problem

【速读】：该论文试图解决从给定字符串序列中自动推断确定性上下文无关L系统（D0L-system）的问题。解决方案的关键在于引入特征图（characteristic graph），将推断D0L-system的问题在多项式时间内转化为最大独立集问题（MIS）和SAT问题。随后，论文提出了一种经典的精确算法和一种近似的量子算法来解决这一问题。

链接: https://arxiv.org/abs/2411.19906
作者: Ali Lotfi,Ian McQuillan,Steven Rayan
关键词-EN: biological processes, plant development, made to model, model and create, create simulations
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: 16 pages, 1 figure

点击查看摘要

Abstract:L-systems can be made to model and create simulations of many biological processes, such as plant development. Finding an L-system for a given process is typically solved by hand, by experts, in a hugely time-consuming process. It would be significant if this could be done automatically from data, such as from sequences of images. In this paper, we are interested in inferring a particular type of L-system, deterministic context-free L-system (D0L-system) from a sequence of strings. We introduce the characteristic graph of a sequence of strings, which we then utilize to translate our problem (inferring D0L-system) in polynomial time into the maximum independent set problem (MIS) and the SAT problem. After that, we offer a classical exact algorithm and an approximate quantum algorithm for the problem.
zh

计算机视觉

[CV-0] AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos NEURIPS2024

【速读】：该论文试图解决3D平面重建中的精确性和灵活性问题，提出了一种名为AlphaTablets的新型通用3D平面表示方法。解决方案的关键在于将3D平面表示为带有alpha通道的矩形，结合了2D和3D平面表示的优点，实现了对3D平面的精确、一致和灵活建模。通过在AlphaTablets上推导可微分光栅化，论文提出了一种自底向上的单目视频3D平面重建流程，从2D超像素和预训练模型的几何线索出发，初始化3D平面为AlphaTablets并通过可微分渲染进行优化。引入有效的合并方案以促进AlphaTablets的生长和细化，通过迭代优化和合并，最终重建出具有坚实表面和清晰边界的完整3D平面。实验结果表明，AlphaTablets在ScanNet数据集上的3D平面重建性能达到了最先进水平，展示了其在多种应用中的巨大潜力。

链接: https://arxiv.org/abs/2411.19950
作者: Yuze He,Wang Zhao,Shaohui Liu,Yubin Hu,Yushi Bai,Yu-Hui Wen,Yong-Jin Liu
关键词-EN: precise boundary delineation, features continuous, boundary delineation, precise boundary, planes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: NeurIPS 2024

点击查看摘要

Abstract:We introduce AlphaTablets, a novel and generic representation of 3D planes that features continuous 3D surface and precise boundary delineation. By representing 3D planes as rectangles with alpha channels, AlphaTablets combine the advantages of current 2D and 3D plane representations, enabling accurate, consistent and flexible modeling of 3D planes. We derive differentiable rasterization on top of AlphaTablets to efficiently render 3D planes into images, and propose a novel bottom-up pipeline for 3D planar reconstruction from monocular videos. Starting with 2D superpixels and geometric cues from pre-trained models, we initialize 3D planes as AlphaTablets and optimize them via differentiable rendering. An effective merging scheme is introduced to facilitate the growth and refinement of AlphaTablets. Through iterative optimization and merging, we reconstruct complete and accurate 3D planes with solid surfaces and clear boundaries. Extensive experiments on the ScanNet dataset demonstrate state-of-the-art performance in 3D planar reconstruction, underscoring the great potential of AlphaTablets as a generic 3D plane representation for various applications. Project page is available at: this https URL
zh

[CV-1] DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation

【速读】：该论文试图解决在数据集蒸馏（dataset distillation）中，批量到全局匹配方法（batch-to-global matching）由于独立优化样本和重复使用全局监督信号导致的合成图像多样性不足的问题。解决方案的关键在于提出了一种新的多样性驱动的早-晚期训练方案（Diversity-driven EarlyLate Training, DELT），通过将预定义的每类图像数量（IPC）样本划分为更小的子任务，并采用局部优化来从不同阶段提取每个子集的分布，从而减少统一优化过程导致的均匀性。这种方法不仅提高了合成图像的多样性，还显著减少了合成时间，提升了训练效率。

链接: https://arxiv.org/abs/2411.19946
作者: Zhiqiang Shen,Ammar Sherif,Zeyuan Yin,Shitong Shao
关键词-EN: Recent advances, main directions, distillation have led, led to solutions, dataset distillation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in dataset distillation have led to solutions in two main directions. The conventional batch-to-batch matching mechanism is ideal for small-scale datasets and includes bi-level optimization methods on models and syntheses, such as FRePo, RCIG, and RaT-BPTT, as well as other methods like distribution matching, gradient matching, and weight trajectory matching. Conversely, batch-to-global matching typifies decoupled methods, which are particularly advantageous for large-scale datasets. This approach has garnered substantial interest within the community, as seen in SRe ^2 L, G-VBSM, WMDD, and CDA. A primary challenge with the second approach is the lack of diversity among syntheses within each class since samples are optimized independently and the same global supervision signals are reused across different synthetic images. In this study, we propose a new Diversity-driven EarlyLate Training (DELT) scheme to enhance the diversity of images in batch-to-global matching with less computation. Our approach is conceptually simple yet effective, it partitions predefined IPC samples into smaller subtasks and employs local optimizations to distill each subset into distributions from distinct phases, reducing the uniformity induced by the unified optimization process. These distilled images from the subtasks demonstrate effective generalization when applied to the entire task. We conduct extensive experiments on CIFAR, Tiny-ImageNet, ImageNet-1K, and its sub-datasets. Our approach outperforms the previous state-of-the-art by 2 \sim 5% on average across different datasets and IPCs (images per class), increasing diversity per class by more than 5% while reducing synthesis time by up to 39.3% for enhancing the training efficiency. Code is available at: this https URL.
zh

[CV-2] Free-form Generation Enhances Challenging Clothed Human Modeling

【速读】：该论文试图解决现有基于学习的方法在处理宽松衣物（如长裙）时，由于衣物远离身体导致的标准化过程不明确，从而产生不连贯和碎片化结果的问题。解决方案的关键在于提出了一种新颖的混合框架，通过区分人体的不同区域（未穿衣区域、变形区域和生成区域），并采用不同的策略进行建模。具体来说，未穿衣区域直接复制，变形区域利用线性混合蒙皮（LBS）处理，而生成区域则引入了一种新的自由形式、部件感知生成器来建模宽松衣物，从而增强了框架的灵活性和表现力，能够捕捉复杂宽松衣物的几何细节。

链接: https://arxiv.org/abs/2411.19942
作者: Hang Ye,Xiaoxuan Ma,Hai Ci,Wentao Zhu,Yizhou Wang
关键词-EN: Achieving realistic animated, Linear Blend Skinning, Achieving realistic, realistic animated human, animated human avatars
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 23 pages, 25 figures

点击查看摘要

Abstract:Achieving realistic animated human avatars requires accurate modeling of pose-dependent clothing deformations. Existing learning-based methods heavily rely on the Linear Blend Skinning (LBS) of minimally-clothed human models like SMPL to model deformation. However, these methods struggle to handle loose clothing, such as long dresses, where the canonicalization process becomes ill-defined when the clothing is far from the body, leading to disjointed and fragmented results. To overcome this limitation, we propose a novel hybrid framework to model challenging clothed humans. Our core idea is to use dedicated strategies to model different regions, depending on whether they are close to or distant from the body. Specifically, we segment the human body into three categories: unclothed, deformed, and generated. We simply replicate unclothed regions that require no deformation. For deformed regions close to the body, we leverage LBS to handle the deformation. As for the generated regions, which correspond to loose clothing areas, we introduce a novel free-form, part-aware generator to model them, as they are less affected by movements. This free-form generation paradigm brings enhanced flexibility and expressiveness to our hybrid framework, enabling it to capture the intricate geometric details of challenging loose clothing, such as skirts and dresses. Experimental results on the benchmark dataset featuring loose clothing demonstrate that our method achieves state-of-the-art performance with superior visual fidelity and realism, particularly in the most challenging cases.
zh

[CV-3] Quantifying the synthetic and real domain gap in aerial scene understanding

【速读】：该论文试图解决合成图像与真实世界图像之间的差距量化问题，特别是在依赖大量数据的基于Transformer的模型和数据集改进方面，尤其是在航空场景理解等未充分探索的领域。解决方案的关键在于引入了一种新的场景复杂度评估方法，使用多模型共识度量 (Multi-Model Consensus Metric, MMCM) 和基于深度的结构度量，以实现对不同领域间感知和结构差异的稳健评估。通过对比真实世界 (Dronescapes) 和合成 (Skyscenes) 数据集的实验分析，论文揭示了真实世界场景在现有视觉Transformer模型中表现出更高的共识度，而合成场景则显示出更大的变异性和对模型适应性的挑战。这一发现强调了增强模拟真实性和模型泛化能力的必要性，并为航空场景理解中的领域自适应策略提供了改进路径。

链接: https://arxiv.org/abs/2411.19913
作者: Alina Marcu
关键词-EN: volumes of data, impact is significant, imagery is essential, essential for improving, improving both transformer-based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages (including references), 5 figures, 2 tables. Accepted for publication in the “Scientific Bulletin”, Series C, Electrical Engineering and Computer Science, ISSN 2286-3540

点击查看摘要

Abstract:Quantifying the gap between synthetic and real-world imagery is essential for improving both transformer-based models - that rely on large volumes of data - and datasets, especially in underexplored domains like aerial scene understanding where the potential impact is significant. This paper introduces a novel methodology for scene complexity assessment using Multi-Model Consensus Metric (MMCM) and depth-based structural metrics, enabling a robust evaluation of perceptual and structural disparities between domains. Our experimental analysis, utilizing real-world (Dronescapes) and synthetic (Skyscenes) datasets, demonstrates that real-world scenes generally exhibit higher consensus among state-of-the-art vision transformers, while synthetic scenes show greater variability and challenge model adaptability. The results underline the inherent complexities and domain gaps, emphasizing the need for enhanced simulation fidelity and model generalization. This work provides critical insights into the interplay between domain characteristics and model performance, offering a pathway for improved domain adaptation strategies in aerial scene understanding.
zh

[CV-4] C3-NeRF: Modeling Multiple Scenes via Conditional-cum-Continual Neural Radiance Fields

【速读】：该论文试图解决神经辐射场 (NeRF) 在处理多个3D场景时的可扩展性问题。传统方法为每个场景单独训练一个模型，导致存储需求和训练时间随场景数量线性增加。论文提出了一种名为 C³-NeRF 的新型条件累积持续框架，通过使用简单的伪场景标签将多个场景编码到单个神经辐射场模型的参数中。关键在于，该框架不仅能够高效地处理多个场景，还具备持续学习能力（通过生成式重放），几乎不会遗忘先前学习的场景，从而在不增加额外参数的情况下适应新场景。

链接: https://arxiv.org/abs/2411.19903
作者: Prajwal Singh,Ashish Tiwari,Gautam Vashishtha,Shanmuganathan Raman
关键词-EN: exhibited highly photorealistic, highly photorealistic rendering, exhibited highly, highly photorealistic, views through per-scene
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural radiance fields (NeRF) have exhibited highly photorealistic rendering of novel views through per-scene optimization over a single 3D scene. With the growing popularity of NeRF and its variants, they have become ubiquitous and have been identified as efficient 3D resources. However, they are still far from being scalable since a separate model needs to be stored for each scene, and the training time increases linearly with every newly added scene. Surprisingly, the idea of encoding multiple 3D scenes into a single NeRF model is heavily under-explored. In this work, we propose a novel conditional-cum-continual framework, called C^3 -NeRF, to accommodate multiple scenes into the parameters of a single neural radiance field. Unlike conventional approaches that leverage feature extractors and pre-trained priors for scene conditioning, we use simple pseudo-scene labels to model multiple scenes in NeRF. Interestingly, we observe the framework is also inherently continual (via generative replay) with minimal, if not no, forgetting of the previously learned scenes. Consequently, the proposed framework adapts to multiple new scenes without necessarily accessing the old data. Through extensive qualitative and quantitative evaluation using synthetic and real datasets, we demonstrate the inherent capacity of the NeRF model to accommodate multiple scenes with high-quality novel-view renderings without adding additional parameters. We provide implementation details and dynamic visualizations of our results in the supplementary file.
zh

[CV-5] GuardSplat: Robust and Efficient Watermarking for 3D Gaussian Splatting

【速读】：该论文试图解决3D高斯喷射（3D Gaussian Splatting, 3DGS）资产的版权保护问题，特别是现有水印方法在安全性、容量和不可见性方面的不足，以及优化时间过长的问题。解决方案的关键在于提出了GuardSplat框架，该框架包含三个创新模块：1) CLIP引导的消息解耦优化模块，利用CLIP的对齐能力和丰富表示，实现高提取精度和低优化成本；2) 球谐感知（Spherical-harmonic-aware, SH-aware）消息嵌入模块，通过一组球谐偏移量将消息无缝嵌入到每个3D高斯的球谐特征中，同时保持原始3D结构，确保水印的不可见性和安全性；3) 抗畸变消息提取模块，提高对各种视觉畸变的鲁棒性。这些模块共同作用，使得GuardSplat在保护3DGS资产版权方面表现出色，并实现了快速的优化速度。

链接: https://arxiv.org/abs/2411.19895
作者: Zixuan Chen,Guangcong Wang,Jiahao Zhu,Jianhuang Lai,Xiaohua Xie
关键词-EN: recently created impressive, Gaussian Splatting, created impressive assets, recently created, created impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Project page: this https URL and Code: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently created impressive assets for various applications. However, the copyright of these assets is not well protected as existing watermarking methods are not suited for 3DGS considering security, capacity, and invisibility. Besides, these methods often require hours or even days for optimization, limiting the application scenarios. In this paper, we propose GuardSplat, an innovative and efficient framework that effectively protects the copyright of 3DGS assets. Specifically, 1) We first propose a CLIP-guided Message Decoupling Optimization module for training the message decoder, leveraging CLIP’s aligning capability and rich representations to achieve a high extraction accuracy with minimal optimization costs, presenting exceptional capability and efficiency. 2) Then, we propose a Spherical-harmonic-aware (SH-aware) Message Embedding module tailored for 3DGS, which employs a set of SH offsets to seamlessly embed the message into the SH features of each 3D Gaussian while maintaining the original 3D structure. It enables the 3DGS assets to be watermarked with minimal fidelity trade-offs and prevents malicious users from removing the messages from the model files, meeting the demands for invisibility and security. 3) We further propose an Anti-distortion Message Extraction module to improve robustness against various visual distortions. Extensive experiments demonstrate that GuardSplat outperforms the state-of-the-art methods and achieves fast optimization speed.
zh

[CV-6] FlowCLAS: Enhancing Normalizing Flow Via Contrastive Learning For Anomaly Segmentation

【速读】：该论文试图解决在安全关键应用中，异常分割任务在面对无标签数据和预训练视觉编码器时的局限性问题。具体来说，现有的场景级异常分割方法依赖于多样化的内类标签进行训练，这限制了它们利用大量无标签数据和预训练视觉编码器的能力，特别是在颜色多样性较低和物体类别有限的领域中表现不佳。同时，现有的无监督方法在处理多样化场景时也存在困难。论文提出的解决方案是FlowCLAS，一种新颖的自监督框架，利用视觉基础模型提取丰富特征，并通过归一化流网络学习这些特征的密度分布。关键在于通过在潜在空间中结合Outlier Exposure和对比学习来增强模型的判别能力，从而在不依赖内类分割标签的情况下，显著提升在空间机器人和自动驾驶领域的异常分割性能。

链接: https://arxiv.org/abs/2411.19888
作者: Chang Won Lee,Selina Leveugle,Svetlana Stolpner,Chris Langley,Paul Grouchy,Jonathan Kelly,Steven L. Waslander
关键词-EN: computer vision task, valuable computer vision, Anomaly segmentation, unexpected events, Anomaly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Anomaly segmentation is a valuable computer vision task for safety-critical applications that need to be aware of unexpected events. Current state-of-the-art (SOTA) scene-level anomaly segmentation approaches rely on diverse inlier class labels during training, limiting their ability to leverage vast unlabeled datasets and pre-trained vision encoders. These methods may underperform in domains with reduced color diversity and limited object classes. Conversely, existing unsupervised methods struggle with anomaly segmentation with the diverse scenes of less restricted domains. To address these challenges, we introduce FlowCLAS, a novel self-supervised framework that utilizes vision foundation models to extract rich features and employs a normalizing flow network to learn their density distribution. We enhance the model’s discriminative power by incorporating Outlier Exposure and contrastive learning in the latent space. FlowCLAS significantly outperforms all existing methods on the ALLO anomaly segmentation benchmark for space robotics and demonstrates competitive results on multiple road anomaly segmentation benchmarks for autonomous driving, including Fishyscapes LostFound and Road Anomaly. These results highlight FlowCLAS’s effectiveness in addressing the unique challenges of space anomaly segmentation while retaining SOTA performance in the autonomous driving domain without reliance on inlier segmentation labels.
zh

[CV-7] SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection

【速读】：该论文试图解决自动驾驶系统中多模态感知融合的问题，特别是雷达和相机数据融合中的深度估计和目标检测精度问题。解决方案的关键在于提出了一种名为SpaRC的新型稀疏融合Transformer，通过三个主要贡献来实现：(1) 稀疏视锥融合 (Sparse Frustum Fusion, SFF) 用于跨模态特征对齐；(2) 范围自适应雷达聚合 (Range-Adaptive Radar Aggregation, RAR) 用于精确的目标定位；(3) 局部自注意力机制 (Local Self-Attention, LSA) 用于聚焦的查询聚合。这些方法避免了传统密集鸟瞰图 (Bird’s Eye View, BEV) 渲染的高计算成本，直接在编码点特征上操作，显著提高了效率和准确性。

链接: https://arxiv.org/abs/2411.19860
作者: Philipp Wolters,Johannes Gilg,Torben Teepe,Fabian Herzog,Felix Fent,Gerhard Rigoll
关键词-EN: integrates multi-view image, multi-view image semantics, Bird Eye View, integrates multi-view, multi-view image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:In this work, we present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird’s Eye View (BEV)-based architectures for depth estimation, contemporary query-based transformers excel in camera-only detection through object-centric methodology. However, these query-based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial improvements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the-art performance metrics of 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available at this https URL.
zh

[CV-8] owards Class-wise Robustness Analysis

【速读】：该论文试图解决深度神经网络在实际应用中因域偏移（domain shifts）如常见数据损坏和对抗攻击而表现不佳的问题。解决方案的关键在于识别和分析对抗训练的鲁棒分类模型中存在的类间偏差（class-to-class biases），并评估这些模型在面对常见损坏和对抗攻击时的类级鲁棒性（class-wise robustness）。通过引入类误报评分（Class False Positive Score），论文旨在提供一种公平的评估方法，以衡量每个类别对错误分类的敏感性，从而揭示潜在的攻击途径。

链接: https://arxiv.org/abs/2411.19853
作者: Tejaswini Medi,Julia Grabinski,Margret Keuper
关键词-EN: deep neural networks, downstream tasks, successful in solving, solving many downstream, networks is limited
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While being very successful in solving many downstream tasks, the application of deep neural networks is limited in real-life scenarios because of their susceptibility to domain shifts such as common corruptions, and adversarial attacks. The existence of adversarial examples and data corruption significantly reduces the performance of deep classification models. Researchers have made strides in developing robust neural architectures to bolster decisions of deep classifiers. However, most of these works rely on effective adversarial training methods, and predominantly focus on overall model robustness, disregarding class-wise differences in robustness, which are critical. Exploiting weakly robust classes is a potential avenue for attackers to fool the image recognition models. Therefore, this study investigates class-to-class biases across adversarially trained robust classification models to understand their latent space structures and analyze their strong and weak class-wise properties. We further assess the robustness of classes against common corruptions and adversarial attacks, recognizing that class vulnerability extends beyond the number of correct classifications for a specific class. We find that the number of false positives of classes as specific target classes significantly impacts their vulnerability to attacks. Through our analysis on the Class False Positive Score, we assess a fair evaluation of how susceptible each class is to misclassification.
zh

[CV-9] A Visual-inertial Localization Algorithm using Opportunistic Visual Beacons and Dead-Reckoning for GNSS-Denied Large-scale Applications

【速读】：该论文试图解决在智能城市发展背景下，大规模城市环境中持续行人导航的需求问题，特别是在复杂的城市峡谷环境中，全球导航卫星系统 (GNSS) 的定位服务受限的情况下。解决方案的关键在于提出了一种低成本的视觉-惯性定位方案，该方案结合了轻量级多尺度组卷积 (MSGC) 基础的视觉位置识别 (VPR) 神经网络、行人航位推算 (PDR) 算法以及基于卡尔曼滤波器 (Kalman filter) 的视觉/惯性融合方法，并加入了粗差抑制技术。VPR 作为卡尔曼滤波器的条件观测，有效校正了 PDR 方法中累积的误差，从而确保了在 GNSS 缺失区域的长期定位可靠性。实验结果表明，该方法在大规模移动中保持了稳定的定位性能，并且在公共数据集上的召回率 (Recall@1) 和参数数量方面优于现有的轻量级 VPR 方法。

链接: https://arxiv.org/abs/2411.19845
作者: Liqiang Zhang Ye Tian Dongyan Wei
关键词-EN: continuous pedestrian navigation, smart cities, significantly increased, development of smart, demand for continuous
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:With the development of smart cities, the demand for continuous pedestrian navigation in large-scale urban environments has significantly increased. While global navigation satellite systems (GNSS) provide low-cost and reliable positioning services, they are often hindered in complex urban canyon environments. Thus, exploring opportunistic signals for positioning in urban areas has become a key solution. Augmented reality (AR) allows pedestrians to acquire real-time visual information. Accordingly, we propose a low-cost visual-inertial positioning solution. This method comprises a lightweight multi-scale group convolution (MSGC)-based visual place recognition (VPR) neural network, a pedestrian dead reckoning (PDR) algorithm, and a visual/inertial fusion approach based on a Kalman filter with gross error suppression. The VPR serves as a conditional observation to the Kalman filter, effectively correcting the errors accumulated through the PDR method. This enables the entire algorithm to ensure the reliability of long-term positioning in GNSS-denied areas. Extensive experimental results demonstrate that our method maintains stable positioning during large-scale movements. Compared to the lightweight MobileNetV3-based VPR method, our proposed VPR solution improves Recall@1 by at least 3% on two public datasets while reducing the number of parameters by 63.37%. It also achieves performance that is comparable to the VGG16-based method. The VPR-PDR algorithm improves localization accuracy by more than 40% compared to the original PDR.
zh

[CV-10] Feedback-driven object detection and iterative model improvement

【速读】：该论文试图解决自动化目标检测中高效、高质量标注的难题。解决方案的关键在于开发了一个交互式平台，该平台允许用户上传和标注图像，并对目标检测模型进行微调。用户可以手动审查和细化标注，进一步生成改进的快照，用于后续图像的自动目标检测，这一过程被称为半自动标注（semi-automatic annotation），显著提高了标注效率。实验结果表明，与手动标注相比，半自动标注的时间减少了高达53%，且未牺牲标注质量，甚至在某些情况下超过了手动标注的准确性。该平台为创建高质量的目标检测数据集提供了潜力，并为未来标注平台的发展提供了最佳实践。

链接: https://arxiv.org/abs/2411.19835
作者: Sönke Tenckhoff,Mario Koddenbrock,Erik Rodner
关键词-EN: Automated object detection, object detection, Automated object, object detection models, diverse applications
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AI4EA24 preprint

点击查看摘要

Abstract:Automated object detection has become increasingly valuable across diverse applications, yet efficient, high-quality annotation remains a persistent challenge. In this paper, we present the development and evaluation of a platform designed to interactively improve object detection models. The platform allows uploading and annotating images as well as fine-tuning object detection models. Users can then manually review and refine annotations, further creating improved snapshots that are used for automatic object detection on subsequent image uploads - a process we refer to as semi-automatic annotation resulting in a significant gain in annotation efficiency. Whereas iterative refinement of model results to speed up annotation has become common practice, we are the first to quantitatively evaluate its benefits with respect to time, effort, and interaction savings. Our experimental results show clear evidence for a significant time reduction of up to 53% for semi-automatic compared to manual annotation. Importantly, these efficiency gains did not compromise annotation quality, while matching or occasionally even exceeding the accuracy of manual annotations. These findings demonstrate the potential of our lightweight annotation platform for creating high-quality object detection datasets and provide best practices to guide future development of annotation platforms. The platform is open-source, with the frontend and backend repositories available on GitHub. Comments: AI4EA24 preprint Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2411.19835 [cs.CV] (or arXiv:2411.19835v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.19835 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-11] SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens

【速读】：该论文试图解决在单张RGB图像中进行实时多人3D人体网格估计时，高分辨率输入带来的计算开销显著增加的问题。解决方案的关键在于引入尺度自适应的token（scale-adaptive tokens），这些token根据图像中个体相对于相机的尺度动态调整分辨率。具体来说，对于图像中较小的个体（远离相机），使用更高的分辨率处理，而对于较大的个体（靠近相机），则使用较低的分辨率处理，背景区域则进一步精简。这种尺度自适应的token能够更高效地编码图像特征，促进后续的人体网格回归，同时使模型能够更有效地分配计算资源，专注于更具挑战性的情况。实验结果表明，该方法在保持高分辨率处理精度优势的同时，显著降低了计算成本，实现了与SOTA方法相当的实时推理性能。

链接: https://arxiv.org/abs/2411.19824
作者: Chi Su,Xiaoxuan Ma,Jiajun Su,Yizhou Wang
关键词-EN: single RGB image, single RGB, RGB image, human mesh estimation, smaller scales
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:We propose a one-stage framework for real-time multi-person 3D human mesh estimation from a single RGB image. While current one-stage methods, which follow a DETR-style pipeline, achieve state-of-the-art (SOTA) performance with high-resolution inputs, we observe that this particularly benefits the estimation of individuals in smaller scales of the image (e.g., those far from the camera), but at the cost of significantly increased computation overhead. To address this, we introduce scale-adaptive tokens that are dynamically adjusted based on the relative scale of each individual in the image within the DETR framework. Specifically, individuals in smaller scales are processed at higher resolutions, larger ones at lower resolutions, and background regions are further distilled. These scale-adaptive tokens more efficiently encode the image features, facilitating subsequent decoding to regress the human mesh, while allowing the model to allocate computational resources more effectively and focus on more challenging cases. Experiments show that our method preserves the accuracy benefits of high-resolution processing while substantially reducing computational cost, achieving real-time inference with performance comparable to SOTA methods.
zh

[CV-12] Gaussian multi-target filtering with target dynamics driven by a stochastic differential equation

【速读】：该论文试图解决在连续时间目标动态和离散时间测量条件下，多目标滤波问题。解决方案的关键在于提出了一种新的高斯连续-离散泊松多伯努利混合滤波器 (Gaussian continuous-discrete Poisson multi-Bernoulli mixture (PMBM) filter)，并通过最小化Kullback-Leibler散度进行矩匹配，计算每个目标出生时的最佳拟合均值和协方差。此外，论文还推导了新出生目标集合的分布，并将其扩展到由非线性随机微分方程驱动的目标动态模型。

链接: https://arxiv.org/abs/2411.19814
作者: Ángel F. García-Fernández,Simo Särkkä
关键词-EN: discrete time instants, paper proposes multi-target, paper proposes, measurements are obtained, obtained at discrete
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Probability (math.PR); Computation (stat.CO)
备注:

点击查看摘要

Abstract:This paper proposes multi-target filtering algorithms in which target dynamics are given in continuous time and measurements are obtained at discrete time instants. In particular, targets appear according to a Poisson point process (PPP) in time with a given Gaussian spatial distribution, targets move according to a general time-invariant linear stochastic differential equation, and the life span of each target is modelled with an exponential distribution. For this multi-target dynamic model, we derive the distribution of the set of new born targets and calculate closed-form expressions for the best fitting mean and covariance of each target at its time of birth by minimising the Kullback-Leibler divergence via moment matching. This yields a novel Gaussian continuous-discrete Poisson multi-Bernoulli mixture (PMBM) filter, and its approximations based on Poisson multi-Bernoulli and probability hypothesis density filtering. These continuous-discrete multi-target filters are also extended to target dynamics driven by nonlinear stochastic differential equations.
zh

[CV-13] LaVIDE: A Language-Vision Discriminator for Detecting Changes in Satellite Image with Map References

【速读】：该论文试图解决在仅有一张卫星图像的情况下进行变化检测的问题。解决方案的关键在于提出了一个名为“Language-Vision Discriminator for detecting changes in satellite image with map references (LVD-E)”的模型，该模型利用语言来弥合地图与图像之间的信息差距。具体来说，LVD-E将变化检测问题形式化为“像素是否属于[类别]？”，通过在语言-视觉模型的特征空间中对齐地图和图像，将高层次的地图类别与低层次的图像细节关联起来。此外，模型构建了一个混合专家判别模块，该模块在多个语义层面上比较地图的语言特征与图像的视觉特征，从而实现全面语义比较，以进行变化检测。实验结果表明，LVD-E在四个基准数据集上的表现优于现有的最先进的变化检测算法。

链接: https://arxiv.org/abs/2411.19758
作者: Shuguo Jiang,Fang Xu,Sen Jia,Gui-Song Xia
关键词-EN: typically relies, significantly hindered, Change detection, single image, textbf
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Change detection, which typically relies on the comparison of bi-temporal images, is significantly hindered when only a single image is available. Comparing a single image with an existing map, such as OpenStreetMap, which is continuously updated through crowd-sourcing, offers a viable solution to this challenge. Unlike images that carry low-level visual details of ground objects, maps convey high-level categorical information. This discrepancy in abstraction levels complicates the alignment and comparison of the two data types. In this paper, we propose a \textbfLanguage-\textbfVIsion \textbfDiscriminator for d\textbfEtecting changes in satellite image with map references, namely \ours, which leverages language to bridge the information gap between maps and images. Specifically, \ours formulates change detection as the problem of ``\textit Does the pixel belong to [class]?‘’, aligning maps and images within the feature space of the language-vision model to associate high-level map categories with low-level image details. Moreover, we build a mixture-of-experts discriminative module, which compares linguistic features from maps with visual features from images across various semantic perspectives, achieving comprehensive semantic comparison for change detection. Extensive evaluation on four benchmark datasets demonstrates that \ours can effectively detect changes in satellite image with map references, outperforming state-of-the-art change detection algorithms, e.g., with gains of about 13.8 % on the DynamicEarthNet dataset and 4.3 % on the SECOND dataset.
zh

[CV-14] Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models NEURIPS2024

【速读】：该论文试图解决在微调基础模型时，模型对分布偏移的鲁棒性下降的问题。解决方案的关键是提出了双重风险最小化（Dual Risk Minimization, DRM）方法，该方法结合了经验风险最小化和最坏情况风险最小化，以更好地保留下游任务的核心特征。具体而言，DRM利用大型语言模型（LLMs）生成的核心特征描述来引导基于核心的零样本预测，这些预测作为代理来估计最坏情况风险。DRM在模型鲁棒性的两个关键方面——期望性能和最坏情况性能之间取得了平衡，并在多个真实世界基准测试中达到了新的最优水平，显著提升了CLIP ViT-L/14@336在ImageNet、WILDS-iWildCam和WILDS-FMoW上的分布外性能。

链接: https://arxiv.org/abs/2411.19757
作者: Kaican Li,Weiyan Xie,Yongxiang Huang,Didan Deng,Lanqing Hong,Zhenguo Li,Ricardo Silva,Nevin L. Zhang
关键词-EN: Fine-tuning foundation models, distribution shifts, robust fine-tuning methods, Fine-tuning foundation, robust fine-tuning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2024

点击查看摘要

Abstract:Fine-tuning foundation models often compromises their robustness to distribution shifts. To remedy this, most robust fine-tuning methods aim to preserve the pre-trained features. However, not all pre-trained features are robust and those methods are largely indifferent to which ones to preserve. We propose dual risk minimization (DRM), which combines empirical risk minimization with worst-case risk minimization, to better preserve the core features of downstream tasks. In particular, we utilize core-feature descriptions generated by LLMs to induce core-based zero-shot predictions which then serve as proxies to estimate the worst-case risk. DRM balances two crucial aspects of model robustness: expected performance and worst-case performance, establishing a new state of the art on various real-world benchmarks. DRM significantly improves the out-of-distribution performance of CLIP ViT-L/14@336 on ImageNet (75.9 to 77.1), WILDS-iWildCam (47.1 to 51.8), and WILDS-FMoW (50.7 to 53.1); opening up new avenues for robust fine-tuning. Our code is available at this https URL .
zh

[CV-15] DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering

【速读】：该论文试图解决在静态3D环境中进行快速新颖视图合成时，由于干扰物或遮挡物破坏多视角一致性假设而导致的3D重建困难问题。解决方案的关键在于提出了一种名为DeSplat的新方法，该方法通过直接基于高斯基元的体积渲染来分离干扰物和静态场景元素。DeSplat在每个相机视图中初始化高斯基元，以重建视图特定的干扰物，并在alpha合成阶段分别建模静态3D场景和干扰物，从而实现静态元素和干扰物的显式场景分离。这种方法在不牺牲渲染速度的情况下，达到了与先前无干扰物方法相媲美的结果。

链接: https://arxiv.org/abs/2411.19756
作者: Yihao Wang,Marcus Klasson,Matias Turkulainen,Shuzhe Wang,Juho Kannala,Arno Solin
关键词-EN: splatting enables fast, Gaussian splatting enables, splatting enables, enables fast, Gaussian splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Gaussian splatting enables fast novel view synthesis in static 3D environments. However, reconstructing real-world environments remains challenging as distractors or occluders break the multi-view consistency assumption required for accurate 3D reconstruction. Most existing methods rely on external semantic information from pre-trained models, introducing additional computational overhead as pre-processing steps or during optimization. In this work, we propose a novel method, DeSplat, that directly separates distractors and static scene elements purely based on volume rendering of Gaussian primitives. We initialize Gaussians within each camera view for reconstructing the view-specific distractors to separately model the static 3D scene and distractors in the alpha compositing stages. DeSplat yields an explicit scene separation of static elements and distractors, achieving comparable results to prior distractor-free approaches without sacrificing rendering speed. We demonstrate DeSplat’s effectiveness on three benchmark data sets for distractor-free novel view synthesis. See the project website at this https URL.
zh

[CV-16] A Comprehensive Content Verification System for ensuring Digital Integrity in the Age of Deep Fakes

【速读】：该论文试图解决在数字内容广泛共享的时代，内容完整性验证超越单一社交媒体平台的问题。解决方案的关键在于开发一个内容验证系统 (Content Verification System)，该系统旨在跨数字平台验证图像和视频的真实性。通过超越传统的“蓝勾”验证方式，该系统赋予个人和影响力者验证其数字足迹真实性的能力，从而在互联世界中保护其声誉。

链接: https://arxiv.org/abs/2411.19750
作者: RaviKanth Kaja
关键词-EN: robust content-integrity verification, social media platforms, Content Verification System, era marked, widespread sharing
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:In an era marked by the widespread sharing of digital content, the need for a robust content-integrity verification goes beyond the confines of individual social media platforms. While verified profiles (such as blue ticks on platforms like Instagram and X) have become synonymous with credibility, the content they share often traverses a complex network of interconnected platforms, by means of re-sharing, re-posting, etc., leaving a void in the authentication process of the content itself. With the advent of easily accessible AI tools (like DALL-E, Sora, and the tools that are explicitly built for generating deepfakes face swaps), the risk of misinformation through social media platforms is growing exponentially. This paper discusses a solution, a Content Verification System, designed to authenticate images and videos shared as posts or stories across the digital landscape. Going beyond the limitations of blue ticks, this system empowers individuals and influencers to validate the authenticity of their digital footprint, safeguarding their reputation in an interconnected world.
zh

[CV-17] A Multi-Loss Strategy for Vehicle Trajectory Prediction: Combining Off-Road Diversity and Directional Consistency Losses

【速读】：该论文试图解决自动驾驶车辆轨迹预测中未能充分捕捉复杂交通规则和潜在车辆运动范围的问题。解决方案的关键在于引入了三种新的损失函数：Offroad Loss（防止预测路径超出驾驶区域边界）、Direction Consistency Error（确保预测路径与交通方向一致）和Diversity Loss（增加预测路径的多样性）。这些损失函数应用于所有预测模式，克服了传统“胜者通吃”训练方法的不足，不仅提升了模型训练效果，还可作为评估轨迹预测现实性和多样性的指标。通过在nuScenes和Argoverse 2数据集上的广泛验证，该方法在保持准确性的同时显著提高了安全性和鲁棒性，平均减少了47%的原生场景和37%的攻击场景中的越界错误。

链接: https://arxiv.org/abs/2411.19747
作者: Ahmad Rahimi,Alexandre Alahi
关键词-EN: efficiency of planning, loss functions, Direction Consistency Error, loss, potential vehicle movements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Preprint, 7 pages, 4 figures and 2 tables

点击查看摘要

Abstract:Trajectory prediction is essential for the safety and efficiency of planning in autonomous vehicles. However, current models often fail to fully capture complex traffic rules and the complete range of potential vehicle movements. Addressing these limitations, this study introduces three novel loss functions: Offroad Loss, Direction Consistency Error, and Diversity Loss. These functions are designed to keep predicted paths within driving area boundaries, aligned with traffic directions, and cover a wider variety of plausible driving scenarios. As all prediction modes should adhere to road rules and conditions, this work overcomes the shortcomings of traditional “winner takes all” training methods by applying the loss functions to all prediction modes. These loss functions not only improve model training but can also serve as metrics for evaluating the realism and diversity of trajectory predictions. Extensive validation on the nuScenes and Argoverse 2 datasets with leading baseline models demonstrates that our approach not only maintains accuracy but significantly improves safety and robustness, reducing offroad errors on average by 47% on original and by 37% on attacked scenes. This work sets a new benchmark for trajectory prediction in autonomous driving, offering substantial improvements in navigating complex environments. Our code is available at this https URL .
zh

[CV-18] Real-Time Anomaly Detection in Video Streams

【速读】：该论文旨在开发一种人工智能系统，用于实时检测视频流中的危险情况。解决方案的关键在于结合时间分析（temporal analysis）和空间分析（spatial analysis），通过集成物体检测（object detection）、人体姿态检测（human pose detection）和运动分析（motion analysis）来提升异常检测的准确性。此外，论文还扩展了图像分析中常用的激活图（activation maps）和显著性图（saliency maps）技术至视频领域，并提出了一种新的方法以增强结果的可解释性。所提出的架构可根据需求进行二分类或多分类，以识别是否存在警报或确定警报的原因。论文中测试了多种神经网络模型，最终选择了YOLO进行空间分析，结合VGG19和GRU的卷积循环神经网络（CRNN）进行时间分析，以及多层感知器（multi-layer perceptron）进行分类。这些模型可以并行或串行组合，尽管并行模式速度更快，但串行模式通常更可靠。为了训练这些模型，论文采用了监督学习方法，并创建了两个专有数据集，分别关注可能引发异常的物体和包含异常或非异常的视频。

链接: https://arxiv.org/abs/2411.19731
作者: Fabien Poirier
关键词-EN: LIASD laboratory, CIFRE agreement, company Othello, thesis is part, CIFRE
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This thesis is part of a CIFRE agreement between the company Othello and the LIASD laboratory. The objective is to develop an artificial intelligence system that can detect real-time dangers in a video stream. To achieve this, a novel approach combining temporal and spatial analysis has been proposed. Several avenues have been explored to improve anomaly detection by integrating object detection, human pose detection, and motion analysis. For result interpretability, techniques commonly used for image analysis, such as activation and saliency maps, have been extended to videos, and an original method has been proposed. The proposed architecture performs binary or multiclass classification depending on whether an alert or the cause needs to be identified. Numerous neural networkmodels have been tested, and three of them have been selected. You Only Looks Once (YOLO) has been used for spatial analysis, a Convolutional Recurrent Neuronal Network (CRNN) composed of VGG19 and a Gated Recurrent Unit (GRU) for temporal analysis, and a multi-layer perceptron for classification. These models handle different types of data and can be combined in parallel or in series. Although the parallel mode is faster, the serial mode is generally more reliable. For training these models, supervised learning was chosen, and two proprietary datasets were created. The first dataset focuses on objects that may play a potential role in anomalies, while the second consists of videos containing anomalies or non-anomalies. This approach allows for the processing of both continuous video streams and finite videos, providing greater flexibility in detection.
zh

[CV-19] JetFormer: An Autoregressive Generative Model of Raw Images and Text

【速读】：该论文试图解决多模态模型中依赖于单独训练的组件（如特定模态的编码器和解码器）的问题，提出了一种更为简化的图像和文本联合生成模型。解决方案的关键在于提出了一种自回归解码器仅变换器模型——JetFormer，该模型通过直接最大化原始数据的似然性进行训练，无需依赖任何预训练组件，并且能够理解和生成文本和图像。具体来说，JetFormer利用归一化流模型（normalizing flow model）获得软令牌图像表示，该表示与自回归多模态变换器联合训练。归一化流模型在推理过程中既作为感知任务的图像编码器，也作为图像生成任务的图像解码器。JetFormer在文本到图像生成质量上与基于VQ-VAE和VAE的基线模型相媲美，同时展示了强大的图像理解能力，是首个能够生成高保真图像并提供强似然性界限的模型。

链接: https://arxiv.org/abs/2411.19722
作者: Michael Tschannen,André Susano Pinto,Alexander Kolesnikov
关键词-EN: Removing modeling constraints, Removing modeling, training large multimodal, constraints and unifying, unifying architectures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference. JetFormer achieves text-to-image generation quality competitive with recent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained image autoencoders, which are trained with a complex mixture of losses, including perceptual ones. At the same time, JetFormer demonstrates robust image understanding capabilities. To the best of our knowledge, JetFormer is the first model that is capable of generating high-fidelity images and producing strong log-likelihood bounds.
zh

[CV-20] MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive Applications WACV25

【速读】：该论文试图解决自监督单目深度估计（Self-supervised Monocular Depth Estimation, MDE）中常见的尺度不变性问题，即在没有额外训练信号的情况下，深度估计结果通常是尺度不变的。解决方案的关键在于引入了一种新颖的自监督度量尺度MDE模型，该模型仅利用单目视频数据和摄像机的安装位置，这两者在现代车辆中都容易获得。具体来说，该方法利用平面视差几何（planar-parallax geometry）来重建场景结构，并通过多帧网络（multi-frame network）处理连续帧来估计静态场景的结构，进而作为教师网络，向单帧网络（singleframe network）传递尺度信息、可行驶区域掩码、度量尺度深度以及动态物体掩码。此外，多帧网络还辅助姿态网络（pose network）预测两幅连续图像之间的度量尺度相对姿态。该方法在KITTI驾驶基准测试中实现了最先进的度量尺度深度预测结果，并且是首批在具有挑战性的Cityscapes数据集上实现自监督度量尺度深度预测的方法之一。

链接: https://arxiv.org/abs/2411.19717
作者: Gasser Elazab,Torben Gräber,Michael Unterreiner,Olaf Hellwich
关键词-EN: self-supervised metric-scaled MDE, monocular depth estimation, gained popularity, popularity for obtaining, obtaining depth predictions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at WACV 25, project page: this https URL

点击查看摘要

Abstract:Self-supervised monocular depth estimation (MDE) has gained popularity for obtaining depth predictions directly from videos. However, these methods often produce scale invariant results, unless additional training signals are provided. Addressing this challenge, we introduce a novel self-supervised metric-scaled MDE model that requires only monocular video data and the camera’s mounting position, both of which are readily available in modern vehicles. Our approach leverages planar-parallax geometry to reconstruct scene structure. The full pipeline consists of three main networks, a multi-frame network, a singleframe network, and a pose network. The multi-frame network processes sequential frames to estimate the structure of the static scene using planar-parallax geometry and the camera mounting position. Based on this reconstruction, it acts as a teacher, distilling knowledge such as scale information, masked drivable area, metric-scale depth for the static scene, and dynamic object mask to the singleframe network. It also aids the pose network in predicting a metric-scaled relative pose between two subsequent images. Our method achieved state-of-the-art results for the driving benchmark KITTI for metric-scaled depth prediction. Notably, it is one of the first methods to produce self-supervised metric-scaled depth prediction for the challenging Cityscapes dataset, demonstrating its effectiveness and versatility.
zh

[CV-21] Forensics Adapter: Adapting CLIP for Generalizable Face Forgery Detection

【速读】：该论文试图解决将通用视觉语言模型CLIP（Contrastive Language-Image Pretraining）高效且泛化地应用于人脸伪造检测的问题。解决方案的关键在于引入了一个名为Forensics Adapter的适配器网络，该适配器专门学习伪造人脸的独特痕迹——混合边界（blending boundaries），并通过任务特定的目标进行引导。通过这种设计，适配器能够增强CLIP的视觉标记（visual tokens），并采用专门的交互策略在CLIP和适配器之间传递知识，从而显著提升人脸伪造检测的性能。该方法仅使用5.7M的可训练参数，在五个标准数据集上平均提升了约7%的检测准确率，为未来的CLIP基人脸伪造检测方法提供了新的基准。

链接: https://arxiv.org/abs/2411.19715
作者: Xinjie Cui,Yuezun Li,Ao Luo,Jiaran Zhou,Junyu Dong
关键词-EN: describe the Forensics, adapter network designed, Forensics Adapter, face forgery, face forgery detector
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We describe the Forensics Adapter, an adapter network designed to transform CLIP into an effective and generalizable face forgery detector. Although CLIP is highly versatile, adapting it for face forgery detection is non-trivial as forgery-related knowledge is entangled with a wide range of unrelated knowledge. Existing methods treat CLIP merely as a feature extractor, lacking task-specific adaptation, which limits their effectiveness. To address this, we introduce an adapter to learn face forgery traces – the blending boundaries unique to forged faces, guided by task-specific objectives. Then we enhance the CLIP visual tokens with a dedicated interaction strategy that communicates knowledge across CLIP and the adapter. Since the adapter is alongside CLIP, its versatility is highly retained, naturally ensuring strong generalizability in face forgery detection. With only \bm5.7M trainable parameters, our method achieves a significant performance boost, improving by approximately \bm7% on average across five standard datasets. We believe the proposed method can serve as a baseline for future CLIP-based face forgery detection methods.
zh

[CV-22] he Streetscape Application Services Stack (SASS): Towards a Distributed Sensing Architecture for Urban Applications

【速读】：该论文试图解决城市环境中智能街道应用中的复杂性和可扩展性问题，特别是在行人安全和自适应交通管理方面。解决方案的关键在于Streetscape Application Services Stack (SASS)，它通过三个核心服务来应对这些挑战：多模态数据同步（Multimodal Data Synchronization）、时空数据融合（Spatiotemporal Data Fusion）和分布式边缘计算（Distributed Edge Computing）。SASS通过提供清晰的、可组合的抽象层，简化了多模态数据的集成，提高了数据同步的精度（减少88%的时间错位误差），增强了目标检测的准确性（提高10%以上），并显著提升了系统吞吐量（增加一个数量级），从而支持实时、可扩展的城市应用。

链接: https://arxiv.org/abs/2411.19714
作者: Navid Salami Pargoo,Mahshid Ghasemi,Shuren Xia,Mehmet Kerem Turkcan,Taqiya Ehsan,Chengbo Zang,Yuan Sun,Javad Ghaderi,Gil Zussman,Zoran Kostic,Jorge Ortiz
关键词-EN: urban populations grow, smart cities, populations grow, driving the deployment, multimodal data synchronization
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As urban populations grow, cities are becoming more complex, driving the deployment of interconnected sensing systems to realize the vision of smart cities. These systems aim to improve safety, mobility, and quality of life through applications that integrate diverse sensors with real-time decision-making. Streetscape applications-focusing on challenges like pedestrian safety and adaptive traffic management-depend on managing distributed, heterogeneous sensor data, aligning information across time and space, and enabling real-time processing. These tasks are inherently complex and often difficult to scale. The Streetscape Application Services Stack (SASS) addresses these challenges with three core services: multimodal data synchronization, spatiotemporal data fusion, and distributed edge computing. By structuring these capabilities as clear, composable abstractions with clear semantics, SASS allows developers to scale streetscape applications efficiently while minimizing the complexity of multimodal integration. We evaluated SASS in two real-world testbed environments: a controlled parking lot and an urban intersection in a major U.S. city. These testbeds allowed us to test SASS under diverse conditions, demonstrating its practical applicability. The Multimodal Data Synchronization service reduced temporal misalignment errors by 88%, achieving synchronization accuracy within 50 milliseconds. Spatiotemporal Data Fusion service improved detection accuracy for pedestrians and vehicles by over 10%, leveraging multicamera integration. The Distributed Edge Computing service increased system throughput by more than an order of magnitude. Together, these results show how SASS provides the abstractions and performance needed to support real-time, scalable urban applications, bridging the gap between sensing infrastructure and actionable streetscape intelligence. Subjects: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2411.19714 [cs.NI] (or arXiv:2411.19714v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2411.19714 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-23] Explaining the Impact of Training on Vision Models via Activation Clustering

【速读】：该论文试图解决可解释人工智能（XAI）领域中视觉模型特征编码器所提取信息的问题。解决方案的关键在于提出了一种名为神经激活视觉解释（Neuro-Activated Vision Explanations, NAVE）的方法，通过聚类冻结网络的特征激活来提取编码器捕获的信息。NAVE不旨在解释模型的预测结果，而是回答图像中哪些部分被相似处理或深层网络中保留了哪些信息等问题。实验结果显示，训练数据集和监督程度影响所捕获的概念，并揭示了视觉变换器（ViT）中寄存器的影响以及训练集中水印Clever Hans效应引起的信息饱和。

链接: https://arxiv.org/abs/2411.19700
作者: Ahcène Boubekki,Samuel G. Fadel,Sebastian Mair
关键词-EN: explainable artificial intelligence, vision models investigate, Recent developments, Neuro-Activated Vision Explanations, artificial intelligence
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent developments in the field of explainable artificial intelligence (XAI) for vision models investigate the information extracted by their feature encoder. We contribute to this effort and propose Neuro-Activated Vision Explanations (NAVE), which extracts the information captured by the encoder by clustering the feature activations of the frozen network to be explained. The method does not aim to explain the model’s prediction but to answer questions such as which parts of the image are processed similarly or which information is kept in deeper layers. Experimentally, we leverage NAVE to show that the training dataset and the level of supervision affect which concepts are captured. In addition, our method reveals the impact of registers on vision transformers (ViT) and the information saturation caused by the watermark Clever Hans effect in the training set.
zh

[CV-24] Gated-Attention Feature-Fusion Based Framework for Poverty Prediction ICDE

【速读】：该论文试图解决在发展中地区准确估计贫困水平的问题，传统方法如家庭调查成本高、频率低且迅速过时。解决方案的关键在于提出了一种先进的卷积神经网络 (CNN) 架构，该架构基于 ResNet50 模型并集成了门控注意力特征融合模块 (Gated-Attention Feature-Fusion Module, GAFM)。这一设计旨在增强模型捕捉和结合卫星图像中全局和局部特征的能力，从而实现更精确的贫困估计。通过这种改进，模型在贫困地图绘制中达到了 75% 的 R2 分数，显著优于现有领先方法。

链接: https://arxiv.org/abs/2411.19690
作者: Muhammad Umer Ramzan,Wahab Khaddim,Muhammad Ehsan Rana,Usman Ali,Manohar Ali,Fiaz ul Hassan,Fatima Mehmood
关键词-EN: research paper addresses, Convolutional Neural Network, accurately estimating poverty, estimating poverty levels, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: The paper has accepted for publication at 5th International Conference on Data Engineering and Communication Technology (ICDECT)

点击查看摘要

Abstract:This research paper addresses the significant challenge of accurately estimating poverty levels using deep learning, particularly in developing regions where traditional methods like household surveys are often costly, infrequent, and quickly become outdated. To address these issues, we propose a state-of-the-art Convolutional Neural Network (CNN) architecture, extending the ResNet50 model by incorporating a Gated-Attention Feature-Fusion Module (GAFM). Our architecture is designed to improve the model’s ability to capture and combine both global and local features from satellite images, leading to more accurate poverty estimates. The model achieves a 75% R2 score, significantly outperforming existing leading methods in poverty mapping. This improvement is due to the model’s capacity to focus on and refine the most relevant features, filtering out unnecessary data, which makes it a powerful tool for remote sensing and poverty estimation.
zh

[CV-25] SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks

【速读】：该论文试图解决视觉-语言模型 (Vision-Language Models, VLMs) 在医疗任务中，特别是视觉问答 (Visual Question Answering, VQA) 任务中的鲁棒性评估问题。当前的评估方法未能充分考虑模型在真实世界数据分布偏移下的表现，且缺乏对模型行为深入的语义理解和可解释性分析。解决方案的关键在于提出了一个名为 SURE-VQA 的新框架，该框架围绕三个核心要求：1) 评估模型在真实世界数据分布偏移下的鲁棒性；2) 使用大型语言模型 (Large Language Models, LLMs) 进行更准确的语义评估，以替代传统的词匹配度量；3) 引入有意义的基准线 (sanity baselines)，以评估多模态数据对 VLM 的影响，从而提高模型的可解释性。通过在三个医疗数据集上进行实验，研究揭示了不同微调方法在面对四种分布偏移时的表现，验证了 SURE-VQA 框架的有效性。

链接: https://arxiv.org/abs/2411.19688
作者: Kim-Celine Kahl,Selen Erkan,Jeremias Traub,Carsten T. Lüth,Klaus Maier-Hein,Lena Maier-Hein,Paul F. Jaeger
关键词-EN: Visual Question Answering, Question Answering, Visual Question, patients and clinicians, great potential
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have great potential in medical tasks, like Visual Question Answering (VQA), where they could act as interactive assistants for both patients and clinicians. Yet their robustness to distribution shifts on unseen data remains a critical concern for safe deployment. Evaluating such robustness requires a controlled experimental setup that allows for systematic insights into the model’s behavior. However, we demonstrate that current setups fail to offer sufficiently thorough evaluations, limiting their ability to accurately assess model robustness. To address this gap, our work introduces a novel framework, called SURE-VQA, centered around three key requirements to overcome the current pitfalls and systematically analyze the robustness of VLMs: 1) Since robustness on synthetic shifts does not necessarily translate to real-world shifts, robustness should be measured on real-world shifts that are inherent to the VQA data; 2) Traditional token-matching metrics often fail to capture underlying semantics, necessitating the use of large language models (LLMs) for more accurate semantic evaluation; 3) Model performance often lacks interpretability due to missing sanity baselines, thus meaningful baselines should be reported that allow assessing the multimodal impact on the VLM. To demonstrate the relevance of this framework, we conduct a study on the robustness of various fine-tuning methods across three medical datasets with four different types of distribution shifts. Our study reveals several important findings: 1) Sanity baselines that do not utilize image data can perform surprisingly well; 2) We confirm LoRA as the best-performing PEFT method; 3) No PEFT method consistently outperforms others in terms of robustness to shifts. Code is provided at this https URL.
zh

[CV-26] xGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting

【速读】：该论文试图解决现有方法在生成3D网格的PBR材质时，由于使用预训练的2D扩散模型进行多视角图像合成，导致生成的纹理与输入的3D网格之间存在严重不一致的问题。解决方案的关键是提出了一种名为TexGaussian的新方法，该方法利用八分体对齐的3D高斯分布（octant-aligned 3D Gaussian Splatting）来快速生成PBR材质。具体来说，TexGaussian将每个3D高斯分布放置在从输入3D网格构建的八叉树的最细叶节点上，以渲染多视角图像，不仅包括反照率图，还包括粗糙度和金属度。此外，该模型采用回归方式进行训练，而非扩散去噪，能够在单次前向传播过程中生成PBR材质。实验结果表明，TexGaussian在无条件和文本条件场景下均能生成更美观且与几何结构更一致的PBR材质，并且运行速度优于以往方法。

链接: https://arxiv.org/abs/2411.19654
作者: Bojun Xiong,Jialun Liu,Jiakui Hu,Chenming Wu,Jinbo Wu,Xing Liu,Chen Zhao,Errui Ding,Zhouhui Lian
关键词-EN: Physically Based Rendering, enabling photorealistic rendering, Physically Based, Based Rendering, photorealistic rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Technical Report

点击查看摘要

Abstract:Physically Based Rendering (PBR) materials play a crucial role in modern graphics, enabling photorealistic rendering across diverse environment maps. Developing an effective and efficient algorithm that is capable of automatically generating high-quality PBR materials rather than RGB texture for 3D meshes can significantly streamline the 3D content creation. Most existing methods leverage pre-trained 2D diffusion models for multi-view image synthesis, which often leads to severe inconsistency between the generated textures and input 3D meshes. This paper presents TexGaussian, a novel method that uses octant-aligned 3D Gaussian Splatting for rapid PBR material generation. Specifically, we place each 3D Gaussian on the finest leaf node of the octree built from the input 3D mesh to render the multiview images not only for the albedo map but also for roughness and metallic. Moreover, our model is trained in a regression manner instead of diffusion denoising, capable of generating the PBR material for a 3D mesh in a single feed-forward process. Extensive experiments on publicly available benchmarks demonstrate that our method synthesizes more visually pleasing PBR materials and runs faster than previous methods in both unconditional and text-conditional scenarios, which exhibit better consistency with the given geometry. Our code and trained models are available at this https URL.
zh

[CV-27] Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing WACV2025

【速读】：该论文试图解决文本引导图像生成和编辑中，现有无调优方法在保持图像保真度和编辑精度之间平衡的难题。解决方案的关键在于分析重建过程中的结构问题，并提出一种新方法，通过用均匀注意力图（uniform attention maps）替代传统的交叉注意力机制（cross-attention mechanism），显著提高图像重建的保真度。此外，论文还引入了一种自适应掩码引导的编辑技术，与重建方法无缝集成，确保编辑任务的一致性和准确性。实验结果表明，该方法在实现高保真图像重建的同时，在实际图像合成和编辑场景中也表现出色。

链接: https://arxiv.org/abs/2411.19652
作者: Wenyi Mo,Tianyu Zhang,Yalong Bai,Bing Su,Ji-Rong Wen
关键词-EN: achieved remarkable advancements, Text-guided image generation, remarkable advancements, Text-guided image, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to WACV 2025

点击查看摘要

Abstract:Text-guided image generation and editing using diffusion models have achieved remarkable advancements. Among these, tuning-free methods have gained attention for their ability to perform edits without extensive model adjustments, offering simplicity and efficiency. However, existing tuning-free approaches often struggle with balancing fidelity and editing precision. Reconstruction errors in DDIM Inversion are partly attributed to the cross-attention mechanism in U-Net, which introduces misalignments during the inversion and reconstruction process. To address this, we analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps, significantly enhancing image reconstruction fidelity. Our method effectively minimizes distortions caused by varying text conditions during noise prediction. To complement this improvement, we introduce an adaptive mask-guided editing technique that integrates seamlessly with our reconstruction approach, ensuring consistency and accuracy in editing tasks. Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios. This study underscores the potential of uniform attention maps to enhance the fidelity and versatility of diffusion-based image processing methods. Code is available at this https URL.
zh

[CV-28] GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding

【速读】：该论文试图解决开放词汇3D物体功能性区域定位的问题，即通过任意指令预测3D物体上的“动作可能性”区域，这对于机器人普遍感知真实场景并响应操作变化至关重要。现有方法主要通过结合图像或语言与3D几何体来引入外部交互先验，但这些方法在利用隐含的不变几何体和潜在交互意图方面存在局限性，导致语义空间受限。论文提出的解决方案之关键是GREAT（GeometRy-intEntion collAboraTive inference）框架，该框架通过挖掘物体的不变几何属性，并在潜在交互场景中进行类比推理，形成功能性知识，从而全面结合几何和视觉内容来定位3D物体的功能性区域。此外，论文还引入了目前最大的3D物体功能性数据集Point Image Affordance Dataset v2 (PIADv2)来支持这一任务。

链接: https://arxiv.org/abs/2411.19626
作者: Yawen Shao,Wei Zhai,Yuhang Yang,Hongchen Luo,Yang Cao,Zheng-Jun Zha
关键词-EN: action possibilities regions, generically perceive real, perceive real scenarios, affordance grounding aims, object affordance grounding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-Vocabulary 3D object affordance grounding aims to anticipate ``action possibilities’’ regions on 3D objects with arbitrary instructions, which is crucial for robots to generically perceive real scenarios and respond to operational changes. Existing methods focus on combining images or languages that depict interactions with 3D geometries to introduce external interaction priors. However, they are still vulnerable to a limited semantic space by failing to leverage implied invariant geometries and potential interaction intentions. Normally, humans address complex tasks through multi-step reasoning and respond to diverse situations by leveraging associative and analogical thinking. In light of this, we propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding, a novel framework that mines the object invariant geometry attributes and performs analogically reason in potential interaction scenarios to form affordance knowledge, fully combining the knowledge with both geometries and visual contents to ground 3D object affordance. Besides, we introduce the Point Image Affordance Dataset v2 (PIADv2), the largest 3D object affordance dataset at present to support the task. Extensive experiments demonstrate the effectiveness and superiority of GREAT. Code and dataset are available at project.
zh

[CV-29] FairDD: Fair Dataset Distillation via Synchronized Matching

【速读】：该论文试图解决在图像分类任务中，数据集蒸馏（Dataset Distillation, DD）可能导致对受保护属性（Protected Attributes, PA）如性别和种族的不公平偏见问题。解决方案的关键在于提出了一种新的公平数据集蒸馏（Fair Dataset Distillation, FDD）框架，即FairDD。FairDD的核心创新在于同步匹配合成数据集与原始数据集中按受保护属性划分的各个子组，而不是像传统DD方法那样无差别地对齐整个分布，后者往往被多数群体主导。这种同步匹配机制使得合成数据集能够避免偏向多数群体，并促进对所有受保护属性群体的平衡生成。FairDD通过这种方式有效地规范了传统DD方法，使其在保持目标属性分类准确性的同时，减少对少数群体的偏见。

链接: https://arxiv.org/abs/2411.19623
作者: Qihang Zhou,Shenhao Fang,Shibo He,Wenchao Meng,Jiming Chen
关键词-EN: Condensing large datasets, Condensing large, counterparts has demonstrated, demonstrated its promise, datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation (DD) fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches, requiring no modifications to their original architectures. The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This synchronized matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DD methods, without sacrificing classification accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach.
zh

[CV-30] ortho-Gaussian: Splatting True Digital Orthophoto Maps

【速读】：该论文试图解决传统数字正射影像图 (True Digital Orthophoto Maps, TDOM) 生成过程中面临的多种挑战，包括不准确的数字表面模型 (Digital Surface Model, DSM)、退化的遮挡检测以及在弱纹理区域和反射表面上的视觉伪影等问题。解决方案的关键在于引入了一种名为 TOrtho-Gaussian 的新方法，该方法受 3D 高斯喷射 (3D Gaussian Splatting, 3DGS) 启发，通过优化各向异性高斯核的正交喷射来生成 TDOM。具体来说，该方法通过将高斯核正交投影到二维图像平面上，简化了正射影像的生成过程，避免了显式的 DSM 和遮挡检测需求。此外，采用分治策略优化了 3DGS 的内存使用和训练渲染效率，并设计了适应不同区域特征的各向异性高斯核，特别是提高了反射表面和细长结构的渲染质量。实验结果表明，该方法在建筑物边界精度、低纹理区域和建筑物立面的视觉质量等方面优于现有的商业软件。

链接: https://arxiv.org/abs/2411.19594
作者: Xin Wang,Wendi Zhang,Hong Xie,Haibin Ai,Qiangqiang Yuan,Zongqian Zhan
关键词-EN: Geographic Information Systems, True Digital Orthophoto, Digital Orthophoto Maps, Information Systems, Geographic Information
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE Transactions on Geoscience and Remote Sensing for possible publication

点击查看摘要

Abstract:True Digital Orthophoto Maps (TDOMs) are essential products for digital twins and Geographic Information Systems (GIS). Traditionally, TDOM generation involves a complex set of traditional photogrammetric process, which may deteriorate due to various challenges, including inaccurate Digital Surface Model (DSM), degenerated occlusion detections, and visual artifacts in weak texture regions and reflective surfaces, etc. To address these challenges, we introduce TOrtho-Gaussian, a novel method inspired by 3D Gaussian Splatting (3DGS) that generates TDOMs through orthogonal splatting of optimized anisotropic Gaussian kernel. More specifically, we first simplify the orthophoto generation by orthographically splatting the Gaussian kernels onto 2D image planes, formulating a geometrically elegant solution that avoids the need for explicit DSM and occlusion detection. Second, to produce TDOM of large-scale area, a divide-and-conquer strategy is adopted to optimize memory usage and time efficiency of training and rendering for 3DGS. Lastly, we design a fully anisotropic Gaussian kernel that adapts to the varying characteristics of different regions, particularly improving the rendering quality of reflective surfaces and slender structures. Extensive experimental evaluations demonstrate that our method outperforms existing commercial software in several aspects, including the accuracy of building boundaries, the visual quality of low-texture regions and building facades. These results underscore the potential of our approach for large-scale urban scene reconstruction, offering a robust alternative for enhancing TDOM quality and scalability.
zh

[CV-31] Gaussian Splashing: Direct Volumetric Rendering Underwater

【速读】：该论文试图解决水下图像中由于水的遮挡导致的三维重建问题，特别是在使用现有的神经辐射场方法（NeRFs）和三维高斯喷射（3DGS）时，这些方法在空气场景中表现良好，但在水下场景中由于遮挡和散射效应而失效。解决方案的关键在于提出了一种名为“高斯喷溅”（Gaussian Splashing）的新方法，该方法结合了3DGS的速度优势和一种新的图像形成模型，用于捕捉散射效应。该方法在渲染和深度估计过程中引入了创新，并改进了3DGS的损失函数，从而能够在几分钟内完成重建，并以140帧每秒（FPS）的速度渲染新场景，显著提高了重建图像的细节和远距离场景的清晰度。

链接: https://arxiv.org/abs/2411.19588
作者: Nir Mualem,Roy Amoyal,Oren Freifeld,Derya Akkaynak
关键词-EN: Neural Radiance Field, occluded by water, features are occluded, Radiance Field methods, underwater
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In underwater images, most useful features are occluded by water. The extent of the occlusion depends on imaging geometry and can vary even across a sequence of burst images. As a result, 3D reconstruction methods robust on in-air scenes, like Neural Radiance Field methods (NeRFs) or 3D Gaussian Splatting (3DGS), fail on underwater scenes. While a recent underwater adaptation of NeRFs achieved state-of-the-art results, it is impractically slow: reconstruction takes hours and its rendering rate, in frames per second (FPS), is less than 1. Here, we present a new method that takes only a few minutes for reconstruction and renders novel underwater scenes at 140 FPS. Named Gaussian Splashing, our method unifies the strengths and speed of 3DGS with an image formation model for capturing scattering, introducing innovations in the rendering and depth estimation procedures and in the 3DGS loss function. Despite the complexities of underwater adaptation, our method produces images at unparalleled speeds with superior details. Moreover, it reveals distant scene details with far greater clarity than other methods, dramatically improving reconstructed and rendered images. We demonstrate results on existing datasets and a new dataset we have collected. Additional visual results are available at: this https URL . Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.19588 [cs.CV] (or arXiv:2411.19588v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.19588 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-32] LDA-AQU: Adaptive Query-guided Upsampling via Local Deformable Attention ACM-MM2024

【速读】：该论文试图解决现有上采样方法在深度卷积神经网络中缺乏特定特征指导或依赖高分辨率特征图的问题，导致性能和灵活性下降。解决方案的关键在于将局部自注意力机制（local self-attention）引入上采样任务，并提出了一种基于局部自注意力的变形机制（deformation mechanism），即LDA-AQU。LDA-AQU利用查询特征（queries）自适应地调整邻近点的位置和聚合权重，从而在各种复杂场景中满足上采样需求。该方法不仅轻量级且易于集成到多种模型架构中，并在物体检测、实例分割、全景分割和语义分割四项密集预测任务中均表现出色，显著优于现有最先进的上采样方法。

链接: https://arxiv.org/abs/2411.19585
作者: Zewen Du,Zhenjiang Hu,Guiyu Zhao,Ying Jin,Hongbin Ma
关键词-EN: convolutional neural networks, constructing deep convolutional, deep convolutional neural, local self-attention, Feature
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ACM MM2024

点击查看摘要

Abstract:Feature upsampling is an essential operation in constructing deep convolutional neural networks. However, existing upsamplers either lack specific feature guidance or necessitate the utilization of high-resolution feature maps, resulting in a loss of performance and flexibility. In this paper, we find that the local self-attention naturally has the feature guidance capability, and its computational paradigm aligns closely with the essence of feature upsampling (\ie feature reassembly of neighboring points). Therefore, we introduce local self-attention into the upsampling task and demonstrate that the majority of existing upsamplers can be regarded as special cases of upsamplers based on local self-attention. Considering the potential semantic gap between upsampled points and their neighboring points, we further introduce the deformation mechanism into the upsampler based on local self-attention, thereby proposing LDA-AQU. As a novel dynamic kernel-based upsampler, LDA-AQU utilizes the feature of queries to guide the model in adaptively adjusting the position and aggregation weight of neighboring points, thereby meeting the upsampling requirements across various complex scenarios. In addition, LDA-AQU is lightweight and can be easily integrated into various model architectures. We evaluate the effectiveness of LDA-AQU across four dense prediction tasks: object detection, instance segmentation, panoptic segmentation, and semantic segmentation. LDA-AQU consistently outperforms previous state-of-the-art upsamplers, achieving performance enhancements of 1.7 AP, 1.5 AP, 2.0 PQ, and 2.5 mIoU compared to the baseline models in the aforementioned four tasks, respectively. Code is available at \urlthis https URL.
zh

[CV-33] Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding

【速读】：该论文试图解决在3D高斯喷洒（3D Gaussian Splatting, 3DGS）中注入语义信息时，现有方法依赖2D监督导致跨视图语义一致性差和数据准备复杂的问题。解决方案的关键在于提出了FreeGS框架，该框架通过引入身份耦合语义场（IDentity-coupled Semantic Field, IDSF）来实现无监督的视图一致性3D场景理解。IDSF不仅捕捉语义表示，还捕捉视图一致的实例索引，并通过两步交替优化策略来优化IDSF，同时采用2D-3D联合对比损失来增强视图一致的3D几何与丰富语义之间的互补性。这一方法避免了复杂的2D标签依赖和数据预处理过程，同时在多个数据集上展示了与现有最先进方法相当的表现。

链接: https://arxiv.org/abs/2411.19551
作者: Wenbo Zhang,Lu Zhang,Ping Hu,Liqian Ma,Yunzhi Zhuge,Huchuan Lu
关键词-EN: garnered significant attention, recently garnered significant, Gaussian Splatting, Injecting semantics, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Injecting semantics into 3D Gaussian Splatting (3DGS) has recently garnered significant attention. While current approaches typically distill 3D semantic features from 2D foundational models (e.g., CLIP and SAM) to facilitate novel view segmentation and semantic understanding, their heavy reliance on 2D supervision can undermine cross-view semantic consistency and necessitate complex data preparation processes, therefore hindering view-consistent scene understanding. In this work, we present FreeGS, an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels. Instead of directly learning semantic features, we introduce the IDentity-coupled Semantic Field (IDSF) into 3DGS, which captures both semantic representations and view-consistent instance indices for each Gaussian. We optimize IDSF with a two-step alternating strategy: semantics help to extract coherent instances in 3D space, while the resulting instances regularize the injection of stable semantics from 2D space. Additionally, we adopt a 2D-3D joint contrastive loss to enhance the complementarity between view-consistent 3D geometry and rich semantics during the bootstrapping process, enabling FreeGS to uniformly perform tasks such as novel-view semantic segmentation, object selection, and 3D object detection. Extensive experiments on LERF-Mask, 3D-OVS, and ScanNet datasets demonstrate that FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
zh

[CV-34] ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

【速读】：该论文试图解决现有传感器模拟方法在渲染复杂驾驶场景（如多车道变换）时的局限性问题。解决方案的关键在于引入ReconDreamer，通过逐步集成世界模型知识来增强驾驶场景的重建。具体来说，提出了DriveRestorer来通过在线修复减少伪影，并采用渐进式数据更新策略以确保高质量渲染。据论文所述，ReconDreamer是首个能够有效渲染大型驾驶操作的方法，实验结果表明其在NTA-IoU、NTL-IoU和FID指标上均优于现有方法，特别是在大型驾驶操作渲染方面显著超越了DriveDreamer4D。

链接: https://arxiv.org/abs/2411.19548
作者: Chaojun Ni,Guosheng Zhao,Xiaofeng Wang,Zheng Zhu,Wenkang Qin,Guan Huang,Chen Liu,Yuyin Chen,Yida Wang,Xueyang Zhang,Yifei Zhan,Kun Zhan,Peng Jia,Xianpeng Lang,Xingang Wang,Wenjun Mei
关键词-EN: Closed-loop simulation, Closed-loop, autonomous driving, driving, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Closed-loop simulation is crucial for end-to-end autonomous driving. Existing sensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes based on conditions that closely mirror training data distributions. However, these methods struggle with rendering novel trajectories, such as lane changes. Recent works have demonstrated that integrating world model knowledge alleviates these issues. Despite their efficiency, these approaches still encounter difficulties in the accurate representation of more complex maneuvers, with multi-lane shifts being a notable example. Therefore, we introduce ReconDreamer, which enhances driving scene reconstruction through incremental integration of world model knowledge. Specifically, DriveRestorer is proposed to mitigate artifacts via online restoration. This is complemented by a progressive data update strategy designed to ensure high-quality rendering for more complex maneuvers. To the best of our knowledge, ReconDreamer is the first method to effectively render in large maneuvers. Experimental results demonstrate that ReconDreamer outperforms Street Gaussians in the NTA-IoU, NTL-IoU, and FID, with relative improvements by 24.87%, 6.72%, and 29.97%. Furthermore, ReconDreamer surpasses DriveDreamer4D with PVG during large maneuver rendering, as verified by a relative improvement of 195.87% in the NTA-IoU metric and a comprehensive user study.
zh

[CV-35] SkelMamba: A State Space Model for Efficient Skeleton Action Recognition of Neurological Disorders

【速读】：该论文试图解决基于骨骼的人类动作识别问题，特别是在临床诊断和一般动作识别任务中提升现有技术的性能。解决方案的关键在于引入了一种基于状态空间模型 (State-Space Model, SSM) 的新框架，该框架采用解剖学引导的架构，将骨骼运动分析分解为空间、时间和时空流，并通过通道分区有效捕捉不同的运动特征。通过在SSM中实施结构化的多方向扫描策略，模型能够捕捉局部关节交互和全局运动模式，从而增强识别细微运动模式的能力，这对于医疗诊断中的步态异常等神经疾病尤为重要。该方法在NTU RGB+D、NTU RGB+D 120和NW-UCLA等公共动作识别基准测试中表现优异，准确率提升高达3.2%，且计算复杂度低于先前的基于Transformer的模型。此外，论文还引入了一个新的医疗数据集，用于验证该方法在基于运动的神经疾病自动诊断中的潜力。

链接: https://arxiv.org/abs/2411.19544
作者: Niki Martinel,Mariano Serrao,Christian Micheloni
关键词-EN: skeleton-based human action, based framework, architecture that improves, NTU RGB, framework for skeleton-based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a novel state-space model (SSM)-based framework for skeleton-based human action recognition, with an anatomically-guided architecture that improves state-of-the-art performance in both clinical diagnostics and general action recognition tasks. Our approach decomposes skeletal motion analysis into spatial, temporal, and spatio-temporal streams, using channel partitioning to capture distinct movement characteristics efficiently. By implementing a structured, multi-directional scanning strategy within SSMs, our model captures local joint interactions and global motion patterns across multiple anatomical body parts. This anatomically-aware decomposition enhances the ability to identify subtle motion patterns critical in medical diagnosis, such as gait anomalies associated with neurological conditions. On public action recognition benchmarks, i.e., NTU RGB+D, NTU RGB+D 120, and NW-UCLA, our model outperforms current state-of-the-art methods, achieving accuracy improvements up to 3.2% with lower computational complexity than previous leading transformer-based models. We also introduce a novel medical dataset for motion-based patient neurological disorder analysis to validate our method’s potential in automated disease diagnosis.
zh

[CV-36] Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook

【速读】：该论文试图解决深度伪造（deepfake）内容日益逼真化带来的识别难题，特别是如何有效检测和防范这些伪造内容。解决方案的关键在于系统性地综述和分类现有的深度伪造生成与检测技术，涵盖图像、视频、音频及多模态内容，并提出一个新颖的多模态基准测试来评估检测器在分布外内容上的表现。论文还指出，现有的最先进检测器在面对未见过的深度伪造生成器时表现不佳，因此提出了未来研究方向以开发更鲁棒和强大的深度伪造检测器。

链接: https://arxiv.org/abs/2411.19537
作者: Florinel-Alin Croitoru,Andrei-Iulian Hiji,Vlad Hondru,Nicolae Catalin Ristea,Paul Irofti,Marius Popescu,Cristian Rusu,Radu Tudor Ionescu,Fahad Shahbaz Khan,Mubarak Shah
关键词-EN: Neural Radiance Fields, detect manipulated media, media content online, generative modeling, steady pace
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:With the recent advancements in generative modeling, the realism of deepfake content has been increasing at a steady pace, even reaching the point where people often fail to detect manipulated media content online, thus being deceived into various kinds of scams. In this paper, we survey deepfake generation and detection techniques, including the most recent developments in the field, such as diffusion models and Neural Radiance Fields. Our literature review covers all deepfake media types, comprising image, video, audio and multimodal (audio-visual) content. We identify various kinds of deepfakes, according to the procedure used to alter or generate the fake content. We further construct a taxonomy of deepfake generation and detection methods, illustrating the important groups of methods and the domains where these methods are applied. Next, we gather datasets used for deepfake detection and provide updated rankings of the best performing deepfake detectors on the most popular datasets. In addition, we develop a novel multimodal benchmark to evaluate deepfake detectors on out-of-distribution content. The results indicate that state-of-the-art detectors fail to generalize to deepfake content generated by unseen deepfake generators. Finally, we propose future directions to obtain robust and powerful deepfake detectors. Our project page and new benchmark are available at this https URL.
zh

[CV-37] QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

【速读】：该论文试图解决生成式文本到图像模型在量化对象数量时面临的领域特定性问题，即避免为每个新的图像领域重新训练模型所带来的高计算成本和有限的可扩展性。解决方案的关键在于提出了QUOTA，一个优化框架，通过双循环元学习策略优化领域不变提示词，结合可学习的计数和领域标记，实现了在不重新训练模型的情况下，对未见领域的对象数量进行有效量化。该方法不仅捕捉了风格变化，还保持了准确性，即使在训练中未遇到的对象类别上也能表现出色。

链接: https://arxiv.org/abs/2411.19534
作者: Wenfang Sun,Yingjun Du,Gaowen Liu,Cees G. M. Snoek
关键词-EN: quantifying the number, object quantification, object, object quantification accuracy, quantification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training. For evaluation, we adopt a new benchmark specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation. Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain.
zh

[CV-38] RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation

【速读】：该论文试图解决在标准服装资产生成过程中，由于高度标准化的采样分布和精确的结构要求，现有模型在空间感知能力有限且容易产生结构幻觉的问题。解决方案的关键在于提出了一个名为RAGDiffusion的新型检索增强生成框架（Retrieval-Augmented Generation, RAG），通过整合外部知识（如大型语言模型和数据库）来增强结构确定性和减少幻觉。RAGDiffusion的核心包括两个过程：（1）基于检索的结构聚合，利用对比学习和结构局部线性嵌入（Structure Locally Linear Embedding, SLLE）来推导全局结构和空间地标，提供软硬指导以对抗结构模糊性；（2）全级别忠实服装生成，引入三级对齐机制，确保扩散过程中的结构、图案和解码组件的忠实性。

链接: https://arxiv.org/abs/2411.19528
作者: Xianfeng Tan,Yuhan Li,Wenxiang Shang,Yubo Wu,Jian Wang,Xuanhong Chen,Yi Zhang,Ran Lin,Bingbing Ni
关键词-EN: involves creating forward-facing, creating forward-facing flat-lay, highly standardized sampling, standardized sampling distributions, precise structural requirements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:Standard clothing asset generation involves creating forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized sampling distributions and precise structural requirements in the generated images. Existing models have limited spatial perception and often exhibit structural hallucinations in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating external knowledge from LLM and databases. RAGDiffusion consists of two core processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a three-level alignment that ensures fidelity in structural, pattern, and decoding components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and detail-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.
zh

[CV-39] DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

【速读】：该论文试图解决生成模型在处理连续动态的人类运动时面临的挑战，特别是离散量化方法（如VQ-VAEs）在表达能力和帧间噪声方面的局限性，以及连续方法在高维复杂性和训练数据有限情况下的不足。解决方案的关键在于引入DisCoRD（Discrete Tokens to Continuous Motion via Rectified Flow Decoding），通过矫正流（Rectified Flow）将离散运动标记解码为连续运动。DisCoRD在连续空间中采用迭代细化过程，捕捉细粒度动态并确保更平滑和自然的运动，同时保持对条件信号的忠实性。该方法兼容任何基于离散的框架，显著提升了运动生成的自然度，并在HumanML3D和KIT-ML数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2411.19527
作者: Jungbin Cho,Junwan Kim,Jisoo Kim,Minseo Kim,Mingu Kang,Sungeun Hong,Tae-Hyun Oh,Youngjae Yu
关键词-EN: presents significant challenges, Human motion, presents significant, generative models, significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages 18 figures

点击查看摘要

Abstract:Human motion, inherently continuous and dynamic, presents significant challenges for generative models. Despite their dominance, discrete quantization methods, such as VQ-VAEs, suffer from inherent limitations, including restricted expressiveness and frame-wise noise artifacts. Continuous approaches, while producing smoother and more natural motions, often falter due to high-dimensional complexity and limited training data. To resolve this “discord” between discrete and continuous representations, we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that decodes discrete motion tokens into continuous motion through rectified flow. By employing an iterative refinement process in the continuous space, DisCoRD captures fine-grained dynamics and ensures smoother and more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results solidify DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Our project page is available at: this https URL.
zh

[CV-40] LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis

【速读】：该论文试图解决基于神经辐射场 (NeRF) 的说话头合成中存在的视觉伪影和高训练成本问题。解决方案的关键在于建立细粒度和可泛化的驱动信号与生成结果之间的对应关系。具体来说，论文提出了 LokiTalk 框架，通过引入区域特定变形场 (Region-Specific Deformation Fields) 来分解整体肖像运动为唇部运动、眨眼、头部姿态和躯干运动，从而实现细粒度的对应关系。此外，通过层次化建模驱动信号及其相关区域，使用两个级联的变形场显著提高了动态准确性并减少了合成伪影。论文还提出了身份感知知识迁移 (ID-Aware Knowledge Transfer) 模块，从多身份视频中学习可泛化的动态和静态对应关系，同时提取身份特定的动态和静态特征以细化个体角色的描绘。这些创新显著提升了合成结果的保真度和训练效率。

链接: https://arxiv.org/abs/2411.19525
作者: Tianqi Li,Ruobing Zheng,Bonan Li,Zicheng Zhang,Meng Wang,Jingdong Chen,Ming Yang
关键词-EN: Neural Radiance Fields, large-scale commercial adoption, Neural Radiance, high training costs, training costs persist
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite significant progress in talking head synthesis since the introduction of Neural Radiance Fields (NeRF), visual artifacts and high training costs persist as major obstacles to large-scale commercial adoption. We propose that identifying and establishing fine-grained and generalizable correspondences between driving signals and generated results can simultaneously resolve both problems. Here we present LokiTalk, a novel framework designed to enhance NeRF-based talking heads with lifelike facial dynamics and improved training efficiency. To achieve fine-grained correspondences, we introduce Region-Specific Deformation Fields, which decompose the overall portrait motion into lip movements, eye blinking, head pose, and torso movements. By hierarchically modeling the driving signals and their associated regions through two cascaded deformation fields, we significantly improve dynamic accuracy and minimize synthetic artifacts. Furthermore, we propose ID-Aware Knowledge Transfer, a plug-and-play module that learns generalizable dynamic and static correspondences from multi-identity videos, while simultaneously extracting ID-specific dynamic and static features to refine the depiction of individual characters. Comprehensive evaluations demonstrate that LokiTalk delivers superior high-fidelity results and training efficiency compared to previous methods. The code will be released upon acceptance.
zh

[CV-41] Subjective and Objective Quality Assessment Methods of Stereoscopic Videos with Visibility Affecting Distortions

【速读】：该论文试图解决立体视频（Stereoscopic 3D, S3D）质量评估的问题。解决方案的关键在于开发了一种无感知（Opinion Unaware, OU）和无失真（Distortion Unaware, DU）的视频质量评估模型。该模型通过构建单眼帧（cyclopean frames）并将其分割为非重叠块，分析自然场景统计（Natural Scene Statistics, NSS）特征，并使用单变量广义高斯分布（Univariate Generalized Gaussian Distribution, UGGD）对NSS特征进行建模。通过在多个空间尺度和方向上计算UGGD模型参数（α, β），并进行多元高斯（Multivariate Gaussian, MVG）建模，计算均值向量和协方差矩阵的巴塔查里亚距离（Bhattacharyya distance），从而估计测试视频与原始视频集之间的感知偏差，最终综合这些距离度量来评估S3D视频的整体质量。

链接: https://arxiv.org/abs/2411.19522
作者: Sria Biswas,Balasubramanyam Appina,Priyanka Kokil,Sumohana S Channappayya
关键词-EN: video, resolution stereoscopic, video dataset comprised, present two major, major contributions
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:We present two major contributions in this work: 1) we create a full HD resolution stereoscopic (S3D) video dataset comprised of 12 reference and 360 distorted videos. The test stimuli are produced by simulating the five levels of fog and haze ambiances on the pristine left and right video sequences. We perform subjective analysis on the created video dataset with 24 viewers and compute Difference Mean Opinion Scores (DMOS) as quality representative of the dataset, 2) an Opinion Unaware (OU) and Distortion Unaware (DU) video quality assessment model is developed for S3D videos. We construct cyclopean frames from the individual views of an S3D video and partition them into nonoverlapping blocks. We analyze the Natural Scene Statistics (NSS) of all patches of pristine and test videos, and empirically model the NSS features with Univariate Generalized Gaussian Distribution (UGGD). We compute UGGD model parameters (\alpha, \beta) at multiple spatial scales and multiple orientations of spherical steerable pyramid decomposition and show that the UGGD parameters are distortion discriminable. Further, we perform Multivariate Gaussian (MVG) modeling on the pristine and distorted video feature sets and compute the corresponding mean vectors and covariance matrices of MVG fits. We compute the Bhattacharyya distance measure between mean vectors and covariance matrices to estimate the perceptual deviation of a test video from pristine video set. Finally, we pool both distance measures to estimate the overall quality score of an S3D video. The performance of the proposed objective algorithm is verified on the popular S3D video datasets such as IRCCYN, LFOVIAS3DPh1, LFOVIAS3DPh2 and the proposed VAD stereo dataset. The algorithm delivers consistent performance across all datasets and shows competitive performance against off-the-shelf 2D and 3D image and video quality assessment algorithms.
zh

[CV-42] Retrieval-guided Cross-view Image Synthesis

【速读】：该论文试图解决跨视图图像合成中的几个关键问题，包括对额外数据的依赖、视图特定语义的关注不足以及缺乏多样化的复杂城市环境数据集。解决方案的关键在于：1) 提出了一种新颖的检索引导框架，利用检索网络作为嵌入器来解决领域差距；2) 设计了一个创新的生成器，增强目标视图的语义一致性和多样性，以提高图像质量和真实感；3) 引入了一个新的数据集 VIGOR-GEN，提供了多样化的城市环境中的跨视图图像对，以丰富数据集的多样性。这些创新显著提升了生成图像的真实感，并在多个数据集上的实验中超越了当前领先的方法。

链接: https://arxiv.org/abs/2411.19510
作者: Hongji Yang,Yiru Li,Yingying Zhu
关键词-EN: synthesis involves generating, image synthesis involves, Cross-view image synthesis, synthesis involves, involves generating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cross-view image synthesis involves generating new images of a scene from different viewpoints or perspectives, given one input image from other viewpoints. Despite recent advancements, there are several limitations in existing methods: 1) reliance on additional data such as semantic segmentation maps or preprocessing modules to bridge the domain gap; 2) insufficient focus on view-specific semantics, leading to compromised image quality and realism; and 3) a lack of diverse datasets representing complex urban environments. To tackle these challenges, we propose: 1) a novel retrieval-guided framework that employs a retrieval network as an embedder to address the domain gap; 2) an innovative generator that enhances semantic consistency and diversity specific to the target view to improve image quality and realism; and 3) a new dataset, VIGOR-GEN, providing diverse cross-view image pairs in urban settings to enrich dataset diversity. Extensive experiments on well-known CVUSA, CVACT, and new VIGOR-GEN datasets demonstrate that our method generates images of superior realism, significantly outperforming current leading approaches, particularly in SSIM and FID evaluations.
zh

[CV-43] Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

【速读】：该论文试图解决基于扩散模型的音频驱动说话头合成中的三个主要问题：慢推理速度、对面部运动的精细控制不足以及偶尔出现的视觉伪影。解决方案的关键在于引入Ditto框架，通过显式的身份无关运动空间替代传统的VAE隐式表示，从而显著降低扩散学习的复杂性并实现对合成说话头的精确控制。此外，论文提出了一种联合优化音频特征提取、运动生成和视频合成的推理策略，以实现流式处理、实时推理和低首帧延迟，这对于交互式应用如AI助手至关重要。

链接: https://arxiv.org/abs/2411.19509
作者: Tianqi Li,Ruobing Zheng,Minghui Yang,Jingdong Chen,Ming Yang
关键词-EN: Recent advances, revolutionized audio-driven talking, models have revolutionized, revolutionized audio-driven, talking head synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have revolutionized audio-driven talking head synthesis. Beyond precise lip synchronization, diffusion-based methods excel in generating subtle expressions and natural head movements that are well-aligned with the audio signal. However, these methods are confronted by slow inference speed, insufficient fine-grained control over facial motions, and occasional visual artifacts largely due to an implicit latent space derived from Variational Auto-Encoders (VAE), which prevent their adoption in realtime interaction applications. To address these issues, we introduce Ditto, a diffusion-based framework that enables controllable realtime talking head synthesis. Our key innovation lies in bridging motion generation and photorealistic neural rendering through an explicit identity-agnostic motion space, replacing conventional VAE representations. This design substantially reduces the complexity of diffusion learning while enabling precise control over the synthesized talking heads. We further propose an inference strategy that jointly optimizes three key components: audio feature extraction, motion generation, and video synthesis. This optimization enables streaming processing, realtime inference, and low first-frame delay, which are the functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and substantially outperforms existing methods in both motion control and realtime performance.
zh

[CV-44] An Approach Towards Learning K-means-friendly Deep Latent Representation

【速读】：该论文试图解决高维数据（如图像）在传统基于质心的聚类方法（如K-means）中面临的困难。解决方案的关键在于提出了一种交替学习聚类友好数据表示和基于K-means的聚类中心的方法。具体来说，论文建议在每次数据批次更新时，同时学习潜在空间中的数据表示和聚类中心，而不是像传统K-means那样保持聚类空间不变。这种方法通过实验证明在基准数据集上优于先前的聚类方法。

链接: https://arxiv.org/abs/2411.19496
作者: Debapriya Roy
关键词-EN: long-standing problem area, Clustering, K-means, long-standing problem, problem area
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clustering is a long-standing problem area in data mining. The centroid-based classical approaches to clustering mainly face difficulty in the case of high dimensional inputs such as images. With the advent of deep neural networks, a common approach to this problem is to map the data to some latent space of comparatively lower dimensions and then do the clustering in that space. Network architectures adopted for this are generally autoencoders that reconstruct a given input in the output. To keep the input in some compact form, the encoder in AE’s learns to extract useful features that get decoded at the reconstruction end. A well-known centroid-based clustering algorithm is K-means. In the context of deep feature learning, recent works have empirically shown the importance of learning the representations and the cluster centroids together. However, in this aspect of joint learning, recently a continuous variant of K-means has been proposed; where the softmax function is used in place of argmax to learn the clustering and network parameters jointly using stochastic gradient descent (SGD). However, unlike K-means, where the input space stays constant, here the learning of the centroid is done in parallel to the learning of the latent space for every batch of data. Such batch updates disagree with the concept of classical K-means, where the clustering space remains constant as it is the input space itself. To this end, we propose to alternatively learn a clustering-friendly data representation and K-means based cluster centers. Experiments on some benchmark datasets have shown improvements of our approach over the previous approaches.
zh

[CV-45] Diorama: Unleashing Zero-shot Single-view 3D Scene Modeling

【速读】：该论文试图解决从单视角RGB图像中重建结构化3D场景的问题，特别是通过使用CAD对象来实现高效且紧凑的场景表示，同时保持场景的组合性和交互性。现有方法依赖于昂贵且不准确的现实世界标注或可控但单调的合成数据，这些方法难以泛化到未见过的对象或领域。论文提出的Diorama系统是首个零样本开放世界系统，能够在无需端到端训练或人工标注的情况下，从单视角RGB图像中整体建模3D场景。解决方案的关键在于将问题分解为子任务，并引入鲁棒且可泛化的解决方案，包括建筑结构重建、3D形状检索、对象姿态估计和场景布局优化。通过在合成和真实世界数据上的评估，证明了该系统显著优于现有基线方法，并展示了其对互联网图像和文本到场景任务的泛化能力。

链接: https://arxiv.org/abs/2411.19492
作者: Qirui Wu,Denys Iliash,Daniel Ritchie,Manolis Savva,Angel X. Chang
关键词-EN: CAD objects unlocks, Reconstructing structured, objects unlocks efficient, compact scene representations, compositionality and interactability
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce robust, generalizable solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and real-world data to show we significantly outperform baselines from prior work. We also demonstrate generalization to internet images and the text-to-scene task.
zh

[CV-46] Interleaved-Modal Chain-of-Thought

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在处理复杂推理任务时，其纯文本的推理步骤难以表达与原始图像的细粒度关联的问题。解决方案的关键在于提出了图像融合的多模态思维链（Image-incorporated Multimodal Chain-of-Thought, ICoT），通过生成包含视觉和文本推理步骤的序列来推导最终答案。具体实现上，论文提出了注意力驱动的选择机制（Attention-driven Selection, ADS），该机制通过智能地插入输入图像的区域来生成细粒度的多模态推理步骤，且无需额外的参数化，因此是一种即插即用的策略，适用于多种VLMs。实验结果表明，ICoT提示方法在性能和可解释性方面相比现有的多模态思维链提示方法有显著提升。

链接: https://arxiv.org/abs/2411.19488
作者: Jun Gao,Yongqi Li,Ziqiang Cao,Wenjie Li
关键词-EN: elicits large language, large language models, prompting elicits large, intermediate reasoning steps, elicits large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbfInterleaved-modal Chain-of-Thought (ICoT), which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose \textbfAttention-driven Selection (ADS) to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14%) and interpretability improvements compared to existing multimodal CoT prompting methods.
zh

[CV-47] V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

【速读】：该论文试图解决从无声的说话人脸视频中直接生成自然且可理解的语音的问题。解决方案的关键在于将语音信号分解为可管理的子空间（内容、音高和说话者信息），并直接从视觉输入中预测这些属性。随后，通过基于Transformer架构的修正流匹配解码器生成连贯且逼真的语音，该解码器能够高效地从随机噪声到目标语音分布建模概率路径。

链接: https://arxiv.org/abs/2411.19486
作者: Jeongsoo Choi,Ji-Hoon Kim,Jinyu Li,Joon Son Chung,Shujie Liu
关键词-EN: talking face videos, silent talking face, framework designed, face videos, intelligible speech directly
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances.
zh

[CV-48] FLARE: Towards Universal Dataset Purification against Backdoor Attacks

【速读】：该论文试图解决深度神经网络 (DNNs) 在面对后门攻击时的脆弱性问题。后门攻击通过在数据集中植入特定的触发器，使得模型在遇到这些触发器时会输出攻击者指定的目标标签。论文揭示了现有高级净化方法的一个潜在假设——后门连接比良性特征更容易学习——在某些攻击类型（如全对全 (A2A) 和非目标 (UT) 攻击）中并不成立。因此，基于输入-输出空间或最终隐藏层空间分离的净化方法效果不佳。论文提出了一种名为 FLARE 的通用净化方法，其关键在于从所有隐藏层中聚合异常激活来构建表示，并通过自适应子空间选择算法来优化分离空间，从而将整个数据集划分为两个集群。FLARE 通过评估每个集群的稳定性来识别并净化被污染的数据。实验结果表明，FLARE 对多种后门攻击（包括全对一 (A2O)、全对全 (A2A) 和非目标 (UT) 攻击）具有显著效果，并能抵御自适应攻击。

链接: https://arxiv.org/abs/2411.19479
作者: Linshan Hou,Wei Luo,Zhongyun Hua,Songhua Chen,Leo Yu Zhang,Yiming Li
关键词-EN: Deep neural networks, Deep neural, enabling malicious manipulation, adversaries poison datasets, backdoor attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:Deep neural networks (DNNs) are susceptible to backdoor attacks, where adversaries poison datasets with adversary-specified triggers to implant hidden backdoors, enabling malicious manipulation of model predictions. Dataset purification serves as a proactive defense by removing malicious training samples to prevent backdoor injection at its source. We first reveal that the current advanced purification methods rely on a latent assumption that the backdoor connections between triggers and target labels in backdoor attacks are simpler to learn than the benign features. We demonstrate that this assumption, however, does not always hold, especially in all-to-all (A2A) and untargeted (UT) attacks. As a result, purification methods that analyze the separation between the poisoned and benign samples in the input-output space or the final hidden layer space are less effective. We observe that this separability is not confined to a single layer but varies across different hidden layers. Motivated by this understanding, we propose FLARE, a universal purification method to counter various backdoor attacks. FLARE aggregates abnormal activations from all hidden layers to construct representations for clustering. To enhance separation, FLARE develops an adaptive subspace selection algorithm to isolate the optimal space for dividing an entire dataset into two clusters. FLARE assesses the stability of each cluster and identifies the cluster with higher stability as poisoned. Extensive evaluations on benchmark datasets demonstrate the effectiveness of FLARE against 22 representative backdoor attacks, including all-to-one (A2O), all-to-all (A2A), and untargeted (UT) attacks, and its robustness to adaptive attacks.
zh

[CV-49] Effective Fine-Tuning of Vision-Language Models for Accurate Galaxy Morphology Analysis

【速读】：该论文试图解决在星系形态分析任务中，直接训练领域专用模型成本高昂，而微调视觉基础模型在较小天文图像数据集上效果不佳的问题。解决方案的关键在于提出了一种名为GalaxAlign的新方法，该方法通过扩展对比学习架构，将三种类型的数据（即表示星系形状和结构的示意图符号、这些符号的文本标签以及星系图像）对齐，从而在微调过程中整合领域特定的多模态知识。这种方法不仅消除了昂贵的预训练需求，还显著提升了微调效果，在星系分类和相似性搜索任务中表现出色。

链接: https://arxiv.org/abs/2411.19475
作者: Ruoqi Wang,Haitao Wang,Qiong Luo
关键词-EN: morphology analysis involves, analysis involves classifying, involves classifying galaxies, Galaxy morphology analysis, morphology analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Galaxy morphology analysis involves classifying galaxies by their shapes and structures. For this task, directly training domain-specific models on large, annotated astronomical datasets is effective but costly. In contrast, fine-tuning vision foundation models on a smaller set of astronomical images is more resource-efficient but generally results in lower accuracy. To harness the benefits of both approaches and address their shortcomings, we propose GalaxAlign, a novel method that fine-tunes pre-trained foundation models to achieve high accuracy on astronomical tasks. Specifically, our method extends a contrastive learning architecture to align three types of data in fine-tuning: (1) a set of schematic symbols representing galaxy shapes and structures, (2) textual labels of these symbols, and (3) galaxy images. This way, GalaxAlign not only eliminates the need for expensive pretraining but also enhances the effectiveness of fine-tuning. Extensive experiments on galaxy classification and similarity search demonstrate that our method effectively fine-tunes general pre-trained models for astronomical tasks by incorporating domain-specific multi-modal knowledge.
zh

[CV-50] ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection

【速读】：该论文试图解决多模态大型语言模型（M-LLMs）在图像篡改检测（IMD）任务中存在的推理文本出现幻觉和过度思考的问题。解决方案的关键在于提出了ForgerySleuth，该方法利用M-LLMs进行全面的线索融合，并生成指示篡改区域的分割输出。此外，通过Chain-of-Clues提示构建了ForgeryAnalysis数据集，包含分析和推理文本，以升级图像篡改检测任务。论文还引入了一个数据引擎来构建更大规模的数据集用于预训练阶段。实验结果表明，ForgeryAnalysis和ForgerySleuth在泛化性、鲁棒性和可解释性方面显著优于现有方法。

链接: https://arxiv.org/abs/2411.19466
作者: Zhihao Sun,Haoran Jiang,Haoran Chen,Yixin Cao,Xipeng Qiu,Zuxuan Wu,Yu-Gang Jiang
关键词-EN: Multimodal large language, large language models, Multimodal large, multimodal tasks, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal large language models have unlocked new possibilities for various multimodal tasks. However, their potential in image manipulation detection remains unexplored. When directly applied to the IMD task, M-LLMs often produce reasoning texts that suffer from hallucinations and overthinking. To address this, in this work, we propose ForgerySleuth, which leverages M-LLMs to perform comprehensive clue fusion and generate segmentation outputs indicating specific regions that are tampered with. Moreover, we construct the ForgeryAnalysis dataset through the Chain-of-Clues prompt, which includes analysis and reasoning text to upgrade the image manipulation detection task. A data engine is also introduced to build a larger-scale dataset for the pre-training phase. Our extensive experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in generalization, robustness, and explainability.
zh

[CV-51] Robust Bayesian Scene Reconstruction by Leveraging Retrieval-Augmented Priors

【速读】：该论文试图解决从单张RGBD图像中重建多物体场景的三维几何表示问题。解决方案的关键在于提出了BRRP方法，该方法通过利用预先存在的网格数据集构建一个鲁棒的概率重建过程中的信息先验。为了提高效率，引入了检索增强先验的概念，在推理过程中检索相关先验分布的组成部分。这种方法不仅比深度学习方法更鲁棒，而且比使用非信息先验的方法更准确。

链接: https://arxiv.org/abs/2411.19461
作者: Herbert Wright,Weiming Zhi,Matthew Johnson-Roberson,Tucker Hermans
关键词-EN: downstream manipulation tasks, manipulation tasks, Constructing, geometry is critical, downstream manipulation
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Constructing 3D representations of object geometry is critical for many downstream manipulation tasks. These representations must be built from potentially noisy partial observations. In this work we focus on the problem of reconstructing a multi-object scene from a single RGBD image. Current deep learning approaches to this problem can be brittle to noisy real world observations and out-of-distribution objects. Other approaches that do not rely on training data cannot accurately infer the backside of objects. We propose BRRP, a reconstruction method that can leverage preexisting mesh datasets to build an informative prior during robust probabilistic reconstruction. In order to make our method more efficient, we introduce the concept of retrieval-augmented prior, where we retrieve relevant components of our prior distribution during inference. Our method produces a distribution over object shape that can be used for reconstruction or measuring uncertainty. We evaluate our method in both procedurally generated scenes and in real world scenes. We show our method is more robust than a deep learning approach while being more accurate than a method with an uninformative prior.
zh

[CV-52] Look Every Frame All at Once: Video-Ma2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

【速读】：该论文试图解决大规模视频数据处理中由于现有基于Transformer的大规模多模态模型（LMMs）导致的内存和计算需求呈二次增长的问题。解决方案的关键在于引入了一种名为Video-Ma²mba的新架构，该架构在Mamba-2框架中嵌入了状态空间模型（State Space Models, SSMs），以替代注意力机制（Attention Mechanisms）。这种设计使得LMMs在时间和内存需求上能够线性扩展，从而能够高效处理长时间的视频内容。此外，论文还提出了多轴梯度检查点（Multi-Axis Gradient Checkpointing, MA-GC）方法，通过策略性地管理内存，仅保留多个计算轴上的关键激活，显著减少了内存占用。这些创新使得Video-Ma²mba能够在单个GPU上处理长达数百万个token或超过两小时的连续视频序列（以1 FPS的速度），同时保持对时间动态的详细捕捉，从而在长时间视频理解任务中提高了准确性和相关性。

链接: https://arxiv.org/abs/2411.19460
作者: Hosu Lee,Junho Kim,Hyunjun Kim,Yong Man Ro
关键词-EN: transformer-based Large Multi-modal, Large Multi-modal Models, Large Multi-modal, poses significant challenges, significant challenges due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:With the growing scale and complexity of video data, efficiently processing long video sequences poses significant challenges due to the quadratic increase in memory and computational demands associated with existing transformer-based Large Multi-modal Models (LMMs). To address these issues, we introduce Video-Ma ^2 mba, a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework, replacing the attention mechanisms. This allows the LMMs to scale linearly in terms of time and memory requirements, making it feasible to handle long-duration video content. Furthermore, we enhance the memory efficiency introducing the Multi-Axis Gradient Checkpointing (MA-GC) method, which strategically manages memory by retaining only essential activations across multiple computational axes. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. Empirical analyses show that Video-Ma ^2 mba can process extensive video sequences-equivalent to millions of tokens or over two hours of continuous sequences at 1 FPS-on a single GPU. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks, demonstrating substantial advantages over existing frameworks.
zh

[CV-53] Fleximo: Towards Flexible Text-to-Human Motion Video Generation

【速读】：该论文试图解决从参考图像和自然语言生成人类运动视频的问题，现有的方法依赖于从参考视频中提取姿态序列，这限制了灵活性和控制性，并且由于姿态检测技术的局限性，提取的姿态序列可能不准确，导致视频输出质量低。解决方案的关键在于提出了一个名为Fleximo的新框架，该框架利用大规模预训练的文本到3D运动模型，并通过引入基于锚点的重缩放方法和设计骨骼适配器来填补缺失细节，从而弥合文本到运动与运动到视频生成之间的差距。此外，通过使用大型语言模型（LLM）将自然语言分解为离散的运动序列，实现了任意长度运动视频的生成。最后，通过视频细化过程进一步提高视频质量。

链接: https://arxiv.org/abs/2411.19459
作者: Yuhang Zhang,Yuan Zhou,Zeyu Liu,Yuxuan Cai,Qiuyue Wang,Aidong Men,Huan Yang
关键词-EN: extracting pose sequences, generating human motion, Current methods, pose sequences, rely on extracting
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current methods for generating human motion videos rely on extracting pose sequences from reference videos, which restricts flexibility and control. Additionally, due to the limitations of pose detection techniques, the extracted pose sequences can sometimes be inaccurate, leading to low-quality video outputs. We introduce a novel task aimed at generating human motion videos solely from reference images and natural language. This approach offers greater flexibility and ease of use, as text is more accessible than the desired guidance videos. However, training an end-to-end model for this task requires millions of high-quality text and human motion video pairs, which are challenging to obtain. To address this, we propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models. This approach is not straightforward, as the text-generated skeletons may not consistently match the scale of the reference image and may lack detailed information. To overcome these challenges, we introduce an anchor point based rescale method and design a skeleton adapter to fill in missing details and bridge the gap between text-to-motion and motion-to-video generation. We also propose a video refinement process to further enhance video quality. A large language model (LLM) is employed to decompose natural language into discrete motion sequences, enabling the generation of motion videos of any desired length. To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions. We also propose a new metric, MotionScore, to evaluate the accuracy of motion following. Both qualitative and quantitative results demonstrate that our method outperforms existing text-conditioned image-to-video generation methods. All code and model weights will be made publicly available.
zh

[CV-54] Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

【速读】：该论文试图解决基于ViT（Vision Transformer）的视觉基础模型在理解和处理3D空间关系方面的不足。解决方案的关键在于通过系统评估和增强模型的3D等变性（3D equivariance），特别是通过检查不同视角下语义嵌入的一致性来实现。研究结果表明，提升3D等变性能够显著提高模型在姿态估计、跟踪和语义传递等下游任务中的表现。为此，论文提出了一种基于3D对应关系的简单而有效的微调策略，该策略显著增强了现有视觉模型对3D对应关系的理解，甚至在单个对象上进行一次迭代微调就能带来显著的性能提升。

链接: https://arxiv.org/abs/2411.19458
作者: Yang You,Yixin Li,Congyue Deng,Yue Wang,Leonidas Guibas
关键词-EN: revolutionized image understanding, providing rich semantic, ViT family, revolutionized image, providing rich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at this https URL.
zh

[CV-55] GausSurf: Geometry-Guided 3D Gaussian Splatting for Surface Reconstruction

【速读】：该论文试图解决使用3D高斯（3D Gaussians）进行高质量表面重建时细节不足的问题。解决方案的关键在于引入GausSurf方法，通过几何引导（geometry guidance）来提升重建质量。具体来说，该方法利用多视角一致性（multi-view consistency）在纹理丰富区域和法线先验（normal priors）在纹理贫乏区域，分别采用基于传统补丁匹配的多视角立体（patch-match based Multi-View Stereo, MVS）方法和预训练的法线估计模型来引导优化过程。这种迭代优化方案实现了高斯优化与补丁匹配精化的相互增强，显著提高了重建结果的质量并加速了训练过程。

链接: https://arxiv.org/abs/2411.19454
作者: Jiepeng Wang,Yuan Liu,Peng Wang,Cheng Lin,Junhui Hou,Xin Li,Taku Komura,Wenping Wang
关键词-EN: real-time rendering capabilities, achieved impressive performance, Gaussian Splatting, Splatting has achieved, rendering capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting has achieved impressive performance in novel view synthesis with real-time rendering capabilities. However, reconstructing high-quality surfaces with fine details using 3D Gaussians remains a challenging task. In this work, we introduce GausSurf, a novel approach to high-quality surface reconstruction by employing geometry guidance from multi-view consistency in texture-rich areas and normal priors in texture-less areas of a scene. We observe that a scene can be mainly divided into two primary regions: 1) texture-rich and 2) texture-less areas. To enforce multi-view consistency at texture-rich areas, we enhance the reconstruction quality by incorporating a traditional patch-match based Multi-View Stereo (MVS) approach to guide the geometry optimization in an iterative scheme. This scheme allows for mutual reinforcement between the optimization of Gaussians and patch-match refinement, which significantly improves the reconstruction results and accelerates the training process. Meanwhile, for the texture-less areas, we leverage normal priors from a pre-trained normal estimation model to guide optimization. Extensive experiments on the DTU and Tanks and Temples datasets demonstrate that our method surpasses state-of-the-art methods in terms of reconstruction quality and computation time.
zh

[CV-56] Learning Visual Abstract Reasoning through Dual-Stream Networks

【速读】：该论文试图解决视觉抽象推理任务中深度神经网络的局限性问题，特别是针对Raven’s Progressive Matrices (RPM)的挑战。解决方案的关键在于提出了Dual-stream Reasoning Network (DRNet)，该网络利用两个并行的分支来捕捉图像特征。这两个分支分别处理局部或空间信息，并通过一个推理模块将高层次特征融合，然后使用规则提取器处理上下文图像与候选图像的组合，提取离散的抽象规则，并利用多层感知器(MLP)进行预测。实验结果表明，DRNet在多个RPM基准测试中达到了最先进的平均性能，并展示了强大的泛化能力，甚至在各种分布外场景中也能表现出色。

链接: https://arxiv.org/abs/2411.19451
作者: Kai Zhao,Chang Xu,Bailu Si
关键词-EN: Raven Progressive Matrices, deep neural networks, neural network model, exposing limitations, Visual abstract reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Visual abstract reasoning tasks present challenges for deep neural networks, exposing limitations in their capabilities. In this work, we present a neural network model that addresses the challenges posed by Raven’s Progressive Matrices (RPM). Inspired by the two-stream hypothesis of visual processing, we introduce the Dual-stream Reasoning Network (DRNet), which utilizes two parallel branches to capture image features. On top of the two streams, a reasoning module first learns to merge the high-level features of the same image. Then, it employs a rule extractor to handle combinations involving the eight context images and each candidate image, extracting discrete abstract rules and utilizing an multilayer perceptron (MLP) to make predictions. Empirical results demonstrate that the proposed DRNet achieves state-of-the-art average performance across multiple RPM benchmarks. Furthermore, DRNet demonstrates robust generalization capabilities, even extending to various out-of-distribution scenarios. The dual streams within DRNet serve distinct functions by addressing local or spatial information. They are then integrated into the reasoning module, leveraging abstract rules to facilitate the execution of visual reasoning tasks. These findings indicate that the dual-stream architecture could play a crucial role in visual abstract reasoning.
zh

[CV-57] Adaptive Interactive Segmentation for Multimodal Medical Imaging via Selection Engine

【速读】：该论文试图解决医学图像分析中多模态数据分割的适应性和泛化性问题。解决方案的关键在于提出了策略驱动的交互式分割模型（Strategy-driven Interactive Segmentation Model, SISeg），该模型基于SAM2构建，并通过集成选择引擎来增强不同医学成像模态下的分割性能。具体来说，论文开发了自适应帧选择引擎（Adaptive Frame Selection Engine, AFSE），用于在2D图像序列推理过程中自动选择最佳提示帧，从而缓解内存瓶颈并优化提示帧选择，同时通过交互反馈机制增强模型的可解释性。实验结果表明，SISeg模型在多模态任务中表现出强大的适应性和泛化能力。

链接: https://arxiv.org/abs/2411.19447
作者: Zhi Li,Kai Zhao,Yaqi Wang,Shuai Wang
关键词-EN: achieving fast, diagnosis and treatment, medical imaging modalities, medical image analysis, Strategy-driven Interactive Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In medical image analysis, achieving fast, efficient, and accurate segmentation is essential for automated diagnosis and treatment. Although recent advancements in deep learning have significantly improved segmentation accuracy, current models often face challenges in adaptability and generalization, particularly when processing multi-modal medical imaging data. These limitations stem from the substantial variations between imaging modalities and the inherent complexity of medical data. To address these challenges, we propose the Strategy-driven Interactive Segmentation Model (SISeg), built on SAM2, which enhances segmentation performance across various medical imaging modalities by integrating a selection engine. To mitigate memory bottlenecks and optimize prompt frame selection during the inference of 2D image sequences, we developed an automated system, the Adaptive Frame Selection Engine (AFSE). This system dynamically selects the optimal prompt frames without requiring extensive prior medical knowledge and enhances the interpretability of the model’s inference process through an interactive feedback mechanism. We conducted extensive experiments on 10 datasets covering 7 representative medical imaging modalities, demonstrating the SISeg model’s robust adaptability and generalization in multi-modal tasks. The project page and code will be available at: [URL].
zh

[CV-58] Any-Resolution AI-Generated Image Detection by Spectral Learning

【速读】：该论文试图解决生成式 AI (Generative AI) 模型在生成图像中引入的频谱伪影问题，特别是这些伪影在不同生成模型之间的显著差异导致现有方法难以泛化到训练过程中未见过的生成器。解决方案的关键在于利用真实图像频谱分布的不变性和高度区分性，通过自监督的掩码频谱学习（masked spectral learning）和频域重建（frequency reconstruction）作为前置任务，捕捉生成图像与真实图像在频谱上的差异。论文提出了频谱重建相似度（spectral reconstruction similarity）来量化这种差异，并引入了频谱上下文注意力（spectral context attention）机制，以高效地捕捉图像中任何分辨率下的细微频谱不一致性。最终，该方法（SPAI）在检测生成图像方面相比之前的最先进方法在AUC上提升了5.5%，并表现出对常见在线扰动的鲁棒性。

链接: https://arxiv.org/abs/2411.19417
作者: Dimitrios Karageorgiou,Symeon Papadopoulos,Ioannis Kompatsiaris,Efstratios Gavves
关键词-EN: labeled data, spectral, images, approaches, recent generative approaches
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recent works have established that AI models introduce spectral artifacts into generated images and propose approaches for learning to capture them using labeled data. However, the significant differences in such artifacts among different generative models hinder these approaches from generalizing to generators not seen during training. In this work, we build upon the key idea that the spectral distribution of real images constitutes both an invariant and highly discriminative pattern for AI-generated image detection. To model this under a self-supervised setup, we employ masked spectral learning using the pretext task of frequency reconstruction. Since generated images constitute out-of-distribution samples for this model, we propose spectral reconstruction similarity to capture this divergence. Moreover, we introduce spectral context attention, which enables our approach to efficiently capture subtle spectral inconsistencies in images of any resolution. Our spectral AI-generated image detection approach (SPAI) achieves a 5.5% absolute improvement in AUC over the previous state-of-the-art across 13 recent generative approaches, while exhibiting robustness against common online perturbations.
zh

[CV-59] AMO Sampler: Enhancing Text Rendering with Overshooting

【速读】：该论文试图解决文本到图像生成中文字渲染的精确对齐问题，特别是在图像中呈现书面文字时，现有的先进模型如Stable Diffusion 3 (SD3)、Flux和AuraFlow仍存在文字拼写错误或不一致的问题。解决方案的关键在于引入一种无需额外训练且计算开销极小的方法，即通过交替超调预训练的修正流(Rectified Flow)模型中的学习常微分方程(ODE)和重新引入噪声，来改进文本渲染质量。具体而言，提出的超调采样器(Overshooting sampler)相比欧拉采样器(Euler sampler)，能有效引入额外的朗之万动力学项(Langevin dynamics term)，纠正连续欧拉步骤中的累积误差，从而提升文本渲染效果。为解决超调强度过高导致的图像过度平滑问题，论文进一步提出了注意力调制超调采样器(Attention Modulated Overshooting sampler, AMO)，根据图像块与文本内容的注意力分数自适应调整超调强度，从而在不牺牲整体图像质量或增加推理成本的前提下，显著提高文本渲染的准确性。

链接: https://arxiv.org/abs/2411.19415
作者: Xixi Hu,Keyang Xu,Bo Liu,Qiang Liu,Hongliang Fei
关键词-EN: Achieving precise alignment, Achieving precise, significant challenge, precise alignment, alignment between textual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages

点击查看摘要

Abstract:Achieving precise alignment between textual instructions and generated images in text-to-image generation is a significant challenge, particularly in rendering written text within images. Sate-of-the-art models like Stable Diffusion 3 (SD3), Flux, and AuraFlow still struggle with accurate text depiction, resulting in misspelled or inconsistent text. We introduce a training-free method with minimal computational overhead that significantly enhances text rendering quality. Specifically, we introduce an overshooting sampler for pretrained rectified flow (RF) models, by alternating between over-simulating the learned ordinary differential equation (ODE) and reintroducing noise. Compared to the Euler sampler, the overshooting sampler effectively introduces an extra Langevin dynamics term that can help correct the compounding error from successive Euler steps and therefore improve the text rendering. However, when the overshooting strength is high, we observe over-smoothing artifacts on the generated images. To address this issue, we propose an Attention Modulated Overshooting sampler (AMO), which adaptively controls the strength of overshooting for each image patch according to their attention score with the text content. AMO demonstrates a 32.3% and 35.9% improvement in text rendering accuracy on SD3 and Flux without compromising overall image quality or increasing inference cost.
zh

[CV-60] DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models WACV2025

【速读】：该论文试图解决个性化图像生成中存在的权衡问题，即在微调预训练的文本到图像扩散模型时，如何在提示忠实度（prompt fidelity）、主体忠实度（subject fidelity）和多样性（diversity）之间取得平衡。解决方案的关键在于提出了一种名为DreamBlend的方法，通过在推理阶段结合早期检查点的提示忠实度和后期检查点的主体忠实度，实现跨注意力引导的图像合成。具体来说，DreamBlend利用早期检查点生成的图像作为指导，引导后期检查点生成具有更高主体忠实度、提示忠实度和多样性的图像，从而在处理复杂提示时超越现有的最先进微调方法。

链接: https://arxiv.org/abs/2411.19390
作者: Shwetha Ram,Tal Neiman,Qianli Feng,Andrew Stuart,Son Tran,Trishul Chilimbi
关键词-EN: prompt fidelity, subject fidelity, fidelity, fine-tune large pre-trained, subject
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Accepted to WACV 2025

点击查看摘要

Abstract:Given a small number of images of a subject, personalized image generation techniques can fine-tune large pre-trained text-to-image diffusion models to generate images of the subject in novel contexts, conditioned on text prompts. In doing so, a trade-off is made between prompt fidelity, subject fidelity and diversity. As the pre-trained model is fine-tuned, earlier checkpoints synthesize images with low subject fidelity but high prompt fidelity and diversity. In contrast, later checkpoints generate images with low prompt fidelity and diversity but high subject fidelity. This inherent trade-off limits the prompt fidelity, subject fidelity and diversity of generated images. In this work, we propose DreamBlend to combine the prompt fidelity from earlier checkpoints and the subject fidelity from later checkpoints during inference. We perform a cross attention guided image synthesis from a later checkpoint, guided by an image generated by an earlier checkpoint, for the same prompt. This enables generation of images with better subject fidelity, prompt fidelity and diversity on challenging prompts, outperforming state-of-the-art fine-tuning methods.
zh

[CV-61] Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints

【速读】：该论文试图解决手绘草图动画化的挑战，特别是如何基于描述性文本提示生成流畅且保持拓扑结构的动画。解决方案的关键在于利用预训练的文本到视频扩散模型（text-to-video diffusion model）结合SDS损失（SDS loss）来引导草图笔画的动态变化，并通过引入长度-面积正则化（length-area (LA) regularization）确保帧间控制点的平滑位移，从而实现时间一致性。此外，采用形状保持的As-Rigid-As-Possible (ARAP)损失来维持草图的刚性，避免拓扑变化。这些方法共同提升了动画的流畅性和拓扑保持性，超越了现有最先进的技术。

链接: https://arxiv.org/abs/2411.19381
作者: Gaurav Rai,Ojaswa Sharma
关键词-EN: Animating hand-drawn sketches, challenging and complex, traditional tools, tools is challenging, Animating hand-drawn
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Animating hand-drawn sketches using traditional tools is challenging and complex. Sketches provide a visual basis for explanations, and animating these sketches offers an experience of real-time scenarios. We propose an approach for animating a given input sketch based on a descriptive text prompt. Our method utilizes a parametric representation of the sketch’s strokes. Unlike previous methods, which struggle to estimate smooth and accurate motion and often fail to preserve the sketch’s topology, we leverage a pre-trained text-to-video diffusion model with SDS loss to guide the motion of the sketch’s strokes. We introduce length-area (LA) regularization to ensure temporal consistency by accurately estimating the smooth displacement of control points across the frame sequence. Additionally, to preserve shape and avoid topology changes, we apply a shape-preserving As-Rigid-As-Possible (ARAP) loss to maintain sketch rigidity. Our method surpasses state-of-the-art performance in both quantitative and qualitative evaluations.
zh

[CV-62] owards a Mechanistic Explanation of Diffusion Model Generalization NEURIPS2024

【速读】：该论文试图解决扩散模型在泛化能力方面的问题，提出了一种基于局部去噪操作的机制。解决方案的关键在于识别并利用扩散模型中的局部归纳偏置（local inductive biases），通过分析网络和经验去噪器，证明了局部去噪操作可以近似最优的扩散去噪器。论文构建了一个基于局部经验去噪器的去噪器，能够近似扩散模型去噪器在正向和反向扩散过程中的泛化行为。

链接: https://arxiv.org/abs/2411.19339
作者: Matthew Niedoba,Berend Zwartsenberg,Kevin Murphy,Frank Wood
关键词-EN: local denoising operations, propose a mechanism, denoising operations, local denoising, diffusion generalization based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 15 figures. Accepted to NeurIPS 2024 Workshop on Attributing Model Behavior at Scale

点击查看摘要

Abstract:We propose a mechanism for diffusion generalization based on local denoising operations. Through analysis of network and empirical denoisers, we identify local inductive biases in diffusion models. We demonstrate that local denoising operations can be used to approximate the optimal diffusion denoiser. Using a collection of patch-based, local empirical denoisers, we construct a denoiser which approximates the generalization behaviour of diffusion model denoisers over forward and reverse diffusion processes.
zh

[CV-63] GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

【速读】：该论文试图解决现有通用视觉语言模型（Vision-Language Models, VLMs）在地理空间应用中的不足，特别是这些模型在处理地理空间数据的复杂性方面表现不佳的问题。解决方案的关键在于提出了GEOBench-VLM，这是一个专门为评估VLMs在地理空间任务中的表现而设计的综合基准。GEOBench-VLM涵盖了场景理解、对象计数、定位、细粒度分类和时间分析等多项任务，并包含了超过10,000条手动验证的指令，覆盖了视觉条件、对象类型和尺度的多样化变体。通过评估多个最先进的VLMs在地理空间上下文中的准确性，研究结果表明，尽管现有VLMs显示出一定的潜力，但在处理地理空间特定示例时仍面临挑战，这为未来的改进提供了空间。

链接: https://arxiv.org/abs/2411.19325
作者: Muhammad Sohail Danish,Muhammad Akhtar Munir,Syed Roshaan Ali Shah,Kartik Kuckreja,Fahad Shahbaz Khan,Paolo Fraccaro,Alexandre Lacoste,Salman Khan
关键词-EN: generic Vision-Language Models, Vision-Language Models, evaluating generic Vision-Language, numerous recent benchmarks, recent benchmarks focus
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they fall short in addressing the unique demands of geospatial applications. Generic VLM benchmarks are not designed to handle the complexities of geospatial data, which is critical for applications such as environmental monitoring, urban planning, and disaster management. Some of the unique challenges in geospatial domain include temporal analysis for changes, counting objects in large quantities, detecting tiny objects, and understanding relationships between entities occurring in Remote Sensing imagery. To address this gap in the geospatial domain, we present GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, fine-grained categorization, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and covers a diverse set of variations in visual conditions, object type, and scale. We evaluate several state-of-the-art VLMs to assess their accuracy within the geospatial context. The results indicate that although existing VLMs demonstrate potential, they face challenges when dealing with geospatial-specific examples, highlighting the room for further improvements. Specifically, the best-performing GPT4o achieves only 40% accuracy on MCQs, which is only double the random guess performance. Our benchmark is publicly available at this https URL .
zh

[CV-64] rajectory Attention for Fine-grained Video Motion Control

【速读】：该论文试图解决视频生成中相机运动控制的问题，特别是在创建视图定制化视觉内容时，现有方法常出现的输出不精确或忽视时间相关性的问题。解决方案的关键在于引入了一种名为“轨迹注意力 (trajectory attention)”的新方法，该方法沿着可用的像素轨迹进行注意力操作，从而实现细粒度的相机运动控制。与传统方法不同，轨迹注意力被建模为与传统时间注意力并行的辅助分支，使得两者能够协同工作，确保在轨迹信息部分可用的情况下，既能实现精确的运动控制，又能生成新的内容。这一设计显著提高了相机运动控制的精度和长期一致性，同时保持了高质量的生成效果，并可扩展到其他视频运动控制任务，如首帧引导的视频编辑。

链接: https://arxiv.org/abs/2411.19324
作者: Zeqi Xiao,Wenqi Ouyang,Yifan Zhou,Shuai Yang,Lei Yang,Jianlou Si,Xingang Pan
关键词-EN: creating view-customized visual, Recent advancements, camera motion control, motion control, view-customized visual content
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this http URL

点击查看摘要

Abstract:Recent advancements in video generation have been greatly driven by video diffusion models, with camera motion control emerging as a crucial challenge in creating view-customized visual content. This paper introduces trajectory attention, a novel approach that performs attention along available pixel trajectories for fine-grained camera motion control. Unlike existing methods that often yield imprecise outputs or neglect temporal correlations, our approach possesses a stronger inductive bias that seamlessly injects trajectory information into the video generation process. Importantly, our approach models trajectory attention as an auxiliary branch alongside traditional temporal attention. This design enables the original temporal attention and the trajectory attention to work in synergy, ensuring both precise motion control and new content generation capability, which is critical when the trajectory is only partially available. Experiments on camera motion control for images and videos demonstrate significant improvements in precision and long-range consistency while maintaining high-quality generation. Furthermore, we show that our approach can be extended to other video motion control tasks, such as first-frame-guided video editing, where it excels in maintaining content consistency over large spatial and temporal ranges.
zh

[CV-65] SAMa: Material-aware 3D Selection and Segmentation

【速读】：该论文试图解决将3D资产分解为材质部分这一常见但高度手动的过程，提出了名为Select Any Material (SAMa)的材质选择方法。解决方案的关键在于扩展了SAM2视频选择模型的能力至材质领域，利用模型的跨视图一致性创建一个3D一致的中间材质相似性表示（以点云形式），并通过最近邻查找在此相似性云中高效重建对象表面的连续选择掩码。该方法设计为多视图一致，无需对比学习或特征场预处理，且在几秒内完成优化，适用于任意3D表示，并在选择精度和多视图一致性方面优于多个强基线方法。

链接: https://arxiv.org/abs/2411.19322
作者: Michael Fischer,Iliyan Georgiev,Thibault Groueix,Vladimir G. Kim,Tobias Ritschel,Valentin Deschaintre
关键词-EN: highly manual process, artists and creators, manual process, common task, task for artists
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Decomposing 3D assets into material parts is a common task for artists and creators, yet remains a highly manual process. In this work, we introduce Select Any Material (SAMa), a material selection approach for various 3D representations. Building on the recently introduced SAM2 video selection model, we extend its capabilities to the material domain. We leverage the model’s cross-view consistency to create a 3D-consistent intermediate material-similarity representation in the form of a point cloud from a sparse set of views. Nearest-neighbour lookups in this similarity cloud allow us to efficiently reconstruct accurate continuous selection masks over objects’ surfaces that can be inspected from any view. Our method is multiview-consistent by design, alleviating the need for contrastive learning or feature-field pre-processing, and performs optimization-free selection in seconds. Our approach works on arbitrary 3D representations and outperforms several strong baselines in terms of selection accuracy and multiview consistency. It enables several compelling applications, such as replacing the diffuse-textured materials on a text-to-3D output, or selecting and editing materials on NeRFs and 3D-Gaussians.
zh

[CV-66] GRAPE: Generalizing Robot Policy via Preference Alignment

【速读】：该论文试图解决视觉-语言-动作（VLA）模型在机器人任务中普遍存在的泛化能力差的问题，主要原因是这些模型过度依赖于从成功演示中进行行为克隆，导致对未见任务的适应性不足。解决方案的关键在于引入GRAPE（Generalizing Robot Policy via Preference Alignment），通过在轨迹级别上对齐VLA模型，并隐式地从成功和失败试验中建模奖励，从而提升模型对多样化任务的泛化能力。此外，GRAPE将复杂操作任务分解为独立阶段，并通过大型视觉-语言模型提出的关键点，自动引导偏好建模，结合定制的时空约束，灵活调整模型以适应不同的任务目标，如安全性、效率和任务成功率。实验结果表明，GRAPE显著提升了现有VLA模型的性能，在已知和未见任务中的成功率分别提高了51.79%和60.36%，同时在安全性和效率方面也取得了显著改进。

链接: https://arxiv.org/abs/2411.19309
作者: Zijian Zhang,Kaiyuan Zheng,Zhaorun Chen,Joel Jang,Yi Li,Chaoqi Wang,Mingyu Ding,Dieter Fox,Huaxiu Yao
关键词-EN: behavior cloning exclusively, Generalizing Robot Policy, recent advancements, variety of robotics, suffer from critical
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 44.31% and rollout step-length by 11.15%, respectively. All code, models, and data are available at this https URL
zh

[CV-67] Enhancing Parameter-Efficient Fine-Tuning of Vision Transformers through Frequency-Based Adaptation

【速读】：该论文试图解决传统参数高效微调方法（PEFT）在捕捉高频特征方面的局限性，这些高频特征对于区分细微图像结构至关重要。解决方案的关键是引入FreqFit，一个新颖的频率微调模块，该模块位于视觉Transformer（ViT）块之间，通过在频率域中操作特征来增强模型的适应性。FreqFit能够与所有现有的PEFT方法集成，显著提升其性能，实验结果表明，FreqFit在24个数据集上的表现均优于原始PEFT方法，性能提升范围从1%到16%不等。

链接: https://arxiv.org/abs/2411.19297
作者: Son Thai Ly,Hien V. Nguyen
关键词-EN: Adapting vision transformer, vision transformer foundation, Adapting vision, PEFT methods, transformer foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages

点击查看摘要

Abstract:Adapting vision transformer foundation models through parameter-efficient fine-tuning (PEFT) methods has become increasingly popular. These methods optimize a limited subset of parameters, enabling efficient adaptation without the need to fine-tune the entire model while still achieving competitive performance. However, traditional PEFT methods may limit the model’s capacity to capture complex patterns, especially those associated with high-frequency spectra. This limitation becomes particularly problematic as existing research indicates that high-frequency features are crucial for distinguishing subtle image structures. To address this issue, we introduce FreqFit, a novel Frequency Fine-tuning module between ViT blocks to enhance model adaptability. FreqFit is simple yet surprisingly effective, and can be integrated with all existing PEFT methods to boost their performance. By manipulating features in the frequency domain, our approach allows models to capture subtle patterns more effectively. Extensive experiments on 24 datasets, using both supervised and self-supervised foundational models with various state-of-the-art PEFT methods, reveal that FreqFit consistently improves performance over the original PEFT methods with performance gains ranging from 1% to 16%. For instance, FreqFit-LoRA surpasses the performances of state-of-the-art baselines on CIFAR100 by more than 10% even without applying regularization or strong augmentation. For reproducibility purposes, the source code is available at this https URL.
zh

[CV-68] UrbanCAD: Towards Highly Controllable and Photorealistic 3D Vehicles for Urban Scene Simulation

【速读】：该论文试图解决自动驾驶模拟和数据增强中高可控性与高真实感之间的权衡问题。解决方案的关键在于提出了UrbanCAD框架，该框架通过从单一城市图像和一组免费3D CAD模型及手工制作的材质中生成高度可控且真实感强的3D车辆数字孪生体，从而推动了这一权衡的边界。UrbanCAD不仅支持360度真实渲染、车辆插入、材质转移、重新照明和组件操作（如开门和车窗下降），还支持构建长尾场景。其核心技术在于采用了一种检索-优化方式的新型流水线，能够在适应观测数据的同时保持灵活的可控性和细粒度的手工细节。此外，通过利用多视角背景透视和鱼眼图像，UrbanCAD能够近似环境光照并使用3DGS重建背景，从而将优化的CAD模型真实地插入到渲染的新视角背景中。实验结果表明，UrbanCAD在真实感方面优于基于重建和检索的基线方法，并且在下游应用中创建安全关键的驾驶场景方面具有显著优势。

链接: https://arxiv.org/abs/2411.19292
作者: Yichong Lu,Yichi Cai,Shangzhan Zhang,Hongyu Zhou,Haoji Hu,Huimin Yu,Andreas Geiger,Yiyi Liao
关键词-EN: autonomous driving simulation, CAD models, essential for autonomous, CAD, CAD models provide
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Photorealistic 3D vehicle models with high controllability are essential for autonomous driving simulation and data augmentation. While handcrafted CAD models provide flexible controllability, free CAD libraries often lack the high-quality materials necessary for photorealistic rendering. Conversely, reconstructed 3D models offer high-fidelity rendering but lack controllability. In this work, we introduce UrbanCAD, a framework that pushes the frontier of the photorealism-controllability trade-off by generating highly controllable and photorealistic 3D vehicle digital twins from a single urban image and a collection of free 3D CAD models and handcrafted materials. These digital twins enable realistic 360-degree rendering, vehicle insertion, material transfer, relighting, and component manipulation such as opening doors and rolling down windows, supporting the construction of long-tail scenarios. To achieve this, we propose a novel pipeline that operates in a retrieval-optimization manner, adapting to observational data while preserving flexible controllability and fine-grained handcrafted details. Furthermore, given multi-view background perspective and fisheye images, we approximate environment lighting using fisheye images and reconstruct the background with 3DGS, enabling the photorealistic insertion of optimized CAD models into rendered novel view backgrounds. Experimental results demonstrate that UrbanCAD outperforms baselines based on reconstruction and retrieval in terms of photorealism. Additionally, we show that various perception models maintain their accuracy when evaluated on UrbanCAD with in-distribution configurations but degrade when applied to realistic out-of-distribution data generated by our method. This suggests that UrbanCAD is a significant advancement in creating photorealistic, safety-critical driving scenarios for downstream applications.
zh

[CV-69] SADG: Segment Any Dynamic Gaussian Without Object Trackers

【速读】：该论文试图解决动态3D场景中的语义信息整合问题，以实现更全面的3D重建，从而支持扩展现实（XR）和自动驾驶等应用。解决方案的关键在于引入了一种名为SADG（Segment Any Dynamic Gaussian Without Object Trackers）的新方法，该方法结合了动态高斯Splatting表示和语义信息，且不依赖于对象ID的监督。具体来说，SADG通过利用Segment Anything Model（SAM）生成的掩码和基于硬像素挖掘的新型对比学习目标，学习语义感知特征，从而实现动态3D对象的一致性分割。这种方法无需进一步的后处理即可有效聚类高斯特征，从而支持快速的对象级编辑操作，如对象移除、组合和风格转换。

链接: https://arxiv.org/abs/2411.19290
作者: Yun-Jin Li,Mariia Gladkova,Yan Xia,Daniel Cremers
关键词-EN: including extended reality, Understanding dynamic, including extended, extended reality, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page this https URL

点击查看摘要

Abstract:Understanding dynamic 3D scenes is fundamental for various applications, including extended reality (XR) and autonomous driving. Effectively integrating semantic information into 3D reconstruction enables holistic representation that opens opportunities for immersive and interactive applications. We introduce SADG, Segment Any Dynamic Gaussian Without Object Trackers, a novel approach that combines dynamic Gaussian Splatting representation and semantic information without reliance on object IDs. In contrast to existing works, we do not rely on supervision based on object identities to enable consistent segmentation of dynamic 3D objects. To this end, we propose to learn semantically-aware features by leveraging masks generated from the Segment Anything Model (SAM) and utilizing our novel contrastive learning objective based on hard pixel mining. The learned Gaussian features can be effectively clustered without further post-processing. This enables fast computation for further object-level editing, such as object removal, composition, and style transfer by manipulating the Gaussians in the scene. We further extend several dynamic novel-view datasets with segmentation benchmarks to enable testing of learned feature fields from unseen viewpoints. We evaluate SADG on proposed benchmarks and demonstrate the superior performance of our approach in segmenting objects within dynamic scenes along with its effectiveness for further downstream editing tasks.
zh

[CV-70] GMS-VINS:Multi-category Dynamic Objects Semantic Segmentation for Enhanced Visual-Inertial Odometry Using a Promptable Foundation Model

【速读】：该论文试图解决视觉惯性里程计 (Visual-inertial Odometry, VIO) 在处理动态环境时，由于动态物体（如车辆、行人等）的存在而导致姿态估计精度下降的问题。解决方案的关键在于引入 GMS-VINS，它通过集成增强的 SORT 算法和鲁棒的多类别分割框架，有效提高了在动态物体多样且频繁遮挡的环境中姿态估计的准确性。具体来说，GMS-VINS 利用基础模型的即时响应能力，能够高效地跟踪和分割多种类别的动态物体，同时增强的 SORT 算法显著提升了在城市环境中对多个动态物体（尤其是在部分遮挡或快速移动情况下）的跟踪可靠性。

链接: https://arxiv.org/abs/2411.19289
作者: Rui Zhou,Jingbin Liu,Junbin Xie,Jianyu Zhang,Yingze Hu,Jiele Zhao
关键词-EN: Visual-inertial odometry, autonomous vehicles, complementary sensors, low cost, cost and complementary
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual-inertial odometry (VIO) is widely used in various fields, such as robots, drones, and autonomous vehicles, due to its low cost and complementary sensors. Most VIO methods presuppose that observed objects are static and time-invariant. However, real-world scenes often feature dynamic objects, compromising the accuracy of pose estimation. These moving entities include cars, trucks, buses, motorcycles, and pedestrians. The diversity and partial occlusion of these objects present a tough challenge for existing dynamic object removal techniques. To tackle this challenge, we introduce GMS-VINS, which integrates an enhanced SORT algorithm along with a robust multi-category segmentation framework into VIO, thereby improving pose estimation accuracy in environments with diverse dynamic objects and frequent occlusions. Leveraging the promptable foundation model, our solution efficiently tracks and segments a wide range of object categories. The enhanced SORT algorithm significantly improves the reliability of tracking multiple dynamic objects, especially in urban settings with partial occlusions or swift movements. We evaluated our proposed method using multiple public datasets representing various scenes, as well as in a real-world scenario involving diverse dynamic objects. The experimental results demonstrate that our proposed method performs impressively in multiple scenarios, outperforming other state-of-the-art methods. This highlights its remarkable generalization and adaptability in diverse dynamic environments, showcasing its potential to handle various dynamic objects in practical applications.
zh

[CV-71] OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration

【速读】：该论文试图解决深度完成 (Depth Completion, DC) 任务中现有方法在新数据集或未见过的稀疏深度模式下泛化能力差的问题。解决方案的关键在于提出了OMNI-DC模型，该模型通过引入多分辨率深度集成层 (multi-resolution depth integration layer) 和基于概率的损失函数 (probability-based loss)，能够有效处理不同密度的稀疏深度图。此外，OMNI-DC在混合的合成数据集上进行训练，并采用尺度归一化技术 (scale normalization technique)，从而显著提升了模型在各种稀疏深度模式下的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2411.19278
作者: Yiming Zuo,Willow Yang,Zeyu Ma,Jia Deng
关键词-EN: RGB image, sparse depth observations, sparse depth patterns, sparse depth, aims to predict
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depth completion (DC) aims to predict a dense depth map from an RGB image and sparse depth observations. Existing methods for DC generalize poorly on new datasets or unseen sparse depth patterns, limiting their practical applications. We propose OMNI-DC, a highly robust DC model that generalizes well across various scenarios. Our method incorporates a novel multi-resolution depth integration layer and a probability-based loss, enabling it to deal with sparse depth maps of varying densities. Moreover, we train OMNI-DC on a mixture of synthetic datasets with a scale normalization technique. To evaluate our model, we establish a new evaluation protocol named Robust-DC for zero-shot testing under various sparse depth patterns. Experimental results on Robust-DC and conventional benchmarks show that OMNI-DC significantly outperforms the previous state of the art. The checkpoints, training code, and evaluations are available at this https URL.
zh

[CV-72] On-chip Hyperspectral Image Segmentation with Fully Convolutional Networks for Scene Understanding in Autonomous Driving

【速读】：该论文试图解决在恶劣天气和复杂光照条件下，基于计算机视觉的高级驾驶辅助系统（ADAS）在物体检测和跟踪方面的可靠性问题。解决方案的关键在于利用高光谱成像（Hyperspectral Imaging, HSI）技术，特别是近红外（NIR）光谱反射率信息，来增强物体在真实驾驶场景中的分割效果。论文通过实验验证了高光谱数据在自然户外场景中的信息提取挑战，并探讨了如何通过结合标准的小型全卷积网络（Fully Convolutional Network, FCN）模型的空间特征，来提升高光谱分割系统在ADAS应用中的性能。

链接: https://arxiv.org/abs/2411.19274
作者: Jon Gutiérrez-Zaballa,Koldo Basterretxea,Javier Echanobe,M. Victoria Martínez,Unai Martínez-Corral,Óscar Mata Carballeira,Inés del Campo
关键词-EN: vision-based advanced driver, advanced driver assistance, driver assistance systems, computer vision-based advanced, vision-based advanced
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Most of current computer vision-based advanced driver assistance systems (ADAS) perform detection and tracking of objects quite successfully under regular conditions. However, under adverse weather and changing lighting conditions, and in complex situations with many overlapping objects, these systems are not completely reliable. The spectral reflectance of the different objects in a driving scene beyond the visible spectrum can offer additional information to increase the reliability of these systems, especially under challenging driving conditions. Furthermore, this information may be significant enough to develop vision systems that allow for a better understanding and interpretation of the whole driving scene. In this work we explore the use of snapshot, video-rate hyperspectral imaging (HSI) cameras in ADAS on the assumption that the near infrared (NIR) spectral reflectance of different materials can help to better segment the objects in real driving scenarios. To do this, we have used the HSI-Drive 1.1 dataset to perform various experiments on spectral classification algorithms. However, the information retrieval of hyperspectral recordings in natural outdoor scenarios is challenging, mainly because of deficient colour constancy and other inherent shortcomings of current snapshot HSI technology, which poses some limitations to the development of pure spectral classifiers. In consequence, in this work we analyze to what extent the spatial features codified by standard, tiny fully convolutional network (FCN) models can improve the performance of HSI segmentation systems for ADAS applications. The abstract above is truncated due to submission limits. For the full abstract, please refer to the published article. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV) Cite as: arXiv:2411.19274 [cs.CV] (or arXiv:2411.19274v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.19274 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2023 Journal of Systems Architecture (JSA) Related DOI: https://doi.org/10.1016/j.sysarc.2023.102878 Focus to learn more DOI(s) linking to related resources
zh

[CV-73] AGS-Mesh: Adaptive Gaussian Splatting and Meshing with Geometric Priors for Indoor Room Reconstruction Using Smartphones

【速读】：该论文试图解决在室内场景的3D重建中，由于移动设备深度传感器分辨率低和单目几何估计器多视角一致性及精度差导致的深度估计不准确问题。解决方案的关键在于提出了一种联合表面深度和法线细化的方法，通过高斯溅射（Gaussian Splatting）技术进行精确的3D重建。具体来说，论文开发了自适应过滤低质量深度和法线估计的监督策略，通过在优化过程中比较先验的一致性来实现。此外，论文还提出了一种尺度感知的网格化策略，受截断符号距离函数（TSDF）和基于八叉树的等值面提取启发，能够从高斯模型中恢复更精细的几何细节。

链接: https://arxiv.org/abs/2411.19271
作者: Xuqian Ren,Matias Turkulainen,Jiepeng Wang,Otto Seiskari,Iaroslav Melekhov,Juho Kannala,Esa Rahtu
关键词-EN: incorporating geometric priors, Geometric priors, incorporating geometric, Geometric, Gaussian Splatting methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geometric priors are often used to enhance 3D reconstruction. With many smartphones featuring low-resolution depth sensors and the prevalence of off-the-shelf monocular geometry estimators, incorporating geometric priors as regularization signals has become common in 3D vision tasks. However, the accuracy of depth estimates from mobile devices is typically poor for highly detailed geometry, and monocular estimators often suffer from poor multi-view consistency and precision. In this work, we propose an approach for joint surface depth and normal refinement of Gaussian Splatting methods for accurate 3D reconstruction of indoor scenes. We develop supervision strategies that adaptively filters low-quality depth and normal estimates by comparing the consistency of the priors during optimization. We mitigate regularization in regions where prior estimates have high uncertainty or ambiguities. Our filtering strategy and optimization design demonstrate significant improvements in both mesh estimation and novel-view synthesis for both 3D and 2D Gaussian Splatting-based methods on challenging indoor room datasets. Furthermore, we explore the use of alternative meshing strategies for finer geometry extraction. We develop a scale-aware meshing strategy inspired by TSDF and octree-based isosurface extraction, which recovers finer details from Gaussian models compared to other commonly used open-source meshing tools. Our code is released in this https URL.
zh

[CV-74] Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention

【速读】：该论文试图解决在开放领域场景下生成多主体一致图像时，现有训练自由扩散模型在处理多个主体时表现不佳的问题。具体来说，现有方法在处理多个主体时存在两个主要问题：一是目标图像中不同主体之间的不期望干扰，二是由于主体在参考图像和目标图像中的位置差异较大，导致注意力机制的效果降低。为解决这些问题，论文提出了一种名为IR-Diffusion的训练自由扩散模型，其关键在于引入隔离注意力（Isolation Attention）和重定位注意力（Reposition Attention）。隔离注意力确保目标图像中的多个主体不相互参考，从而有效消除主体融合现象；重定位注意力则通过将参考图像和目标图像中的主体缩放并重定位到相同位置，使得目标图像中的主体能够更好地参考参考图像中的主体，从而保持更好的主体一致性。实验结果表明，该方法显著提升了多主体一致性，在开放领域场景下优于所有现有方法。

链接: https://arxiv.org/abs/2411.19261
作者: Huiguo He,Qiuyue Wang,Yuan Zhou,Yuxuan Cai,Hongyang Chao,Jian Yin,Huan Yang
关键词-EN: achieved remarkable progress, generating multi-subject consistent, achieved remarkable, remarkable progress, progress in generating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training-free diffusion models have achieved remarkable progress in generating multi-subject consistent images within open-domain scenarios. The key idea of these methods is to incorporate reference subject information within the attention layer. However, existing methods still obtain suboptimal performance when handling numerous subjects. This paper reveals the two primary issues contributing to this deficiency. Firstly, there is undesired interference among different subjects within the target image. Secondly, tokens tend to reference nearby tokens, which reduces the effectiveness of the attention mechanism when there is a significant positional difference between subjects in reference and target images. To address these challenges, we propose a training-free diffusion model with Isolation and Reposition Attention, named IR-Diffusion. Specifically, Isolation Attention ensures that multiple subjects in the target image do not reference each other, effectively eliminating the subject fusion. On the other hand, Reposition Attention involves scaling and repositioning subjects in both reference and target images to the same position within the images. This ensures that subjects in the target image can better reference those in the reference image, thereby maintaining better consistency. Extensive experiments demonstrate that the proposed methods significantly enhance multi-subject consistency, outperforming all existing methods in open-domain scenarios.
zh

[CV-75] Face2QR: A Unified Framework for Aesthetic Face-Preserving and Scannable QR Code Generation

【速读】：该论文试图解决在生成美观的二维码（QR codes）时，将人脸身份融入二维码设计中可能导致的美观性与可扫描性之间的矛盾。解决方案的关键在于提出了一种名为Face2QR的新型流水线，该流水线包含三个创新组件：ID-refined QR integration (IDQR)、ID-aware QR ReShuffle (IDRS)和ID-preserved Scannability Enhancement (IDSE)。IDQR通过统一的基于Stable Diffusion (SD)的框架与控制网络，将背景样式与人脸身份无缝融合；IDRS通过重新排列二维码模块，有效解决人脸身份与二维码图案之间的冲突，同时保持面部特征的完整性；IDSE通过潜在代码优化显著提升扫描的鲁棒性，确保在保持人脸身份、美观质量与二维码功能之间的微妙平衡。这些组件共同确保了生成的个性化二维码在美观性、人脸身份识别和可扫描性方面均表现出色。

链接: https://arxiv.org/abs/2411.19246
作者: Xuehao Cui,Guangyang Wu,Zhenghao Gan,Guangtao Zhai,Xiaohong Liu
关键词-EN: style transfer techniques, human face identity, incorporate human face, transfer techniques, tend to compromise
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing methods to generate aesthetic QR codes, such as image and style transfer techniques, tend to compromise either the visual appeal or the scannability of QR codes when they incorporate human face identity. Addressing these imperfections, we present Face2QR-a novel pipeline specifically designed for generating personalized QR codes that harmoniously blend aesthetics, face identity, and scannability. Our pipeline introduces three innovative components. First, the ID-refined QR integration (IDQR) seamlessly intertwines the background styling with face ID, utilizing a unified Stable Diffusion (SD)-based framework with control networks. Second, the ID-aware QR ReShuffle (IDRS) effectively rectifies the conflicts between face IDs and QR patterns, rearranging QR modules to maintain the integrity of facial features without compromising scannability. Lastly, the ID-preserved Scannability Enhancement (IDSE) markedly boosts scanning robustness through latent code optimization, striking a delicate balance between face ID, aesthetic quality and QR functionality. In comprehensive experiments, Face2QR demonstrates remarkable performance, outperforming existing approaches, particularly in preserving facial recognition features within custom QR code designs. Codes are available at \hrefthis https URL\textthis URL link .
zh

[CV-76] InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

【速读】：该论文试图解决3D场景理解中3D高斯喷射（3D Gaussian Splatting, 3DGS）方法面临的三个主要挑战：1) 外观与语义之间的不平衡；2) 外观与语义之间的不一致性；3) 依赖自上而下的实例分割方法导致的类别分布不均问题。解决方案的关键在于提出了一种名为InstanceGaussian的新方法，该方法通过以下创新点来解决这些问题：i) 引入了一种新的语义支架高斯表示（Semantic-Scaffold-GS），平衡了外观和语义特征，从而改善了特征表示和边界划分；ii) 采用了一种渐进的外观-语义联合训练策略，增强了训练的稳定性和分割精度；iii) 提出了一种自底向上、类别无关的实例聚合方法，通过最远点采样和连通分量分析来解决分割难题。这些创新使得该方法在类别无关、开放词汇的3D点级别分割任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2411.19235
作者: Haijie Li,Yanmin Wu,Jiarui Meng,Qiankun Gao,Zhiyao Zhang,Ronggang Wang,Jian Zhang
关键词-EN: autonomous driving, augmented reality, essential area, area of research, research with applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report, 13 pages

点击查看摘要

Abstract:3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation. In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies. Project page: this https URL
zh

[CV-77] Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes

【速读】：该论文试图解决现有新视角合成方法在静态3D场景中缺乏“生动性”的问题，这是创建引人入胜的3D体验的关键要素。解决方案的关键在于提出了一种名为Gaussians2Life的方法，通过利用视频扩散模型（video diffusion models）作为生成组件，并结合一种将2D视频提升为有意义的3D运动的技术，来动画化高质量3D场景中的部分内容。这种方法不仅能够实现复杂预先存在的3D场景的真实动画化，还能支持多种对象类别的动画化，相较于以往主要集中在基于先验的角色动画或单一3D对象的工作，这是一个显著的进步。

链接: https://arxiv.org/abs/2411.19233
作者: Thomas Wimmer,Michael Oechsle,Michael Niemeyer,Federico Tombari
关键词-EN: achieve impressive results, synthesis methods achieve, methods achieve impressive, view synthesis methods, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack “liveliness,” a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.
zh

[CV-78] Z-STAR: A Zero-shot Style Transfer Method via Adjusting Style Distribution

【速读】：该论文试图解决风格迁移中风格表示的局限性问题，传统方法通过预定义的风格损失（style loss）来约束风格表示，但这种方法常导致风格表达受限和生成伪影。论文提出了一种基于扩散模型（diffusion models）的零样本风格迁移方法（zero-shot style transfer），称为Z-STAR+。其关键在于利用扩散模型中的潜在特征（latent features）自然包含的风格和内容分布，直接提取风格信息并将其融入内容图像，无需重新训练模型。具体解决方案包括采用双去噪路径（dual denoising paths）在潜在空间中表示内容和风格参考，通过交叉注意力重加权模块（Cross-attention Reweighting module）根据局部内容特征查询最适合输入块的风格图像信息，以及设计缩放自适应实例归一化（scaled adaptive instance normalization）来全局调整风格和风格化图像之间的颜色分布一致性。

链接: https://arxiv.org/abs/2411.19231
作者: Yingying Deng,Xiangyu He,Fan Tang,Weiming Dong
关键词-EN: Style transfer presents, Style, significant challenge, primarily centered, transfer presents
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:Style transfer presents a significant challenge, primarily centered on identifying an appropriate style representation. Conventional methods employ style loss, derived from second-order statistics or contrastive learning, to constrain style representation in the stylized result. However, these pre-defined style representations often limit stylistic expression, leading to artifacts. In contrast to existing approaches, we have discovered that latent features in vanilla diffusion models inherently contain natural style and content distributions. This allows for direct extraction of style information and seamless integration of generative priors into the content image without necessitating retraining. Our method adopts dual denoising paths to represent content and style references in latent space, subsequently guiding the content image denoising process with style latent codes. We introduce a Cross-attention Reweighting module that utilizes local content features to query style image information best suited to the input patch, thereby aligning the style distribution of the stylized results with that of the style image. Furthermore, we design a scaled adaptive instance normalization to mitigate inconsistencies in color distribution between style and stylized images on a global scale. Through theoretical analysis and extensive experimentation, we demonstrate the effectiveness and superiority of our diffusion-based \ulinezero-shot \ulinestyle \ulinetransfer via \ulineadjusting style dist\ulineribution, termed Z-STAR+.
zh

[CV-79] Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection

【速读】：该论文试图解决工业产品缺陷和异常的自动化检测问题，传统的人工检测方法存在速度慢、主观性强和易出错等缺点。解决方案的关键在于提出了一种无需训练的零样本学习方法，通过多模态机器学习管道实现自动化检测。具体步骤包括：首先利用大型语言模型GPT-3生成描述正常和异常产品外观的文本提示；然后使用基于对象检测的模型Grounding DINO定位图像中的产品；最后通过零样本图像-文本匹配模型CLIP比较裁剪后的产品图像块与生成的提示，以识别任何异常。该方法在MVTec-AD和VisA两个工业产品图像数据集上的实验表明，能够在无需模型训练的情况下实现高精度的缺陷和异常检测，从而在工业制造环境中实现高效、可扩展和客观的质量控制。

链接: https://arxiv.org/abs/2411.19220
作者: Tsun-Hin Cheung,Ka-Chun Fung,Songjiang Lai,Kwan-Ho Lin,Vincent Ng,Kin-Man Lam
关键词-EN: Identifying defects, Identifying, quality control task, control task, critical quality control
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to APSIPA ASC 2024

点击查看摘要

Abstract:Identifying defects and anomalies in industrial products is a critical quality control task. Traditional manual inspection methods are slow, subjective, and error-prone. In this work, we propose a novel zero-shot training-free approach for automated industrial image anomaly detection using a multimodal machine learning pipeline, consisting of three foundation models. Our method first uses a large language model, i.e., GPT-3. generate text prompts describing the expected appearances of normal and abnormal products. We then use a grounding object detection model, called Grounding DINO, to locate the product in the image. Finally, we compare the cropped product image patches to the generated prompts using a zero-shot image-text matching model, called CLIP, to identify any anomalies. Our experiments on two datasets of industrial product images, namely MVTec-AD and VisA, demonstrate the effectiveness of this method, achieving high accuracy in detecting various types of defects and anomalies without the need for model training. Our proposed model enables efficient, scalable, and objective quality control in industrial manufacturing settings.
zh

[CV-80] Cross-Spectral Attention for Unsupervised RGB-IR Face Verification and Person Re-identification

【速读】：该论文试图解决跨光谱生物识别（Cross-spectral biometrics）中RGB和红外（IR）图像之间的大光谱差异问题，特别是在监督学习方法中标注数据获取困难和扩展性差的问题。解决方案的关键在于提出了一种新的无监督跨光谱框架，该框架结合了以下三个核心组件：(1) 一种新的伪三元组损失（pseudo triplet loss）与跨光谱投票机制；(2) 一种利用多子空间的跨光谱注意力网络（cross-spectral attention network）；(3) 结构化稀疏性（structured sparsity）以实现更具区分性的跨光谱聚类。通过这些创新，论文在两个具有挑战性的基准数据集（ARL-VTF和RegDB）上与最新的最先进模型进行了广泛比较，并在某些情况下实现了优于完全监督方法的性能。

链接: https://arxiv.org/abs/2411.19215
作者: Kshitij Nikhal,Cedric Nimpa Fondje,Benjamin S. Riggan
关键词-EN: focal plane arrays, visible spectrum, increasing sensitivity, rapidly advanced, decade due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-spectral biometrics, such as matching imagery of faces or persons from visible (RGB) and infrared (IR) bands, have rapidly advanced over the last decade due to increasing sensitivity, size, quality, and ubiquity of IR focal plane arrays and enhanced analytics beyond the visible spectrum. Current techniques for mitigating large spectral disparities between RGB and IR imagery often include learning a discriminative common subspace by exploiting precisely curated data acquired from multiple spectra. Although there are challenges with determining robust architectures for extracting common information, a critical limitation for supervised methods is poor scalability in terms of acquiring labeled data. Therefore, we propose a novel unsupervised cross-spectral framework that combines (1) a new pseudo triplet loss with cross-spectral voting, (2) a new cross-spectral attention network leveraging multiple subspaces, and (3) structured sparsity to perform more discriminative cross-spectral clustering. We extensively compare our proposed RGB-IR biometric learning framework (and its individual components) with recent and previous state-of-the-art models on two challenging benchmark datasets: DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF) and RegDB person re-identification dataset, and, in some cases, achieve performance superior to completely supervised methods.
zh

[CV-81] ANDHRA Bandersnatch: Training Neural Networks to Predict Parallel Realities

【速读】：该论文试图解决在神经网络中通过引入多分支架构来提高模型性能的问题。解决方案的关键在于提出了一种基于多世界解释（Many-Worlds Interpretation, MWI）的新型神经网络架构，该架构在每一层将输入信号分裂成并行的分支，使用超整流激活（Hyper Rectified Activation, ANDHRA），并且这些分支在网络中不合并，形成独立的网络路径，从而生成多个输出预测头。通过联合训练这些独立的预测头并结合它们的损失值，实验结果表明在CIFAR-10/100数据集上，该架构能够在相同的参数和计算成本下，实现统计学上显著的精度提升。

链接: https://arxiv.org/abs/2411.19213
作者: Venkata Satya Sai Ajay Daliparthi
关键词-EN: Hyper Rectified Activation, Rectified Activation, Hyper Rectified, utilizing a Hyper, Many-Worlds Interpretation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: New World!

点击查看摘要

Abstract:Inspired by the Many-Worlds Interpretation (MWI), this work introduces a novel neural network architecture that splits the same input signal into parallel branches at each layer, utilizing a Hyper Rectified Activation, referred to as ANDHRA. The branched layers do not merge and form separate network paths, leading to multiple network heads for output prediction. For a network with a branching factor of 2 at three levels, the total number of heads is 2^3 = 8 . The individual heads are jointly trained by combining their respective loss values. However, the proposed architecture requires additional parameters and memory during training due to the additional branches. During inference, the experimental results on CIFAR-10/100 demonstrate that there exists one individual head that outperforms the baseline accuracy, achieving statistically significant improvement with equal parameters and computational cost.
zh

[CV-82] rack Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

【速读】：该论文试图解决零样本（zero-shot）情况下从可见掩码（visible masks）进行模态完成（amodal completion）的问题。解决方案的关键在于提出了一种新的数据集（TABE-51）、处理流程（TABE pipeline）和评估框架，这些工具能够在不需要预训练类别标签的情况下，仅使用第一帧中对象可见的单个查询掩码（query mask）进行灵活的零样本推理。特别地，TABE-51数据集提供了高度准确的模态分割掩码（amodal segmentation masks），无需人工估计或3D重建，而TABE pipeline则专门设计用于处理对象完全遮挡情况下的模态完成任务。此外，论文还引入了一个专门的评估框架，该框架能够隔离模态完成性能，避免传统视觉分割指标的影响。

链接: https://arxiv.org/abs/2411.19210
作者: Finlay G. C. Hudson,William A. P. Smith
关键词-EN: present Track, Track, Abstract, amodal completion, amodal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Track Anything Behind Everything (TABE), a novel dataset, pipeline, and evaluation framework for zero-shot amodal completion from visible masks. Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible, enabling flexible, zero-shot inference. Our dataset, TABE-51 provides highly accurate ground truth amodal segmentation masks without the need for human estimation or 3D reconstruction. Our TABE pipeline is specifically designed to handle amodal completion, even in scenarios where objects are completely occluded. We also introduce a specialised evaluation framework that isolates amodal completion performance, free from the influence of traditional visual segmentation metrics.
zh

[CV-83] Video Depth without Video Models

【速读】：该论文试图解决单目视频深度估计中的时间连续性问题，特别是在相机运动导致深度范围突然变化时，直接应用单帧深度估计器会导致深度视频中的闪烁和不一致性。解决方案的关键在于将单图像潜在扩散模型 (Latent Diffusion Model, LDM) 转化为一个先进的视频深度估计器，即 RollingDepth。其核心创新包括：(i) 一个多帧深度估计器，基于单图像 LDM，能够将非常短的视频片段（通常是三帧一组）映射到深度片段；(ii) 一个基于优化的鲁棒配准算法，用于将不同帧率采样的深度片段最佳地组装成一致的视频。这种方法能够高效处理长视频，并提供比专用视频深度估计器和单帧模型更准确的深度视频。

链接: https://arxiv.org/abs/2411.19189
作者: Bingxin Ke,Dominik Narnhofer,Shengyu Huang,Lei Ke,Torben Peters,Katerina Fragkiadaki,Anton Obukhov,Konrad Schindler
关键词-EN: estimation lifts monocular, inferring dense depth, monocular video clips, lifts monocular video, depth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: this http URL.
zh

[CV-84] SOWing Information: Cultivating Contextual Coherence with MLLM s in Image Generation

【速读】：该论文试图解决扩散生成模型中信息扩散的无序性和混乱性导致的图像区域间干扰问题，从而提升图像生成的细节保留和上下文一致性。解决方案的关键在于重新构思无序扩散为文本-视觉到图像生成任务（TV2I）中的有效工具，通过引入循环单向扩散（Cyclic One-Way Diffusion, COW）和选择性单向扩散（Selective One-Way Diffusion, SOW）来实现像素级条件保真度，同时保持视觉和语义的一致性。COW提供了一个高效的单向扩散框架，用于精确的信息传递并最小化干扰；SOW则利用多模态大语言模型（Multimodal Large Language Models, MLLMs）来明确图像中的语义和空间关系，并通过注意力机制动态调节扩散的方向和强度。

链接: https://arxiv.org/abs/2411.19182
作者: Yuhan Pei,Ruoyu Wang,Yongqi Yang,Ye Zhu,Olga Russakovsky,Yu Wu
关键词-EN: random movement, random walk, phenomenon in physics, collisions of particles, denoising trajectory
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.
zh

[CV-85] HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

【速读】：该论文试图解决在3D环境中以自我为中心的手和物体跟踪问题，解决方案的关键在于引入了一个名为HOT3D的公开数据集。HOT3D数据集提供了超过833分钟的多视角RGB/单色图像流，展示了19名受试者与33个不同刚性物体的交互，包括眼动追踪、场景点云等多模态信号，以及全面的地面实况标注，如物体、手和相机的3D姿态，以及手和物体的3D模型。通过使用Meta的Project Aria和Quest 3设备记录数据，并结合专业的动作捕捉系统获取地面实况姿态，HOT3D数据集有效地支持了多视角自我中心数据的三个主要任务：3D手部跟踪、6DoF物体姿态估计和未知手持物体的3D提升。实验结果表明，多视角方法在HOT3D数据集上的表现显著优于单视角方法，从而验证了该数据集在解决相关问题中的关键作用。

链接: https://arxiv.org/abs/2411.19167
作者: Prithviraj Banerjee,Sindi Shkodrani,Pierre Moulon,Shreyas Hampali,Shangchen Han,Fan Zhang,Linguang Zhang,Jade Fountain,Edward Miller,Selen Basol,Richard Newcombe,Robert Wang,Jakob Julian Engel,Tomas Hodan
关键词-EN: objects, Project Aria, dataset, multi-view RGB, image streams showing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: arXiv admin note: substantial text overlap with arXiv:2406.09598

点击查看摘要

Abstract:We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground-truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.
zh

[CV-86] Lost Found: Updating Dynamic 3D Scene Graphs from Egocentric Observations

【速读】：该论文试图解决动态环境中对象的6自由度（6DoF）姿态跟踪问题，特别是在人机交互场景中，传统的静态语义地图无法捕捉动态变化，而重新扫描环境以更新地图既昂贵又低效。解决方案的关键在于提出了一种名为“Lost & Found”的方法，该方法基于自我中心视角的记录和相应的手部位置及相机姿态估计，能够在检测到的交互间隔内实时跟踪移动对象的6DoF姿态。这些变化被在线应用于一个可变换的场景图，该图捕捉对象级别的关系。与现有的最先进对象姿态跟踪器相比，该方法在处理自我中心视角和缺乏深度信息的情况下更为可靠，并且在平移和方向误差方面分别优于第二最佳方法34%和56%，生成的6DoF对象轨迹更为平滑。此外，该方法还展示了如何在机器人应用中利用动态场景图中的交互信息，例如通过教学-重复命令移动机械手，以及根据先前的交互信息从抽屉中检索隐藏的对象。

链接: https://arxiv.org/abs/2411.19162
作者: Tjark Behrens,René Zurbrügg,Marc Pollefeys,Zuria Bauer,Hermann Blum
关键词-EN: Recent approaches, equipping downstream applications, approaches have successfully, successfully focused, equipping downstream
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL

点击查看摘要

Abstract:Recent approaches have successfully focused on the segmentation of static reconstructions, thereby equipping downstream applications with semantic 3D understanding. However, the world in which we live is dynamic, characterized by numerous interactions between the environment and humans or robotic agents. Static semantic maps are unable to capture this information, and the naive solution of rescanning the environment after every change is both costly and ineffective in tracking e.g. objects being stored away in drawers. With Lost Found we present an approach that addresses this limitation. Based solely on egocentric recordings with corresponding hand position and camera pose estimates, we are able to track the 6DoF poses of the moving object within the detected interaction interval. These changes are applied online to a transformable scene graph that captures object-level relations. Compared to state-of-the-art object pose trackers, our approach is more reliable in handling the challenging egocentric viewpoint and the lack of depth information. It outperforms the second-best approach by 34% and 56% for translational and orientational error, respectively, and produces visibly smoother 6DoF object trajectories. In addition, we illustrate how the acquired interaction information in the dynamic scene graph can be employed in the context of robotic applications that would otherwise be unfeasible: We show how our method allows to command a mobile manipulator through teach repeat, and how information about prior interaction allows a mobile manipulator to retrieve an object hidden in a drawer. Code, videos and corresponding data are accessible at this https URL.
zh

[CV-87] Neural Shadow Art

【速读】：该论文试图解决传统影子艺术（Shadow Art）中投影形状与目标图像匹配度不高的问题，特别是在不同光照方向和屏幕方向下难以精确匹配的问题。解决方案的关键在于引入神经影子艺术（Neural Shadow Art），通过隐式函数表示（implicit function representations）来优化3D模型的投影效果。具体来说，该方法允许投影几何体相对于输入的二值图像进行刚性变换，同时通过优化光照方向和屏幕方向，确保投影与目标图像高度一致。此外，该方法还支持特定的角度约束，允许用户在必要时固定投影角度。这种方法不仅在艺术创作上提供了更大的灵活性和精确性，还在工业应用中展示了材料使用效率的提升和几何平滑性的增强。

链接: https://arxiv.org/abs/2411.19161
作者: Caoliwen Wang,Bailin Deng
关键词-EN: Neural Shadow Art, Shadow art, introduce Neural Shadow, Neural Shadow, high accuracy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Shadow art is a captivating form of sculptural expression, where the projection of a sculpture in a specific direction reveals a desired shape with high accuracy. In this work, we introduce Neural Shadow Art, which leverages implicit function representations to expand the possibilities of shadow art. Our method provides a more flexible framework that allows projections to match input binary images under various lighting directions and screen orientations, without requiring the light source to be perpendicular to the screen. Unlike previous approaches, our method permits rigid transformations of the projected geometry relative to the input binary image. By optimizing lighting directions and screen orientations simultaneously through the implicit representation of 3D models, we ensure the projection closely resembles the target image. Additionally, like prior works, our method accommodates specific angular constraints, allowing users to fix the projection angle when necessary. Beyond its artistic significance, our approach proves valuable for industrial applications, demonstrating lower material usage and enhanced geometric smoothness. This capability avoids oversimplified results, such as the intersection of cylindrical volumes formed by light rays and the projection image. Furthermore, our approach excels in generating sculptures with complex topologies, surpassing previous methods and achieving sculptural effects akin to those in contemporary art.
zh

[CV-88] LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair

【速读】：该论文试图解决图像编辑中视觉指令的准确性和可扩展性问题。解决方案的关键在于提出了LoRA of Change (LoC)框架，通过动态学习特定指令的LoRA（Low-Rank Adaptation）来编码前后图像对中的“变化”，从而增强模型的解释性和复用性。此外，论文引入了LoRA Reverse优化技术，使得模型能够仅使用成对数据进行大规模训练，克服了传统方法依赖四元数据（quad data）的局限性，从而支持更广泛的实际视觉指令。

链接: https://arxiv.org/abs/2411.19156
作者: Xue Song,Jiequan Cui,Hanwang Zhang,Jiaxin Shi,Jingjing Chen,Chi Zhang,Yu-Gang Jiang
关键词-EN: before-after image pair, visual instructions, before-after image, image editing, image pair
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose the LoRA of Change (LoC) framework for image editing with visual instructions, i.e., before-after image pairs. Compared to the ambiguities, insufficient specificity, and diverse interpretations of natural language, visual instructions can accurately reflect users’ intent. Building on the success of LoRA in text-based image editing and generation, we dynamically learn an instruction-specific LoRA to encode the “change” in a before-after image pair, enhancing the interpretability and reusability of our model. Furthermore, generalizable models for image editing with visual instructions typically require quad data, i.e., a before-after image pair, along with query and target images. Due to the scarcity of such quad data, existing models are limited to a narrow range of visual instructions. To overcome this limitation, we introduce the LoRA Reverse optimization technique, enabling large-scale training with paired data alone. Extensive qualitative and quantitative experiments demonstrate that our model produces high-quality images that align with user intent and support a broad spectrum of real-world visual instructions.
zh

[CV-89] Counting Stacked Objects from Multi-View Images

【速读】：该论文试图解决在计算机视觉中，由于3D物体堆叠导致的物体计数困难问题。解决方案的关键在于将任务分解为两个互补的子问题：估计物体堆叠的3D几何形状和从多视角图像中计算占用率。通过结合几何重建和基于深度学习的深度分析，该方法能够准确地计数容器内不规则堆叠的相同物体。

链接: https://arxiv.org/abs/2411.19149
作者: Corentin Dumery,Noa Etté,Jingyi Xu,Aoxiang Fan,Ren Li,Hieu Le,Pascal Fua
关键词-EN: fundamental computer vision, numerous real-world applications, computer vision task, vision task underpinning, task underpinning numerous
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of stacked 3D objects in which most objects are hidden by those above them. To address this important yet underexplored problem, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems - estimating the 3D geometry of the object stack and the occupancy ratio from multi-view images. By combining geometric reconstruction and deep learning-based depth analysis, our method can accurately count identical objects within containers, even when they are irregularly stacked. We validate our 3D Counting pipeline on diverse real-world and large-scale synthetic datasets, which we will release publicly to facilitate further research.
zh

[CV-90] Co-Learning: Towards Semi-Supervised Object Detection with Road-side Cameras

【速读】：该论文试图解决在实际应用中获取大量标注数据的高成本和困难问题，特别是在边缘设备如路边摄像头上的目标检测任务。解决方案的关键在于开发了一种基于师生模型的半监督学习框架（Co-Learning），通过相互学习和标注对齐策略，有效利用未标注数据，从而在仅使用10%标注数据的情况下，实现了与全监督学习方法相当的性能。

链接: https://arxiv.org/abs/2411.19143
作者: Jicheng Yuan,Anh Le-Tuan,Ali Ganbarov,Manfred Hauswirth,Danh Le-Phuoc
关键词-EN: experienced rapid expansion, supervised learning methodologies, rapid expansion, contributing significantly, experienced rapid
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at EAmSI24: Edge AI meets swarm intelligence

点击查看摘要

Abstract:Recently, deep learning has experienced rapid expansion, contributing significantly to the progress of supervised learning methodologies. However, acquiring labeled data in real-world settings can be costly, labor-intensive, and sometimes scarce. This challenge inhibits the extensive use of neural networks for practical tasks due to the impractical nature of labeling vast datasets for every individual application. To tackle this, semi-supervised learning (SSL) offers a promising solution by using both labeled and unlabeled data to train object detectors, potentially enhancing detection efficacy and reducing annotation costs. Nevertheless, SSL faces several challenges, including pseudo-target inconsistencies, disharmony between classification and regression tasks, and efficient use of abundant unlabeled data, especially on edge devices, such as roadside cameras. Thus, we developed a teacher-student-based SSL framework, Co-Learning, which employs mutual learning and annotation-alignment strategies to adeptly navigate these complexities and achieves comparable performance as fully-supervised solutions using 10% labeled data.
zh

[CV-91] On Moving Object Segmentation from Monocular Video with Transformers ICCV2023

【速读】：该论文试图解决单目移动摄像机下的运动目标检测与分割问题，这一任务需要对识别、运动和三维几何有深入理解。解决方案的关键在于提出了一种新的融合架构——M3Former，该架构利用了Transformer在分割和多模态融合方面的强大性能。由于单目视频重建运动是一个不适定问题，论文系统地分析了不同的二维和三维运动表示方法及其对分割性能的重要性。此外，论文还强调了训练数据多样性的重要性，指出需要多样化的数据集才能在Kitti和Davis等基准上达到最先进的性能。

链接: https://arxiv.org/abs/2411.19141
作者: Christian Homeyer,Christoph Schnörr
关键词-EN: Moving object detection, single moving camera, Moving object, single moving, moving camera
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WICCV2023

点击查看摘要

Abstract:Moving object detection and segmentation from a single moving camera is a challenging task, requiring an understanding of recognition, motion and 3D geometry. Combining both recognition and reconstruction boils down to a fusion problem, where appearance and motion features need to be combined for classification and segmentation. In this paper, we present a novel fusion architecture for monocular motion segmentation - M3Former, which leverages the strong performance of transformers for segmentation and multi-modal fusion. As reconstructing motion from monocular video is ill-posed, we systematically analyze different 2D and 3D motion representations for this problem and their importance for segmentation performance. Finally, we analyze the effect of training data and show that diverse datasets are required to achieve SotA performance on Kitti and Davis.
zh

[CV-92] Visual SLAMMOT Considering Multiple Motion Models

【速读】：该论文试图解决在自动驾驶领域中，传统SLAM（Simultaneous Localization and Mapping）和MOT（Multi-Object Tracking）作为独立模块处理时存在的局限性问题。传统SLAM方法假设环境是静态的，不适用于动态的户外场景；而传统MOT方法依赖于车辆已知状态，限制了对象状态估计的准确性。论文提出了一种视觉SLAMMOT解决方案，关键在于将多个运动模型（multiple motion models）纳入SLAMMOT框架中，实现SLAM和MOT的紧密耦合（tightly coupled SLAM and MOT），并通过视觉传感机制验证了IMM-SLAMMOT方法在视觉领域的可行性和优势。

链接: https://arxiv.org/abs/2411.19134
作者: Peilin Tian,Hao Li
关键词-EN: Simultaneous Localization, Localization and Mapping, attracting considerable research, considerable research attention, attracting considerable
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Simultaneous Localization and Mapping (SLAM) and Multi-Object Tracking (MOT) are pivotal tasks in the realm of autonomous driving, attracting considerable research attention. While SLAM endeavors to generate real-time maps and determine the vehicle’s pose in unfamiliar settings, MOT focuses on the real-time identification and tracking of multiple dynamic objects. Despite their importance, the prevalent approach treats SLAM and MOT as independent modules within an autonomous vehicle system, leading to inherent limitations. Classical SLAM methodologies often rely on a static environment assumption, suitable for indoor rather than dynamic outdoor scenarios. Conversely, conventional MOT techniques typically rely on the vehicle’s known state, constraining the accuracy of object state estimations based on this prior. To address these challenges, previous efforts introduced the unified SLAMMOT paradigm, yet primarily focused on simplistic motion patterns. In our team’s previous work IMM-SLAMMOT\citeIMM-SLAMMOT, we present a novel methodology incorporating consideration of multiple motion models into SLAMMOT i.e. tightly coupled SLAM and MOT, demonstrating its efficacy in LiDAR-based systems. This paper studies feasibility and advantages of instantiating this methodology as visual SLAMMOT, bridging the gap between LiDAR and vision-based sensing mechanisms. Specifically, we propose a solution of visual SLAMMOT considering multiple motion models and validate the inherent advantages of IMM-SLAMMOT in the visual domain.
zh

[CV-93] MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

【速读】：该论文试图解决基于连续场景生成多场景视频的评估问题，特别是相较于传统短视频生成，这类视频需要考虑角色一致性、艺术连贯性、审美质量以及生成内容与预期提示的匹配度等多重因素。解决方案的关键在于提出了一种基于分数的评估基准（score-based evaluation benchmark），该基准通过自动化评分过程，替代了传统的手动选择最佳镜头的方式，从而实现对复杂视频生成过程的客观和高效评估。这一方法使得能够基于自动评分选择最佳结果，生成高质量的多场景视频。

链接: https://arxiv.org/abs/2411.19121
作者: Daewon Yoon,Hyungsuk Lee,Wonsik Shin
关键词-EN: traditional short video, short video generation, continuous scenario, paper addresses, addresses the metrics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses the metrics required for generating multi-scene videos based on a continuous scenario, as opposed to traditional short video generation. Scenario-based videos require a comprehensive evaluation that considers multiple factors such as character consistency, artistic coherence, aesthetic quality, and the alignment of the generated content with the intended prompt. Additionally, in video generation, unlike single images, the movement of characters across frames introduces potential issues like distortion or unintended changes, which must be effectively evaluated and corrected. In the context of probabilistic models like diffusion, generating the desired scene requires repeated sampling and manual selection, akin to how a film director chooses the best shots from numerous takes. We propose a score-based evaluation benchmark that automates this process, enabling a more objective and efficient assessment of these complexities. This approach allows for the generation of high-quality multi-scene videos by selecting the best outcomes based on automated scoring rather than manual inspection.
zh

[CV-94] Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models

【速读】：该论文试图解决生成式模型（Generative Models）在面部合成和编辑中带来的深度伪造（Deepfake）风险问题。解决方案的关键在于利用无训练检测方法（Training-free Detection Methods），通过直接从视觉基础模型（Vision Foundation Models）中提取统计特性来区分真实和伪造图像。具体来说，论文引入了RIGID方法，利用DINOv2模型对图像空间扰动的敏感性来检测伪造图像，发现伪造图像的嵌入（Embeddings）比真实图像的嵌入对扰动更敏感。通过实验，论文发现检测性能与模型鲁棒性（Model Robustness）密切相关，自监督学习（Self-Supervised Learning, SSL）模型提供了更可靠的表示。此外，论文提出了Contrastive Blur和MINDER方法，分别通过增强面部图像的检测性能和解决噪声类型偏差（Noise Type Bias）来进一步提高检测效果。这些方法不仅提升了检测性能，还为生成模型和检测社区提供了关于模型鲁棒性在深度伪造检测中应用的深入见解。

链接: https://arxiv.org/abs/2411.19117
作者: Chung-Ting Tsai,Ching-Yun Ko,I-Hsin Chung,Yu-Chiang Frank Wang,Pin-Yu Chen
关键词-EN: introduced serious risks, synthesis and editing, rapid advancement, including deepfake techniques, detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of generative models has introduced serious risks, including deepfake techniques for facial synthesis and editing. Traditional approaches rely on training classifiers and enhancing generalizability through various feature extraction techniques. Meanwhile, training-free detection methods address issues like limited data and overfitting by directly leveraging statistical properties from vision foundation models to distinguish between real and fake images. The current leading training-free approach, RIGID, utilizes DINOv2 sensitivity to perturbations in image space for detecting fake images, with fake image embeddings exhibiting greater sensitivity than those of real images. This observation prompts us to investigate how detection performance varies across model backbones, perturbation types, and datasets. Our experiments reveal that detection performance is closely linked to model robustness, with self-supervised (SSL) models providing more reliable representations. While Gaussian noise effectively detects general objects, it performs worse on facial images, whereas Gaussian blur is more effective due to potential frequency artifacts. To further improve detection, we introduce Contrastive Blur, which enhances performance on facial images, and MINDER (MINimum distance DetEctoR), which addresses noise type bias, balancing performance across domains. Beyond performance gains, our work offers valuable insights for both the generative and detection communities, contributing to a deeper understanding of model robustness property utilized for deepfake detection.
zh

[CV-95] mestep Embedding Tells: Its Time to Cache for Video Diffusion Model

【速读】：该论文试图解决扩散模型在视频生成中由于去噪过程的顺序性导致的低推理速度问题。解决方案的关键是引入了一种名为“Timestep Embedding Aware Cache (TeaCache)”的无训练缓存方法，该方法通过估计和利用模型输出在不同时间步之间的波动差异来加速推理过程。TeaCache不直接使用耗时的模型输出，而是关注与模型输出强相关的模型输入，这些输入在计算成本上几乎可以忽略不计。通过使用时间步嵌入调制噪声输入，TeaCache确保输入差异更好地近似模型输出差异，并引入重缩放策略来细化估计差异，从而指导输出缓存。实验结果表明，TeaCache在视觉质量几乎无损的情况下，实现了高达4.41倍的加速。

链接: https://arxiv.org/abs/2411.19108
作者: Feng Liu,Shiwei Zhang,Xiaofeng Wang,Yujie Wei,Haonan Qiu,Yuzhong Zhao,Yingya Zhang,Qixiang Ye,Fang Wan
关键词-EN: model outputs, low inference speed, inference speed due, video generation, nature of denoising
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL

点击查看摘要

Abstract:As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.
zh

[CV-96] Detailed Object Description with Controllable Dimensions

【速读】：该论文试图解决多模态大语言模型（MLLMs）生成的对象描述中可能包含大量与用户意图无关内容的问题。解决方案的关键在于提出了一种无需训练的描述优化流程，称为“Dimension Tailor”。该流程通过三个步骤——维度提取、擦除和补充，将描述分解为预定义的维度，并根据用户意图进行调整。这种方法不仅提高了对象细节的质量，还提供了根据用户偏好包含或排除特定维度的灵活性，从而显著提升了MLLMs在可控对象描述生成方面的性能。

链接: https://arxiv.org/abs/2411.19106
作者: Xinran Wang,Haiwen Zhang,Baoteng Li,Kongming Liang,Hao Sun,Zhongjiang He,Zhanyu Ma,Jun Guo
关键词-EN: visually impaired individuals, Object description plays, plays an important, important role, role for visually
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Object description plays an important role for visually impaired individuals to understand and compare the differences between objects. Recent multimodal large language models (MLLMs) exhibit powerful perceptual abilities and demonstrate impressive potential for generating object-centric captions. However, the descriptions generated by such models may still usually contain a lot of content that is not relevant to the user intent. Under special scenarios, users may only need the details of certain dimensions of an object. In this paper, we propose a training-free captioning refinement pipeline, \textbfDimension Tailor, designed to enhance user-specified details in object descriptions. This pipeline includes three steps: dimension extracting, erasing, and supplementing, which decompose the description into pre-defined dimensions and correspond to user intent. Therefore, it can not only improve the quality of object details but also offer flexibility in including or excluding specific dimensions based on user preferences. We conducted extensive experiments to demonstrate the effectiveness of Dimension Tailor on controllable object descriptions. Notably, the proposed pipeline can consistently improve the performance of the recent MLLMs. The code is currently accessible at the following anonymous link: \urlthis https URL.
zh

[CV-97] 360Recon: An Accurate Reconstruction Method Based on Depth Fusion from 360 Images

【速读】：该论文试图解决全景图像（360-degree images）在多视图重建（Multi-View Stereo, MVS）中由于广角视场引起的畸变问题，这种畸变影响了特征提取和匹配，进而导致几何一致性问题。解决方案的关键在于提出了360Recon算法，该算法包含一个球面特征提取模块（spherical feature extraction module），能够有效缓解畸变效应。通过结合构建的3D代价体（3D cost volume）与来自全景图像的多尺度增强特征，360Recon实现了高精度的场景重建，同时保持了局部几何一致性。实验结果表明，该方法在现有的全景重建数据集上达到了最先进的性能和高效率。

链接: https://arxiv.org/abs/2411.19102
作者: Zhongmiao Yan,Qi Wu,Songpengcheng Xia,Junyuan Deng,Xiang Mu,Renbiao Jin,Ling Pei
关键词-EN: traditional pinhole cameras, enabling sparse sampling, significantly wider field, pinhole cameras, enabling sparse
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:360-degree images offer a significantly wider field of view compared to traditional pinhole cameras, enabling sparse sampling and dense 3D reconstruction in low-texture environments. This makes them crucial for applications in VR, AR, and related fields. However, the inherent distortion caused by the wide field of view affects feature extraction and matching, leading to geometric consistency issues in subsequent multi-view reconstruction. In this work, we propose 360Recon, an innovative MVS algorithm for ERP images. The proposed spherical feature extraction module effectively mitigates distortion effects, and by combining the constructed 3D cost volume with multi-scale enhanced features from ERP images, our approach achieves high-precision scene reconstruction while preserving local geometric consistency. Experimental results demonstrate that 360Recon achieves state-of-the-art performance and high efficiency in depth estimation and 3D reconstruction on existing public panoramic reconstruction datasets.
zh

[CV-98] racking Progress Towards Sustainable Development Goal 6 Using Satellite Imagery

【速读】：该论文试图解决全球范围内清洁水和卫生设施普及率数据覆盖不足的问题，特别是在非洲地区。解决方案的关键在于整合非传统数据源，包括Afrobarometer调查数据、Landsat 8和Sentinel-2卫星影像，以及深度学习技术（Meta的DINO模型），构建一个高精度的建模框架，用于评估非洲各地的管道供水和污水处理系统覆盖情况。该框架通过卫星影像实现了超过96%和97%的准确率，能够为政策制定者和利益相关者提供一个筛查工具，帮助识别需要优先改善水与卫生基础设施的区域，并结合空间人口数据，估算和追踪全国范围内享有管道供水和污水处理系统的人口比例。未来，该方法还可能扩展到评估其他可持续发展目标（SDGs），特别是与关键基础设施相关的目标。

链接: https://arxiv.org/abs/2411.19093
作者: Othmane Echchabi,Nizar Talty,Josh Manto,Aya Lahlou,Ka Leung Lam
关键词-EN: Sustainable Development Goal, Nations’ Sustainable Development, significant global disparities, United Nations’ Sustainable, global disparities remain
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clean water and sanitation are essential for health, well-being, and sustainable development, yet significant global disparities remain. Although the United Nations’ Sustainable Development Goal 6 has clear targets for universal access to clean water and sanitation, data coverage and openness remain obstacles for tracking progress in many countries. Nontraditional data sources are needed to fill this gap. This study incorporated Afrobarometer survey data, satellite imagery (Landsat 8 and Sentinel-2), and deep learning techniques (Meta’s DINO model) to develop a modelling framework for evaluating access to piped water and sewage systems across diverse African regions. The modelling framework demonstrated high accuracy, achieving over 96% and 97% accuracy in identifying areas with piped water access and sewage system access respectively using satellite imagery. It can serve as a screening tool for policymakers and stakeholders to potentially identify regions for more targeted and prioritized efforts to improve water and sanitation infrastructure. When coupled with spatial population data, the modelling framework can also estimate and track the national-level percentages of the population with access to piped water and sewage systems. In the future, this approach could potentially be extended to evaluate other SDGs, particularly those related to critical infrastructure.
zh

[CV-99] ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos

【速读】：该论文试图解决自我中心视角与外部视角物体对应 (Ego-Exo Object Correspondence) 这一计算机视觉领域的新兴挑战，旨在将物体在自我中心视角和外部视角之间进行映射。解决方案的关键在于引入了ObjectRelator这一新方法，该方法包含两个核心模块：多模态条件融合 (Multimodal Condition Fusion, MCFuse) 和基于自监督的跨视角物体对齐 (SSL-based Cross-View Object Alignment, XObjAlign)。MCFuse通过有效融合语言和视觉条件来增强目标物体的定位，而XObjAlign则通过自监督对齐策略确保物体表示在不同视角下的一致性。实验结果表明，ObjectRelator在Ego2Exo和Exo2Ego任务上达到了最先进的性能，且仅需极少的额外参数。

链接: https://arxiv.org/abs/2411.19083
作者: Yuqian Fu,Runze Wang,Yanwei Fu,Danda Pani Paudel,Xuanjing Huang,Luc Van Gool
关键词-EN: Ego-Exo Object Correspondence, Object Correspondence task, Object Correspondence, Correspondence task, Multimodal Condition Fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we focus on the Ego-Exo Object Correspondence task, an emerging challenge in the field of computer vision that aims to map objects across ego-centric and exo-centric views. We introduce ObjectRelator, a novel method designed to tackle this task, featuring two new modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse effectively fuses language and visual conditions to enhance target object localization, while XObjAlign enforces consistency in object representations across views through a self-supervised alignment strategy. Extensive experiments demonstrate the effectiveness of ObjectRelator, achieving state-of-the-art performance on Ego2Exo and Exo2Ego tasks with minimal additional parameters. This work provides a foundation for future research in comprehensive cross-view object relation understanding highlighting the potential of leveraging multimodal guidance and cross-view alignment. Codes and models will be released to advance further research in this direction.
zh

[CV-100] Dynamic Attention and Bi-directional Fusion for Safety Helmet Wearing Detection

【速读】：该论文试图解决建筑工地安全中工人安全帽佩戴的实时检测问题，特别是在复杂环境、密集工作区域和因建筑物遮挡导致的小物体或重叠物体难以检测的情况下。解决方案的关键在于提出了一种结合动态注意力机制的新算法，该机制包括特征级注意力（feature-level attention）用于尺度适应，空间注意力（spatial attention）用于空间定位，以及通道注意力（channel attention）用于任务特定洞察，从而在不增加计算开销的情况下提升小物体检测能力。此外，通过双向融合策略（two-way fusion strategy）实现双向信息流，通过自适应多尺度加权（adaptive multi-scale weighting）优化特征融合，增强对遮挡目标的识别能力。实验结果表明，该方法在mAP@[.5:.95]指标上比最佳基线提高了1.7%，同时在大尺寸图像上减少了11.9%的GFLOPs，显示出其在实际建筑安全监控中的高效性和实用性。

链接: https://arxiv.org/abs/2411.19071
作者: Junwei Feng,Xueyan Fan,Yuyang Chen,Yi Li
关键词-EN: densely populated work, populated work areas, Ensuring construction site, site safety requires, safety requires accurate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring construction site safety requires accurate and real-time detection of workers’ safety helmet use, despite challenges posed by cluttered environments, densely populated work areas, and hard-to-detect small or overlapping objects caused by building obstructions. This paper proposes a novel algorithm for safety helmet wearing detection, incorporating a dynamic attention within the detection head to enhance multi-scale perception. The mechanism combines feature-level attention for scale adaptation, spatial attention for spatial localization, and channel attention for task-specific insights, improving small object detection without additional computational overhead. Furthermore, a two-way fusion strategy enables bidirectional information flow, refining feature fusion through adaptive multi-scale weighting, and enhancing recognition of occluded targets. Experimental results demonstrate a 1.7% improvement in mAP@[.5:.95] compared to the best baseline while reducing GFLOPs by 11.9% on larger sizes. The proposed method surpasses existing models, providing an efficient and practical solution for real-world construction safety monitoring.
zh

[CV-101] MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

【速读】：该论文试图解决的是在指代图像分割 (Referring Image Segmentation, RIS) 任务中，传统数据增强方法未能有效提升模型性能的问题。解决方案的关键在于提出了一个名为掩码指代图像分割 (Masked Referring Image Segmentation, MaskRIS) 的新训练框架。该框架通过结合图像和文本的掩码操作，并引入畸变感知上下文学习 (Distortion-aware Contextual Learning, DCL)，显著提升了模型对遮挡、信息不完整以及语言复杂性的鲁棒性。实验结果表明，MaskRIS 不仅在全监督和弱监督设置下均优于现有方法，还在多个基准数据集上达到了新的最先进性能。

链接: https://arxiv.org/abs/2411.19067
作者: Minhyun Lee,Seungho Lee,Song Park,Dongyoon Han,Byeongho Heo,Hyunjung Shim
关键词-EN: Referring Image Segmentation, advanced vision-language task, free-form text descriptions, Masked Referring Image, Image Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: First two authors contributed equally

点击查看摘要

Abstract:Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model’s robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at this https URL.
zh

[CV-102] I Dream My Painting: Connecting MLLM s and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting WACV2025

【速读】：该论文试图解决多区域图像修复（multi-mask inpainting）的问题，即在同一图像中同时修复多个缺失或损坏区域，并使用不同的提示（prompts）来指导每个区域的修复。解决方案的关键在于设计了一种针对多模态大语言模型（multimodal LLMs）如LLaVA的微调程序，使其能够自动生成多区域修复的提示。这些生成的提示随后被输入到经过微调的Stable Diffusion模型中，该模型通过修正的交叉注意力机制（rectified cross-attention）确保每个提示仅作用于其指定的修复区域。实验结果表明，该方法在WikiArt和Densely Captioned Images数据集上的数字化绘画修复中表现出色，能够生成创意且准确的修复结果。

链接: https://arxiv.org/abs/2411.19050
作者: Nicola Fanelli,Gennaro Vessio,Giovanna Castellano
关键词-EN: content and style, blend seamlessly, surrounding content, Inpainting, Densely Captioned Images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025

点击查看摘要

Abstract:Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting, we introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts. Furthermore, we design a fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion, which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention, enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results. Our code, data, and trained models are available at this https URL.
zh

[CV-103] AMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition

【速读】：该论文试图解决跨域少样本动作识别（Cross-Domain Few-Shot Action Recognition, CDFSAR）中的领域差距问题，特别是在源域到目标域的迁移学习中。现有方法主要通过联合训练源域和目标域数据来缓解领域差距的副作用，但存在计算成本高和未能充分利用预训练模型潜力的问题。论文提出的解决方案是引入一个简单而有效的基线方法，称为时间感知模型微调（Temporal-Aware Model Tuning, TAMT）。其关键在于采用解耦范式，即在源域数据上进行预训练，然后在目标域数据上进行微调，避免了多次目标数据与单个源数据联合训练的重计算。TAMT的核心组件包括层次化时间微调网络（Hierarchical Temporal Tuning Network, HTTN），其中包含局部时间感知适配器（Temporal-Aware Adapters, TAA）和全局时间感知动量微调（Global Temporal-Aware Moment Tuning, GTMT）。TAA通过学习少量参数来重新校准冻结预训练模型的中间特征，从而实现对目标域的高效适应；GTMT则有助于生成强大的视频表示，提升目标域的匹配性能。实验结果表明，TAMT在多个广泛使用的视频基准上显著优于现有方法，达到了新的CDFSAR技术水平。

链接: https://arxiv.org/abs/2411.19041
作者: Yilong Wang,Zilin Gao,Qilong Wang,Zhaofeng Chen,Peihua Li,Qinghua Hu
关键词-EN: cross-domain FSAR, attracted recent research, recent research interests, FSAR, few-shot action recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Going beyond few-shot action recognition (FSAR), cross-domain FSAR (CDFSAR) has attracted recent research interests by solving the domain gap lying in source-to-target transfer learning. Existing CDFSAR methods mainly focus on joint training of source and target data to mitigate the side effect of domain gap. However, such kind of methods suffer from two limitations: First, pair-wise joint training requires retraining deep models in case of one source data and multiple target ones, which incurs heavy computation cost, especially for large source and small target data. Second, pre-trained models after joint training are adopted to target domain in a straightforward manner, hardly taking full potential of pre-trained models and then limiting recognition performance. To overcome above limitations, this paper proposes a simple yet effective baseline, namely Temporal-Aware Model Tuning (TAMT) for CDFSAR. Specifically, our TAMT involves a decoupled paradigm by performing pre-training on source data and fine-tuning target data, which avoids retraining for multiple target data with single source. To effectively and efficiently explore the potential of pre-trained models in transferring to target domain, our TAMT proposes a Hierarchical Temporal Tuning Network (HTTN), whose core involves local temporal-aware adapters (TAA) and a global temporal-aware moment tuning (GTMT). Particularly, TAA learns few parameters to recalibrate the intermediate features of frozen pre-trained models, enabling efficient adaptation to target domains. Furthermore, GTMT helps to generate powerful video representations, improving match performance on the target domain. Experiments on several widely used video benchmarks show our TAMT outperforms the recently proposed counterparts by 13% \sim 31%, achieving new state-of-the-art CDFSAR results.
zh

[CV-104] 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes

【速读】：该论文试图解决传统自回归模型（AR models）在处理大规模3D数据时存在的计算成本高和生成效率低的问题。解决方案的关键在于引入3D-WAG模型，该模型通过将3D形状编码为多尺度小波标记图（multi-scale wavelet token maps），并利用Transformer以自回归方式预测“下一更高分辨率标记图”，从而将3D自回归生成任务重新定义为“下一尺度”预测。这种方法不仅降低了生成过程中的计算成本，还通过更结构化和层次化的方式保留了3D形状的几何细节。

链接: https://arxiv.org/abs/2411.19037
作者: Tejaswini Medi,Arianna Rampini,Pradyumna Reddy,Pradeep Kumar Jayaraman,Margret Keuper
关键词-EN: remains largely unexplored, achieved remarkable success, modeling remains largely, shape modeling remains, largely unexplored
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive (AR) models have achieved remarkable success in natural language and image generation, but their application to 3D shape modeling remains largely unexplored. Unlike diffusion models, AR models enable more efficient and controllable generation with faster inference times, making them especially suitable for data-intensive domains. Traditional 3D generative models using AR approaches often rely on next-token" predictions at the voxel or point level. While effective for certain applications, these methods can be restrictive and computationally expensive when dealing with large-scale 3D data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation, class-conditioned and also text-conditioned shape generation. Our key idea is to encode shapes as multi-scale wavelet token maps and use a Transformer to predict the next higher-resolution token map" in an autoregressive manner. By redefining 3D AR generation task as next-scale" prediction, we reduce the computational cost of generation compared to traditional next-token" prediction models, while preserving essential geometric details of 3D shapes in a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its benefit by quantitative and qualitative comparisons with state-of-the-art methods on widely used benchmarks. Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD, generating high-fidelity 3D shapes that closely match the real data distribution.
zh

[CV-105] PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors

【速读】：该论文试图解决点云完成（point cloud completion）中由于解空间过大导致的预测不准确问题。解决方案的关键在于利用大模型中的多视角扩散先验（multi-view diffusion priors）来生成目标形状的新视角图像，这些图像集编码了全局和局部的形状线索，特别有利于形状完成。论文设计了一个形状融合模块（shape fusion module），用于从多模态输入（如图像和点云）生成初始完整形状，并随后通过形状巩固模块（shape consolidation module）剔除由扩散先验不一致引入的不稳定点，从而获得最终的完整形状。

链接: https://arxiv.org/abs/2411.19036
作者: Guangshun Wei,Yuan Feng,Long Ma,Chen Wang,Yuanfeng Zhou,Changjian Li
关键词-EN: paper presents PCDreamer, partial point clouds, presents PCDreamer, point clouds, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:This paper presents PCDreamer, a novel method for point cloud completion. Traditional methods typically extract features from partial point clouds to predict missing regions, but the large solution space often leads to unsatisfactory results. More recent approaches have started to use images as extra guidance, effectively improving performance, but obtaining paired data of images and partial point clouds is challenging in practice. To overcome these limitations, we harness the relatively view-consistent multi-view diffusion priors within large models, to generate novel views of the desired shape. The resulting image set encodes both global and local shape cues, which is especially beneficial for shape completion. To fully exploit the priors, we have designed a shape fusion module for producing an initial complete shape from multi-modality input (\ie, images and point clouds), and a follow-up shape consolidation module to obtain the final complete shape by discarding unreliable points introduced by the inconsistency from diffusion priors. Extensive experimental results demonstrate our superior performance, especially in recovering fine details.
zh

[CV-106] Enhancing Neural Network Robustness Against Fault Injection Through Non-linear Weight Transformations

【速读】：该论文试图解决在实际环境中部署深度神经网络 (DNNs) 时，由于硬件故障（如辐射、老化和温度波动）导致的模型失效问题。解决方案的关键在于通过应用饱和激活函数 (SAFs) 来约束 DNN 权重，防止故障导致权重过大从而引发模型失败。具体方法是在训练阶段使用 SAFs 约束权重，在部署阶段将未应用 SAFs 的权重写入易出错的介质，读取时再应用 SAFs 进行推理。这种方法不仅增强了 DNN 对故障注入的鲁棒性，还在一定程度上提升了模型性能。论文通过在 CIFAR10、CIFAR100 和 ImageNet 2012 数据集上进行实验，验证了该方法在不同数据类型（FP32、16-bit 浮点和 8-bit 定点）下的有效性。

链接: https://arxiv.org/abs/2411.19027
作者: Ninnart Fuengfusin,Hakaru Tamukoh
关键词-EN: Deploying deep neural, deep neural networks, real-world environments poses, environments poses challenges, poses challenges due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 6 figures

点击查看摘要

Abstract:Deploying deep neural networks (DNNs) in real-world environments poses challenges due to faults that can manifest in physical hardware from radiation, aging, and temperature fluctuations. To address this, previous works have focused on protecting DNNs via activation range restriction using clipped ReLU and finding the optimal clipping threshold. However, this work instead focuses on constraining DNN weights by applying saturated activation functions (SAFs): Tanh, Arctan, and others. SAFs prevent faults from causing DNN weights to become excessively large, which can lead to model failure. These methods not only enhance the robustness of DNNs against fault injections but also improve DNN performance by a small margin. Before deployment, DNNs are trained with weights constrained by SAFs. During deployment, the weights without applied SAF are written to mediums with faults. When read, weights with faults are applied with SAFs and are used for inference. We demonstrate our proposed method across three datasets (CIFAR10, CIFAR100, ImageNet 2012) and across three datatypes (32-bit floating point (FP32), 16-bit floating point, and 8-bit fixed point). We show that our method enables FP32 ResNet18 with ImageNet 2012 to operate at a bit-error rate of 0.00001 with minor accuracy loss, while without the proposed method, the FP32 DNN only produces random guesses. Furthermore, to accelerate the training process, we demonstrate that an ImageNet 2012 pre-trained ResNet18 can be adapted to SAF by training for a few epochs with a slight improvement in Top-1 accuracy while still ensuring robustness against fault injection.
zh

[CV-107] Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

【速读】：该论文试图解决从粗糙面部素描生成高保真彩色图像的问题。解决方案的关键在于采用了一种基于卷积块注意力机制的自动编码器网络（Convolutional Block Attention-based Auto-encoder Network, CA2N），通过编码器-解码器架构中的块注意力机制有效捕捉和增强关键面部特征。随后，利用噪声诱导的条件生成对抗网络（Conditional Generative Adversarial Network, cGAN）过程，使系统在训练过程中未见过的领域上仍能保持高性能。这些技术显著提升了图像的真实感和保真度，模型在CelebAMask-HQ、CUHK和CUFSF数据集上的表现优于现有最佳方法，分别提高了17、23和38的FID得分，从而在素描到图像生成领域达到了新的技术水平。

链接: https://arxiv.org/abs/2411.19005
作者: Muhammad Umer Ramzan,Ali Zia,Abdelwahed Khamis,yman Elgharabawy,Ahmad Liaqat,Usman Ali
关键词-EN: rudimentary face sketches, Attention-based Auto-encoder Network, high-fidelity colour images, Convolutional Block Attention-based, Block Attention-based Auto-encoder
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted for publication in 25th International Conference on Digital Image Computing: Techniques Applications (DICTA) 2024

点击查看摘要

Abstract:This paper presents a novel deep-learning framework that significantly enhances the transformation of rudimentary face sketches into high-fidelity colour images. Employing a Convolutional Block Attention-based Auto-encoder Network (CA2N), our approach effectively captures and enhances critical facial features through a block attention mechanism within an encoder-decoder architecture. Subsequently, the framework utilises a noise-induced conditional Generative Adversarial Network (cGAN) process that allows the system to maintain high performance even on domains unseen during the training. These enhancements lead to considerable improvements in image realism and fidelity, with our model achieving superior performance metrics that outperform the best method by FID margin of 17, 23, and 38 on CelebAMask-HQ, CUHK, and CUFSF datasets; respectively. The model sets a new state-of-the-art in sketch-to-image generation, can generalize across sketch types, and offers a robust solution for applications such as criminal identification in law enforcement.
zh

[CV-108] MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

【速读】：该论文试图解决视觉变换器（Vision Transformers, ViTs）在特征学习效率上的问题，特别是当前研究主要集中在有效的token混合器（token mixers）上，而忽略了与归一化（normalization）方法之间的潜在关系。论文提出的解决方案关键在于引入两个新组件：多视图归一化模块（Multi-View Normalization, MVN）和多视图token混合器（Multi-View Token Mixer, MVTM）。MVN通过学习加权和的方式整合了批归一化（batch normalization）、层归一化（layer normalization）和实例归一化（instance normalization）三种不同归一化方法的特征，从而提供多样化的模式信息给token混合器，增强特征学习的多样性。MVTM则是一种基于卷积的多尺度token混合器，通过配置不同感受野的局部、中间和全局滤波器，捕捉视觉模式的不同范围，并根据不同阶段的特点进行调整，以提高视觉变换器的整体性能。通过在MetaFormer块中结合MVN和MVTM，论文提出的多视图变换器（Multi-Vision Transformer, MVFormer）在图像分类、目标检测和实例及语义分割任务上均表现出色，超越了当前最先进的基于卷积的视觉变换器。

链接: https://arxiv.org/abs/2411.18995
作者: Jongseong Bae,Susang Kim,Minsu Cho,Ha Young Kim
关键词-EN: Active research, token mixer, underway to enhance, enhance the efficiency, efficiency of vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum. Each normalization method outputs a different distribution, generating distinct features. Thus, the MVN is expected to offer diverse pattern information to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters, and it incorporates stage specificity by configuring various receptive fields for the token mixer at each stage, efficiently capturing ranges of visual patterns. We propose a novel ViT model, multi-vision transformer (MVFormer), adopting the MVN and MVTM in the MetaFormer block, the generalized ViT scheme. Our MVFormer outperforms state-of-the-art convolution-based ViTs on image classification, object detection, and instance and semantic segmentation with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer-T, S, and B achieve 83.4%, 84.3%, and 84.6% top-1 accuracy, respectively, on ImageNet-1K benchmark.
zh

[CV-109] Harden Deep Neural Networks Against Fault Injections Through Weight Scaling

【速读】：该论文试图解决深度神经网络 (DNNs) 在硬件设备上部署时，由于老化、温度变化和写入错误等因素导致的权重位翻转问题，这些问题会显著降低 DNN 的性能。解决方案的关键在于提出了一种简单而有效的方法，即在将权重存储到易出错的介质之前，通过乘以常数来硬化权重。在使用时，这些权重通过除以相同的常数来恢复原始比例。该方法基于位翻转错误具有类似加性噪声的特性，因此通过除以常数可以减少位翻转引起的绝对误差。实验结果表明，仅通过乘以常数，8-bit 定点 ResNet50 在位错误率为 0.0001 时，Top-1 准确率提高了 54.418。

链接: https://arxiv.org/abs/2411.18993
作者: Ninnart Fuengfusin,Hakaru Tamukoh
关键词-EN: Deep neural networks, Deep neural, enabled smart applications, neural networks, hardware devices
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 8 figures

点击查看摘要

Abstract:Deep neural networks (DNNs) have enabled smart applications on hardware devices. However, these hardware devices are vulnerable to unintended faults caused by aging, temperature variance, and write errors. These faults can cause bit-flips in DNN weights and significantly degrade the performance of DNNs. Thus, protection against these faults is crucial for the deployment of DNNs in critical applications. Previous works have proposed error correction codes based methods, however these methods often require high overheads in both memory and computation. In this paper, we propose a simple yet effective method to harden DNN weights by multiplying weights by constants before storing them to fault-prone medium. When used, these weights are divided back by the same constants to restore the original scale. Our method is based on the observation that errors from bit-flips have properties similar to additive noise, therefore by dividing by constants can reduce the absolute error from bit-flips. To demonstrate our method, we conduct experiments across four ImageNet 2012 pre-trained models along with three different data types: 32-bit floating point, 16-bit floating point, and 8-bit fixed point. This method demonstrates that by only multiplying weights with constants, Top-1 Accuracy of 8-bit fixed point ResNet50 is improved by 54.418 at bit-error rate of 0.0001.
zh

[CV-110] SPAgent : Adaptive Task Decomposition and Model Selection for General Video Generation and Editing

【速读】：该论文试图解决现有开源视频生成和编辑模型在面对多样化用户需求时，由于各自功能局限性而难以协调使用的问题。解决方案的关键在于提出了一个基于语义规划代理 (Semantic Planning Agent, SPAgent) 的新型视频生成和编辑系统。SPAgent 通过集成当前最先进的开源图像和视频生成与编辑模型，形成一个工具库，并通过一个三步框架（解耦意图识别、原则引导的路径规划和基于能力的执行模型选择）自动协调这些工具，以满足用户的多样化需求。此外，SPAgent 还增强了视频质量评估能力，使其能够自主评估并整合新的视频生成和编辑模型，无需人工干预。实验结果表明，SPAgent 能够有效协调模型生成或编辑视频，展示了其在各种视频任务中的多功能性和适应性。

链接: https://arxiv.org/abs/2411.18983
作者: Rong-Cheng Tu,Wenhao Sun,Zhao Jin,Jingyi Liao,Jiaxing Huang,Dacheng Tao
关键词-EN: generation and editing, made significant progress, video generation, generation, video
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:While open-source video generation and editing models have made significant progress, individual models are typically limited to specific tasks, failing to meet the diverse needs of users. Effectively coordinating these models can unlock a wide range of video generation and editing capabilities. However, manual coordination is complex and time-consuming, requiring users to deeply understand task requirements and possess comprehensive knowledge of each model’s performance, applicability, and limitations, thereby increasing the barrier to entry. To address these challenges, we propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent). SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models, enhancing the adaptability, efficiency, and overall quality of video generation and editing. Specifically, the SPAgent assembles a tool library integrating state-of-the-art open-source image and video generation and editing models as tools. After fine-tuning on our manually annotated dataset, SPAgent can automatically coordinate the tools for video generation and editing, through our novelly designed three-step framework: (1) decoupled intent recognition, (2) principle-guided route planning, and (3) capability-based execution model selection. Additionally, we enhance the SPAgent’s video quality evaluation capability, enabling it to autonomously assess and incorporate new video generation and editing models into its tool library without human intervention. Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos, highlighting its versatility and adaptability across various video tasks.
zh

[CV-111] Det-SAM2:Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2

【速读】：该论文试图解决视频分割和分割结果精化的自动化问题。解决方案的关键在于构建了一个名为 Det-SAM2 的完全自动化管道，其中通过检测模型自动生成对象提示，以辅助 SAM2 进行推理和精化。这一管道不仅实现了对无限长视频流的推理，而且在保持恒定的 VRAM 和 RAM 使用量的同时，保持了与原始 SAM2 相同的效率和准确性。论文重点介绍了 Det-SAM2 框架的构建及其在工程优化上的应用，并通过一个基于 Det-SAM2 框架的案例——台球场景中的 AI 裁判系统，展示了其在实际业务中的应用潜力。

链接: https://arxiv.org/abs/2411.18977
作者: Zhiting Wang,Qiangong Zhou,Zongyang Liu
关键词-EN: demonstrates exceptional performance, segmentation results, demonstrates exceptional, exceptional performance, Segment Anything Model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segment Anything Model 2 (SAM2) demonstrates exceptional performance in video segmentation and refinement of segmentation results. We anticipate that it can further evolve to achieve higher levels of automation for practical applications. Building upon SAM2, we conducted a series of practices that ultimately led to the development of a fully automated pipeline, termed Det-SAM2, in which object prompts are automatically generated by a detection model to facilitate inference and refinement by SAM2. This pipeline enables inference on infinitely long video streams with constant VRAM and RAM usage, all while preserving the same efficiency and accuracy as the original SAM2. This technical report focuses on the construction of the overall Det-SAM2 framework and the subsequent engineering optimization applied to SAM2. We present a case demonstrating an application built on the Det-SAM2 framework: AI refereeing in a billiards scenario, derived from our business context. The project at \urlthis https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.18977 [cs.CV] (or arXiv:2411.18977v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.18977 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-112] Perception of Visual Content: Differences Between Humans and Foundation Models

【速读】：该论文试图解决的问题是如何在多样化的社会经济背景下，比较人类标注和机器学习（ML）生成的图像标注的差异，并识别内容解释中可能存在的偏见。解决方案的关键在于通过对比人类和机器生成的标注在语义上的差异，评估它们对预测模型的影响。研究结果表明，尽管在低层次的词汇类型和句子结构上，人类和机器的标注相似度较低，但在不同地区图像的感知相似性上，两者表现出相似性。此外，人类标注在地区分类上表现最佳且最平衡，而机器生成的对象和描述在收入回归上表现最佳。这一发现强调了图像质量和标注中区分性特征的重要性，同时也揭示了人类和机器在感知图像时缺乏偏见的相似性。

链接: https://arxiv.org/abs/2411.18968
作者: Nardiena A. Pratama,Shaoyang Fan,Gianluca Demartini
关键词-EN: train machine learning, Human-annotated content, annotations, Human-annotated, human
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator’s efforts. This study compares human-generated and ML-generated annotations of images representing diverse socio-economic contexts. We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels washing their hands. We compare human and ML-generated annotations semantically and evaluate their impact on predictive models. Our results show low similarity between human and machine annotations from a low-level perspective, i.e., types of words that appear and sentence structures, but are alike in how similar or dissimilar they perceive images across different regions. Additionally, human annotations resulted in best overall and most balanced region classification performance on the class level, while ML Objects and ML Captions performed best for income regression. Humans and machines’ similarity in their lack of bias when perceiving images highlights how they are more alike than what was initially perceived. The superior and fairer performance of using human annotations for region classification and machine annotations for income regression show how important the quality of the images and the discriminative features in the annotations are.
zh

[CV-113] SuperGaussians: Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors

【速读】：该论文试图解决当前基于高斯显式表示的多视图重建方法中，高斯基元仅具有单一视角依赖的颜色和不透明度，导致场景表示不紧凑的问题。解决方案的关键在于引入了一种名为SuperGaussians的新方法，通过在高斯基元中引入空间变化的颜色和不透明度，从而提升其表示能力。具体实现包括双线性插值、可移动核以及微型神经网络作为空间变化函数。实验结果表明，这三种方法均优于基线方法，其中可移动核在多个数据集上的新视角合成性能表现尤为突出，凸显了空间变化函数的强大潜力。

链接: https://arxiv.org/abs/2411.18966
作者: Rui Xu,Wenyue Chen,Jiepeng Wang,Yuan Liu,Peng Wang,Lin Gao,Shiqing Xin,Taku Komura,Xin Li,Wenping Wang
关键词-EN: multi-view reconstruction based, Splattings demonstrate impressive, Gaussian Splattings demonstrate, Gaussian Splattings, Gaussian explicit representations
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Gaussian Splattings demonstrate impressive results in multi-view reconstruction based on Gaussian explicit representations. However, the current Gaussian primitives only have a single view-dependent color and an opacity to represent the appearance and geometry of the scene, resulting in a non-compact representation. In this paper, we introduce a new method called SuperGaussians that utilizes spatially varying colors and opacity in a single Gaussian primitive to improve its representation ability. We have implemented bilinear interpolation, movable kernels, and even tiny neural networks as spatially varying functions. Quantitative and qualitative experimental results demonstrate that all three functions outperform the baseline, with the best movable kernels achieving superior novel view synthesis performance on multiple datasets, highlighting the strong potential of spatially varying functions.
zh

[CV-114] Random Sampling for Diffusion-based Adversarial Purification

【速读】：该论文试图解决在对抗性净化中，现有的基于扩散模型的方法忽视了一个基本问题，即原始的扩散模型采样（DDPM）是为稳定生成设计的，可能不是对抗性净化的最佳解决方案。论文提出的解决方案之关键是引入了一种新的采样方案——随机采样（random sampling），该方案在每次扩散过程中从随机噪声空间中采样，相比DDPM和DDIM的连续采样，增加了更多的随机性，从而提高了对抗攻击的鲁棒性。此外，论文还提出了一种新的中介条件引导（mediator conditional guidance），以确保净化后的图像与干净图像输入下的预测一致性。通过这些创新，论文建立了一个名为DiffAP的基准方法，显著优于现有的最先进（SOTA）方法，在性能和防御稳定性方面取得了显著提升。

链接: https://arxiv.org/abs/2411.18956
作者: Jiancheng Zhang,Peiran Dong,Yongyong Chen,Yin-Ping Zhao,Song Guo
关键词-EN: Diffusion Probabilistic Models, Denoising Diffusion Probabilistic, gained great attention, Probabilistic Models, Diffusion Implicit Model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models (DDPMs) have gained great attention in adversarial purification. Current diffusion-based works focus on designing effective condition-guided mechanisms while ignoring a fundamental problem, i.e., the original DDPM sampling is intended for stable generation, which may not be the optimal solution for adversarial purification. Inspired by the stability of the Denoising Diffusion Implicit Model (DDIM), we propose an opposite sampling scheme called random sampling. In brief, random sampling will sample from a random noisy space during each diffusion process, while DDPM and DDIM sampling will continuously sample from the adjacent or original noisy space. Thus, random sampling obtains more randomness and achieves stronger robustness against adversarial attacks. Correspondingly, we also introduce a novel mediator conditional guidance to guarantee the consistency of the prediction under the purified image and clean image input. To expand awareness of guided diffusion purification, we conduct a detailed evaluation with different sampling methods and our random sampling achieves an impressive improvement in multiple settings. Leveraging mediator-guided random sampling, we also establish a baseline method named DiffAP, which significantly outperforms state-of-the-art (SOTA) approaches in performance and defensive stability. Remarkably, under strong attack, our DiffAP even achieves a more than 20% robustness advantage with 10 \times sampling acceleration.
zh

[CV-115] Waterfall Transformer for Multi-person Pose Estimation

【速读】：该论文试图解决多人体姿态估计问题，提出了一种名为Waterfall Transformer架构的Pose估计模型（WTPose）。解决方案的关键在于利用基于Transformer的瀑布模块，该模块从不同骨干阶段生成多尺度特征图，并通过级联架构中的过滤操作扩展感受野，捕捉局部和全局上下文，从而增强网络的整体特征表示能力。实验结果表明，结合改进的Swin骨干网络和Transformer瀑布模块的WTPose架构在COCO数据集上优于其他Transformer架构。

链接: https://arxiv.org/abs/2411.18944
作者: Navin Ranjan,Bruno Artacho,Andreas Savakis
关键词-EN: trainable framework designed, multi-person pose estimation, Pose estimation, transformer-based waterfall module, Waterfall Transformer architecture
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose the Waterfall Transformer architecture for Pose estimation (WTPose), a single-pass, end-to-end trainable framework designed for multi-person pose estimation. Our framework leverages a transformer-based waterfall module that generates multi-scale feature maps from various backbone stages. The module performs filtering in the cascade architecture to expand the receptive fields and to capture local and global context, therefore increasing the overall feature representation capability of the network. Our experiments on the COCO dataset demonstrate that the proposed WTPose architecture, with a modified Swin backbone and transformer-based waterfall module, outperforms other transformer architectures for multi-person pose estimation
zh

[CV-116] Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition

【速读】：该论文试图解决基于骨骼的动作识别中，由于骨骼表示缺乏图像级细节而导致相似动作轨迹难以区分的问题。解决方案的关键在于关注局部骨骼组件的细粒度运动细节，并引入ProtoGCN，一种基于图卷积网络 (Graph Convolutional Network, GCN) 的模型。ProtoGCN通过将整个骨骼序列的动力学分解为表示动作单元核心运动模式的多个可学习原型，并通过对比原型的重构来有效识别和增强相似动作的区分性表示。该方法在多个基准数据集上实现了最先进的性能，证明了其有效性。

链接: https://arxiv.org/abs/2411.18941
作者: Hongda Liu,Yunfan Liu,Min Ren,Hao Wang,Yunlong Wang,Zhenan Sun
关键词-EN: skeleton-based action recognition, Graph Convolutional Network, key challenge, challenge is distinguishing, trajectories of joints
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In skeleton-based action recognition, a key challenge is distinguishing between actions with similar trajectories of joints due to the lack of image-level details in skeletal representations. Recognizing that the differentiation of similar actions relies on subtle motion details in specific body parts, we direct our approach to focus on the fine-grained motion of local skeleton components. To this end, we introduce ProtoGCN, a Graph Convolutional Network (GCN)-based model that breaks down the dynamics of entire skeleton sequences into a combination of learnable prototypes representing core motion patterns of action units. By contrasting the reconstruction of prototypes, ProtoGCN can effectively identify and enhance the discriminative representation of similar actions. Without bells and whistles, ProtoGCN achieves state-of-the-art performance on multiple benchmark datasets, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, and FineGYM, which demonstrates the effectiveness of the proposed method. The code is available at this https URL.
zh

[CV-117] Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

【速读】：该论文试图解决扩散模型在合成多个相似外观主体时存在的主题混合问题。解决方案的关键在于提出了一种自交叉扩散引导（Self-Cross diffusion guidance）方法，通过惩罚交叉注意力图与聚合自注意力图之间的重叠来有效消除主题混合。与仅依赖自注意力或交叉注意力的先前方法相比，自交叉引导不仅能更有效地解决混合问题，还能处理主体相关区域（如鸟的喙）的混合问题。该方法通过聚合自动选择的主体补丁的自注意力图来形成主体关注的区域，且无需额外训练，可提升任何基于Transformer的扩散模型的性能，如Stable Diffusion。

链接: https://arxiv.org/abs/2411.18936
作者: Weimin Qiu,Jieke Wang,Meng Tang
关键词-EN: achieved unprecedented fidelity, achieved unprecedented, unprecedented fidelity, fidelity and diversity, diversity for synthesizing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved unprecedented fidelity and diversity for synthesizing image, video, 3D assets, etc. However, subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross diffusion guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our self-cross guidance is more effective in eliminating subject mixing. What’s more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to. Our method is training-free and can boost the performance of any transformer-based diffusion model such as Stable Diffusion.% for synthesizing similar subjects. We also release a more challenging benchmark with many text prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross guidance.
zh

[CV-118] Efficient Track Anything

【速读】：该论文试图解决视频对象分割和跟踪任务中，Segment Anything Model 2 (SAM 2) 由于其多阶段图像编码器和内存模块的高计算复杂度，导致在实际应用中（如移动设备上的视频对象分割）难以实现高效运行的问题。解决方案的关键在于提出了一种轻量级的跟踪任何模型 (EfficientTAMs)，通过重新审视非层次化的Vision Transformer (ViT) 作为图像编码器，并引入高效的内存模块，从而在保持高质量结果的同时，显著降低计算复杂度和模型大小。具体来说，EfficientTAMs 使用简单的轻量级 ViT 和高效的内存模块构建，经过在 SA-1B 和 SA-V 数据集上的训练，能够在多个视频分割基准测试中与 SAM 2 模型相比，实现约2倍的加速和约2.4倍的参数减少，同时在移动设备上也能以合理的质量进行视频对象分割。

链接: https://arxiv.org/abs/2411.18933
作者: Yunyang Xiong,Chong Zhou,Xiaoyu Xiang,Lemeng Wu,Chenchen Zhu,Zechun Liu,Saksham Suri,Balakrishnan Varadarajan,Ramya Akula,Forrest Iandola,Raghuraman Krishnamoorthi,Bilge Soran,Vikas Chandra
关键词-EN: video object segmentation, video object, object segmentation, object, segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.
zh

[CV-119] VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference

【速读】：该论文试图解决的问题是如何在扩散概率模型中有效地处理带有掩码或部分缺失的图像的条件生成问题。现有的启发式采样方法虽然能够解决一些逆问题，但它们并不直接近似推理查询所施加的真实条件分布，且在大面积掩码区域中效果不佳，尤其不适用于使用图像编码以提高效率的潜在扩散模型。论文提出的解决方案之关键是开发了一种分层变分推断算法，该算法通过解析地边缘化缺失特征，并利用严格的变分边界来优化非高斯马尔可夫近似，从而直接逼近真实的扩散后验分布。实验结果表明，该方法（VIPaint）在填补缺失区域的合理性和多样性方面显著优于先前的方法，并且可以轻松扩展到其他逆问题，如去模糊和超分辨率。

链接: https://arxiv.org/abs/2411.18929
作者: Sakshi Agarwal,Gabe Hoope,Erik B. Sudderth
关键词-EN: probabilistic models learn, learn to remove, artificially added, remove noise, Diffusion probabilistic models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Diffusion probabilistic models learn to remove noise that is artificially added to the data during training. Novel data, like images, may then be generated from Gaussian noise through a sequence of denoising operations. While this Markov process implicitly defines a joint distribution over noise-free data, it is not simple to condition the generative process on masked or partial images. A number of heuristic sampling procedures have been proposed for solving inverse problems with diffusion priors, but these approaches do not directly approximate the true conditional distribution imposed by inference queries, and are often ineffective for large masked regions. Moreover, many of these baselines cannot be applied to latent diffusion models which use image encodings for efficiency. We instead develop a hierarchical variational inference algorithm that analytically marginalizes missing features, and uses a rigorous variational bound to optimize a non-Gaussian Markov approximation of the true diffusion posterior. Through extensive experiments with both pixel-based and latent diffusion models of images, we show that our VIPaint method significantly outperforms previous approaches in both the plausibility and diversity of imputations, and is easily generalized to other inverse problems like deblurring and superresolution.
zh

[CV-120] Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?

【速读】：该论文试图解决医学领域数据稀缺问题，特别是在结肠镜检查视频中息肉定位任务中。解决方案的关键在于利用去噪扩散模型（Denoising Diffusion models）生成带有定位标注的结肠镜图像数据，以扩充训练数据集。通过这种方式，研究者能够在数据量有限的情况下，通过迁移学习提升基于YOLO v9的息肉定位模型的性能。

链接: https://arxiv.org/abs/2411.18926
作者: Adrian Tormos,Blanca Llauradó,Fernando Núñez,Axel Romero,Dario Garcia-Gasulla,Javier Béjar
关键词-EN: medical domains hinders, Deep Learning, data, Deep Learning models, medical domains
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The scarcity of data in medical domains hinders the performance of Deep Learning models. Data augmentation techniques can alleviate that problem, but they usually rely on functional transformations of the data that do not guarantee to preserve the original tasks. To approximate the distribution of the data using generative models is a way of reducing that problem and also to obtain new samples that resemble the original data. Denoising Diffusion models is a promising Deep Learning technique that can learn good approximations of different kinds of data like images, time series or tabular data. Automatic colonoscopy analysis and specifically Polyp localization in colonoscopy videos is a task that can assist clinical diagnosis and treatment. The annotation of video frames for training a deep learning model is a time consuming task and usually only small datasets can be obtained. The fine tuning of application models using a large dataset of generated data could be an alternative to improve their performance. We conduct a set of experiments training different diffusion models that can generate jointly colonoscopy images with localization annotations using a combination of existing open datasets. The generated data is used on various transfer learning experiments in the task of polyp localization with a model based on YOLO v9 on the low data regime. Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.2.1; I.4.8; I.5.1; I.4.9 Cite as: arXiv:2411.18926 [cs.CV] (or arXiv:2411.18926v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.18926 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-121] xtured As-Is BIM via GIS-informed Point Cloud Segmentation WWW

【速读】：该论文试图解决从零开始创建现况模型（as-is models）所需的高人工成本和时间成本问题。解决方案的关键在于利用机器学习和深度学习模型对点云数据（Point Cloud Data, PCD）进行物体识别和语义分割，从而实现自动化生成语义丰富的3D几何模型。此外，通过整合地理信息系统（Geoinformation System, GIS）数据，进一步增强模型的语义信息，从而生成GIS信息增强且BIM（Building Information Modeling）就绪的现况建筑信息模型（BIM），特别是在铁路项目中。该方法展示了显著的成本节约潜力，并揭示了自由可用的GIS数据在自动化过程中的未充分利用资源。

链接: https://arxiv.org/abs/2411.18898
作者: Mohamed S. H. Alabassy
关键词-EN: money-consuming task due, high manual effort, manual effort, Deep Learning Models, money-consuming task
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Permission granted by all co-authors for the publication of the extended article to the conference paper “BIM Integration for Automated Identification of Relevant Geo-Context Information via Point Cloud Segmentation” (2023). URL: this https URL

点击查看摘要

Abstract:Creating as-is models from scratch is to this day still a time- and money-consuming task due to its high manual effort. Therefore, projects, especially those with a big spatial extent, could profit from automating the process of creating semantically rich 3D geometries from surveying data such as Point Cloud Data (PCD). An automation can be achieved by using Machine and Deep Learning Models for object recognition and semantic segmentation of PCD. As PCDs do not usually include more than the mere position and RGB colour values of points, tapping into semantically enriched Geoinformation System (GIS) data can be used to enhance the process of creating meaningful as-is models. This paper presents a methodology, an implementation framework and a proof of concept for the automated generation of GIS-informed and BIM-ready as-is Building Information Models (BIM) for railway projects. The results show a high potential for cost savings and reveal the unemployed resources of freely accessible GIS data within.
zh

[CV-122] 2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving

【速读】：该论文试图解决自动驾驶中理解和生成高清晰度（HD）地图的挑战，特别是如何准确建模交通场景中的车道、道路信号及其拓扑关系。解决方案的关键在于提出了一个新颖的交通拓扑场景图（Traffic Topology Scene Graph, T2SG），并通过TopoFormer模型生成该图。TopoFormer包含两个新设计的层：车道聚合层（Lane Aggregation Layer, LAL）用于利用车道中心线的几何距离来聚合全局信息，以及反事实干预层（Counterfactual Intervention Layer, CIL）用于在反事实干预下建模合理的道路结构。这些创新使得生成的T2SG能够更准确和可解释地描述交通场景的拓扑结构，从而在下游任务中显著提升交通拓扑推理性能，达到OpenLane-V2基准测试中的最先进性能（46.3 OLS）。

链接: https://arxiv.org/abs/2411.18894
作者: Changsheng Lv,Mengshi Qi,Liang Liu,Huadong Ma
关键词-EN: Topology Scene Graph, maps present significant, present significant challenges, Scene Graph, Traffic Topology Scene
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the traffic scenes and then generating high-definition (HD) maps present significant challenges in autonomous driving. In this paper, we defined a novel Traffic Topology Scene Graph, a unified scene graph explicitly modeling the lane, controlled and guided by different road signals (e.g., right turn), and topology relationships among them, which is always ignored by previous high-definition (HD) mapping methods. For the generation of T2SG, we propose TopoFormer, a novel one-stage Topology Scene Graph TransFormer with two newly designed layers. Specifically, TopoFormer incorporates a Lane Aggregation Layer (LAL) that leverages the geometric distance among the centerline of lanes to guide the aggregation of global information. Furthermore, we proposed a Counterfactual Intervention Layer (CIL) to model the reasonable road structure ( e.g., intersection, straight) among lanes under counterfactual intervention. Then the generated T2SG can provide a more accurate and explainable description of the topological structure in traffic scenes. Experimental results demonstrate that TopoFormer outperforms existing methods on the T2SG generation task, and the generated T2SG significantly enhances traffic topology reasoning in downstream tasks, achieving a state-of-the-art performance of 46.3 OLS on the OpenLane-V2 benchmark. We will release our source code and model.
zh

[CV-123] ETSM: Automating Dissection Trajectory Suggestion and Confidence Map-Based Safety Margin Prediction for Robot-assisted Endoscopic Submucosal Dissection

【速读】：该论文试图解决机器人辅助内镜黏膜下剥离术 (Robot-assisted Endoscopic Submucosal Dissection, ESD) 中预测剥离轨迹的难题，特别是在肿瘤边缘多变和视觉条件动态变化的情况下。解决方案的关键在于引入了一个结合最佳剥离轨迹预测和基于置信度图的安全边缘的框架，并提出了基于回归的置信度图预测网络 (Regression-based Confidence Map Prediction Network, RCMNet)。RCMNet 通过回归方法预测剥离区域的置信度图，从而划定不同安全边缘级别，显著提高了预测的准确性和剥离过程的安全性。实验结果表明，该方法在置信度图预测任务中表现优异，平均绝对误差 (MAE) 仅为 3.18，展示了其在临床实践中的重要意义。

链接: https://arxiv.org/abs/2411.18884
作者: Mengya Xu,Wenjin Mo,Guankun Wang,Huxin Gao,An Wang,Long Bai,Chaoyang Lyu,Xiaoxiao Yang,Zhen Li,Hongliang Ren
关键词-EN: Robot-assisted Endoscopic Submucosal, Robot-assisted Endoscopic, Endoscopic Submucosal Dissection, Endoscopic Submucosal, Confidence Map-based Safety
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robot-assisted Endoscopic Submucosal Dissection (ESD) improves the surgical procedure by providing a more comprehensive view through advanced robotic instruments and bimanual operation, thereby enhancing dissection efficiency and accuracy. Accurate prediction of dissection trajectories is crucial for better decision-making, reducing intraoperative errors, and improving surgical training. Nevertheless, predicting these trajectories is challenging due to variable tumor margins and dynamic visual conditions. To address this issue, we create the ESD Trajectory and Confidence Map-based Safety Margin (ETSM) dataset with 1849 short clips, focusing on submucosal dissection with a dual-arm robotic system. We also introduce a framework that combines optimal dissection trajectory prediction with a confidence map-based safety margin, providing a more secure and intelligent decision-making tool to minimize surgical risks for ESD procedures. Additionally, we propose the Regression-based Confidence Map Prediction Network (RCMNet), which utilizes a regression approach to predict confidence maps for dissection areas, thereby delineating various levels of safety margins. We evaluate our RCMNet using three distinct experimental setups: in-domain evaluation, robustness assessment, and out-of-domain evaluation. Experimental results show that our approach excels in the confidence map-based safety margin prediction task, achieving a mean absolute error (MAE) of only 3.18 . To the best of our knowledge, this is the first study to apply a regression approach for visual guidance concerning delineating varying safety levels of dissection areas. Our approach bridges gaps in current research by improving prediction accuracy and enhancing the safety of the dissection process, showing great clinical significance in practice.
zh

[CV-124] GTPC-SSCD: Gate-guided Two-level Perturbation Consistency-based Semi-Supervised Change Detection

【速读】：该论文试图解决现有半监督变化检测 (Semi-supervised Change Detection, SSCD) 方法在利用未标记数据时仅实施单一层次扰动，无法充分挖掘未标记数据潜力的问题。解决方案的关键在于引入了一种新的门控双层扰动一致性正则化方法 (Gate-guided Two-level Perturbation Consistency regularization-based SSCD, GTPC-SSCD)，该方法在图像级别和特征级别同时保持强弱一致性，从而有效利用未标记数据。此外，设计了一个门控模块来评估不同样本的训练复杂度，并决定是否对每个样本进行特征扰动，这种差异化处理使得网络能够更有效地探索未标记数据的潜力。

链接: https://arxiv.org/abs/2411.18880
作者: Yan Xing,Qi’ao Xu,Zongyu Guo,Rui Huang,Yuxiang Zhang
关键词-EN: employs partially labeled, consistency regularization-based SSCD, partially labeled data, Semi-supervised change detection, unlabeled data
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Semi-supervised change detection (SSCD) employs partially labeled data and a substantial amount of unlabeled data to identify differences between images captured in the same geographic area but at different times. However, existing consistency regularization-based SSCD methods only implement perturbations at a single level and can not exploit the full potential of unlabeled data. In this paper, we introduce a novel Gate-guided Two-level Perturbation Consistency regularization-based SSCD method (GTPC-SSCD), which simultaneously maintains strong-to-weak consistency at the image level and perturbation consistency at the feature level, thus effectively utilizing the unlabeled data. Moreover, a gate module is designed to evaluate the training complexity of different samples and determine the necessity of performing feature perturbations on each sample. This differential treatment enables the network to more effectively explore the potential of unlabeled data. Extensive experiments conducted on six public remote sensing change detection datasets demonstrate the superiority of our method over seven state-of-the-art SSCD methods.
zh

[CV-125] Comprehensive Performance Evaluation of YOLOv11 YOLOv10 YOLOv9 YOLOv8 and YOLOv5 on Object Detection of Power Equipment

【速读】：该论文旨在解决电力设备故障检测的准确性和可靠性问题，通过评估和比较不同版本的YOLO模型（YOLOv5, YOLOv8, YOLOv9, YOLOv10, YOLOv11）在电力设备目标检测中的性能。解决方案的关键在于采用YOLOv11模型，其在公共数据集上的平均精度均值（mAP）达到57.2%，显著高于其他模型，并且在召回率和减少误检方面表现出色。YOLOv11模型被证明是电力设备目标检测中一种可靠且有效的解决方案，有助于提升电力系统的运行可靠性。

链接: https://arxiv.org/abs/2411.18871
作者: Zijian He,Kang Wang,Tian Fang,Lei Su,Rui Chen,Xihong Fei
关键词-EN: global industrial production, power equipment, power equipment object, equipment object detection, power
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of global industrial production, the demand for reliability in power equipment has been continuously increasing. Ensuring the stability of power system operations requires accurate methods to detect potential faults in power equipment, thereby guaranteeing the normal supply of electrical energy. In this article, the performance of YOLOv5, YOLOv8, YOLOv9, YOLOv10, and the state-of-the-art YOLOv11 methods was comprehensively evaluated for power equipment object detection. Experimental results demonstrate that the mean average precision (mAP) on a public dataset for power equipment was 54.4%, 55.5%, 43.8%, 48.0%, and 57.2%, respectively, with the YOLOv11 achieving the highest detection performance. Moreover, the YOLOv11 outperformed other methods in terms of recall rate and exhibited superior performance in reducing false detections. In conclusion, the findings indicate that the YOLOv11 model provides a reliable and effective solution for power equipment object detection, representing a promising approach to enhancing the operational reliability of power systems.
zh

[CV-126] RIGI: Rectifying Image-to-3D Generation Inconsistency via Uncertainty-aware Learning

【速读】：该论文试图解决单张图像到3D生成过程中，由于多视角图像或视频引入的不一致性导致的噪声和伪影问题。解决方案的关键在于利用3D高斯喷射 (3D Gaussian Splatting, 3DGS) 进行3D重建，并明确地将不确定性学习融入重建过程。通过捕捉两个高斯模型之间的随机性，估计不确定性图，并用于不确定性感知的正则化，以纠正不一致性的影响。具体方法包括同时优化两个高斯模型，通过评估相同视角下渲染图像的差异来计算不确定性图，并基于此图应用自适应像素级损失加权来正则化模型，减少高不确定性区域的重建强度，从而动态检测并缓解多视角标签中的冲突，实现更平滑的结果并有效减少伪影。

链接: https://arxiv.org/abs/2411.18866
作者: Jiacheng Wang,Zhedong Zheng,Wei Xu,Ping Liu
关键词-EN: aims to reconstruct, geometric shape, single image, generation aims, Gaussian models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Given a single image of a target object, image-to-3D generation aims to reconstruct its texture and geometric shape. Recent methods often utilize intermediate media, such as multi-view images or videos, to bridge the gap between input image and the 3D target, thereby guiding the generation of both shape and texture. However, inconsistencies in the generated multi-view snapshots frequently introduce noise and artifacts along object boundaries, undermining the 3D reconstruction process. To address this challenge, we leverage 3D Gaussian Splatting (3DGS) for 3D reconstruction, and explicitly integrate uncertainty-aware learning into the reconstruction process. By capturing the stochasticity between two Gaussian models, we estimate an uncertainty map, which is subsequently used for uncertainty-aware regularization to rectify the impact of inconsistencies. Specifically, we optimize both Gaussian models simultaneously, calculating the uncertainty map by evaluating the discrepancies between rendered images from identical viewpoints. Based on the uncertainty map, we apply adaptive pixel-wise loss weighting to regularize the models, reducing reconstruction intensity in high-uncertainty regions. This approach dynamically detects and mitigates conflicts in multi-view labels, leading to smoother results and effectively reducing artifacts. Extensive experiments show the effectiveness of our method in improving 3D generation quality by reducing inconsistencies and artifacts.
zh

[CV-127] Improving Batch Normalization with TTA for Robust Object Detection in Self-Driving

【速读】：该论文试图解决在开放真实世界自动驾驶场景中，由于传感器故障和极端天气条件导致的领域偏移问题，从而影响自动驾驶感知模型在未见领域中的泛化能力。解决方案的关键在于提出了两种新的方法来改进基于测试时适应（Test-Time Adaptation, TTA）的批量归一化（Batch Normalization, BN）技术：(1) 引入基于广义搜索熵最小化（Generalized-search Entropy Minimization, GSEM）方法的可学习BN层（LearnableBN），通过辅助可学习参数动态更新输入数据的统计信息；(2) 提出基于语义一致性的双阶段适应策略，通过迭代搜索最优解并消除适应过程中的不稳定样本，从而提高模型在复杂环境下的鲁棒性和性能。实验结果表明，该方法在NuScenes-C数据集上使用BEVFormer作为基线模型时，在六种损坏类型和三种严重程度级别上实现了约8%的最大改进。

链接: https://arxiv.org/abs/2411.18860
作者: Dacheng Liao,Mengshi Qi,Liang Liu,Huadong Ma
关键词-EN: unseen domain due, autonomous driving perception, current open real-world, extreme weather conditions, weather conditions hinder
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In current open real-world autonomous driving scenarios, challenges such as sensor failure and extreme weather conditions hinder the generalization of most autonomous driving perception models to these unseen domain due to the domain shifts between the test and training data. As the parameter scale of autonomous driving perception models grows, traditional test-time adaptation (TTA) methods become unstable and often degrade model performance in most scenarios. To address these challenges, this paper proposes two new robust methods to improve the Batch Normalization with TTA for object detection in autonomous driving: (1) We introduce a LearnableBN layer based on Generalized-search Entropy Minimization (GSEM) method. Specifically, we modify the traditional BN layer by incorporating auxiliary learnable parameters, which enables the BN layer to dynamically update the statistics according to the different input data. (2) We propose a new semantic-consistency based dual-stage-adaptation strategy, which encourages the model to iteratively search for the optimal solution and eliminates unstable samples during the adaptation process. Extensive experiments on the NuScenes-C dataset shows that our method achieves a maximum improvement of about 8% using BEVFormer as the baseline model across six corruption types and three levels of severity. We will make our source code available soon.
zh

[CV-128] COMPrompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection

【速读】：该论文试图解决的是伪装物体检测 (Camouflaged Object Detection, COD) 问题，特别是在利用现有的分割任何模型 (Segment Anything Model, SAM) 进行伪装物体检测时，如何进一步提升其检测精度和泛化能力。解决方案的关键在于提出了一种名为 COMPrompter 的多提示网络，通过引入边缘梯度提取模块生成包含梯度信息的掩码作为新的边界提示，设计了边界与框提示的相互引导模块，以及利用离散小波变换提取图像嵌入中的高频特征，从而增强了模型对伪装物体的检测能力。实验结果表明，COMPrompter 在 COD 基准测试中表现优异，平均正向指标提升了 2.2%，在特定应用如息肉分割中也优于现有顶级方法。

链接: https://arxiv.org/abs/2411.18858
作者: Xiaoqin Zhang,Zhenni Yu,Li Zhao,Deng-Ping Fan,Guobao Xiao
关键词-EN: camouflaged object detection, rethink the segment, multiprompt network called, SAM, COD
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SCIENCE CHINA Information Sciences 2024

点击查看摘要

Abstract:We rethink the segment anything model (SAM) and propose a novel multiprompt network called COMPrompter for camouflaged object detection (COD). SAM has zero-shot generalization ability beyond other models and can provide an ideal framework for COD. Our network aims to enhance the single prompt strategy in SAM to a multiprompt strategy. To achieve this, we propose an edge gradient extraction module, which generates a mask containing gradient information regarding the boundaries of camouflaged objects. This gradient mask is then used as a novel boundary prompt, enhancing the segmentation process. Thereafter, we design a box-boundary mutual guidance module, which fosters more precise and comprehensive feature extraction via mutual guidance between a boundary prompt and a box prompt. This collaboration enhances the model’s ability to accurately detect camouflaged objects. Moreover, we employ the discrete wavelet transform to extract high-frequency features from image embeddings. The high-frequency features serve as a supplementary component to the multiprompt system. Finally, our COMPrompter guides the network to achieve enhanced segmentation results, thereby advancing the development of SAM in terms of COD. Experimental results across COD benchmarks demonstrate that COMPrompter achieves a cutting-edge performance, surpassing the current leading model by an average positive metric of 2.2% in COD10K. In the specific application of COD, the experimental results in polyp segmentation show that our model is superior to top-tier methods as well. The code will be made available at this https URL.
zh

[CV-129] Improving Accuracy and Generalization for Efficient Visual Tracking WACV2025

【速读】：该论文试图解决高效视觉追踪器在训练分布外的序列（out-of-distribution, OOD）上表现不佳的问题，这是由于这些追踪器过度拟合于其训练分布，缺乏泛化能力。解决方案的关键在于引入了一种名为SiamABC的高效孪生追踪器（Siamese tracker），它通过新的架构设计和训练损失函数来提升追踪性能，特别是在OOD序列上。此外，SiamABC还采用了快速的无反向传播动态测试时适应方法（backward-free dynamic test-time adaptation method），以根据目标的动态视觉变化持续调整模型，从而直接解决OOD追踪的泛化问题。实验结果表明，SiamABC在OOD数据集上表现出显著的性能提升，同时在训练分布内的基准测试中也保持了高准确性。

链接: https://arxiv.org/abs/2411.18855
作者: Ram Zaveri,Shivang Patel,Yu Gu,Gianfranco Doretto
关键词-EN: efficient Siamese tracker, lack generalization abilities, highly efficient Siamese, visual trackers overfit, Efficient visual trackers
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: WACV 2025

点击查看摘要

Abstract:Efficient visual trackers overfit to their training distributions and lack generalization abilities, resulting in them performing well on their respective in-distribution (ID) test sets and not as well on out-of-distribution (OOD) sequences, imposing limitations to their deployment in-the-wild under constrained resources. We introduce SiamABC, a highly efficient Siamese tracker that significantly improves tracking performance, even on OOD sequences. SiamABC takes advantage of new architectural designs in the way it bridges the dynamic variability of the target, and of new losses for training. Also, it directly addresses OOD tracking generalization by including a fast backward-free dynamic test-time adaptation method that continuously adapts the model according to the dynamic visual changes of the target. Our extensive experiments suggest that SiamABC shows remarkable performance gains in OOD sets while maintaining accurate performance on the ID benchmarks. SiamABC outperforms MixFormerV2-S by 7.6% on the OOD AVisT benchmark while being 3x faster (100 FPS) on a CPU.
zh

[CV-130] CrossTracker: Robust Multi-modal 3D Multi-Object Tracking via Cross Correction

【速读】：该论文试图解决在3D多目标跟踪（MOT）中，单一传感器（如LiDAR）检测失败的问题，并提出了一种新的解决方案。解决方案的关键在于引入了一个两阶段的方法，称为CrossTracker。该方法通过粗到细的方式，首先生成粗略的轨迹，然后通过独立的细化过程进行改进。具体来说，CrossTracker包括三个核心模块：1) 多模态建模（M^3）模块，通过融合多模态信息（图像、点云和平面几何）提供强大的轨迹生成基础；2) 粗轨迹生成（C-TG）模块，生成初始的粗略双流轨迹；3) 轨迹细化（TR）模块，通过相机和LiDAR流之间的交叉校正来细化粗略轨迹。这种方法有效地利用了相机和LiDAR传感器的协同优势，显著提升了多模态3D MOT的性能。

链接: https://arxiv.org/abs/2411.18850
作者: Lipeng Gu,Xuefeng Yan,Weiming Wang,Honghua Chen,Dingkun Zhu,Liangliang Nan,Mingqiang Wei
关键词-EN: LiDAR-based detections offers, mitigate tracking failures, tracking failures, offers a promising, promising solution
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The fusion of camera- and LiDAR-based detections offers a promising solution to mitigate tracking failures in 3D multi-object tracking (MOT). However, existing methods predominantly exploit camera detections to correct tracking failures caused by potential LiDAR detection problems, neglecting the reciprocal benefit of refining camera detections using LiDAR data. This limitation is rooted in their single-stage architecture, akin to single-stage object detectors, lacking a dedicated trajectory refinement module to fully exploit the complementary multi-modal information. To this end, we introduce CrossTracker, a novel two-stage paradigm for online multi-modal 3D MOT. CrossTracker operates in a coarse-to-fine manner, initially generating coarse trajectories and subsequently refining them through an independent refinement process. Specifically, CrossTracker incorporates three essential modules: i) a multi-modal modeling (M^3) module that, by fusing multi-modal information (images, point clouds, and even plane geometry extracted from images), provides a robust metric for subsequent trajectory generation. ii) a coarse trajectory generation (C-TG) module that generates initial coarse dual-stream trajectories, and iii) a trajectory refinement (TR) module that refines coarse trajectories through cross correction between camera and LiDAR streams. Comprehensive experiments demonstrate the superior performance of our CrossTracker over its eighteen competitors, underscoring its effectiveness in harnessing the synergistic benefits of camera and LiDAR sensors for robust multi-modal 3D MOT.
zh

[CV-131] FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

【速读】：该论文试图解决图像超分辨率 (Image Super-Resolution, SR) 中恢复图像的真实性和结构一致性问题。解决方案的关键在于提出了一种名为 FaithDiff 的方法，该方法充分利用了潜在扩散模型 (Latent Diffusion Models, LDMs) 的强大能力。与现有冻结预训练扩散模型的方法不同，FaithDiff 释放了扩散先验以识别有用信息并恢复忠实结构。此外，论文还开发了一个有效的对齐模块，用于探索降质输入中的有用特征，并将其与扩散过程对齐。最后，通过在统一的优化框架中联合微调编码器和扩散模型，确保编码器提取的特征与扩散过程相一致，从而显著提升了超分辨率结果的质量和忠实度。

链接: https://arxiv.org/abs/2411.18824
作者: Junyang Chen,Jinshan Pan,Jiangxin Dong
关键词-EN: image generation tasks, restored images maintain, images maintain fidelity, Faithful image super-resolution, generation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.
zh

[CV-132] Multi-Task Label Discovery via Hierarchical Task Tokens for Partially Annotated Dense Predictions

【速读】：该论文试图解决在部分标注的多任务密集预测学习中缺乏直接像素级监督的问题。解决方案的关键在于提出了一种新的方法，通过优化一组可学习的层次化任务标记（包括全局和细粒度的任务标记），来在特征和预测层面上发现一致的像素级监督信号。具体来说，全局任务标记用于在全局上下文中进行有效的跨任务特征交互，而一组细粒度的任务特定空间标记则从相应的全局任务标记中学习，并与每个任务特定的特征图进行密集交互。这些学习到的全局和局部细粒度任务标记进一步用于在不同粒度级别上发现伪任务特定的密集标签，并可直接用于监督多任务密集预测框架的学习。

链接: https://arxiv.org/abs/2411.18823
作者: Jingdong Zhang,Hanrong Ye,Xin Li,Wenping Wang,Dan Xu
关键词-EN: important research area, recent years, research area, data has emerged, important research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, simultaneous learning of multiple dense prediction tasks with partially annotated label data has emerged as an important research area. Previous works primarily focus on constructing cross-task consistency or conducting adversarial training to regularize cross-task predictions, which achieve promising performance improvements, while still suffering from the lack of direct pixel-wise supervision for multi-task dense predictions. To tackle this challenge, we propose a novel approach to optimize a set of learnable hierarchical task tokens, including global and fine-grained ones, to discover consistent pixel-wise supervision signals in both feature and prediction levels. Specifically, the global task tokens are designed for effective cross-task feature interactions in a global context. Then, a group of fine-grained task-specific spatial tokens for each task is learned from the corresponding global task tokens. It is embedded to have dense interactions with each task-specific feature map. The learned global and local fine-grained task tokens are further used to discover pseudo task-specific dense labels at different levels of granularity, and they can be utilized to directly supervise the learning of the multi-task dense prediction framework. Extensive experimental results on challenging NYUD-v2, Cityscapes, and PASCAL Context datasets demonstrate significant improvements over existing state-of-the-art methods for partially annotated multi-task dense prediction.
zh

[CV-133] Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds

【速读】：该论文试图解决文本到图像扩散模型在处理组合性提示（如“两只狗”或“一只企鹅在碗的右边”）时产生不一致结果的问题。解决方案的关键在于识别和利用初始噪声模式对组合性提示的可靠性影响。研究发现，不同的初始随机种子会导致模型在图像中放置对象的位置不同，这些位置可能与特定的相机角度和图像构图模式相关。为此，论文提出了一种方法，通过挖掘这些可靠的噪声模式，生成一个无需手动标注的精选训练集，并通过微调文本到图像模型来显著提升其组合能力。实验结果显示，在数值组合和空间组合方面，Stable Diffusion和PixArt-\alpha模型分别获得了29.3%、19.5%和60.7%、21.1%的相对提升。

链接: https://arxiv.org/abs/2411.18810
作者: Shuangqi Li,Hieu Le,Jingyi Xu,Mathieu Salzmann
关键词-EN: demonstrated remarkable capability, arbitrary text prompts, generating realistic images, demonstrated remarkable, remarkable capability
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as “two dogs” or “a penguin on the right of a bowl”. Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model’s compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-\alpha, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-\alpha.
zh

[CV-134] Lifting Motion to the 3D World via 2D Diffusion

【速读】：该论文试图解决从2D观测中估计3D运动的问题，特别是在缺乏3D地面真值数据的情况下。解决方案的关键在于引入了一种名为MVLift的新方法，该方法通过多阶段框架利用2D运动扩散模型逐步生成多视角下一致的2D姿态序列，从而恢复准确的全局3D运动。MVLift不仅不需要3D监督，还能在包括人体姿态、人-物交互和动物姿态在内的多个领域中实现泛化，并在多个数据集上超越了需要3D监督的现有方法。

链接: https://arxiv.org/abs/2411.18808
作者: Jiaman Li,C. Karen Liu,Jiajun Wu
关键词-EN: long-standing research challenge, research challenge, long-standing research, Estimating, motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion – including both joint rotations and root trajectories in the world coordinate system – using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.
zh

[CV-135] GloFinder: AI-empowered QuPath Plugin for WSI-level Glomerular Detection Visualization and Curation

【速读】：该论文试图解决现有开源工具在肾病理学中自动化检测肾小球（glomeruli）时存在的两个主要问题：一是这些工具通常以源代码或Docker容器形式分发，需要高级编程技能，限制了非程序员（如临床医生）的使用；二是当前模型通常仅在一个数据集上训练，缺乏调整预测置信度的灵活性。解决方案的关键是引入GloFinder，一个QuPath插件，通过图形用户界面（GUI）实现单击自动检测整个全片图像（WSIs）中的肾小球，并支持在线编辑。GloFinder采用CircleNet，一个基于圆形表示的无锚检测框架，用于精确对象定位，并结合Weighted Circle Fusion (WCF)，一种集成方法，通过融合多个CircleNet模型的置信度得分来提高预测精度。该插件不仅提高了检测性能，还通过QuPath的直接可视化和编辑功能，增强了临床医生和研究人员的使用体验。

链接: https://arxiv.org/abs/2411.18795
作者: Jialin Yue,Tianyuan Yao,Ruining Deng,Siqi Lu,Junlin Guo,Quan Liu,Mengmeng Yin,Juming Xiong,Haichun Yang,Yuankai Huo
关键词-EN: demonstrated significant success, key functional units, Artificial intelligence, kidney pathology, slide images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has demonstrated significant success in automating the detection of glomeruli, the key functional units of the kidney, from whole slide images (WSIs) in kidney pathology. However, existing open-source tools are often distributed as source code or Docker containers, requiring advanced programming skills that hinder accessibility for non-programmers, such as clinicians. Additionally, current models are typically trained on a single dataset and lack flexibility in adjusting confidence levels for predictions. To overcome these challenges, we introduce GloFinder, a QuPath plugin designed for single-click automated glomeruli detection across entire WSIs with online editing through the graphical user interface (GUI). GloFinder employs CircleNet, an anchor-free detection framework utilizing circle representations for precise object localization, with models trained on approximately 160,000 manually annotated glomeruli. To further enhance accuracy, the plugin incorporates Weighted Circle Fusion (WCF), an ensemble method that combines confidence scores from multiple CircleNet models to produce refined predictions, achieving superior performance in glomerular detection. GloFinder enables direct visualization and editing of results in QuPath, facilitating seamless interaction for clinicians and providing a powerful tool for nephropathology research and clinical practice.
zh

[CV-136] MRI Breast tissue segmentation using nnU-Net for biomechanical modeling MICCAI2024

【速读】：该论文试图解决二维乳腺X线摄影与三维磁共振成像（MRI）在乳腺癌诊断和治疗规划中整合的挑战，关键在于克服不同成像模态间的差异以及精确的组织分割和对齐需求。解决方案的核心在于两个方面：一是通过使用nnU-Net分割模型提高组织识别的准确性，二是评估有限元（FE）生物力学求解器，特别是NiftySim和FEBio的性能。论文通过详细的多类分割（六类）乳腺MRI数据，实现了高精度的组织分割，为三维重建和生物力学建模提供了坚实基础。随后，利用分割数据生成的三维网格和生物力学模型，模拟了乳腺组织在压缩下的物理行为，并通过对比NiftySim和FEBio的模拟结果，评估了这些模拟在研究乳腺组织响应中的准确性和可靠性。这些研究成果有望提升二维和三维成像模态的整合效果，从而提高乳腺癌的诊断精度和治疗规划。

链接: https://arxiv.org/abs/2411.18784
作者: Melika Pooyan,Hadeel Awwad,Eloy García,Robert Martí
关键词-EN: magnetic resonance imaging, magnetic resonance, breast cancer diagnosis, resonance imaging, achieving Dice Coefficients
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)
备注: Deep Breath @ MICCAI 2024

点击查看摘要

Abstract:Integrating 2D mammography with 3D magnetic resonance imaging (MRI) is crucial for improving breast cancer diagnosis and treatment planning. However, this integration is challenging due to differences in imaging modalities and the need for precise tissue segmentation and alignment. This paper addresses these challenges by enhancing biomechanical breast models in two main aspects: improving tissue identification using nnU-Net segmentation models and evaluating finite element (FE) biomechanical solvers, specifically comparing NiftySim and FEBio. We performed a detailed six-class segmentation of breast MRI data using the nnU-Net architecture, achieving Dice Coefficients of 0.94 for fat, 0.88 for glandular tissue, and 0.87 for pectoral muscle. The overall foreground segmentation reached a mean Dice Coefficient of 0.83 through an ensemble of 2D and 3D U-Net configurations, providing a solid foundation for 3D reconstruction and biomechanical modeling. The segmented data was then used to generate detailed 3D meshes and develop biomechanical models using NiftySim and FEBio, which simulate breast tissue’s physical behaviors under compression. Our results include a comparison between NiftySim and FEBio, providing insights into the accuracy and reliability of these simulations in studying breast tissue responses under compression. The findings of this study have the potential to improve the integration of 2D and 3D imaging modalities, thereby enhancing diagnostic accuracy and treatment planning for breast cancer.
zh

[CV-137] Fall Leaf Adversarial Attack on Traffic Sign Classification

【速读】：该论文试图解决在自动驾驶系统中，利用自然物体（如树叶）进行对抗性图像扰动攻击的问题。解决方案的关键在于利用自然界中的树叶作为扰动源，通过模拟树叶附着在交通标志上的情况，使神经网络对交通标志进行错误分类。这种方法具有较高的隐蔽性，因为树叶的自然附着行为难以被识别为恶意攻击。论文通过分析不同种类树叶的大小、颜色和旋转角度等参数，评估了这种新型对抗性攻击的成功率，并探讨了这些攻击如何影响图像分类算法中的边缘检测过程。

链接: https://arxiv.org/abs/2411.18776
作者: Anthony Etim,Jakub Szefer
关键词-EN: Adversarial input image, input image perturbation, machine learning algorithms, image classification setting, image perturbation attacks
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Adversarial input image perturbation attacks have emerged as a significant threat to machine learning algorithms, particularly in image classification setting. These attacks involve subtle perturbations to input images that cause neural networks to misclassify the input images, even though the images remain easily recognizable to humans. One critical area where adversarial attacks have been demonstrated is in automotive systems where traffic sign classification and recognition is critical, and where misclassified images can cause autonomous systems to take wrong actions. This work presents a new class of adversarial attacks. Unlike existing work that has focused on adversarial perturbations that leverage human-made artifacts to cause the perturbations, such as adding stickers, paint, or shining flashlights at traffic signs, this work leverages nature-made artifacts: tree leaves. By leveraging nature-made artifacts, the new class of attacks has plausible deniability: a fall leaf stuck to a street sign could come from a near-by tree, rather than be placed there by an malicious human attacker. To evaluate the new class of the adversarial input image perturbation attacks, this work analyses how fall leaves can cause misclassification in street signs. The work evaluates various leaves from different species of trees, and considers various parameters such as size, color due to tree leaf type, and rotation. The work demonstrates high success rate for misclassification. The work also explores the correlation between successful attacks and how they affect the edge detection, which is critical in many image classification algorithms.
zh

[CV-138] CoVis: A Collaborative Framework for Fine-grained Graphic Visual Understanding

【速读】：该论文试图解决视觉内容解释依赖于人类个人知识背景的问题，从而影响信息获取和理解的质量与效率。解决方案的关键在于提出了一个名为CoVis的协作框架，该框架通过设计并实现一个级联的双层分割网络与基于大语言模型（LLM）的内容生成器相结合，从图像中尽可能多地提取知识，并生成图像的视觉分析，帮助观察者从更全面的角度理解图像内容。实验结果表明，CoVis在特征提取方面优于现有方法，并能生成比当前通用大模型更全面和详细的视觉描述。

链接: https://arxiv.org/abs/2411.18764
作者: Xiaoyu Deng,Zhengjian Kang,Xintao Li,Yongzhe Zhang,Tianmin Guo
关键词-EN: Graphic visual content, promoting information communication, Graphic visual, inspiration divergence, communication and inspiration
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphic visual content helps in promoting information communication and inspiration divergence. However, the interpretation of visual content currently relies mainly on humans’ personal knowledge background, thereby affecting the quality and efficiency of information acquisition and understanding. To improve the quality and efficiency of visual information transmission and avoid the limitation of the observer due to the information cocoon, we propose CoVis, a collaborative framework for fine-grained visual understanding. By designing and implementing a cascaded dual-layer segmentation network coupled with a large-language-model (LLM) based content generator, the framework extracts as much knowledge as possible from an image. Then, it generates visual analytics for images, assisting observers in comprehending imagery from a more holistic perspective. Quantitative experiments and qualitative experiments based on 32 human participants indicate that the CoVis has better performance than current methods in feature extraction and can generate more comprehensive and detailed visual descriptions than current general-purpose large models.
zh

[CV-139] DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration

【速读】：该论文试图解决视频修复中的一个挑战，即在动态、真实世界场景中重建被遮挡区域的问题。解决方案的关键在于提出了一种基于扩散的视频级修复模型，称为DiffMVR。该模型引入了动态双引导图像提示系统，利用自适应参考帧来指导修复过程，从而能够捕捉细粒度细节和视频帧之间的平滑过渡，提供对修复方向的精确控制，并显著提高在复杂动态环境中的修复准确性。

链接: https://arxiv.org/abs/2411.18745
作者: Zheyan Zhang,Diego Klabjan,Renee CB Manworren
关键词-EN: reconstructing occluded regions, real-world scenarios, reconstructing occluded, address a challenge, occluded regions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we address a challenge in video inpainting: reconstructing occluded regions in dynamic, real-world scenarios. Motivated by the need for continuous human motion monitoring in healthcare settings, where facial features are frequently obscured, we propose a diffusion-based video-level inpainting model, DiffMVR. Our approach introduces a dynamic dual-guided image prompting system, leveraging adaptive reference frames to guide the inpainting process. This enables the model to capture both fine-grained details and smooth transitions between video frames, offering precise control over inpainting direction and significantly improving restoration accuracy in challenging, dynamic environments. DiffMVR represents a significant advancement in the field of diffusion-based inpainting, with practical implications for real-time applications in various dynamic settings.
zh

[CV-140] he Last Mile to Supervised Performance: Semi-Supervised Domain Adaptation for Semantic Segmentation

【速读】：该论文试图解决在语义分割等密集任务中，由于标注数据获取困难而导致的监督深度学习性能受限的问题。解决方案的关键在于提出了一种半监督域适应 (Semi-Supervised Domain Adaptation, SSDA) 框架，该框架结合了一致性正则化 (consistency regularization)、像素对比学习 (pixel contrastive learning) 和自训练 (self-training) 技术，以有效利用少量目标域标签。通过这种方法，论文在GTA-to-Cityscapes等基准测试中超越了现有技术，并证明仅需50个目标域标签即可接近全监督性能。此外，研究还发现现有的无监督域适应 (Unsupervised Domain Adaptation, UDA) 和半监督学习 (Semi-Supervised Learning, SSL) 方法在SSDA设置中表现不佳，因此讨论了适应这些方法的设计模式。

链接: https://arxiv.org/abs/2411.18728
作者: Daniel Morales-Brotons,Grigorios Chrysos,Stratis Tzoumas,Volkan Cevher
关键词-EN: requires massive labeled, deep learning requires, learning requires massive, Unsupervised Domain Adaptation, massive labeled datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages, 6 figures

点击查看摘要

Abstract:Supervised deep learning requires massive labeled datasets, but obtaining annotations is not always easy or possible, especially for dense tasks like semantic segmentation. To overcome this issue, numerous works explore Unsupervised Domain Adaptation (UDA), which uses a labeled dataset from another domain (source), or Semi-Supervised Learning (SSL), which trains on a partially labeled set. Despite the success of UDA and SSL, reaching supervised performance at a low annotation cost remains a notoriously elusive goal. To address this, we study the promising setting of Semi-Supervised Domain Adaptation (SSDA). We propose a simple SSDA framework that combines consistency regularization, pixel contrastive learning, and self-training to effectively utilize a few target-domain labels. Our method outperforms prior art in the popular GTA-to-Cityscapes benchmark and shows that as little as 50 target labels can suffice to achieve near-supervised performance. Additional results on Synthia-to-Cityscapes, GTA-to-BDD and Synthia-to-BDD further demonstrate the effectiveness and practical utility of the method. Lastly, we find that existing UDA and SSL methods are not well-suited for the SSDA setting and discuss design patterns to adapt them.
zh

[CV-141] Generative Visual Communication in the Era of Vision-Language Models

【速读】：该论文试图解决生成式视觉语言模型（Vision-Language Models, VLMs）在自动化创建有效视觉传达设计时面临的挑战，特别是如何将复杂信息简化为清晰、抽象的视觉元素，以及像素级输出的局限性问题。解决方案的关键在于约束模型的操作空间，并引入任务特定的正则化（task-specific regularizations），以探索视觉传达的各个方面，包括草图和视觉抽象、排版、动画和视觉灵感。

链接: https://arxiv.org/abs/2411.18727
作者: Yael Vinker
关键词-EN: prehistoric cave paintings, dating back, cave paintings, back to prehistoric, prehistoric cave
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: PhD Thesis

点击查看摘要

Abstract:Visual communication, dating back to prehistoric cave paintings, is the use of visual elements to convey ideas and information. In today’s visually saturated world, effective design demands an understanding of graphic design principles, visual storytelling, human psychology, and the ability to distill complex information into clear visuals. This dissertation explores how recent advancements in vision-language models (VLMs) can be leveraged to automate the creation of effective visual communication designs. Although generative models have made great progress in generating images from text, they still struggle to simplify complex ideas into clear, abstract visuals and are constrained by pixel-based outputs, which lack flexibility for many design tasks. To address these challenges, we constrain the models’ operational space and introduce task-specific regularizations. We explore various aspects of visual communication, namely, sketches and visual abstraction, typography, animation, and visual inspiration.
zh

[CV-142] Random Walks with Tweedie: A Unified Framework for Diffusion Models

【速读】：该论文试图解决生成式扩散模型（Generative Diffusion Models）的理论复杂性问题，并提供一个简单且自洽的理论基础。解决方案的关键在于将扩散采样解释为一系列随机游走（random walks），并基于此提出了一种新的理论框架，该框架避免了使用马尔可夫链（Markov chains）或反向扩散（reverse diffusion）理论，而是以随机游走和Tweedie公式为核心。这一方法不仅简化了理论基础，还导出了统一的算法模板，用于网络训练和采样，特别是实现了训练和采样过程中噪声计划的分离。此外，该框架还支持条件采样（conditional sampling），无需进行似然估计（likelihood approximation）。

链接: https://arxiv.org/abs/2411.18702
作者: Chicago Y. Park,Michael T. McCann,Cristina Garcia-Cardona,Brendt Wohlberg,Ulugbek S. Kamilov
关键词-EN: designing generative diffusion, diffusion models, generative diffusion model, designing generative, model algorithms based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present a simple template for designing generative diffusion model algorithms based on an interpretation of diffusion sampling as a sequence of random walks. Score-based diffusion models are widely used to generate high-quality images. Diffusion models have also been shown to yield state-of-the-art performance in many inverse problems. While these algorithms are often surprisingly simple, the theory behind them is not, and multiple complex theoretical justifications exist in the literature. Here, we provide a simple and largely self-contained theoretical justification for score-based-diffusion models that avoids using the theory of Markov chains or reverse diffusion, instead centering the theory of random walks and Tweedie’s formula. This approach leads to unified algorithmic templates for network training and sampling. In particular, these templates cleanly separate training from sampling, e.g., the noise schedule used during training need not match the one used during sampling. We show that several existing diffusion models correspond to particular choices within this template and demonstrate that other, more straightforward algorithmic choices lead to effective diffusion models. The proposed framework has the added benefit of enabling conditional sampling without any likelihood approximation.
zh

[CV-143] MatchDiffusion: Training-free Generation of Match-cuts ATC

【速读】：该论文试图解决传统影视剪辑中制作匹配剪辑（match-cuts）的挑战，即如何在不依赖大量资源和艺术规划的情况下，实现场景间的无缝过渡。解决方案的关键在于利用文本到视频扩散模型（text-to-video diffusion models），通过“联合扩散”（Joint Diffusion）和“分离扩散”（Disjoint Diffusion）两种策略来生成匹配剪辑。联合扩散通过共享噪声初始化两个提示的生成，确保场景结构和动作的对齐；分离扩散则允许视频在细节上产生差异，从而生成视觉上连贯的视频，适合用于匹配剪辑。这种方法不仅减少了资源需求，还提高了匹配剪辑制作的效率和普及性。

链接: https://arxiv.org/abs/2411.18677
作者: Alejandro Pardo,Fabio Pizzati,Tong Zhang,Alexander Pondaven,Philip Torr,Juan Camilo Perez,Bernard Ghanem
关键词-EN: delivering strong visual, powerful cinematic tools, create seamless transitions, delivering strong, metaphorical connections
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting match-cuts is a challenging, resource-intensive process requiring deliberate artistic planning. In MatchDiffusion, we present the first training-free method for match-cut generation using text-to-video diffusion models. MatchDiffusion leverages a key property of diffusion models: early denoising steps define the scene’s broad structure, while later steps add details. Guided by this insight, MatchDiffusion employs “Joint Diffusion” to initialize generation for two prompts from shared noise, aligning structure and motion. It then applies “Disjoint Diffusion”, allowing the videos to diverge and introduce unique details. This approach produces visually coherent videos suited for match-cuts. User studies and metrics demonstrate MatchDiffusion’s effectiveness and potential to democratize match-cut creation.
zh

[CV-144] GaussianSpeech: Audio-Driven Gaussian Avatars

【速读】：该论文试图解决从语音音频生成高保真、个性化3D人头动画序列的问题。解决方案的关键在于结合语音信号与3D高斯溅射（3D Gaussian splatting）技术，以捕捉人类头部的表达细节，包括皮肤褶皱和微小面部运动。论文提出了一种基于3DGS的紧凑且高效的虚拟形象表示方法，该方法能够生成与表情相关的颜色，并利用基于皱纹和感知的损失函数来合成包括不同表情下出现的皱纹在内的面部细节。此外，论文设计了一种音频条件下的Transformer模型，能够直接从音频输入中提取唇部和表情特征，从而实现对3D高斯溅射序列的建模。由于缺乏高质量的音视频对应数据集，研究团队还采集了一个新的多视角音视频序列数据集，涵盖了具有不同面部几何特征和英语口音的说话人。

链接: https://arxiv.org/abs/2411.18675
作者: Shivangi Aneja,Artem Sevastopolsky,Tobias Kirschstein,Justus Thies,Angela Dai,Matthias Nießner
关键词-EN: synthesizes high-fidelity animation, high-fidelity animation sequences, human head avatars, high-fidelity animation, synthesizes high-fidelity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Paper Video: this https URL Project Page: this https URL

点击查看摘要

Abstract:We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple speech signal with 3D Gaussian splatting to create realistic, temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details, including wrinkles that occur with different expressions. To enable sequence modeling of 3D Gaussian splats with audio, we devise an audio-conditioned transformer model capable of extracting lip and expression features directly from audio input. Due to the absence of high-quality datasets of talking humans in correspondence with audio, we captured a new large-scale multi-view dataset of audio-visual sequences of talking humans with native English accents and diverse facial geometry. GaussianSpeech consistently achieves state-of-the-art performance with visually natural motion at real time rendering rates, while encompassing diverse facial expressions and styles.
zh

[CV-145] Active Data Curation Effectively Distills Large-Scale Multimodal Models

【速读】：该论文试图解决在大规模模型压缩过程中，如何通过知识蒸馏 (Knowledge Distillation, KD) 将大型模型压缩为更小、更高效的模型的挑战。解决方案的关键在于提出了一种简单而有效的在线批次选择方法，称为 ACID (Active Data Curation as effective Distillation)，用于对比多模态预训练。ACID 方法不仅在各种模型、数据和计算配置下优于传统的 KD 基线，而且发现这种主动数据筛选策略与标准 KD 是互补的，可以有效结合以训练出高性能且推理效率高的模型。论文进一步提出的 ACED (Active Data Curation for Efficient Distillation) 预训练框架，在27个零样本分类和检索任务中达到了最先进的结果，推理 FLOPs 减少了高达11%。此外，ACED 模型在 LiT-Decoder 设置下训练生成式多模态模型时，其视觉编码器在图像描述和视觉问答任务中表现优于更大的视觉编码器。

链接: https://arxiv.org/abs/2411.18674
作者: Vishaal Udandarao,Nikhil Parthasarathy,Muhammad Ferjad Naeem,Talfan Evans,Samuel Albanie,Federico Tombari,Yongqin Xian,Alessio Tonioni,Olivier J. Hénaff
关键词-EN: Knowledge distillation, compressing large-scale models, compressing large-scale, Knowledge, active data curation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach – active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.
zh

[CV-146] AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

【速读】：该论文试图解决现有文本到视频生成模型中3D相机控制不精确以及视频生成质量下降的问题。解决方案的关键在于从基本原理出发，分析相机运动特性，并提出了一系列改进措施：首先，通过识别视频中相机运动引起的低频运动特性，调整训练和测试中的姿态条件调度，加速训练收敛并提高视觉和运动质量；其次，通过研究无条件视频扩散变换器的表示，发现其内部隐式执行相机姿态估计，且仅部分层包含相机信息，因此限制相机条件注入到架构的子集，减少训练参数4倍，提高训练速度和视觉质量10%；最后，通过补充包含20K个多样动态视频的精选数据集，帮助模型区分相机和场景运动，提升生成姿态条件视频的动态效果。这些发现共同构成了Advanced 3D Camera Control (AC3D)架构，成为生成视频建模中相机控制的新标准。

链接: https://arxiv.org/abs/2411.18673
作者: Sherwin Bahmani,Ivan Skorokhodov,Guocheng Qian,Aliaksandr Siarohin,Willi Menapace,Andrea Tagliasacchi,David B. Lindell,Sergey Tulyakov
关键词-EN: Numerous works, generation quality suffers, camera, camera control, recently integrated
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4x reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.
zh

[CV-147] FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models

【速读】：该论文试图解决医学视觉-语言模型在生成放射学报告中量化测量不准确的问题，即“幻觉”现象，这影响了临床可靠性。解决方案的关键是引入了一个名为FactCheXcker的模块化框架，该框架通过改进的查询-代码-更新范式来消除报告中的测量幻觉。具体来说，FactCheXcker利用大型语言模型的代码生成能力，结合专门的模块来处理基于原始报告生成的测量查询，提取可测量的发现，并将结果整合到更新后的报告中。实验结果表明，FactCheXcker显著减少了幻觉，提高了测量精度，并保持了原始报告的质量。

链接: https://arxiv.org/abs/2411.18672
作者: Alice Heiman,Xiaoman Zhang,Emma Chen,Sung Eun Kim,Pranav Rajpurkar
关键词-EN: undermine clinical reliability, generating accurate quantitative, accurate quantitative measurements, clinical reliability, struggle with generating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical vision-language model models often struggle with generating accurate quantitative measurements in radiology reports, leading to hallucinations that undermine clinical reliability. We introduce FactCheXcker, a modular framework that de-hallucinates radiology report measurements by leveraging an improved query-code-update paradigm. Specifically, FactCheXcker employs specialized modules and the code generation capabilities of large language models to solve measurement queries generated based on the original report. After extracting measurable findings, the results are incorporated into an updated report. We evaluate FactCheXcker on endotracheal tube placement, which accounts for an average of 78% of report measurements, using the MIMIC-CXR dataset and 11 medical report-generation models. Our results show that FactCheXcker significantly reduces hallucinations, improves measurement precision, and maintains the quality of the original reports. Specifically, FactCheXcker improves the performance of all 11 models and achieves an average improvement of 94.0% in reducing measurement hallucinations measured by mean absolute error.
zh

[CV-148] APTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

【速读】：该论文试图解决在长视频中点跟踪的鲁棒性问题，特别是TAPTRv2在处理长时间序列时由于目标跟踪点随时间变化而导致的特征漂移问题。解决方案的关键在于TAPTRv3引入了空间和时间上下文信息，以改进特征查询的质量。具体来说，TAPTRv3采用了Context-aware Cross-Attention (CCA) 来利用周围空间上下文增强图像特征查询的注意力分数质量，以及Visibility-aware Long-Temporal Attention (VLTA) 来在考虑帧可见性的同时对所有过去帧进行时间注意力处理，从而有效解决了TAPTRv2中RNN-like长时序建模带来的特征漂移问题。这些改进使得TAPTRv3在多个挑战性数据集上显著超越了TAPTRv2，并达到了最先进的性能。

链接: https://arxiv.org/abs/2411.18671
作者: Jinyuan Qu,Hongyang Li,Shilong Liu,Tianhe Ren,Zhaoyang Zeng,Lei Zhang
关键词-EN: long videos, point tracking robustness, videos, querying, long
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present TAPTRv3, which is built upon TAPTRv2 to improve its point tracking robustness in long videos. TAPTRv2 is a simple DETR-like framework that can accurately track any point in real-world videos without requiring cost-volume. TAPTRv3 improves TAPTRv2 by addressing its shortage in querying high quality features from long videos, where the target tracking points normally undergo increasing variation over time. In TAPTRv3, we propose to utilize both spatial and temporal context to bring better feature querying along the spatial and temporal dimensions for more robust tracking in long videos. For better spatial feature querying, we present Context-aware Cross-Attention (CCA), which leverages surrounding spatial context to enhance the quality of attention scores when querying image features. For better temporal feature querying, we introduce Visibility-aware Long-Temporal Attention (VLTA) to conduct temporal attention to all past frames while considering their corresponding visibilities, which effectively addresses the feature drifting problem in TAPTRv2 brought by its RNN-like long-temporal modeling. TAPTRv3 surpasses TAPTRv2 by a large margin on most of the challenging datasets and obtains state-of-the-art performance. Even when compared with methods trained with large-scale extra internal data, TAPTRv3 is still competitive.
zh

[CV-149] SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

【速读】：该论文试图解决的问题是如何将基于自然RGB图像训练的视觉基础模型（如Segment Anything Model, SAM）有效地迁移到具有不同物理特性的其他成像模态（如偏振成像）。解决方案的关键在于提出了一种名为SimCMF的简单而有效的框架，该框架通过引入一个新颖的跨模态对齐模块（cross-modal alignment module）来解决模态对齐问题。SimCMF通过对不同基本组件的深入分析，最终实现了对新成像模态的支持，并在缺乏相关基准的情况下构建了性能评估基准。实验结果表明，SimCMF能够显著提升其他传感器模态的分割性能（mIoU），平均从22.15%提升至53.88%，并始终优于其他基线方法。

链接: https://arxiv.org/abs/2411.18669
作者: Chenyang Lei,Liyi Chen,Jun Cen,Xiao Chen,Zhen Lei,Felix Heide,Qifeng Chen,Zhaoxiang Zhang
关键词-EN: revolutionary social impact, ChatGPT and Sora, Foundation models, vision foundation models, foundation models trained
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL . arXiv admin note: substantial text overlap with arXiv:2409.08083

点击查看摘要

Abstract:Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework, SimCMF, to study an important problem: cross-modal fine-tuning from vision foundation models trained on natural RGB images to other imaging modalities of different physical properties (e.g., polarization). In SimCMF, we conduct a thorough analysis of different basic components from the most naive design and ultimately propose a novel cross-modal alignment module to address the modality misalignment problem. We apply SimCMF to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new imaging modality. Given the absence of relevant benchmarks, we construct a benchmark for performance evaluation. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors’ performance. SimCMF can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. The code is available at this https URL
zh

[CV-150] owards Chunk-Wise Generation for Long Videos

【速读】：该论文试图解决生成长时间视频时面临的内存需求过大和计算复杂度高的问题。解决方案的关键在于采用自回归分块策略，即将长时间视频生成任务分解为多个短时间视频生成子任务，通过生成多个具有强时空关联性的短视频块，然后将它们拼接在一起形成长时间视频。这种方法有效降低了每个子任务的计算成本，避免了内存溢出问题，并通过设计高效的k步搜索解决方案来缓解应用短图像到视频模型于长时间视频任务时产生的常见问题。

链接: https://arxiv.org/abs/2411.18668
作者: Siyang Zhang,Ser-Nam Lim
关键词-EN: substantial GPU memory, GPU memory demands, Generating long-duration videos, significant challenge due, memory demands required
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating long-duration videos has always been a significant challenge due to the inherent complexity of spatio-temporal domain and the substantial GPU memory demands required to calculate huge size tensors. While diffusion based generative models achieve state-of-the-art performance in video generation task, they are typically trained with predefined video resolutions and lengths. During inference, a noise tensor with specific resolution and length should be specified at first, and the model will perform denoising on the entire video tensor simultaneously, all the frames together. Such approach will easily raise an out-of-memory (OOM) problem when the specified resolution and/or length exceed a certain limit. One of the solutions to this problem is to generate many short video chunks autoregressively with strong inter-chunk spatio-temporal relation and then concatenate them together to form a long video. In this approach, a long video generation task is divided into multiple short video generation subtasks, and the cost of each subtask is reduced to a feasible level. In this paper, we conduct a detailed survey on long video generation with the autoregressive chunk-by-chunk strategy. We address common problems caused by applying short image-to-video models to long video tasks and design an efficient k -step search solution to mitigate these problems.
zh

[CV-151] Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting

【速读】：该论文试图解决现有基于渲染的自监督学习框架在预训练阶段计算量大和内存消耗高的问题。解决方案的关键在于提出了一种高效的框架，名为GS³，它将快速3D高斯Splatting（3D Gaussian Splatting）无缝集成到基于渲染的框架中。具体来说，GS³通过比较渲染的RGB图像与真实RGB图像来预训练点云编码器，仅使用富含学习到的几何和外观信息的高斯点来生成高质量的渲染图像。该框架通过将输入的RGB-D图像反投影到3D空间，并使用点云编码器提取点特征，然后从学习到的点云特征中预测场景的3D高斯点，并使用基于瓦片的栅格化器进行图像渲染。最终，预训练的点云编码器可以微调以适应各种下游3D任务，包括高级感知任务如3D分割和检测，以及低级任务如3D场景重建。实验结果表明，GS³框架在预训练速度上提升了约9倍，内存成本降低了至多0.25倍，相较于之前的渲染框架Ponder，具有显著的效率提升和效果优势。

链接: https://arxiv.org/abs/2411.18667
作者: Hao Liu,Minglin Chen,Yanni Ma,Haihong Xiao,Ying He
关键词-EN: large-scale unlabeled datasets, unlabeled datasets contribute, point cloud encoder, point cloud, model achieving powerful
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures, 15 tables

点击查看摘要

Abstract:Pre-training on large-scale unlabeled datasets contribute to the model achieving powerful performance on 3D vision tasks, especially when annotations are limited. However, existing rendering-based self-supervised frameworks are computationally demanding and memory-intensive during pre-training due to the inherent nature of volume rendering. In this paper, we propose an efficient framework named GS ^3 to learn point cloud representation, which seamlessly integrates fast 3D Gaussian Splatting into the rendering-based framework. The core idea behind our framework is to pre-train the point cloud encoder by comparing rendered RGB images with real RGB images, as only Gaussian points enriched with learned rich geometric and appearance information can produce high-quality renderings. Specifically, we back-project the input RGB-D images into 3D space and use a point cloud encoder to extract point-wise features. Then, we predict 3D Gaussian points of the scene from the learned point cloud features and uses a tile-based rasterizer for image rendering. Finally, the pre-trained point cloud encoder can be fine-tuned to adapt to various downstream 3D tasks, including high-level perception tasks such as 3D segmentation and detection, as well as low-level tasks such as 3D scene reconstruction. Extensive experiments on downstream tasks demonstrate the strong transferability of the pre-trained point cloud encoder and the effectiveness of our self-supervised learning framework. In addition, our GS ^3 framework is highly efficient, achieving approximately 9 \times pre-training speedup and less than 0.25 \times memory cost compared to the previous rendering-based framework Ponder.
zh

[CV-152] 3D Scene Graph Guided Vision-Language Pre-training

【速读】：该论文试图解决3D视觉-语言（3D Vision-Language, VL）推理任务中现有方法任务特定性强、依赖手工设计模块和辅助损失的问题。解决方案的关键在于提出了一种3D场景图引导的视觉-语言预训练（3D Scene Graph-Guided Vision-Language Pre-training, VLP）框架。该框架通过利用模态编码器、图卷积层和交叉注意力层，学习适用于多种3D VL推理任务的通用表示，从而消除了对任务特定设计的依赖。预训练目标包括场景图引导的对比学习和掩码模态学习，前者通过3D场景图与自然语言之间的强相关性，在不同细粒度级别上对齐3D对象与文本特征；后者利用跨模态信息重构掩码词和3D对象，通过位置线索预测其语义类别，而非直接重构3D点云。实验结果表明，该预训练模型在下游任务如3D视觉定位、3D密集标注和3D问答中表现优异。

链接: https://arxiv.org/abs/2411.18666
作者: Hao Liu,Yanni Ma,Yan Liu,Haihong Xiao,Ying He
关键词-EN: gained significant attention, significant attention due, natural language descriptions, physical world, gained significant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 7 tables

点击查看摘要

Abstract:3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions. Existing approaches typically follow task-specific, highly specialized paradigms. Therefore, these methods focus on a limited range of reasoning sub-tasks and rely heavily on the hand-crafted modules and auxiliary losses. This highlights the need for a simpler, unified and general-purpose model. In this paper, we leverage the inherent connection between 3D scene graphs and natural language, proposing a 3D scene graph-guided vision-language pre-training (VLP) framework. Our approach utilizes modality encoders, graph convolutional layers and cross-attention layers to learn universal representations that adapt to a variety of 3D VL reasoning tasks, thereby eliminating the need for task-specific designs. The pre-training objectives include: 1) Scene graph-guided contrastive learning, which leverages the strong correlation between 3D scene graphs and natural language to align 3D objects with textual features at various fine-grained levels; and 2) Masked modality learning, which uses cross-modality information to reconstruct masked words and 3D objects. Instead of directly reconstructing the 3D point clouds of masked objects, we use position clues to predict their semantic categories. Extensive experiments demonstrate that our pre-training model, when fine-tuned on several downstream tasks, achieves performance comparable to or better than existing methods in tasks such as 3D visual grounding, 3D dense captioning, and 3D question answering.
zh

[CV-153] SpotLight: Shadow-Guided Object Relighting via Diffusion

【速读】：该论文试图解决神经渲染引擎在插入虚拟对象时缺乏对光照设置的精确控制问题。解决方案的关键在于通过指定对象的阴影来实现精确的光照控制。具体来说，论文提出的方法“SpotLight”通过将对象的阴影注入预训练的扩散模型神经渲染器中，使其能够根据所需的光照位置准确地为对象着色，并自然地融入目标背景图像中。这种方法无需额外的训练，且在对象合成结果上表现出色，优于现有的专门用于重新照明的扩散模型。

链接: https://arxiv.org/abs/2411.18665
作者: Frédéric Fortier-Chouinard,Zitian Zhang,Louis-Etienne Messier,Mathieu Garon,Anand Bhattad,Jean-François Lalonde
关键词-EN: neural rendering engines, inserting virtual objects, powerful neural rendering, neural rendering, rendering engines
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent work has shown that diffusion models can be used as powerful neural rendering engines that can be leveraged for inserting virtual objects into images. Unlike typical physics-based renderers, however, neural rendering engines are limited by the lack of manual control over the lighting setup, which is often essential for improving or personalizing the desired image outcome. In this paper, we show that precise lighting control can be achieved for object relighting simply by specifying the desired shadows of the object. Rather surprisingly, we show that injecting only the shadow of the object into a pre-trained diffusion-based neural renderer enables it to accurately shade the object according to the desired light position, while properly harmonizing the object (and its shadow) within the target background image. Our method, SpotLight, leverages existing neural rendering approaches and achieves controllable relighting results with no additional training. Specifically, we demonstrate its use with two neural renderers from the recent literature. We show that SpotLight achieves superior object compositing results, both quantitatively and perceptually, as confirmed by a user study, outperforming existing diffusion-based models specifically designed for relighting.
zh

[CV-154] Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

【速读】：该论文试图解决现有扩散模型在采样引导技术（如CFG）下生成高质量图像、视频和3D内容时，质量提升的同时牺牲多样性和动态性的问题。解决方案的关键在于引入了一种名为时空跳跃引导（Spatiotemporal Skip Guidance, STG）的无需训练的采样引导方法。STG通过自扰动方式隐式地模拟弱模型，避免了外部模型或额外训练的需求。其核心机制是通过选择性地跳过时空层，生成原始模型的对齐降级版本，从而在不损害多样性或动态性的前提下提升样本质量。

链接: https://arxiv.org/abs/2411.18664
作者: Junha Hyung,Kinam Kim,Susung Hong,Min-Jung Kim,Jaegul Choo
关键词-EN: generating high-quality images, high-quality images, powerful tool, tool for generating, generating high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Diffusion models have emerged as a powerful tool for generating high-quality images, videos, and 3D content. While sampling guidance techniques like CFG improve quality, they reduce diversity and motion. Autoguidance mitigates these issues but demands extra weak model training, limiting its practicality for large-scale models. In this work, we introduce Spatiotemporal Skip Guidance (STG), a simple training-free sampling guidance method for enhancing transformer-based video diffusion models. STG employs an implicit weak model via self-perturbation, avoiding the need for external models or additional training. By selectively skipping spatiotemporal layers, STG produces an aligned, degraded version of the original model to boost sample quality without compromising diversity or dynamic degree. Our contributions include: (1) introducing STG as an efficient, high-performing guidance technique for video diffusion models, (2) eliminating the need for auxiliary models by simulating a weak model through layer skipping, and (3) ensuring quality-enhanced guidance without compromising sample diversity or dynamics unlike CFG. For additional results, visit this https URL.
zh

[CV-155] HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior

【速读】：该论文试图解决现有文本到图像扩散模型在真实世界图像超分辨率（Real-ISR）中由于噪声文本提示和缺乏空间信息而导致的意外结果问题。解决方案的关键在于提出了HoliSDiP框架，该框架利用语义分割技术提供精确的文本和空间指导。具体来说，HoliSDiP使用语义标签作为简洁的文本提示，并通过分割掩码和提出的Segmentation-CLIP Map引入密集的语义指导，从而在减少提示噪声和增强空间控制方面显著提升图像质量。

链接: https://arxiv.org/abs/2411.18662
作者: Li-Yuan Tsao,Hao-Wei Chen,Hao-Wei Chung,Deqing Sun,Chun-Yi Lee,Kelvin C.K. Chan,Ming-Hsuan Yang
关键词-EN: real-world image super-resolution, diffusion models, models have emerged, emerged as powerful, powerful priors
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models have emerged as powerful priors for real-world image super-resolution (Real-ISR). However, existing methods may produce unintended results due to noisy text prompts and their lack of spatial information. In this paper, we present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for diffusion-based Real-ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed Segmentation-CLIP Map. Extensive experiments demonstrate that HoliSDiP achieves significant improvement in image quality across various Real-ISR scenarios through reduced prompt noise and enhanced spatial control.
zh

[CV-156] OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains

【速读】：该论文试图解决从文本描述生成逼真的三维人体-物体交互（Human-Object Interactions, HOIs）的问题，特别是在域外（Out-of-Domain, OOD）场景中确保物理合理性的挑战。解决方案的关键在于提出了OOD-HOI框架，该框架通过以下几个核心组件实现：1) 双分支互惠扩散模型（dual-branch reciprocal diffusion model）用于合成初始交互姿态；2) 接触引导的交互优化器（contact-guided interaction refiner）基于预测的接触区域提高物理准确性；3) 动态适应机制（dynamic adaptation mechanism）包括语义调整和几何变形，以增强鲁棒性。这些组件共同作用，使得生成的三维交互姿态在域外场景中更加真实和物理上合理。

链接: https://arxiv.org/abs/2411.18660
作者: Yixuan Zhang,Hui Yang,Chuanchen Luo,Junran Peng,Yuxi Wang,Zhaoxiang Zhang
关键词-EN: active research topic, augmented reality, text descriptions, active research, research topic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic 3D human-object interactions (HOIs) from text descriptions is a active research topic with potential applications in virtual and augmented reality, robotics, and animation. However, creating high-quality 3D HOIs remains challenging due to the lack of large-scale interaction data and the difficulty of ensuring physical plausibility, especially in out-of-domain (OOD) scenarios. Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. In this paper, we propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness. Experimental results demonstrate that our OOD-HOI could generate more realistic and physically plausible 3D interaction pose in OOD scenarios compared to existing methods.
zh

[CV-157] DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models

【速读】：该论文试图解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）中的幻觉问题，包括对象、属性和关系幻觉。解决方案的关键在于通过分析跨模态注意力模式（Cross-modal Attention Patterns）在幻觉和非幻觉状态下的差异，开发了一种轻量级的幻觉检测器，称为Detecting Hallucinations by Cross-modal Attention Patterns (DHCP)。DHCP方法无需额外的LVLM训练或推理步骤，实验结果表明其在幻觉检测方面表现出色，为提升LVLMs的可靠性和可信度提供了新的见解。

链接: https://arxiv.org/abs/2411.18659
作者: Yudong Zhang,Ruobing Xie,Jiansheng Chen,Xingwu Sun,Zhanhui kang,Yu Wang
关键词-EN: Large vision-language models, complex multimodal tasks, Large vision-language, demonstrated exceptional performance, cross-modal attention patterns
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) have demonstrated exceptional performance on complex multimodal tasks. However, they continue to suffer from significant hallucination issues, including object, attribute, and relational hallucinations. To accurately detect these hallucinations, we investigated the variations in cross-modal attention patterns between hallucination and non-hallucination states. Leveraging these distinctions, we developed a lightweight detector capable of identifying hallucinations. Our proposed method, Detecting Hallucinations by Cross-modal Attention Patterns (DHCP), is straightforward and does not require additional LVLM training or extra LVLM inference steps. Experimental results show that DHCP achieves remarkable performance in hallucination detection. By offering novel insights into the identification and analysis of hallucinations in LVLMs, DHCP contributes to advancing the reliability and trustworthiness of these models.
zh

[CV-158] HDI-Former: Hybrid Dynamic Interaction ANN-SNN Transformer for Object Detection Using Frames and Events

【速读】：该论文试图解决在复杂场景中进行物体检测时，现有方法使用两个独立的人工神经网络（ANN）分支导致跨模态信息交互受限，以及从事件流中提取时间线索时能耗较高的问题。解决方案的关键在于提出了一个混合动态交互的ANN-SNN Transformer架构，称为HDI-Former。该架构首次尝试直接训练一个混合ANN-SNN模型，以实现高精度和低能耗的物体检测。具体技术包括：1) 引入一种新的语义增强自注意力机制，以加强ANN Transformer分支中图像编码令牌之间的关联；2) 设计了一个脉冲Swin Transformer分支，用于以低能耗建模事件流中的时间线索；3) 提出了一种生物启发的动态交互机制，用于实现ANN和SNN子网络之间的跨模态信息交互。实验结果表明，HDI-Former在性能上显著优于现有的11种最先进方法和4种基线方法，并且在DSEC-Detection数据集上，SNN分支的能耗仅为相同架构ANN的1/10.57。

链接: https://arxiv.org/abs/2411.18658
作者: Dianze Li,Jianing Li,Xu Liu,Zhaokun Zhou,Xiaopeng Fan,Yonghong Tian
关键词-EN: Artificial Neural Network, Combining the complementary, independent Artificial Neural, object detection, challenging scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures

点击查看摘要

Abstract:Combining the complementary benefits of frames and events has been widely used for object detection in challenging scenarios. However, most object detection methods use two independent Artificial Neural Network (ANN) branches, limiting cross-modality information interaction across the two visual streams and encountering challenges in extracting temporal cues from event streams with low power consumption. To address these challenges, we propose HDI-Former, a Hybrid Dynamic Interaction ANN-SNN Transformer, marking the first trial to design a directly trained hybrid ANN-SNN architecture for high-accuracy and energy-efficient object detection using frames and events. Technically, we first present a novel semantic-enhanced self-attention mechanism that strengthens the correlation between image encoding tokens within the ANN Transformer branch for better performance. Then, we design a Spiking Swin Transformer branch to model temporal cues from event streams with low power consumption. Finally, we propose a bio-inspired dynamic interaction mechanism between ANN and SNN sub-networks for cross-modality information interaction. The results demonstrate that our HDI-Former outperforms eleven state-of-the-art methods and our four baselines by a large margin. Our SNN branch also shows comparable performance to the ANN with the same architecture while consuming 10.57 \times less energy on the DSEC-Detection dataset. Our open-source code is available in the supplementary material.
zh

[CV-159] AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward

【速读】：该论文试图解决文本到动作生成模型在事件级文本描述与生成动作之间对齐的问题。解决方案的关键在于引入AToM框架，通过利用GPT-4Vision的奖励机制来增强生成动作与文本提示之间的对齐。具体步骤包括：构建包含完整性、时间关系和动作频率的MotionPrefer数据集；设计基于GPT-4Vision的详细动作注释范式，包括视觉数据格式化、任务特定指令和评分规则；以及使用强化学习对现有文本到动作模型进行微调。实验结果表明，AToM显著提高了文本到动作生成的事件级对齐质量。

链接: https://arxiv.org/abs/2411.18654
作者: Haonan Han,Xiangzuo Wu,Huan Liao,Zunnan Xu,Zhongyuan Hu,Ronghui Li,Yachao Zhang,Xiu Li
关键词-EN: creating realistic human, realistic human motion, efficiency and flexibility, opened new possibilities, possibilities for creating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, text-to-motion models have opened new possibilities for creating realistic human motion with greater efficiency and flexibility. However, aligning motion generation with event-level textual descriptions presents unique challenges due to the complex relationship between textual prompts and desired motion outcomes. To address this, we introduce AToM, a framework that enhances the alignment between generated motion and text prompts by leveraging reward from GPT-4Vision. AToM comprises three main stages: Firstly, we construct a dataset MotionPrefer that pairs three types of event-level textual prompts with generated motions, which cover the integrity, temporal relationship and frequency of motion. Secondly, we design a paradigm that utilizes GPT-4Vision for detailed motion annotation, including visual data formatting, task-specific instructions and scoring rules for each sub-task. Finally, we fine-tune an existing text-to-motion model using reinforcement learning guided by this paradigm. Experimental results demonstrate that AToM significantly improves the event-level alignment quality of text-to-motion generation.
zh

[CV-160] Surf-NeRF: Surface Regularised Neural Radiance Fields

【速读】：该论文试图解决神经辐射场 (NeRF) 在表示复杂场景几何时存在的形状-辐射模糊性问题，即 NeRF 难以收敛到与真实几何一致的表示。解决方案的关键在于采用课程学习 (curriculum learning) 方法，通过引入四个额外的正则化项来增强几何平滑性、法线一致性以及在场景几何中分离朗伯反射和镜面反射。这些正则化项基于物理模型，有助于 NeRF 更准确地表示场景几何，从而在位置编码 NeRF 和基于网格的模型上分别提升了 14.4% 和 9.2% 的法线精度。此外，该方法还实现了视点依赖的外观分离，使得 NeRF 的几何表示与捕捉到的场景更加一致，并且兼容现有的 NeRF 变体，为几何关键应用中的辐射场表示提供了重要支持。

链接: https://arxiv.org/abs/2411.18652
作者: Jack Naylor,Viorela Ila,Donald G. Dansereau
关键词-EN: Neural Radiance Fields, Neural Radiance, realistically represent complex, represent complex behaviour, Radiance Fields
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 17 figures, 9 tables, project page can be found at this http URL

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) provide a high fidelity, continuous scene representation that can realistically represent complex behaviour of light. Despite recent works like Ref-NeRF improving geometry through physics-inspired models, the ability for a NeRF to overcome shape-radiance ambiguity and converge to a representation consistent with real geometry remains limited. We demonstrate how curriculum learning of a surface light field model helps a NeRF converge towards a more geometrically accurate scene representation. We introduce four additional regularisation terms to impose geometric smoothness, consistency of normals and a separation of Lambertian and specular appearance at geometry in the scene, conforming to physical models. Our approach yields improvements of 14.4% to normals on positionally encoded NeRFs and 9.2% on grid-based models compared to current reflection-based NeRF variants. This includes a separated view-dependent appearance, conditioning a NeRF to have a geometric representation consistent with the captured scene. We demonstrate compatibility of our method with existing NeRF variants, as a key step in enabling radiance-based representations for geometry critical applications.
zh

[CV-161] RoMo: Robust Motion Segmentation Improves Structure from Motion

【速读】：该论文试图解决单目视频中动态场景的相机姿态估计问题，特别是如何从视频中区分静态和动态部分以提高结构从运动 (SfM) 的准确性。解决方案的关键在于提出了一种名为 RoMo 的新型视频运动分割方法，该方法结合了光流 (optical flow) 和极线约束 (epipolar cues) 以及预训练的视频分割模型。RoMo 通过迭代优化，能够有效地识别相对于固定世界坐标系的运动部分，从而显著提升 SfM 相机校准管道的性能，尤其是在包含动态内容的场景中，达到了新的技术水平。

链接: https://arxiv.org/abs/2411.18650
作者: Lily Goli,Sara Sabour,Mark Matthews,Marcus Brubaker,Dmitry Lagun,Alec Jacobson,David J. Fleet,Saurabh Saxena,Andrea Tagliasacchi
关键词-EN: monocular casually-captured video, extensive progress, reconstruction and generation, monocular casually-captured, casually-captured video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
zh

[CV-162] Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings

【速读】：该论文试图解决在大规模图像分类任务中，如何实现内解释性（Inner Interpretability）的问题。解决方案的关键在于提出了一个概念框架，并引入了双向概念与输入嵌入交互模块（Bi-directional Interaction between Concept and Input Embeddings, Bi-ICE）。Bi-ICE模块通过在计算、算法和实现层面上促进解释性，增强了模型的透明度。具体来说，该模块通过生成基于人类可理解概念的预测、量化这些概念的贡献，并在输入中定位这些概念，从而实现了对图像分类任务的增强透明性。此外，该方法还展示了概念学习过程及其收敛性，突出了算法层面的解释性。

链接: https://arxiv.org/abs/2411.18645
作者: Jinyung Hong,Yearim Kim,Keun Hee Park,Sangyu Han,Nojun Kwak,Theodore P. Pavlic
关键词-EN: promising field focused, developing scalable, automated methods, promising field, field focused
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The first two authors equally contributed to this work, 27 pages, 19 figures, 9 tables

点击查看摘要

Abstract:Inner interpretability is a promising field focused on uncovering the internal mechanisms of AI systems and developing scalable, automated methods to understand these systems at a mechanistic level. While significant research has explored top-down approaches starting from high-level problems or algorithmic hypotheses and bottom-up approaches building higher-level abstractions from low-level or circuit-level descriptions, most efforts have concentrated on analyzing large language models. Moreover, limited attention has been given to applying inner interpretability to large-scale image tasks, primarily focusing on architectural and functional levels to visualize learned concepts. In this paper, we first present a conceptual framework that supports inner interpretability and multilevel analysis for large-scale image classification tasks. We introduce the Bi-directional Interaction between Concept and Input Embeddings (Bi-ICE) module, which facilitates interpretability across the computational, algorithmic, and implementation levels. This module enhances transparency by generating predictions based on human-understandable concepts, quantifying their contributions, and localizing them within the inputs. Finally, we showcase enhanced transparency in image classification, measuring concept contributions and pinpointing their locations within the inputs. Our approach highlights algorithmic interpretability by demonstrating the process of concept learning and its convergence.
zh

[CV-163] Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop

【速读】：该论文试图解决视频生成中存在的时序不一致和物理规律违反等伪影问题。解决方案的关键在于利用3D场景来提供对场景实体的精确控制，从而从根本上解决这些问题。论文提出了Scene Copilot框架，该框架结合了大型语言模型（LLMs）与程序化3D场景生成器，具体包括Scene Codex、BlenderGPT和Human in the loop三个组件。Scene Codex负责将用户输入的文本转换为3D场景生成器可理解的命令，BlenderGPT提供用户直观且直接的方式来精确控制生成的3D场景和最终输出视频，用户还可以通过Blender UI获得即时视觉反馈。此外，论文还构建了一个程序化对象数据集以增强系统能力。这些组件协同工作，支持用户生成所需的3D场景和视频。

链接: https://arxiv.org/abs/2411.18644
作者: Zhaofang Qian,Abolfazl Sharifi,Tucker Carroll,Ser-Nam Lim
关键词-EN: achieved impressive quality, impressive quality, physical laws, achieved impressive, suffers from artifacts
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Videos are available at our project page: this https URL

点击查看摘要

Abstract:Video generation has achieved impressive quality, but it still suffers from artifacts such as temporal inconsistency and violation of physical laws. Leveraging 3D scenes can fundamentally resolve these issues by providing precise control over scene entities. To facilitate the easy generation of diverse photorealistic scenes, we propose Scene Copilot, a framework combining large language models (LLMs) with a procedural 3D scene generator. Specifically, Scene Copilot consists of Scene Codex, BlenderGPT, and Human in the loop. Scene Codex is designed to translate textual user input into commands understandable by the 3D scene generator. BlenderGPT provides users with an intuitive and direct way to precisely control the generated 3D scene and the final output video. Furthermore, users can utilize Blender UI to receive instant visual feedback. Additionally, we have curated a procedural dataset of objects in code format to further enhance our system’s capabilities. Each component works seamlessly together to support users in generating desired 3D scenes. Extensive experiments demonstrate the capability of our framework in customizing 3D scenes and video generation.
zh

[CV-164] Volume Rendering of Human Hand Anatomy

【速读】：该论文试图解决在磁共振成像（MRI）数据集的体积渲染中，如何清晰地展示人手内部复杂解剖结构的问题。解决方案的关键在于设计有效的传递函数（transfer functions），这些函数能够针对手部不同组织（如骨骼、肌肉、肌腱、韧带、皮下脂肪等）进行精细控制，从而在体积渲染过程中突出显示感兴趣的组织，同时保持手部整体视觉上下文的清晰度。论文提出了两种传递函数族，用于强调不同的手部组织，并通过减少标准体积光线投射中的伪影来提高渲染质量。实验结果表明，该方法相较于传统的表面和体积渲染技术，显著提升了手部解剖结构的视觉呈现效果。

链接: https://arxiv.org/abs/2411.18630
作者: Jingtao Huang,Bohan Wang,Zhiyuan Gao,Mianlun Zheng,George Matcuk,Jernej Barbic
关键词-EN: magnetic resonance imaging, volumetric rendering, human hands, hand, resonance imaging
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:We study the design of transfer functions for volumetric rendering of magnetic resonance imaging (MRI) datasets of human hands. Human hands are anatomically complex, containing various organs within a limited space, which presents challenges for volumetric rendering. We focus on hand musculoskeletal organs because they are volumetrically the largest inside the hand, and most important for the hand’s main function, namely manipulation of objects. While volumetric rendering is a mature field, the choice of the transfer function for the different organs is arguably just as important as the choice of the specific volume rendering algorithm; we demonstrate that it significantly influences the clarity and interpretability of the resulting images. We assume that the hand MRI scans have already been segmented into the different organs (bones, muscles, tendons, ligaments, subcutaneous fat, etc.). Our method uses the hand MRI volume data, and the geometry of its inner organs and their known segmentation, to produce high-quality volume rendering images of the hand, and permits fine control over the appearance of each tissue. We contribute two families of transfer functions to emphasize different hand tissues of interest, while preserving the visual context of the hand. We also discuss and reduce artifacts present in standard volume ray-casting of human hands. We evaluate our volumetric rendering on five challenging hand motion sequences. Our experimental results demonstrate that our method improves hand anatomy visualization, compared to standard surface and volume rendering techniques.
zh

[CV-165] Human Motion Instruction Tuning

【速读】：该论文试图解决传统指令调优方法在处理非语言输入（如视频或运动序列）时，通过将其转换为语言标记而丢失运动细节的问题。解决方案的关键在于提出了LLaMo（Large Language and Human Motion Assistant），这是一个多模态框架，能够在指令调优过程中保留运动数据的原始形式，从而更好地捕捉和解释复杂的人类行为。通过同时处理视频、运动数据和文本输入，LLaMo提高了模型在运动密集场景中的理解和预测能力，特别是在高复杂度领域如人类行为和专业活动中表现出色。

链接: https://arxiv.org/abs/2411.16805
作者: Lei Li,Sen Jia,Wang Jianhao,Zhongyu Jiang,Feng Zhou,Ju Dai,Tianfang Zhang,Wu Zongkai,Jenq-Neng Hwang
关键词-EN: Human Motion Assistant, Large Language, Motion Assistant, motion instruction tuning, instruction tuning
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model’s ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction. Our code and models are available on the project website: this https URL.
zh

[CV-166] Multimodal Whole Slide Foundation Model for Pathology

【速读】：该论文试图解决在计算病理学领域中，由于罕见疾病和特定疾病群体的临床数据有限，导致基础模型在处理复杂临床挑战（如患者和切片级别的分析）时表现受限的问题。解决方案的关键在于提出了TITAN，一个多模态全切片基础模型，通过视觉自监督学习（SSL）和视觉-语言对齐，预训练了335,645张全切片图像，并结合了423,122条由多模态生成式AI（Generative AI）生成的合成病理报告。TITAN无需微调或临床标签，能够提取通用切片表示并生成病理报告，适用于资源有限的临床场景，如罕见疾病检索和癌症预后。

链接: https://arxiv.org/abs/2411.19666
作者: Tong Ding,Sophia J. Wagner,Andrew H. Song,Richard J. Chen,Ming Y. Lu,Andrew Zhang,Anurag J. Vaidya,Guillaume Jaume,Muhammad Shaban,Ahrong Kim,Drew F.K. Williamson,Bowen Chen,Cristina Almagro-Perez,Paul Doucet,Sharifa Sahai,Chengkuan Chen,Daisuke Komura,Akihiro Kawabe,Shumpei Ishikawa,Georg Gerber,Tingying Peng,Long Phi Le,Faisal Mahmood
关键词-EN: transferable feature representations, encode histopathology, field of computational, transformed with recent, recent advances
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
备注: The code is accessible at this https URL

点击查看摘要

Abstract:The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning (SSL). However, translating these advancements to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data in disease-specific cohorts, especially for rare clinical conditions. We propose TITAN, a multimodal whole slide foundation model pretrained using 335,645 WSIs via visual self-supervised learning and vision-language alignment with corresponding pathology reports and 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology. Without any finetuning or requiring clinical labels, TITAN can extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis. We evaluate TITAN on diverse clinical tasks and find that TITAN outperforms both ROI and slide foundation models across machine learning settings such as linear probing, few-shot and zero-shot classification, rare cancer retrieval and cross-modal retrieval, and pathology report generation.
zh

[CV-167] Self-Supervised Denoiser Framework

【速读】：该论文试图解决在工业计算机断层扫描（Computed Tomography, CT）中因数据欠采样导致的图像质量下降问题。解决方案的关键在于提出了自监督去噪框架（Self-supervised Denoiser Framework, SDF），该框架通过在高质量采样的正弦图（sinogram）数据上进行预训练，来提升从欠采样正弦图数据重建的图像质量。SDF的核心创新在于在正弦图空间中训练图像去噪器，通过预测一个正弦图子集来实现自监督学习，从而无需真实图像数据，充分利用CT中丰富的正弦图数据，显著提升从部分测量数据重建的图像质量。实验结果表明，SDF在2D扇形束和3D锥形束CT设置中均优于其他分析和自监督框架，并且在少量高质量图像数据上进行微调后，其增强效果依然显著，使其成为CT中基础图像增强模型的有力候选。

链接: https://arxiv.org/abs/2411.19593
作者: Emilien Valat,Andreas Hauptmann,Ozan Öktem
关键词-EN: Computed Tomography, Reconstructing images, industrial context leads, leads to specific, specific challenges
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reconstructing images using Computed Tomography (CT) in an industrial context leads to specific challenges that differ from those encountered in other areas, such as clinical CT. Indeed, non-destructive testing with industrial CT will often involve scanning multiple similar objects while maintaining high throughput, requiring short scanning times, which is not a relevant concern in clinical CT. Under-sampling the tomographic data (sinograms) is a natural way to reduce the scanning time at the cost of image quality since the latter depends on the number of measurements. In such a scenario, post-processing techniques are required to compensate for the image artifacts induced by the sinogram sparsity. We introduce the Self-supervised Denoiser Framework (SDF), a self-supervised training method that leverages pre-training on highly sampled sinogram data to enhance the quality of images reconstructed from undersampled sinogram data. The main contribution of SDF is that it proposes to train an image denoiser in the sinogram space by setting the learning task as the prediction of one sinogram subset from another. As such, it does not require ground-truth image data, leverages the abundant data modality in CT, the sinogram, and can drastically enhance the quality of images reconstructed from a fraction of the measurements. We demonstrate that SDF produces better image quality, in terms of peak signal-to-noise ratio, than other analytical and self-supervised frameworks in both 2D fan-beam or 3D cone-beam CT settings. Moreover, we show that the enhancement provided by SDF carries over when fine-tuning the image denoiser on a few examples, making it a suitable pre-training technique in a context where there is little high-quality image data. Our results are established on experimental datasets, making SDF a strong candidate for being the building block of foundational image-enhancement models in CT.
zh

[CV-168] A Comprehensive Framework for Automated Segmentation of Perivascular Spaces in Brain MRI with the nnU-Net

【速读】：该论文试图解决脑部小血管疾病、阿尔茨海默病和帕金森病等神经退行性疾病中常见的血管周围间隙（Perivascular Spaces, PVS）扩大的检测问题。解决方案的关键在于优化一个广泛使用的深度学习模型——无新UNet（no-new-UNet, nnU-Net），以实现PVS的自动分割。通过在30名健康参与者中使用三种不同MRI扫描协议和三种扫描仪获取的T1加权MRI图像，研究人员采用稀疏标注策略手动分割PVS，并比较了11种不同模型在图像处理、预处理和半监督学习策略下的性能。结果显示，体素间距无关模型（DSC=64.3%）优于重采样模型（DSC=40.5-55%），并通过迭代标签清理和半监督学习（使用伪标签）显著提高了模型性能（DSC=85.7%）。此外，该模型还被扩展用于中脑和海马体的PVS分割。最终，该深度学习模型提供了一个稳健且全面的框架，用于脑部MRI中PVS的自动量化。

链接: https://arxiv.org/abs/2411.19564
作者: William Pham,Alexander Jarema,Donggyu Rim,Zhibin Chen,Mohamed S. H. Khlif,Vaughan G. Macefield,Luke A. Henderson,Amy Brodtmann
关键词-EN: small vessel disease, Alzheimer disease, Parkinson disease, neurodegenerative disorders including, disorders including cerebral
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 46 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Background: Enlargement of perivascular spaces (PVS) is common in neurodegenerative disorders including cerebral small vessel disease, Alzheimer’s disease, and Parkinson’s disease. PVS enlargement may indicate impaired clearance pathways and there is a need for reliable PVS detection methods which are currently lacking. Aim: To optimise a widely used deep learning model, the no-new-UNet (nnU-Net), for PVS segmentation. Methods: In 30 healthy participants (mean \pm SD age: 50 \pm 18.9 years; 13 females), T1-weighted MRI images were acquired using three different protocols on three MRI scanners (3T Siemens Tim Trio, 3T Philips Achieva, and 7T Siemens Magnetom). PVS were manually segmented across ten axial slices in each participant. Segmentations were completed using a sparse annotation strategy. In total, 11 models were compared using various strategies for image handling, preprocessing and semi-supervised learning with pseudo-labels. Model performance was evaluated using 5-fold cross validation (5FCV). The main performance metric was the Dice Similarity Coefficient (DSC). Results: The voxel-spacing agnostic model (mean \pm SD DSC=64.3 \pm 3.3%) outperformed models which resampled images to a common resolution (DSC=40.5-55%). Model performance improved substantially following iterative label cleaning (DSC=85.7 \pm 1.2%). Semi-supervised learning with pseudo-labels (n=12,740) from 18 additional datasets improved the agreement between raw and predicted PVS cluster counts (Lin’s concordance correlation coefficient=0.89, 95%CI=0.82-0.94). We extended the model to enable PVS segmentation in the midbrain (DSC=64.3 \pm 6.5%) and hippocampus (DSC=67.8 \pm 5%). Conclusions: Our deep learning models provide a robust and holistic framework for the automated quantification of PVS in brain MRI.
zh

[CV-169] Contextual Checkerboard Denoise – A Novel Neural Network-Based Approach for Classification-Aware OCT Image Denoising

【速读】：该论文试图解决医学图像去噪中常见的两个问题：一是传统去噪方法在提高图像清晰度的同时，可能会改变图像的关键信息，从而影响分类性能和诊断质量；二是监督式去噪方法在医学图像领域不实用，因为难以获得噪声图像的真实无噪声版本。论文提出了一种基于神经网络的新方法——上下文棋盘去噪 (Contextual Checkerboard Denoising)，该方法能够仅通过噪声图像数据集进行学习，同时保留对图像分类和分析至关重要的解剖细节。实验结果表明，该方法显著提高了图像质量，生成更清晰和详细的OCT图像，并增强了诊断准确性。解决方案的关键在于利用上下文信息和棋盘状的特征提取策略，确保在去噪过程中不引入新的伪影，同时保留图像的关键特征。

链接: https://arxiv.org/abs/2411.19549
作者: Md. Touhidul Islam,Md. Abtahi M. Chowdhury,Sumaiya Salekin,Aye T. Maung,Akil A. Taki,Hafiz Imtiaz
关键词-EN: denoising warrants preservation, non-medical image denoising, image denoising warrants, Contextual Checkerboard Denoising, primary goal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review in Springer Journal of Medical Systems. Code available: this https URL

点击查看摘要

Abstract:In contrast to non-medical image denoising, where enhancing image clarity is the primary goal, medical image denoising warrants preservation of crucial features without introduction of new artifacts. However, many denoising methods that improve the clarity of the image, inadvertently alter critical information of the denoised images, potentially compromising classification performance and diagnostic quality. Additionally, supervised denoising methods are not very practical in medical image domain, since a \emphground truth denoised version of a noisy medical image is often extremely challenging to obtain. In this paper, we tackle both of these problems by introducing a novel neural network based method – \emphContextual Checkerboard Denoising, that can learn denoising from only a dataset of noisy images, while preserving crucial anatomical details necessary for image classification/analysis. We perform our experimentation on real Optical Coherence Tomography (OCT) images, and empirically demonstrate that our proposed method significantly improves image quality, providing clearer and more detailed OCT images, while enhancing diagnostic accuracy.
zh

[CV-170] Enhancing AI microscopy for foodborne bacterial classification via adversarial domain adaptation across optical and biological variability

【速读】：该论文试图解决传统基于培养的食源性细菌检测方法所需的长时培养和复杂样品制备问题。解决方案的关键在于利用生成式对抗网络（DANNs）和多域生成式对抗网络（MDANNs）进行域适应，以提高AI驱动的显微镜技术在细菌分类中的泛化能力。通过在不同显微镜模式（相衬、明场）、放大倍数（60x、20x）和培养时间（3小时、5小时）下收集的数据进行训练和评估，DANNs和MDANNs显著提升了目标域的分类准确性，同时保持了源域的性能。该方法减少了样品制备的依赖性，为在资源有限的环境中进行快速细菌检测提供了可扩展和适应性强的框架。

链接: https://arxiv.org/abs/2411.19514
作者: Siddhartha Bhattacharya,Aarham Wasit,Mason Earles,Nitin Nitin,Luyao Ma,Jiyoon Yi
关键词-EN: traditional culture-based methods, culture-based methods require, Rapid detection, methods require extended, safety and quality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rapid detection of foodborne bacteria is critical for food safety and quality, yet traditional culture-based methods require extended incubation and specialized sample preparation. This study addresses these challenges by i) enhancing the generalizability of AI-enabled microscopy for bacterial classification using adversarial domain adaptation and ii) comparing the performance of single-target and multi-domain adaptation. Three Gram-positive (Bacillus coagulans, Bacillus subtilis, Listeria innocua) and three Gram-negative (E. coli, Salmonella Enteritidis, Salmonella Typhimurium) strains were classified. EfficientNetV2 served as the backbone architecture, leveraging fine-grained feature extraction for small targets. Few-shot learning enabled scalability, with domain-adversarial neural networks (DANNs) addressing single domains and multi-DANNs (MDANNs) generalizing across all target domains. The model was trained on source domain data collected under controlled conditions (phase contrast microscopy, 60x magnification, 3-h bacterial incubation) and evaluated on target domains with variations in microscopy modality (brightfield, BF), magnification (20x), and extended incubation to compensate for lower resolution (20x-5h). DANNs improved target domain classification accuracy by up to 54.45% (20x), 43.44% (20x-5h), and 31.67% (BF), with minimal source domain degradation (4.44%). MDANNs achieved superior performance in the BF domain and substantial gains in the 20x domain. Grad-CAM and t-SNE visualizations validated the model’s ability to learn domain-invariant features across diverse conditions. This study presents a scalable and adaptable framework for bacterial classification, reducing reliance on extensive sample preparation and enabling application in decentralized and resource-limited environments.
zh

[CV-171] Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB

【速读】：该论文试图解决基于RGB的重建方法在低纹理、低光照和低反射率场景中表现不佳的问题。解决方案的关键在于引入一种新型的“模糊”LiDAR（diffuse LiDAR），其通过发射漫射闪光来显著提高场景覆盖率，但同时也引入了空间模糊性。为了处理这种模糊性，论文提出了一种结合漫射LiDAR与RGB数据的策略，利用高斯面元（Gaussian surfel）渲染框架和场景自适应损失函数，动态平衡RGB和漫射LiDAR信号，从而在挑战性环境中实现鲁棒的3D扫描和精确的颜色与几何估计。

链接: https://arxiv.org/abs/2411.19474
作者: Nikhil Behari,Aaron Young,Siddharth Somasundaram,Tzofi Klinghoffer,Akshat Dave,Ramesh Raskar
关键词-EN: virtual reality, essential across applications, applications of virtual, surface reconstruction, LiDAR
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:3D surface reconstruction is essential across applications of virtual reality, robotics, and mobile scanning. However, RGB-based reconstruction often fails in low-texture, low-light, and low-albedo scenes. Handheld LiDARs, now common on mobile devices, aim to address these challenges by capturing depth information from time-of-flight measurements of a coarse grid of projected dots. Yet, these sparse LiDARs struggle with scene coverage on limited input views, leaving large gaps in depth information. In this work, we propose using an alternative class of “blurred” LiDAR that emits a diffuse flash, greatly improving scene coverage but introducing spatial ambiguity from mixed time-of-flight measurements across a wide field of view. To handle these ambiguities, we propose leveraging the complementary strengths of diffuse LiDAR with RGB. We introduce a Gaussian surfel-based rendering framework with a scene-adaptive loss function that dynamically balances RGB and diffuse LiDAR signals. We demonstrate that, surprisingly, diffuse LiDAR can outperform traditional sparse LiDAR, enabling robust 3D scanning with accurate color and geometry estimation in challenging environments.
zh

[CV-172] MCUCoder: Adaptive Bitrate Learned Video Compression for IoT Devices

【速读】：该论文试图解决在资源受限的物联网（IoT）设备上进行高效视频压缩的问题。解决方案的关键在于引入了一种名为MCUCoder的开源自适应比特率视频压缩模型。MCUCoder的核心特点是其超轻量级编码器，仅包含10.5K参数和350KB的内存占用，非常适合边缘设备和微控制器单元（MCU）。尽管MCUCoder在能耗上与M-JPEG相当，但在MCL-JCV和UVG数据集上，其比特率分别降低了55.65%和55.59%，以MS-SSIM为衡量标准。此外，MCUCoder通过生成按重要性排序的潜在表示，支持自适应比特率流媒体传输，确保在低资源设备上网络条件波动时也能实现流畅的实时视频传输。

链接: https://arxiv.org/abs/2411.19442
作者: Ali Hojjat,Janek Haberer,Olaf Landsiedel
关键词-EN: unstable internet connections, RAM and unstable, face hardware constraints, internet connections, efficient video compression
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid growth of camera-based IoT devices demands the need for efficient video compression, particularly for edge applications where devices face hardware constraints, often with only 1 or 2 MB of RAM and unstable internet connections. Traditional and deep video compression methods are designed for high-end hardware, exceeding the capabilities of these constrained devices. Consequently, video compression in these scenarios is often limited to M-JPEG due to its high hardware efficiency and low complexity. This paper introduces , an open-source adaptive bitrate video compression model tailored for resource-limited IoT settings. MCUCoder features an ultra-lightweight encoder with only 10.5K parameters and a minimal 350KB memory footprint, making it well-suited for edge devices and MCUs. While MCUCoder uses a similar amount of energy as M-JPEG, it reduces bitrate by 55.65% on the MCL-JCV dataset and 55.59% on the UVG dataset, measured in MS-SSIM. Moreover, MCUCoder supports adaptive bitrate streaming by generating a latent representation that is sorted by importance, allowing transmission based on available bandwidth. This ensures smooth real-time video transmission even under fluctuating network conditions on low-resource devices. Source code available at this https URL.
zh

[CV-173] 3D Wasserstein generative adversarial network with dense U-Net based discriminator for preclinical fMRI denoising

【速读】：该论文试图解决在预临床功能磁共振成像（fMRI）数据中由于生理过程、硬件和外部噪声导致的固有噪声问题。解决方案的关键在于提出了一种基于3D Wasserstein生成对抗网络（GAN）和3D密集U-Net鉴别器的结构保持算法，称为3D U-WGAN。该方法通过4D数据配置有效去噪时间和空间信息，并利用3D密集U-Net鉴别器学习全局和局部特征差异，以避免过度平滑。此外，引入对抗损失和特征空间距离测量来增强感知相似性，从而在提高信噪比的同时，避免引入过多的结构变化。实验结果表明，该方法在静息态和任务态预临床fMRI数据中显著提升了图像质量，优于现有的最先进方法。

链接: https://arxiv.org/abs/2411.19345
作者: Sima Soltanpour,Arnold Chang,Dan Madularu,Praveen Kulkarni,Craig Ferris,Chris Joslin
关键词-EN: Functional magnetic resonance, magnetic resonance imaging, Functional magnetic, study brain function, inherently noisy due
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Functional magnetic resonance imaging (fMRI) is extensively used in clinical and preclinical settings to study brain function, however, fMRI data is inherently noisy due to physiological processes, hardware, and external noise. Denoising is one of the main preprocessing steps in any fMRI analysis pipeline. This process is challenging in preclinical data in comparison to clinical data due to variations in brain geometry, image resolution, and low signal-to-noise ratios. In this paper, we propose a structure-preserved algorithm based on a 3D Wasserstein generative adversarial network with a 3D dense U-net based discriminator called, 3D U-WGAN. We apply a 4D data configuration to effectively denoise temporal and spatial information in analyzing preclinical fMRI data. GAN-based denoising methods often utilize a discriminator to identify significant differences between denoised and noise-free images, focusing on global or local features. To refine the fMRI denoising model, our method employs a 3D dense U-Net discriminator to learn both global and local distinctions. To tackle potential over-smoothing, we introduce an adversarial loss and enhance perceptual similarity by measuring feature space distances. Experiments illustrate that 3D U-WGAN significantly improves image quality in resting-state and task preclinical fMRI data, enhancing signal-to-noise ratio without introducing excessive structural changes in existing methods. The proposed method outperforms state-of-the-art methods when applied to simulated and real data in a fMRI analysis pipeline.
zh

[CV-174] Generalized Gaussian Model for Learned Image Compression

【速读】：该论文试图解决在图像压缩中，如何更灵活地建模潜在变量分布以提高压缩性能的问题。解决方案的关键在于将传统的正态分布模型扩展为广义正态分布模型（generalized Gaussian model），并引入一个额外的形状参数（beta），从而在保持模型复杂度可控的同时，更精确地拟合潜在变量的分布。此外，论文还提出了改进的训练方法，包括依赖于beta的下界约束和梯度修正，以缓解训练与测试之间的不匹配问题，从而进一步提升广义正态分布模型的性能。实验结果表明，该方法在多种图像压缩方法中均优于传统的正态分布模型和高斯混合模型。

链接: https://arxiv.org/abs/2411.19320
作者: Haotian Zhang,Li Li,Dong Liu
关键词-EN: Gaussian model, generalized Gaussian model, Gaussian mixture models, Gaussian, generalized Gaussian
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures

点击查看摘要

Abstract:In learned image compression, probabilistic models play an essential role in characterizing the distribution of latent variables. The Gaussian model with mean and scale parameters has been widely used for its simplicity and effectiveness. Probabilistic models with more parameters, such as the Gaussian mixture models, can fit the distribution of latent variables more precisely, but the corresponding complexity will also be higher. To balance between compression performance and complexity, we extend the Gaussian model to the generalized Gaussian model for more flexible latent distribution modeling, introducing only one additional shape parameter, beta, than the Gaussian model. To enhance the performance of the generalized Gaussian model by alleviating the train-test mismatch, we propose improved training methods, including beta-dependent lower bounds for scale parameters and gradient rectification. Our proposed generalized Gaussian model, coupled with the improved training methods, is demonstrated to outperform the Gaussian and Gaussian mixture models on a variety of learned image compression methods.
zh

[CV-175] Skeleton Detection Using Dual Radars with Integration of Dual-View CNN Models and mmPose

【速读】：该论文试图解决在老年人实时跌倒检测中，利用毫米波雷达（mmWave radar）采集的点云数据进行骨骼检测的问题。解决方案的关键在于：1) 通过融合PointNet和mmPose模型，处理点云数据的旋转不变性（rotation invariance）、平移不变性（translation invariance）和局部性（locality）；2) 通过整合来自两个雷达的数据，弥补单雷达点云数据点数不足的问题；3) 利用点云数据的坐标、速度和信噪比（SNR）等特征，减少数据稀疏性并降低计算负荷。研究提出了三种结合PointNet和mmPose的Dual View CNN模型，并通过均方绝对误差（MAE）进行性能比较，结果显示该模型在手臂摆动检测中表现优异，但在随机行走检测中效果欠佳。

链接: https://arxiv.org/abs/2411.19251
作者: Masaharu Kodama(Department of Computer and Information Sciences, Hosei University),Runhe Huang(Hosei University)
关键词-EN: Skeleton detection, variety of situations, Skeleton, point cloud data, point cloud
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was presented at the 16th International Conference on Advanced Applied Informatics (IIAI AAI 2024)

点击查看摘要

Abstract:Skeleton detection is a technique that can beapplied to a variety of situations. It is especially critical identifying and tracking the movements of the elderly, especially in real-time fall detection. While conventional image processing methods exist, there’s a growing preference for utilizing pointclouds data collected by mmWave radars from viewpoint of privacy protection, offering a non-intrusive approach to elevatesafety and care for the elderly. Dealing with point cloud data necessitates addressing three critical considerations. Firstly, the inherent nature of point clouds – rotation invariance, translation invariance, and locality – is managed through the fusion of PointNet and mmPose. PointNet ensures rotational and translational invariance, while mmPose addresses locality. Secondly, the limited points per frame from radar require data integration from two radars to enhance skeletal detection. Lastly,inputting point cloud data into the learning model involves utilizing features like coordinates, velocity, and signal-to-noise ratio (SNR) per radar point to mitigate sparsity issues and reduce computational load. This research proposes three Dual ViewCNN models, combining PointNet and mmPose, employing two mmWave radars, with performance comparisons in terms of Mean Absolute Error (MAE). While the proposed model shows suboptimal results for random walking, it excels in the arm swing case.
zh

[CV-176] Voxel-based Differentiable X-ray Rendering Improves Self-Supervised 3D CBCT Reconstruction

【速读】：该论文试图解决在锥束计算机断层扫描（Cone-Beam Computed Tomography, CBCT）重建中，如何通过自监督学习框架提高重建质量和减少所需X射线数量的问题。解决方案的关键在于直接优化体素网格（voxelgrid）表示，并采用基于物理的微分X射线渲染技术。具体来说，论文提出了一种基于Beer-Lambert定律的精确离散化方法来模拟X射线衰减，这种方法在结合正则化的体素网格学习框架后，显著优于传统的迭代CBCT重建算法，特别是在输入视图较少的情况下。通过这种方法，论文成功实现了从更少的X射线中重建高质量的3D CBCT体积，从而潜在地减少了电离辐射的暴露。

链接: https://arxiv.org/abs/2411.19224
作者: Mohammadhossein Momeni,Vivek Gopalakrishnan,Neel Dey,Polina Golland,Sarah Frisken
关键词-EN: Cone-Beam Computed Tomography, Computed Tomography, differentiable X-ray rendering, physics-based differentiable X-ray, Cone-Beam Computed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a self-supervised framework for Cone-Beam Computed Tomography (CBCT) reconstruction by directly optimizing a voxelgrid representation using physics-based differentiable X-ray rendering. Further, we investigate how the different formulations of X-ray image formation physics in the renderer affect the quality of 3D reconstruction and novel view synthesis. When combined with our regularized voxelgrid-based learning framework, we find that using an exact discretization of the Beer-Lambert law for X-ray attenuation in the renderer outperforms widely used iterative CBCT reconstruction algorithms, particularly when given only a few input views. As a result, we reconstruct high-fidelity 3D CBCT volumes from fewer X-rays, potentially reducing ionizing radiation exposure.
zh

[CV-177] Bayesian Deconvolution of Astronomical Images with Diffusion Models: Quantifying Prior-Driven Features in Reconstructions NEURIPS2024

【速读】：该论文试图解决天文图像的去卷积问题，以恢复天体的固有属性，特别是在地面观测条件下。解决方案的关键在于使用扩散模型 (Diffusion Models, DMs) 和扩散后验采样算法 (Diffusion Posterior Sampling, DPS) 来处理这一逆问题任务。通过在贝叶斯框架下应用基于高分辨率宇宙学模拟训练的评分扩散模型，计算给定观测数据的后验分布。该方法考虑了红移和像素尺度作为逆问题的参数，使其能够灵活适应任何数据集。论文通过在Hyper Supreme Camera (HSC) 数据上测试模型，展示了其能够达到与哈勃太空望远镜 (Hubble Space Telescope, HST) 图像相媲美的分辨率。此外，论文还量化了重建结果的不确定性，并提出了一种识别重建图像中先验驱动特征的度量方法，这对于科学应用具有重要意义。

链接: https://arxiv.org/abs/2411.19158
作者: Alessio Spagnoletti,Alexandre Boucaud,Marc Huertas-Company,Wassim Kabalan,Biswajit Biswas
关键词-EN: Deconvolution of astronomical, Diffusion Posterior Sampling, celestial objects, aspect of recovering, recovering the intrinsic
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注: 5+5 pages, 16 figures, Machine Learning and the Physical Sciences Workshop, NeurIPS 2024

点击查看摘要

Abstract:Deconvolution of astronomical images is a key aspect of recovering the intrinsic properties of celestial objects, especially when considering ground-based observations. This paper explores the use of diffusion models (DMs) and the Diffusion Posterior Sampling (DPS) algorithm to solve this inverse problem task. We apply score-based DMs trained on high-resolution cosmological simulations, through a Bayesian setting to compute a posterior distribution given the observations available. By considering the redshift and the pixel scale as parameters of our inverse problem, the tool can be easily adapted to any dataset. We test our model on Hyper Supreme Camera (HSC) data and show that we reach resolutions comparable to those obtained by Hubble Space Telescope (HST) images. Most importantly, we quantify the uncertainty of reconstructions and propose a metric to identify prior-driven features in the reconstructed images, which is key in view of applying these methods for scientific purposes.
zh

[CV-178] FAN-Unet: Enhancing Unet with vision Fourier Analysis Block for Biomedical Image Segmentation

【速读】：该论文试图解决医学图像分割中卷积神经网络 (CNN) 难以捕捉图像长程依赖关系的问题。解决方案的关键在于提出了一种名为 FAN-UNet 的新型架构，该架构结合了基于傅里叶分析网络 (Fourier Analysis Network, FAN) 的视觉骨干网络和 U-Net 架构的优势。具体来说，FAN-UNet 通过引入 Vision-FAN 层，将 FAN 层与自注意力机制相结合，利用傅里叶分析来有效捕捉图像中的长程依赖关系和周期性特征。这种设计在保持模型复杂度可控的同时，显著提升了医学图像分割任务的性能。

链接: https://arxiv.org/abs/2411.18975
作者: Jiashu Xu
关键词-EN: Convolutional Neural Networks, modern medical research, Fourier Analysis Network, Medical image segmentation, clinical practice
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2410.02523

点击查看摘要

Abstract:Medical image segmentation is a critical aspect of modern medical research and clinical practice. Despite the remarkable performance of Convolutional Neural Networks (CNNs) in this domain, they inherently struggle to capture long-range dependencies within images. Transformers, on the other hand, are naturally adept at modeling global context but often face challenges in capturing local features effectively. Therefore, we presents FAN-UNet, a novel architecture that combines the strengths of Fourier Analysis Network (FAN)-based vision backbones and the U-Net architecture, effectively addressing the challenges of long-range dependency and periodicity modeling in biomedical image segmentation tasks. The proposed Vision-FAN layer integrates the FAN layer and self-attention mechanisms, leveraging Fourier analysis to enable the model to effectively capture both long-range dependencies and periodic relationships. Extensive experiments on various medical imaging datasets demonstrate that FAN-UNet achieves a favorable balance between model complexity and performance, validating its effectiveness and practicality for medical image segmentation tasks.
zh

[CV-179] FiRe: Fixed-points of Restoration Priors for Solving Inverse Problems

【速读】：该论文试图解决在成像逆问题中选择合适的先验（prior）以补偿测量算子导致的信息损失这一基本挑战。解决方案的关键在于引入固定点恢复（Fixed-points of Restoration, FiRe）先验作为扩展Plug-and-Play (PnP)算法中先验概念的新框架，使其超越传统的去噪模型，适用于更广泛的恢复模型。FiRe的核心洞察在于自然图像作为退化算子与相应恢复模型复合操作的固定点出现，从而通过量化图像在这种复合操作下的不变性来推导出隐式先验的显式公式。这一固定点视角不仅展示了各种恢复网络如何有效地作为先验用于解决逆问题，还支持多模型组合和采集信息引导的恢复网络，所有这些都在统一的优化框架内实现。实验结果验证了FiRe在多种逆问题中的有效性，确立了将预训练恢复模型整合到PnP类算法中的新范式。

链接: https://arxiv.org/abs/2411.18970
作者: Matthieu Terris,Ulugbek S. Kamilov,Thomas Moreau
关键词-EN: information loss due, compensate for information, information loss, loss due, fundamental challenge
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Selecting an appropriate prior to compensate for information loss due to the measurement operator is a fundamental challenge in imaging inverse problems. Implicit priors based on denoising neural networks have become central to widely-used frameworks such as Plug-and-Play (PnP) algorithms. In this work, we introduce Fixed-points of Restoration (FiRe) priors as a new framework for expanding the notion of priors in PnP to general restoration models beyond traditional denoising models. The key insight behind FiRe is that natural images emerge as fixed points of the composition of a degradation operator with the corresponding restoration model. This enables us to derive an explicit formula for our implicit prior by quantifying invariance of images under this composite operation. Adopting this fixed-point perspective, we show how various restoration networks can effectively serve as priors for solving inverse problems. The FiRe framework further enables ensemble-like combinations of multiple restoration models as well as acquisition-informed restoration networks, all within a unified optimization approach. Experimental results validate the effectiveness of FiRe across various inverse problems, establishing a new paradigm for incorporating pretrained restoration models into PnP-like algorithms.
zh

[CV-180] Deep Plug-and-Play HIO Approach for Phase Retrieval

【速读】：该论文试图解决相位恢复问题，即从仅包含强度的测量（如傅里叶强度）中恢复未知图像。解决方案的关键在于引入了一种基于学习的插拔式方法，通过将学习型先验与混合输入输出方法（HIO）结合，利用插拔式正则化技术进行优化。该方法的核心是通过半二次分裂技术推导出高效的更新步骤，从而在图像质量、计算效率和初始化及噪声鲁棒性方面展现出优越的性能。

链接: https://arxiv.org/abs/2411.18967
作者: Cagatay Isil,Figen S. Oktem
关键词-EN: Fourier intensity, intensity-only measurements, phase retrieval problem, Computational Optical Sensing, phase retrieval
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the phase retrieval problem, the aim is the recovery of an unknown image from intensity-only measurements such as Fourier intensity. Although there are several solution approaches, solving this problem is challenging due to its nonlinear and ill-posed nature. Recently, learning-based approaches have emerged as powerful alternatives to the analytical methods for several inverse problems. In the context of phase retrieval, a novel plug-and-play approach that exploits learning-based prior and e!cient update steps has been presented at the Computational Optical Sensing and Imaging topical meeting, with demonstrated state-of-the-art performance. The key idea was to incorporate learning-based prior to the hybrid input-output method (HIO) through plug-and-play regularization. In this paper, we present the mathematical development of the method including the derivation of its analytical update steps based on half-quadratic splitting and comparatively evaluate its performance through extensive simulations on a large test dataset. The results show the e"ectiveness of the method in terms of both image quality, computational e!ciency, and robustness to initialization and noise.
zh

[CV-181] CovHuSeg: An Enhanced Approach for Kidney Pathology Segmentation

【速读】：该论文试图解决传统深度学习和机器学习模型在分割任务中难以捕捉几何特征（如大小和凸性）的问题，特别是在肾脏病理图像中对肾小球（glomerulus）的分割。解决方案的关键是提出了一种名为CovHuSeg的算法，这是一种专门针对球形异常（包括肾小球）分割的后处理方法。CovHuSeg算法确保生成的分割掩码没有空洞，并且形状符合肾小球的自然形态，从而提高了分割的准确性。

链接: https://arxiv.org/abs/2411.18893
作者: Huy Trinh,Khang Tran,Nam Nguyen,Tri Cao,Binh Nguyen
关键词-EN: numerous real-world applications, computer vision due, real-world applications, long been essential, essential in computer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Segmentation has long been essential in computer vision due to its numerous real-world applications. However, most traditional deep learning and machine learning models need help to capture geometric features such as size and convexity of the segmentation targets, resulting in suboptimal outcomes. To resolve this problem, we propose using a CovHuSeg algorithm to solve the problem of kidney glomeruli segmentation. This simple post-processing method is specified to adapt to the segmentation of ball-shaped anomalies, including the glomerulus. Unlike other post-processing methods, the CovHuSeg algorithm assures that the outcome mask does not have holes in it or comes in unusual shapes that are impossible to be the shape of a glomerulus. We illustrate the effectiveness of our method by experimenting with multiple deep-learning models in the context of segmentation on kidney pathology images. The results show that all models have increased accuracy when using the CovHuSeg algorithm.
zh

[CV-182] Multi-Task Learning for Integrated Automated Contouring and Voxel-Based Dose Prediction in Radiotherapy

【速读】：该论文试图解决传统放射治疗计划中自动化轮廓勾画和剂量预测作为独立任务的问题，以及深度学习（DL）中这两个任务独立进行的问题。解决方案的关键在于采用多任务学习（MTL）方法，将自动化轮廓勾画和基于体素的剂量预测任务无缝集成。通过利用两个任务之间的共同信息，MTL不仅提高了自动化任务的效率，还增强了剂量预测性能，同时保持或提升了轮廓勾画的准确性。具体来说，与顺序的DL方法相比，MTL在前列腺和头颈部癌症数据集上的剂量体积直方图指标的平均绝对差异分别提高了19.82%和16.33%，并且在轮廓勾画精度上也有所提升，前列腺和头颈部数据集的Dice系数分别从0.818和0.674提高到0.824和0.716。

链接: https://arxiv.org/abs/2411.18767
作者: Sangwook Kim,Aly Khalifa,Thomas G. Purdie,Chris McIntosh
关键词-EN: treatment planning, automated treatment planning, automated contouring, Deep learning-based automated, learning-based automated contouring
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning-based automated contouring and treatment planning has been proven to improve the efficiency and accuracy of radiotherapy. However, conventional radiotherapy treatment planning process has the automated contouring and treatment planning as separate tasks. Moreover in deep learning (DL), the contouring and dose prediction tasks for automated treatment planning are done independently. In this study, we applied the multi-task learning (MTL) approach in order to seamlessly integrate automated contouring and voxel-based dose prediction tasks, as MTL can leverage common information between the two tasks and be able able to increase the efficiency of the automated tasks. We developed our MTL framework using the two datasets: in-house prostate cancer dataset and the publicly available head and neck cancer dataset, OpenKBP. Compared to the sequential DL contouring and treatment planning tasks, our proposed method using MTL improved the mean absolute difference of dose volume histogram metrics of prostate and head and neck sites by 19.82% and 16.33%, respectively. Our MTL model for automated contouring and dose prediction tasks demonstrated enhanced dose prediction performance while maintaining or sometimes even improving the contouring accuracy. Compared to the baseline automated contouring model with the dice score coefficients of 0.818 for prostate and 0.674 for head and neck datasets, our MTL approach achieved average scores of 0.824 and 0.716 for these datasets, respectively. Our study highlights the potential of the proposed automated contouring and planning using MTL to support the development of efficient and accurate automated treatment planning for radiotherapy.
zh

人工智能

[AI-0] Dynamic EEG-fMRI mapping: Revealing the relationship between brain connectivity and cognitive state

链接: https://arxiv.org/abs/2411.19922
作者: Guiran Liu,Binrong Zhu
关键词-EN: dynamic connectivity patterns, brain network interactions, patterns between EEG, network interactions, connectivity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)

点击查看摘要

Abstract:This study investigated the dynamic connectivity patterns between EEG and fMRI modalities, contributing to our understanding of brain network interactions. By employing a comprehensive approach that integrated static and dynamic analyses of EEG-fMRI data, we were able to uncover distinct connectivity states and characterize their temporal fluctuations. The results revealed modular organization within the intrinsic connectivity networks (ICNs) of the brain, highlighting the significant roles of sensory systems and the default mode network. The use of a sliding window technique allowed us to assess how functional connectivity varies over time, further elucidating the transient nature of brain connectivity. Additionally, our findings align with previous literature, reinforcing the notion that cognitive states can be effectively identified through short-duration data, specifically within the 30-60 second timeframe. The established relationships between connectivity strength and cognitive processes, particularly during different visual states, underscore the relevance of our approach for future research into brain dynamics. Overall, this study not only enhances our understanding of the interplay between EEG and fMRI signals but also paves the way for further exploration into the neural correlates of cognitive functions and their implications in clinical settings. Future research should focus on refining these methodologies and exploring their applications in various cognitive and clinical contexts.

[AI-1] Handling irresolvable conflicts in the Semantic Web: an RDF-based conflict-tolerant version of the Deontic Traditional Scheme

链接: https://arxiv.org/abs/2411.19918
作者: Livio Robaldo,Gianluca Pozzato
关键词-EN: Deontic Traditional Scheme, statements prescribe conflicting, well-known Deontic Traditional, prescribe conflicting obligations, formal Deontic Logic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a new ontology that implements the well-known Deontic Traditional Scheme in RDFs and SPARQL, fit to handle irresolvable conflicts, i.e., situations in which two or more statements prescribe conflicting obligations, prohibitions, or permissions, with none of them being “stronger” than the other one(s). In our view, this paper marks a significant advancement in standard theoretical research in formal Deontic Logic. Most contemporary approaches in this field are confined to the propositional level, mainly focus on the notion of obligation, and lack implementations. The proposed framework is encoded in RDF, which is not only a first-order language but also the most widely used knowledge representation language, as it forms the foundation of the Semantic Web. Moreover, the proposed computational ontology formalizes all deontic modalities defined in the Deontic Traditional Scheme, without specifically focusing on obligations, and offers constructs to model and reason with various types of irresolvable conflicts, violations, and the interaction between deontic modalities and contextual constraints in a given state of affairs. To the best of our knowledge, no existing approach in the literature addresses all these aspects within a unified integrated framework. All examples presented and discussed in this paper, together with Java code and clear instructions to re-execute them locally, are available at this https URL

[AI-2] PDDLFuse: A Tool for Generating Diverse Planning Domains

链接: https://arxiv.org/abs/2411.19886
作者: Vedant Khandelwal,Amit Sheth,Forest Agostinelli
关键词-EN: real-world challenges require, challenges require planning, require planning algorithms, real-world challenges, challenges require
类目: Artificial Intelligence (cs.AI)
*备注: 218 Tables, 3 Figures, 4 Algorithms

点击查看摘要

Abstract:Various real-world challenges require planning algorithms that can adapt to a broad range of domains. Traditionally, the creation of planning domains has relied heavily on human implementation, which limits the scale and diversity of available domains. While recent advancements have leveraged generative AI technologies such as large language models (LLMs) for domain creation, these efforts have predominantly focused on translating existing domains from natural language descriptions rather than generating novel ones. In contrast, the concept of domain randomization, which has been highly effective in reinforcement learning, enhances performance and generalizability by training on a diverse array of randomized new domains. Inspired by this success, our tool, PDDLFuse, aims to bridge this gap in Planning Domain Definition Language (PDDL). PDDLFuse is designed to generate new, diverse planning domains that can be used to validate new planners or test foundational planning models. We have developed methods to adjust the domain generators parameters to modulate the difficulty of the domains it generates. This adaptability is crucial as existing domain-independent planners often struggle with more complex problems. Initial tests indicate that PDDLFuse efficiently creates intricate and varied domains, representing a significant advancement over traditional domain generation methods and making a contribution towards planning research.

[AI-3] LUMIA: Linear probing for Unimodal and MultiModal Membership Inference A!acks leveraging internal LLM states

链接: https://arxiv.org/abs/2411.19876
作者: Luis Ibanez-Lissen,Lorena Gonzalez-Manzano,Jose Maria de Fuentes,Nicolas Anciaux,Joaquin Garcia-Alfaro
关键词-EN: Large Language Models, Large Language, Membership Inference Attacks, membership inference, Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in a variety of applications, but concerns around membership inference have grown in parallel. Previous efforts focus on black-to-grey-box models, thus neglecting the potential benefit from internal LLM information. To address this, we propose the use of Linear Probes (LPs) as a method to detect Membership Inference Attacks (MIAs) by examining internal activations of LLMs. Our approach, dubbed LUMIA, applies LPs layer-by-layer to get fine-grained data on the model inner workings. We test this method across several model architectures, sizes and datasets, including unimodal and multimodal tasks. In unimodal MIA, LUMIA achieves an average gain of 15.71 % in Area Under the Curve (AUC) over previous techniques. Remarkably, LUMIA reaches AUC60% in 65.33% of cases – an increment of 46.80% against the state of the art. Furthermore, our approach reveals key insights, such as the model layers where MIAs are most detectable. In multimodal models, LPs indicate that visual inputs can significantly contribute to detect MIAs – AUC60% is reached in 85.90% of experiments.

[AI-4] DeMo: Decoupled Momentum Optimization

链接: https://arxiv.org/abs/2411.19870
作者: Bowen Peng,Jeffrey Quesnelle,Diederik P. Kingma
关键词-EN: typically requires sharing, requires sharing gradients, networks typically requires, typically requires, requires sharing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects. Drawing from the signal processing principles of frequency decomposition and energy compaction, we demonstrate that synchronizing full optimizer states and model parameters during training is unnecessary. By decoupling momentum updates and allowing controlled divergence in optimizer states across accelerators, we achieve improved convergence compared to state-of-the-art optimizers. We introduce \textbfDecoupled \textbfMomentum (DeMo), a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude. This enables training of large neural networks even with limited network bandwidth and heterogeneous hardware. Our method is topology-agnostic and architecture-independent and supports scalable clock-synchronous distributed training with negligible compute and memory overhead. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW, while eliminating the need for high-speed interconnects when pre-training large scale foundation models. An open source reference PyTorch implementation is published on GitHub at this https URL

[AI-5] Q-learning-based Model-free Safety Filter

链接: https://arxiv.org/abs/2411.19809
作者: Guo Ning Sue,Yogita Choudhary,Richard Desatnik,Carmel Majidi,John Dolan,Guanya Shi
关键词-EN: presents significant challenges, Ensuring safety, robotics presents significant, significant challenges, presents significant
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: *Denotes equal contribution

点击查看摘要

Abstract:Ensuring safety via safety filters in real-world robotics presents significant challenges, particularly when the system dynamics is complex or unavailable. To handle this issue, learning-based safety filters recently gained popularity, which can be classified as model-based and model-free methods. Existing model-based approaches requires various assumptions on system model (e.g., control-affine), which limits their application in complex systems, and existing model-free approaches need substantial modifications to standard RL algorithms and lack versatility. This paper proposes a simple, plugin-and-play, and effective model-free safety filter learning framework. We introduce a novel reward formulation and use Q-learning to learn Q-value functions to safeguard arbitrary task specific nominal policies via filtering out their potentially unsafe actions. The threshold used in the filtering process is supported by our theoretical analysis. Due to its model-free nature and simplicity, our framework can be seamlessly integrated with various RL algorithms. We validate the proposed approach through simulations on double integrator and Dubin’s car systems and demonstrate its effectiveness in real-world experiments with a soft robotic limb.

[AI-6] Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures ICASSP2025

链接: https://arxiv.org/abs/2411.19806
作者: Alain Riou,Antonin Gagneré,Gaëtan Hadjeres,Stefan Lattner,Geoffroy Peeters
关键词-EN: Joint-Embedding Predictive Architectures, musical stem retrieval, Predictive Architectures, stem retrieval, latent representations
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model’s performance. We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information. Comments: Submitted to ICASSP 2025 Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) Cite as: arXiv:2411.19806 [cs.SD] (or arXiv:2411.19806v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2411.19806 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2411.19804
作者: Robin D. Pesl,Jerin G. Mathew,Massimo Mecella,Marco Aiello
关键词-EN: create advanced Information, advanced Information Systems, endpoint discovery, essential to create, create advanced
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce the token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform naïve chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score.

[AI-8] CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives

链接: https://arxiv.org/abs/2411.19787
作者: Armin Saghafian,Amirmohammad Izadi,Negin Hashemi Dijujin,Mahdieh Soleymani Baghshah
关键词-EN: solving language-guided goal-reaching, reinforcement learning, reinforcement learning problems, language-guided goal-reaching reinforcement, goal-reaching reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Grounding the instruction in the environment is a key step in solving language-guided goal-reaching reinforcement learning problems. In automated reinforcement learning, a key concern is to enhance the model’s ability to generalize across various tasks and environments. In goal-reaching scenarios, the agent must comprehend the different parts of the instructions within the environmental context in order to complete the overall task successfully. In this work, we propose CAREL (Cross-modal Auxiliary REinforcement Learning) as a new framework to solve this problem using auxiliary loss functions inspired by video-text retrieval literature and a novel method called instruction tracking, which automatically keeps track of progress in an environment. The results of our experiments suggest superior sample efficiency and systematic generalization for this framework in multi-modal reinforcement learning problems. Our code base is available here.

[AI-9] Stock Price Prediction using Multi-Faceted Information based on Deep Recurrent Neural Networks

链接: https://arxiv.org/abs/2411.19766
作者: Lida Shahbandari,Elahe Moradi,Mohammad Manthouri
关键词-EN: enhanced wealth creation, effective portfolio management, Convolutional Neural Networks, integrating Convolutional Neural, informed investment decisions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate prediction of stock market trends is crucial for informed investment decisions and effective portfolio management, ultimately leading to enhanced wealth creation and risk mitigation. This study proposes a novel approach for predicting stock prices in the stock market by integrating Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks, using sentiment analysis of social network data and candlestick data (price). The proposed methodology consists of two primary components: sentiment analysis of social network and candlestick data. By amalgamating candlestick data with insights gleaned from Twitter, this approach facilitates a more detailed and accurate examination of market trends and patterns, ultimately leading to more effective stock price predictions. Additionally, a Random Forest algorithm is used to classify tweets as either positive or negative, allowing for a more subtle and informed assessment of market sentiment. This study uses CNN and LSTM networks to predict stock prices. The CNN extracts short-term features, while the LSTM models long-term dependencies. The integration of both networks enables a more comprehensive analysis of market trends and patterns, leading to more accurate stock price predictions.

[AI-10] Forecasting Foreign Exchange Market Prices Using Technical Indicators with Deep Learning and Attention Mechanism

链接: https://arxiv.org/abs/2411.19763
作者: Sahabeh Saadati,Mohammad Manthouri
关键词-EN: Accurate prediction, foreign exchange market, Convolutional Neural Network, CNN networks, foreign exchange
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate prediction of price behavior in the foreign exchange market is crucial. This paper proposes a novel approach that leverages technical indicators and deep neural networks. The proposed architecture consists of a Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN), and attention mechanism. Initially, trend and oscillation technical indicators are employed to extract statistical features from Forex currency pair data, providing insights into price trends, market volatility, relative price strength, and overbought and oversold conditions. Subsequently, the LSTM and CNN networks are utilized in parallel to predict future price movements, leveraging the strengths of both recurrent and convolutional architectures. The LSTM network captures long-term dependencies and temporal patterns in the data, while the CNN network extracts local patterns. The outputs of the parallel LSTM and CNN networks are then fed into an attention mechanism, which learns to weigh the importance of each feature and temporal dependency, generating a context-aware representation of the input data. The attention-weighted output is then used to predict future price movements, enabling the model to focus on the most relevant features and temporal dependencies. Through a comprehensive evaluation of the proposed approach on multiple Forex currency pairs, we demonstrate its effectiveness in predicting price behavior and outperforming benchmark models.

[AI-11] HVAC-DPT: A Decision Pretrained Transformer for HVAC Control

链接: https://arxiv.org/abs/2411.19746
作者: Anaïs Berkes
关键词-EN: Air Conditioning, operations consume approximately, Building operations consume, consume approximately, Ventilation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 7 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Building operations consume approximately 40% of global energy, with Heating, Ventilation, and Air Conditioning (HVAC) systems responsible for up to 50% of this consumption. As HVAC energy demands are expected to rise, optimising system efficiency is crucial for reducing future energy use and mitigating climate change. Existing control strategies lack generalisation and require extensive training and data, limiting their rapid deployment across diverse buildings. This paper introduces HVAC-DPT, a Decision-Pretrained Transformer using in-context Reinforcement Learning (RL) for multi-zone HVAC control. HVAC-DPT frames HVAC control as a sequential prediction task, training a causal transformer on interaction histories generated by diverse RL agents. This approach enables HVAC-DPT to refine its policy in-context, without modifying network parameters, allowing for deployment across different buildings without the need for additional training or data collection. HVAC-DPT reduces energy consumption in unseen buildings by 45% compared to the baseline controller, offering a scalable and effective approach to mitigating the increasing environmental impact of HVAC systems.

[AI-12] Amplifying human performance in combinatorial competitive programming

链接: https://arxiv.org/abs/2411.19744
作者: Petar Veličković,Alex Vitvitskyi,Larisa Markeeva,Borja Ibarz,Lars Buesing,Matej Balog,Alexander Novikov
关键词-EN: capable of performing, significant surge, surge in complex, complex AI systems, performing at admirable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Programming Languages (cs.PL)
*备注: Technical report. 18 pages, 8 figures

点击查看摘要

Abstract:Recent years have seen a significant surge in complex AI systems for competitive programming, capable of performing at admirable levels against human competitors. While steady progress has been made, the highest percentiles still remain out of reach for these methods on standard competition platforms such as Codeforces. Here we instead focus on combinatorial competitive programming, where the target is to find as-good-as-possible solutions to otherwise computationally intractable problems, over specific given inputs. We hypothesise that this scenario offers a unique testbed for human-AI synergy, as human programmers can write a backbone of a heuristic solution, after which AI can be used to optimise the scoring function used by the heuristic. We deploy our approach on previous iterations of Hash Code, a global team programming competition inspired by NP-hard software engineering problems at Google, and we leverage FunSearch to evolve our scoring functions. Our evolved solutions significantly improve the attained scores from their baseline, successfully breaking into the top percentile on all previous Hash Code online qualification rounds, and outperforming the top human teams on several. Our method is also performant on an optimisation problem that featured in a recent held-out AtCoder contest.

[AI-13] Graph Neural Networks for Heart Failure Prediction on an EHR-Based Patient Similarity Graph

链接: https://arxiv.org/abs/2411.19742
作者: Heloisa Oss Boll,Ali Amirahmadi,Amira Soliman,Stefan Byttner,Mariana Recamonde-Mendoza
关键词-EN: patient similarity graph, accurately predicting diseases, modern healthcare, crucial matter, Graph Transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Objective: In modern healthcare, accurately predicting diseases is a crucial matter. This study introduces a novel approach using graph neural networks (GNNs) and a Graph Transformer (GT) to predict the incidence of heart failure (HF) on a patient similarity graph at the next hospital visit. Materials and Methods: We used electronic health records (EHR) from the MIMIC-III dataset and applied the K-Nearest Neighbors (KNN) algorithm to create a patient similarity graph using embeddings from diagnoses, procedures, and medications. Three models - GraphSAGE, Graph Attention Network (GAT), and Graph Transformer (GT) - were implemented to predict HF incidence. Model performance was evaluated using F1 score, AUROC, and AUPRC metrics, and results were compared against baseline algorithms. An interpretability analysis was performed to understand the model’s decision-making process. Results: The GT model demonstrated the best performance (F1 score: 0.5361, AUROC: 0.7925, AUPRC: 0.5168). Although the Random Forest (RF) baseline achieved a similar AUPRC value, the GT model offered enhanced interpretability due to the use of patient relationships in the graph structure. A joint analysis of attention weights, graph connectivity, and clinical features provided insight into model predictions across different classification groups. Discussion and Conclusion: Graph-based approaches such as GNNs provide an effective framework for predicting HF. By leveraging a patient similarity graph, GNNs can capture complex relationships in EHR data, potentially improving prediction accuracy and clinical interpretability.

[AI-14] Improving generalization of robot locomotion policies via Sharpness-Aware Reinforcement Learning

链接: https://arxiv.org/abs/2411.19732
作者: Severin Bochem,Eduardo Gonzalez-Sanchez,Yves Bicker,Gabriele Fadini
关键词-EN: extensive training data, requires extensive training, training data, requires extensive, extensive training
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Reinforcement learning often requires extensive training data. Simulation-to-real transfer offers a promising approach to address this challenge in robotics. While differentiable simulators offer improved sample efficiency through exact gradients, they can be unstable in contact-rich environments and may lead to poor generalization. This paper introduces a novel approach integrating sharpness-aware optimization into gradient-based reinforcement learning algorithms. Our simulation results demonstrate that our method, tested on contact-rich environments, significantly enhances policy robustness to environmental variations and action perturbations while maintaining the sample efficiency of first-order methods. Specifically, our approach improves action noise tolerance compared to standard first-order methods and achieves generalization comparable to zeroth-order methods. This improvement stems from finding flatter minima in the loss landscape, associated with better generalization. Our work offers a promising solution to balance efficient learning and robust sim-to-real transfer in robotics, potentially bridging the gap between simulation and real-world performance.

[AI-15] CantorNet: A Sandbox for Testing Topological and Geometrical Measures NEURIPS

链接: https://arxiv.org/abs/2411.19713
作者: Michal Lewandowski,Hamid Eghbalzadeh,Bernhard A.Moser
关键词-EN: human faces, symmetry of human, repetitive motif, Cantor set, natural phenomena
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at the NeurIPS Workshop on Symmetry and Geometry in Neural Representations, 2024

点击查看摘要

Abstract:Many natural phenomena are characterized by self-similarity, for example the symmetry of human faces, or a repetitive motif of a song. Studying of such symmetries will allow us to gain deeper insights into the underlying mechanisms of complex systems. Recognizing the importance of understanding these patterns, we propose a geometrically inspired framework to study such phenomena in artificial neural networks. To this end, we introduce \emphCantorNet, inspired by the triadic construction of the Cantor set, which was introduced by Georg Cantor in the 19^\textth century. In mathematics, the Cantor set is a set of points lying on a single line that is self-similar and has a counter intuitive property of being an uncountably infinite null set. Similarly, we introduce CantorNet as a sandbox for studying self-similarity by means of novel topological and geometrical complexity measures. CantorNet constitutes a family of ReLU neural networks that spans the whole spectrum of possible Kolmogorov complexities, including the two opposite descriptions (linear and exponential as measured by the description length). CantorNet’s decision boundaries can be arbitrarily ragged, yet are analytically known. Besides serving as a testing ground for complexity measures, our work may serve to illustrate potential pitfalls in geometry-ignorant data augmentation techniques and adversarial attacks.

[AI-16] CAdam: Confidence-Based Optimization for Online Learning

链接: https://arxiv.org/abs/2411.19647
作者: Shaowen Wang,Anan Liu,Jian Xiao,Huan Liu,Yuekui Yang,Cong Xu,Qianqian Pu,Suncong Zheng,Wei Zhang,Jian Li
关键词-EN: frequently employ online, freshly collected data, Modern recommendation systems, systems frequently employ, employ online learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which integrates momentum ( m_t ) and adaptive learning rate ( v_t ). However, the volatile nature of online learning data, characterized by its frequent distribution shifts and presence of noises, poses significant challenges to Adam’s standard optimization process: (1) Adam may use outdated momentum and the average of squared gradients, resulting in slower adaptation to distribution changes, and (2) Adam’s performance is adversely affected by data noise. To mitigate these issues, we introduce CAdam, a confidence-based optimization strategy that assesses the consistence between the momentum and the gradient for each parameter dimension before deciding on updates. If momentum and gradient are in sync, CAdam proceeds with parameter updates according to Adam’s original formulation; if not, it temporarily withholds updates and monitors potential shifts in data distribution in subsequent iterations. This method allows CAdam to distinguish between the true distributional shifts and mere noise, and adapt more quickly to new data distributions. Our experiments with both synthetic and real-world datasets demonstrate that CAdam surpasses other well-known optimizers, including the original Adam, in efficiency and noise robustness. Furthermore, in large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system’s gross merchandise volume (GMV).

[AI-17] Solving Rubiks Cube Without Tricky Sampling

链接: https://arxiv.org/abs/2411.19583
作者: Yicheng Lin,Siyu Liang
关键词-EN: sparse reward structure, vast state space, reaching rewarded states, Rubiks Cube, reward structure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Rubiks Cube, with its vast state space and sparse reward structure, presents a significant challenge for reinforcement learning (RL) due to the difficulty of reaching rewarded states. Previous research addressed this by propagating cost-to-go estimates from the solved state and incorporating search techniques. These approaches differ from human strategies that start from fully scrambled cubes, which can be tricky for solving a general sparse-reward problem. In this paper, we introduce a novel RL algorithm using policy gradient methods to solve the Rubiks Cube without relying on near solved-state sampling. Our approach employs a neural network to predict cost patterns between states, allowing the agent to learn directly from scrambled states. Our method was tested on the 2x2x2 Rubiks Cube, where the cube was scrambled 50,000 times, and the model successfully solved it in over 99.4% of cases. Notably, this result was achieved using only the policy network without relying on tree search as in previous methods, demonstrating its effectiveness and potential for broader applications in sparse-reward problems.

[AI-18] Unimib Assistant: designing a student-friendly RAG-based chatbot for all their needs

链接: https://arxiv.org/abs/2411.19554
作者: Chiara Antico,Stefano Giordano,Cansu Koyuturk,Dimitri Ognibene
关键词-EN: Large Language Models, Natural language processing, language processing skills, Natural language, Language Models
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted for Italian Workshop on Artificial Intelligence for Human Machine Interaction (AIxHMI 2024), November 26, 2024, Bolzano, Italy

点击查看摘要

Abstract:Natural language processing skills of Large Language Models (LLMs) are unprecedented, having wide diffusion and application in different tasks. This pilot study focuses on specializing ChatGPT behavior through a Retrieval-Augmented Generation (RAG) system using the OpenAI custom GPTs feature. The purpose of our chatbot, called Unimib Assistant, is to provide information and solutions to the specific needs of University of Milano-Bicocca (Unimib) students through a question-answering approach. We provided the system with a prompt highlighting its specific purpose and behavior, as well as university-related documents and links obtained from an initial need-finding phase, interviewing six students. After a preliminary customization phase, a qualitative usability test was conducted with six other students to identify the strengths and weaknesses of the chatbot, with the goal of improving it in a subsequent redesign phase. While the chatbot was appreciated for its user-friendly experience, perceived general reliability, well-structured responses, and conversational tone, several significant technical and functional limitations emerged. In particular, the satisfaction and overall experience of the users was impaired by the system’s inability to always provide fully accurate information. Moreover, it would often neglect to report relevant information even if present in the materials uploaded and prompt given. Furthermore, it sometimes generated unclickable links, undermining its trustworthiness, since providing the source of information was an important aspect for our users. Further in-depth studies and feedback from other users as well as implementation iterations are planned to refine our Unimib Assistant.

[AI-19] Quantized Delta Weight Is Safety Keeper

链接: https://arxiv.org/abs/2411.19530
作者: Yule Liu,Zhen Sun,Xinlei He,Xinyi Huang
关键词-EN: enable customized applications, high resource demands, fine-tuning proprietary language, proprietary language models, language models enable
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in fine-tuning proprietary language models enable customized applications across various domains but also introduce two major challenges: high resource demands and security risks. Regarding resource demands, recent work proposes novel partial compression, such as BitDelta, to quantize the delta weights between the fine-tuned model and base model. Regarding the security risks, user-defined fine-tuning can introduce security vulnerabilities, such as alignment issues, backdoor attacks, and hallucinations. However, most of the current efforts in security assessment focus on the full-precision or full-compression models, it is not well-discussed how the partial compression methods affect security concerns. To bridge this gap, we evaluate the robustness of delta-weight quantization against these security threats. In this paper, we uncover a “free lunch” phenomenon: partial compression can enhance model security against fine-tuning-based attacks with bearable utility loss. Using Llama-2-7b-chat as a case study, we show that, with under 10% utility degradation, the partial compression mitigates alignment-breaking risks by up to 66.17%, harmful backdoor vulnerabilities by 64.46%, and targeted output manipulation risks by up to 90.53%. We further apply LogitLens to visualize internal state transformations during forward passes, suggesting mechanisms for both security failure and recovery in standard versus compressed fine-tuning. This work offers new insights into selecting effective delta compression methods for secure, resource-efficient multi-tenant services.

[AI-20] A Local Information Aggregation based Multi-Agent Reinforcement Learning for Robot Swarm Dynamic Task Allocation

链接: https://arxiv.org/abs/2411.19526
作者: Yang Lv,Jinlong Lei,Peng Yi
关键词-EN: Deterministic Policy Gradient, Local Information Aggregation, Deep Deterministic Policy, emphasizing the necessity, formulating robust
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In this paper, we explore how to optimize task allocation for robot swarms in dynamic environments, emphasizing the necessity of formulating robust, flexible, and scalable strategies for robot cooperation. We introduce a novel framework using a decentralized partially observable Markov decision process (Dec_POMDP), specifically designed for distributed robot swarm networks. At the core of our methodology is the Local Information Aggregation Multi-Agent Deep Deterministic Policy Gradient (LIA_MADDPG) algorithm, which merges centralized training with distributed execution (CTDE). During the centralized training phase, a local information aggregation (LIA) module is meticulously designed to gather critical data from neighboring robots, enhancing decision-making efficiency. In the distributed execution phase, a strategy improvement method is proposed to dynamically adjust task allocation based on changing and partially observable environmental conditions. Our empirical evaluations show that the LIA module can be seamlessly integrated into various CTDE-based MARL methods, significantly enhancing their performance. Additionally, by comparing LIA_MADDPG with six conventional reinforcement learning algorithms and a heuristic algorithm, we demonstrate its superior scalability, rapid adaptation to environmental changes, and ability to maintain both stability and convergence speed. These results underscore LIA_MADDPG’s outstanding performance and its potential to significantly improve dynamic task allocation in robot swarms through enhanced local collaboration and adaptive strategy execution.

[AI-21] RL-MILP Solver: A Reinforcement Learning Approach for Solving Mixed-Integer Linear Programs with Graph Neural Networks

链接: https://arxiv.org/abs/2411.19517
作者: Tae-Hoon Lee,Min-Soo Kim
关键词-EN: Mixed-Integer Linear Programming, Linear Programming, optimization technique widely, Mixed-Integer Linear, MILP
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mixed-Integer Linear Programming (MILP) is an optimization technique widely used in various fields. Primal heuristics, which reduce the search space of MILP, have enabled traditional solvers (e.g., Gurobi) to efficiently find high-quality solutions. However, traditional primal heuristics rely on expert knowledge, motivating the advent of machine learning (ML)-based primal heuristics that learn repetitive patterns in MILP. Nonetheless, existing ML-based primal heuristics do not guarantee solution feasibility (i.e., satisfying all constraints) and primarily focus on prediction for binary decision variables. When addressing MILP involving non-binary integer variables using ML-based approaches, feasibility issues can become even more pronounced. Since finding an optimal solution requires satisfying all constraints, addressing feasibility is critical. To overcome these limitations, we propose a novel reinforcement learning (RL)-based solver that interacts with MILP to find feasible solutions, rather than delegating sub-problems to traditional solvers. We design reward functions tailored for MILP, which enables the RL agent to learn relationships between decision variables and constraints. Additionally, to effectively model complex relationships among decision variables, we leverage a Transformer encoder-based graph neural network (GNN). Our experimental results demonstrate that the proposed method can solve MILP problems and find near-optimal solutions without delegating the remainder to traditional solvers. The proposed method provides a meaningful step forward as an initial study in solving MILP problems end-to-end based solely on ML.

[AI-22] Knowledge-Data Fusion Based Source-Free Semi-Supervised Domain Adaptation for Seizure Subtype Classification

链接: https://arxiv.org/abs/2411.19502
作者: Ruimin Peng,Jiayu An,Dongrui Wu
关键词-EN: seizure subtype classification, clinical diagnosis efficiency, enhances clinical diagnosis, seizure subtype, subtype classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Electroencephalogram (EEG)-based seizure subtype classification enhances clinical diagnosis efficiency. Source-free semi-supervised domain adaptation (SF-SSDA), which transfers a pre-trained model to a new dataset with no source data and limited labeled target data, can be used for privacy-preserving seizure subtype classification. This paper considers two challenges in SF-SSDA for EEG-based seizure subtype classification: 1) How to effectively fuse both raw EEG data and expert knowledge in classifier design? 2) How to align the source and target domain distributions for SF-SSDA? We propose a Knowledge-Data Fusion based SF-SSDA approach, KDF-MutualSHOT, for EEG-based seizure subtype classification. In source model training, KDF uses Jensen-Shannon Divergence to facilitate mutual learning between a feature-driven Decision Tree-based model and a data-driven Transformer-based model. To adapt KDF to a new target dataset, an SF-SSDA algorithm, MutualSHOT, is developed, which features a consistency-based pseudo-label selection strategy. Experiments on the public TUSZ and CHSZ datasets demonstrated that KDF-MutualSHOT outperformed other supervised and source-free domain adaptation approaches in cross-subject seizure subtype classification.

[AI-23] Protecting Multiple Types of Privacy Simultaneously in EEG-based Brain-Computer Interfaces

链接: https://arxiv.org/abs/2411.19498
作者: Lubin Meng,Xue Jiang,Tianwang Jia,Dongrui Wu
关键词-EN: enables direct communication, EEG data, brain-computer interface, enables direct, external device
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A brain-computer interface (BCI) enables direct communication between the brain and an external device. Electroencephalogram (EEG) is the preferred input signal in non-invasive BCIs, due to its convenience and low cost. EEG-based BCIs have been successfully used in many applications, such as neurological rehabilitation, text input, games, and so on. However, EEG signals inherently carry rich personal information, necessitating privacy protection. This paper demonstrates that multiple types of private information (user identity, gender, and BCI-experience) can be easily inferred from EEG data, imposing a serious privacy threat to BCIs. To address this issue, we design perturbations to convert the original EEG data into privacy-protected EEG data, which conceal the private information while maintaining the primary BCI task performance. Experimental results demonstrated that the privacy-protected EEG data can significantly reduce the classification accuracy of user identity, gender and BCI-experience, but almost do not affect at all the classification accuracy of the primary BCI task, enabling user privacy protection in EEG-based BCIs.

[AI-24] Action Engine: An LLM -based Framework for Automatic FaaS Workflow Generation

链接: https://arxiv.org/abs/2411.19485
作者: Akiharu Esashi,Pawissanutt Lertpongrujikorn,Mohsen Amini Salehi
关键词-EN: cloud systems due, Action Engine, called Action Engine, Action Engine includes, Large Language Models
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted at Utility Cloud Computing (UCC '24) conference

点击查看摘要

Abstract:Function as a Service (FaaS) is poised to become the foundation of the next generation of cloud systems due to its inherent advantages in scalability, cost-efficiency, and ease of use. However, challenges such as the need for specialized knowledge and difficulties in building function workflows persist for cloud-native application developers. To overcome these challenges and mitigate the burden of developing FaaS-based applications, in this paper, we propose a mechanism called Action Engine, that makes use of Tool-Augmented Large Language Models (LLMs) at its kernel to interpret human language queries and automates FaaS workflow generation, thereby, reducing the need for specialized expertise and manual design. Action Engine includes modules to identify relevant functions from the FaaS repository and seamlessly manage the data dependency between them, ensuring that the developer’s query is processed and resolved. Beyond that, Action Engine can execute the generated workflow by feeding the user-provided parameters. Our evaluations show that Action Engine can generate workflows with up to 20% higher correctness without developer involvement. We notice that Action Engine can unlock FaaS workflow generation for non-cloud-savvy developers and expedite the development cycles of cloud-native applications.

[AI-25] owards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems

链接: https://arxiv.org/abs/2411.19463
作者: Shengming Zhao,Yuheng Huang,Jiayang Song,Zhijie Wang,Chengcheng Wan,Lei Ma
关键词-EN: large language models, demonstrated promising efficacy, Retrieval-Augmented Generation, RAG systems, RAG
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a pivotal technique for enhancing the capability of large language models (LLMs) and has demonstrated promising efficacy across a diverse spectrum of tasks. While LLM-driven RAG systems show superior performance, they face unique challenges in stability and reliability. Their complexity hinders developers’ efforts to design, maintain, and optimize effective RAG systems. Therefore, it is crucial to understand how RAG’s performance is impacted by its design. In this work, we conduct an early exploratory study toward a better understanding of the mechanism of RAG systems, covering three code datasets, three QA datasets, and two LLMs. We focus on four design factors: retrieval document type, retrieval recall, document selection, and prompt techniques. Our study uncovers how each factor impacts system correctness and confidence, providing valuable insights for developing an accurate and reliable RAG system. Based on these findings, we present nine actionable guidelines for detecting defects and optimizing the performance of RAG systems. We hope our early exploration can inspire further advancements in engineering, improving and maintaining LLM-driven intelligent software systems for greater efficiency and reliability.

[AI-26] Gradient Inversion Attack on Graph Neural Networks

链接: https://arxiv.org/abs/2411.19440
作者: Divya Anand Sinha,Yezi Liu,Ruijie Du,Yanning Shen
关键词-EN: protecting data privacy, large graph datasets, local graph data, Graph federated learning, federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph federated learning is of essential importance for training over large graph datasets while protecting data privacy, where each client stores a subset of local graph data, while the server collects the local gradients and broadcasts only the aggregated gradients. Recent studies reveal that a malicious attacker can steal private image data from gradient exchanging of neural networks during federated learning. However, none of the existing works have studied the vulnerability of graph data and graph neural networks under such attack. To answer this question, the present paper studies the problem of whether private data can be recovered from leaked gradients in both node classification and graph classification tasks and proposes a novel attack named Graph Leakage from Gradients (GLG). Two widely-used GNN frameworks are analyzed, namely GCN and GraphSAGE. The effects of different model settings on recovery are extensively discussed. Through theoretical analysis and empirical validation, it is shown that parts of the graph data can be leaked from the gradients.

[AI-27] Proto Successor Measure: Representing the Space of All Possible Solutions of Reinforcement Learning

链接: https://arxiv.org/abs/2411.19418
作者: Siddhant Agarwal,Harshit Sikchi,Peter Stone,Amy Zhang
关键词-EN: Proto Successor Measure, transfer their knowledge, reinforcement learning, general-purpose reinforcement learning, intelligent agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under submission, 23 pages

点击查看摘要

Abstract:Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment. Referred to as “zero-shot learning,” this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We present \emphProto Successor Measure: the basis set for all possible solutions of Reinforcement Learning in a dynamical system. We provably show that any possible policy can be represented using an affine combination of these policy independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these basis corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using only interaction data from the environment and show that our approach can produce the optimal policy at test time for any given reward function without additional environmental interactions. Project page: this https URL.

[AI-28] Global Tensor Motion Planning

链接: https://arxiv.org/abs/2411.19393
作者: An T. Le,Kay Hansel,João Carvalho,Joe Watson,Julen Urain,Armin Biess,Georgia Chalvatzaki,Jan Peters
关键词-EN: dataset generation diversity, Global Tensor Motion, Tensor Motion Planning, generation diversity, increasingly crucial
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Batch planning is increasingly crucial for the scalability of robotics tasks and dataset generation diversity. This paper presents Global Tensor Motion Planning (GTMP) – a sampling-based motion planning algorithm comprising only tensor operations. We introduce a novel discretization structure represented as a random multipartite graph, enabling efficient vectorized sampling, collision checking, and search. We provide an early theoretical investigation showing that GTMP exhibits probabilistic completeness while supporting modern GPU/TPU. Additionally, by incorporating smooth structures into the multipartite graph, GTMP directly plans smooth splines without requiring gradient-based optimization. Experiments on lidar-scanned occupancy maps and the MotionBenchMarker dataset demonstrate GTMP’s computation efficiency in batch planning compared to baselines, underscoring GTMP’s potential as a robust, scalable planner for diverse applications and large-scale robot learning tasks.

[AI-29] Zero-Forget Preservation of Semantic Communication Alignment in Distributed AI Networks

链接: https://arxiv.org/abs/2411.19385
作者: Jingzhi Hu,Geoffrey Ye Li
关键词-EN: Future communication networks, distributed artificial intelligence, connect massive distributed, massive distributed artificial, Future communication
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Future communication networks are expected to connect massive distributed artificial intelligence (AI). Exploiting aligned priori knowledge of AI pairs, it is promising to convert high-dimensional data transmission into highly-compressed semantic communications (SC). However, to accommodate the local data distribution and user preferences, AIs generally adapt to different domains, which fundamentally distorts the SC alignment. In this paper, we propose a zero-forget domain adaptation (ZFDA) framework to preserve SC alignment. To prevent the DA from changing substantial neural parameters of AI, we design sparse additive modifications (SAM) to the parameters, which can be efficiently stored and switched-off to restore the SC alignment. To optimize the SAM, we decouple it into tractable continuous variables and a binary mask, and then handle the binary mask by a score-based optimization. Experimental evaluations on a SC system for image transmissions validate that the proposed framework perfectly preserves the SC alignment with almost no loss of DA performance, even improved in some cases, at a cost of less than 1% of additional memory.

[AI-30] Marconi: Prefix Caching for the Era of Hybrid LLM s

链接: https://arxiv.org/abs/2411.19379
作者: Rui Pan,Zhuang Wang,Zhen Jia,Can Karakus,Luca Zancato,Tri Dao,Ravi Netravali,Yida Wang
关键词-EN: language modeling capabilities, Language Model serving, State Space Models, Large Language Model, practically supporting long
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4 \times higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.

[AI-31] Integrating Transit Signal Priority into Multi-Agent Reinforcement Learning based Traffic Signal Control

链接: https://arxiv.org/abs/2411.19359
作者: Dickness Kakitahi Kwesiga,Suyash Chandra Vishnoi,Angshuman Guin,Michael Hunter
关键词-EN: Transit Signal Priority, integrates Transit Signal, study integrates Transit, multi-agent reinforcement learning, signal control based
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This study integrates Transit Signal Priority (TSP) into multi-agent reinforcement learning (MARL) based traffic signal control. The first part of the study develops adaptive signal control based on MARL for a pair of coordinated intersections in a microscopic simulation environment. The two agents, one for each intersection, are centrally trained using a value decomposition network (VDN) architecture. The trained agents show slightly better performance compared to coordinated actuated signal control based on overall intersection delay at v/c of 0.95. In the second part of the study the trained signal control agents are used as background signal controllers while developing event-based TSP agents. In one variation, independent TSP agents are formulated and trained under a decentralized training and decentralized execution (DTDE) framework to implement TSP at each intersection. In the second variation, the two TSP agents are centrally trained under a centralized training and decentralized execution (CTDE) framework and VDN architecture to select and implement coordinated TSP strategies across the two intersections. In both cases the agents converge to the same bus delay value, but independent agents show high instability throughout the training process. For the test runs, the two independent agents reduce bus delay across the two intersections by 22% compared to the no TSP case while the coordinated TSP agents achieve 27% delay reduction. In both cases, there is only a slight increase in delay for a majority of the side street movements.

[AI-32] Mapping Public Perception of Artificial Intelligence: Expectations Risk-Benefit Tradeoffs and Value As Determinants for Societal Acceptance

链接: https://arxiv.org/abs/2411.19356
作者: Philipp Brauner,Felix Glawe,Gian Luca Liehner,Luisa Vervier,Martina Ziefle
关键词-EN: influence innovation trajectories, shape policy decisions, successful market strategies, Understanding public perception, artificial intelligence
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Understanding public perception of artificial intelligence (AI) and the tradeoffs between potential risks and benefits is crucial, as these perceptions might shape policy decisions, influence innovation trajectories for successful market strategies, and determine individual and societal acceptance of AI technologies. Using a representative sample of 1100 participants from Germany, this study examines mental models of AI. Participants quantitatively evaluated 71 statements about AI’s future capabilities (e.g., autonomous driving, medical care, art, politics, warfare, and societal divides), assessing the expected likelihood of occurrence, perceived risks, benefits, and overall value. We present rankings of these projections alongside visual mappings illustrating public risk-benefit tradeoffs. While many scenarios were deemed likely, participants often associated them with high risks, limited benefits, and low overall value. Across all scenarios, 96.4% ( r^2=96.4% ) of the variance in value assessment can be explained by perceived risks ( \beta=-.504 ) and perceived benefits ( \beta=+.710 ), with no significant relation to expected likelihood. Demographics and personality traits influenced perceptions of risks, benefits, and overall evaluations, underscoring the importance of increasing AI literacy and tailoring public information to diverse user needs. These findings provide actionable insights for researchers, developers, and policymakers by highlighting critical public concerns and individual factors essential to align AI development with individual values.

[AI-33] OMuleT: Orchestrating Multiple Tools for Practicable Conversational Recommendation

链接: https://arxiv.org/abs/2411.19352
作者: Se-eun Yoon,Xiaokai Wei,Yexi Jiang,Rachit Pareek,Frank Ong,Kevin Gao,Julian McAuley,Michelle Gong
关键词-EN: realistic conversational recommender, conversational recommender system, present a systematic, systematic effort, implement a realistic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we present a systematic effort to design, evaluate, and implement a realistic conversational recommender system (CRS). The objective of our system is to allow users to input free-form text to request recommendations, and then receive a list of relevant and diverse items. While previous work on synthetic queries augments large language models (LLMs) with 1-3 tools, we argue that a more extensive toolbox is necessary to effectively handle real user requests. As such, we propose a novel approach that equips LLMs with over 10 tools, providing them access to the internal knowledge base and API calls used in production. We evaluate our model on a dataset of real users and show that it generates relevant, novel, and diverse recommendations compared to vanilla LLMs. Furthermore, we conduct ablation studies to demonstrate the effectiveness of using the full range of tools in our toolbox. We share our designs and lessons learned from deploying the system for internal alpha release. Our contribution is the addressing of all four key aspects of a practicable CRS: (1) real user requests, (2) augmenting LLMs with a wide variety of tools, (3) extensive evaluation, and (4) deployment insights.

[AI-34] An Adversarial Learning Approach to Irregular Time-Series Forecasting NEURIPS2024

链接: https://arxiv.org/abs/2411.19341
作者: Heejeong Nam,Jihyun Kim,Jimin Yeom
关键词-EN: penalize unrealistic forecasts, significant challenges due, presents significant challenges, series presents significant, traditional error-based evaluation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to AdvML-Frontiers Workshop @ NeurIPS 2024

点击查看摘要

Abstract:Forecasting irregular time series presents significant challenges due to two key issues: the vulnerability of models to mean regression, driven by the noisy and complex nature of the data, and the limitations of traditional error-based evaluation metrics, which fail to capture meaningful patterns and penalize unrealistic forecasts. These problems result in forecasts that often misalign with human intuition. To tackle these challenges, we propose an adversarial learning framework with a deep analysis of adversarial components. Specifically, we emphasize the importance of balancing the modeling of global distribution (overall patterns) and transition dynamics (localized temporal changes) to better capture the nuances of irregular time series. Overall, this research provides practical insights for improving models and evaluation metrics, and pioneers the application of adversarial learning in the domian of irregular time-series forecasting.

[AI-35] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2411.19335
作者: Shenghui Li,Edith C.-H. Ngai,Fanghua Ye,Thiemo Voigt
关键词-EN: Pre-trained Language Models, Federated Parameter-Efficient Fine-Tuning, Federated Learning, Pre-trained Language, Federated Parameter-Efficient
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as a promising paradigm for privacy-preserving and efficient adaptation of Pre-trained Language Models (PLMs) in Federated Learning (FL) settings. It preserves data privacy by keeping the data decentralized and training the model on local devices, ensuring that raw data never leaves the user’s device. Moreover, the integration of PEFT methods such as LoRA significantly reduces the number of trainable parameters compared to fine-tuning the entire model, thereby minimizing communication costs and computational overhead. Despite its potential, the security implications of FedPEFT remain underexplored. This paper introduces a novel security threat to FedPEFT, termed PEFT-as-an-Attack (PaaA), which exposes how PEFT can be exploited as an attack vector to circumvent PLMs’ safety alignment and generate harmful content in response to malicious prompts. Our evaluation of PaaA reveals that with less than 1% of the model’s parameters set as trainable, and a small subset of clients acting maliciously, the attack achieves an approximate 80% attack success rate using representative PEFT methods such as LoRA. To mitigate this threat, we further investigate potential defense strategies, including Robust Aggregation Schemes (RASs) and Post-PEFT Safety Alignment (PPSA). However, our empirical analysis highlights the limitations of these defenses, i.e., even the most advanced RASs, such as DnC and ClippedClustering, struggle to defend against PaaA in scenarios with highly heterogeneous data distributions. Similarly, while PPSA can reduce attack success rates to below 10%, it severely degrades the model’s accuracy on the target task. Our results underscore the urgent need for more effective defense mechanisms that simultaneously ensure security and maintain the performance of the FedPEFT paradigm.

[AI-36] Structured Object Language Modeling (SoLM): Native Structured Objects Generation Conforming to Complex Schemas with Self-Supervised Denoising

链接: https://arxiv.org/abs/2411.19301
作者: Amir Tavanaei,Kee Kiat Koo,Hayreddin Ceker,Shaobai Jiang,Qi Li,Julien Han,Karim Bouyarmane
关键词-EN: generating structured objects, Structured Object Language, Object Language Modeling, Language Modeling problem, complex schema
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we study the problem of generating structured objects that conform to a complex schema, with intricate dependencies between the different components (facets) of the object. The facets of the object (attributes, fields, columns, properties) can be a mix of short, structured, type-constrained facts, or long natural-language descriptions. The object has to be self-consistent between the different facets in the redundant information it carries (relative consistency), while being grounded with respect to world knowledge (absolute consistency). We frame the problem as a Language Modeling problem (Structured Object Language Modeling) and train an LLM to perform the task natively, without requiring instructions or prompt-engineering. We propose a self-supervised denoising method to train the model from an existing dataset of such objects. The input query can be the existing object itself, in which case the model acts as a regenerator, completing, correcting, normalizing the input, or any unstructured blurb to be structured. We show that the self-supervised denoising training provides a strong baseline, and that additional supervised fine-tuning with small amount of human demonstrations leads to further improvement. Experimental results show that the proposed method matches or outperforms prompt-engineered general-purpose state-of-the-art LLMs (Claude 3, Mixtral-8x7B), while being order-of-magnitude more cost-efficient.

[AI-37] BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning NEURIPS2024

链接: https://arxiv.org/abs/2411.19285
作者: Jianming Pan,Zeqi Ye,Xiao Yang,Xu Yang,Weiqing Liu,Lewen Wang,Jiang Bian
关键词-EN: Data-driven decision-making processes, learnable deep neural, render final decisions, decision-making processes increasingly, deep neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM)
*备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Data-driven decision-making processes increasingly utilize end-to-end learnable deep neural networks to render final decisions. Sometimes, the output of the forward functions in certain layers is determined by the solutions to mathematical optimization problems, leading to the emergence of differentiable optimization layers that permit gradient back-propagation. However, real-world scenarios often involve large-scale datasets and numerous constraints, presenting significant challenges. Current methods for differentiating optimization problems typically rely on implicit differentiation, which necessitates costly computations on the Jacobian matrices, resulting in low efficiency. In this paper, we introduce BPQP, a differentiable convex optimization framework designed for efficient end-to-end learning. To enhance efficiency, we reformulate the backward pass as a simplified and decoupled quadratic programming problem by leveraging the structural properties of the KKT matrix. This reformulation enables the use of first-order optimization algorithms in calculating the backward pass gradients, allowing our framework to potentially utilize any state-of-the-art solver. As solver technologies evolve, BPQP can continuously adapt and improve its efficiency. Extensive experiments on both simulated and real-world datasets demonstrate that BPQP achieves a significant improvement in efficiency–typically an order of magnitude faster in overall execution time compared to other differentiable optimization layers. Our results not only highlight the efficiency gains of BPQP but also underscore its superiority over differentiable optimization layer baselines.

[AI-38] SmartLLM Sentry: A Comprehensive LLM Based Smart Contract Vulnerability Detection Framework

链接: https://arxiv.org/abs/2411.19234
作者: Oualid Zaazaa,Hanan El Bakkali
关键词-EN: managing digital assets, effective security measures, essential for managing, managing digital, digital assets
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Smart contracts are essential for managing digital assets in blockchain networks, highlighting the need for effective security measures. This paper introduces SmartLLMSentry, a novel framework that leverages large language models (LLMs), specifically ChatGPT with in-context training, to advance smart contract vulnerability detection. Traditional rule-based frameworks have limitations in integrating new detection rules efficiently. In contrast, SmartLLMSentry utilizes LLMs to streamline this process. We created a specialized dataset of five randomly selected vulnerabilities for model training and evaluation. Our results show an exact match accuracy of 91.1% with sufficient data, although GPT-4 demonstrated reduced performance compared to GPT-3 in rule generation. This study illustrates that SmartLLMSentry significantly enhances the speed and accuracy of vulnerability detection through LLMdriven rule integration, offering a new approach to improving Blockchain security and addressing previously underexplored vulnerabilities in smart contracts.

[AI-39] Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

链接: https://arxiv.org/abs/2411.19230
作者: Xinxu Wei,Kanhao Zhao,Yong Jiao,Nancy B. Carlisle,Hua Xie,Yu Zhang
关键词-EN: low-density EEG data, EEG data, Effectively utilizing extensive, EEG data presents, Graph Masked Autoencoder
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages

点击查看摘要

Abstract:Effectively utilizing extensive unlabeled high-density EEG data to improve performance in scenarios with limited labeled low-density EEG data presents a significant challenge. In this paper, we address this by framing it as a graph transfer learning and knowledge distillation problem. We propose a Unified Pre-trained Graph Contrastive Masked Autoencoder Distiller, named EEG-DisGCMAE, to bridge the gap between unlabeled/labeled and high/low-density EEG data. To fully leverage the abundant unlabeled EEG data, we introduce a novel unified graph self-supervised pre-training paradigm, which seamlessly integrates Graph Contrastive Pre-training and Graph Masked Autoencoder Pre-training. This approach synergistically combines contrastive and generative pre-training techniques by reconstructing contrastive samples and contrasting the reconstructions. For knowledge distillation from high-density to low-density EEG data, we propose a Graph Topology Distillation loss function, allowing a lightweight student model trained on low-density data to learn from a teacher model trained on high-density data, effectively handling missing electrodes through contrastive distillation. To integrate transfer learning and distillation, we jointly pre-train the teacher and student models by contrasting their queries and keys during pre-training, enabling robust distillers for downstream tasks. We demonstrate the effectiveness of our method on four classification tasks across two clinical EEG datasets with abundant unlabeled data and limited labeled data. The experimental results show that our approach significantly outperforms contemporary methods in both efficiency and accuracy.

[AI-40] Habit Coach: Customising RAG-based chatbots to support behavior change

链接: https://arxiv.org/abs/2411.19229
作者: Arian Fooroogh Mand Arabi,Cansu Koyuturk,Michael O’Mahony,Raffaella Calati,Dimitri Ognibene
关键词-EN: GPT-based chatbot designed, Habit Coach, Cognitive Behavioral Therapy, paper presents, presents the iterative
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted for Italian Workshop on Artificial Intelligence for Human Machine Interaction (AIxHMI 2024), November 26, 2024, Bolzano, Italy

点击查看摘要

Abstract:This paper presents the iterative development of Habit Coach, a GPT-based chatbot designed to support users in habit change through personalized interaction. Employing a user-centered design approach, we developed the chatbot using a Retrieval-Augmented Generation (RAG) system, which enables behavior personalization without retraining the underlying language model (GPT-4). The system leverages document retrieval and specialized prompts to tailor interactions, drawing from Cognitive Behavioral Therapy (CBT) and narrative therapy techniques. A key challenge in the development process was the difficulty of translating declarative knowledge into effective interaction behaviors. In the initial phase, the chatbot was provided with declarative knowledge about CBT via reference textbooks and high-level conversational goals. However, this approach resulted in imprecise and inefficient behavior, as the GPT model struggled to convert static information into dynamic and contextually appropriate interactions. This highlighted the limitations of relying solely on declarative knowledge to guide chatbot behavior, particularly in nuanced, therapeutic conversations. Over four iterations, we addressed this issue by gradually transitioning towards procedural knowledge, refining the chatbot’s interaction strategies, and improving its overall effectiveness. In the final evaluation, 5 participants engaged with the chatbot over five consecutive days, receiving individualized CBT interventions. The Self-Report Habit Index (SRHI) was used to measure habit strength before and after the intervention, revealing a reduction in habit strength post-intervention. These results underscore the importance of procedural knowledge in driving effective, personalized behavior change support in RAG-based systems.

[AI-41] On the Unknowable Limits to Prediction

链接: https://arxiv.org/abs/2411.19223
作者: Jiani Yan,Charles Rahal
关键词-EN: short Correspondence critiques, short Correspondence, Correspondence critiques, irreducible components, differential speeds
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This short Correspondence critiques the classic dichotomization of prediction error into reducible and irreducible components, noting that certain types of error can be eliminated at differential speeds. We propose an improved analytical framework that better distinguishes epistemic from aleatoric uncertainty, emphasizing that predictability depends on information sets and cautioning against premature claims of unpredictability.

[AI-42] On the Ethical Considerations of Generative Agents NEURIPS2024

链接: https://arxiv.org/abs/2411.19211
作者: N’yoma Diamond,Soumya Banerjee
关键词-EN: framework recently developed, Agents framework recently, developed by Park, Generative Agents framework, Generative Agents
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
*备注: Accepted (poster) to Socially Responsible Language Modelling Research (SoLaR) Workshop at NeurIPS 2024

点击查看摘要

Abstract:The Generative Agents framework recently developed by Park et al. has enabled numerous new technical solutions and problem-solving approaches. Academic and industrial interest in generative agents has been explosive as a result of the effectiveness of generative agents toward emulating human behaviour. However, it is necessary to consider the ethical challenges and concerns posed by this technique and its usage. In this position paper, we discuss the extant literature that evaluate the ethical considerations regarding generative agents and similar generative tools, and identify additional concerns of significant importance. We also suggest guidelines and necessary future research on how to mitigate some of the ethical issues and systemic risks associated with generative agents.

[AI-43] Convex Regularization and Convergence of Policy Gradient Flows under Safety Constraints

链接: https://arxiv.org/abs/2411.19193
作者: Pekka Malo,Lauri Viitasaari,Antti Suominen,Eeva Vilkkumaa,Olli Tahvonen
关键词-EN: studies reinforcement learning, paper studies reinforcement, infinite-horizon dynamic decision, dynamic decision processes, almost-sure safety constraints
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注: 74 pages

点击查看摘要

Abstract:This paper studies reinforcement learning (RL) in infinite-horizon dynamic decision processes with almost-sure safety constraints. Such safety-constrained decision processes are central to applications in autonomous systems, finance, and resource management, where policies must satisfy strict, state-dependent constraints. We consider a doubly-regularized RL framework that combines reward and parameter regularization to address these constraints within continuous state-action spaces. Specifically, we formulate the problem as a convex regularized objective with parametrized policies in the mean-field regime. Our approach leverages recent developments in mean-field theory and Wasserstein gradient flows to model policies as elements of an infinite-dimensional statistical manifold, with policy updates evolving via gradient flows on the space of parameter distributions. Our main contributions include establishing solvability conditions for safety-constrained problems, defining smooth and bounded approximations that facilitate gradient flows, and demonstrating exponential convergence towards global solutions under sufficient regularization. We provide general conditions on regularization functions, encompassing standard entropy regularization as a special case. The results also enable a particle method implementation for practical RL applications. The theoretical insights and convergence guarantees presented here offer a robust framework for safe RL in complex, high-dimensional decision-making problems.

[AI-44] DESIRE: Dynamic Knowledge Consolidation for Rehearsal-Free Continual Learning

链接: https://arxiv.org/abs/2411.19154
作者: Haiyang Guo,Fei Zhu,Fanhu Zeng,Bing Liu,Xu-Yao Zhang
关键词-EN: Continual learning aims, previously learned knowledge, retain previously learned, Continual learning, aims to equip
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual learning aims to equip models with the ability to retain previously learned knowledge like a human. Recent work incorporating Parameter-Efficient Fine-Tuning has revitalized the field by introducing lightweight extension modules. However, existing methods usually overlook the issue of information leakage caused by the fact that the experiment data have been used in pre-trained models. Once these duplicate data are removed in the pre-training phase, their performance can be severely affected. In this paper, we propose a new LoRA-based rehearsal-free method named DESIRE. Our method avoids imposing additional constraints during training to mitigate catastrophic forgetting, thereby maximizing the learning of new classes. To integrate knowledge from old and new tasks, we propose two efficient post-processing modules. On the one hand, we retain only two sets of LoRA parameters for merging and propose dynamic representation consolidation to calibrate the merged feature representation. On the other hand, we propose decision boundary refinement to address classifier bias when training solely on new class data. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple datasets and strikes an effective balance between stability and plasticity. Our code will be publicly available.

[AI-45] PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers

链接: https://arxiv.org/abs/2411.19114
作者: Gwangoo Yeo,Jiin Kim,Yujeong Choi,Minsoo Rhu
关键词-EN: NVIDIA Multi-Instance GPU, smaller GPU slices, multiple smaller GPU, NVIDIA Multi-Instance, GPU slices
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:NVIDIA’s Multi-Instance GPU (MIG) is a feature that enables system designers to reconfigure one large GPU into multiple smaller GPU slices. This work characterizes this emerging GPU and evaluates its effectiveness in designing high-performance AI inference servers. Our study reveals that the data preprocessing stage of AI inference causes significant performance bottlenecks to MIG. To this end, we present PREBA, which is a hardware/software co-design targeting MIG inference servers. Our first proposition is an FPGA-based data preprocessing accelerator that unlocks the full potential of MIG with domain-specific acceleration of data preprocessing. The MIG inference server unleashed from preprocessing overheads is then augmented with our dynamic batching system that enables high-performance inference. PREBA is implemented end-to-end in real systems, providing a 3.7x improvement in throughput, 3.4x reduction in tail latency, 3.5x improvement in energy-efficiency, and 3.0x improvement in cost-efficiency.

[AI-46] LADDER: Multi-objective Backdoor Attack via Evolutionary Algorithm

链接: https://arxiv.org/abs/2411.19075
作者: Dazhuang Liu,Yanqi Qiao,Rui Wang,Kaitai Liang,Georgios Smaragdakis
关键词-EN: convolutional neural networks, neural networks formulate, convolutional neural, neural networks, single domain
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Current black-box backdoor attacks in convolutional neural networks formulate attack objective(s) as single-objective optimization problems in single domain. Designing triggers in single domain harms semantics and trigger robustness as well as introduces visual and spectral anomaly. This work proposes a multi-objective black-box backdoor attack in dual domains via evolutionary algorithm (LADDER), the first instance of achieving multiple attack objectives simultaneously by optimizing triggers without requiring prior knowledge about victim model. In particular, we formulate LADDER as a multi-objective optimization problem (MOP) and solve it via multi-objective evolutionary algorithm (MOEA). MOEA maintains a population of triggers with trade-offs among attack objectives and uses non-dominated sort to drive triggers toward optimal solutions. We further apply preference-based selection to MOEA to exclude impractical triggers. We state that LADDER investigates a new dual-domain perspective for trigger stealthiness by minimizing the anomaly between clean and poisoned samples in the spectral domain. Lastly, the robustness against preprocessing operations is achieved by pushing triggers to low-frequency regions. Extensive experiments comprehensively showcase that LADDER achieves attack effectiveness of at least 99%, attack robustness with 90.23% (50.09% higher than state-of-the-art attacks on average), superior natural stealthiness (1.12x to 196.74x improvement) and excellent spectral stealthiness (8.45x enhancement) as compared to current stealthy attacks by the average l_2 -norm across 5 public datasets.

[AI-47] Using a Feedback Loop for LLM -based Infrastructure as Code Generation

链接: https://arxiv.org/abs/2411.19043
作者: Mayur Amarnath Palavalli,Mark Santolucito
关键词-EN: Large Language Models, increase software developer, software developer productivity, Language Models, Large Language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 4 pages, submitted to accepted by International Journal of Secondary Computing and Applications Research

点击查看摘要

Abstract:Code generation with Large Language Models (LLMs) has helped to increase software developer productivity in coding tasks, but has yet to have significant impact on the tasks of software developers that surround this code. In particular, the challenge of infrastructure management remains an open question. We investigate the ability of an LLM agent to construct infrastructure using the Infrastructure as Code (IaC) paradigm. We particularly investigate the use of a feedback loop that returns errors and warnings on the generated IaC to allow the LLM agent to improve the code. We find that, for each iteration of the loop, its effectiveness decreases exponentially until it plateaus at a certain point and becomes ineffective.

[AI-48] Mars-PO: Multi-Agent Reasoning System Preference Optimization

链接: https://arxiv.org/abs/2411.19039
作者: Xiaoxuan Lou,Chaojie Wang,Bo An
关键词-EN: large language models, achieving high performance, language models, significant challenge, fundamental capability
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mathematical reasoning is a fundamental capability for large language models (LLMs), yet achieving high performance in this domain remains a significant challenge. The auto-regressive generation process often makes LLMs susceptible to errors, hallucinations, and inconsistencies, particularly during multi-step reasoning. In this paper, we propose Mars-PO, a novel framework to improve the mathematical reasoning capabilities of LLMs through a multi-agent system. It combines high-quality outputs from multiple agents into a hybrid positive sample set and pairs them with agent-specific negative samples to construct robust preference pairs for training. By aligning agents with shared positive samples while addressing individual weaknesses, Mars-PO achieves substantial performance improvements on mathematical reasoning benchmarks. For example, it increases the accuracy on the MATH benchmark of the state-of-the-art instruction-tuned LLM, Llama3.1-8B-Instruct, from 50.38% to 57.82%. Experimental results further demonstrate that our method consistently outperforms other baselines, such as supervised fine-tuning, vanilla DPO, and its enhanced versions, highlighting the effectiveness of our approach.

[AI-49] A Unified Platform for At-Home Post-Stroke Rehabilitation Enabled by Wearable Technologies and Artificial Intelligence

链接: https://arxiv.org/abs/2411.19000
作者: Chenyu Tang,Ruizhi Zhang,Shuo Gao,Zihe Zhao,Zibo Zhang,Jiaqi Wang,Cong Li,Junliang Chen,Yanning Dai,Shengbo Wang,Ruoyu Juan,Qiaoying Li,Ruimou Xie,Xuhang Chen,Xinkai Zhou,Yunjia Xia,Jianan Chen,Fanghao Lu,Xin Li,Ninglli Wang,Peter Smielewski,Yu Pan,Hubin Zhao,Luigi G. Occhipinti
关键词-EN: presents significant challenges, post-stroke patients presents, patients presents significant, At-home rehabilitation, significant challenges
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 5 figures, 35 references

点击查看摘要

Abstract:At-home rehabilitation for post-stroke patients presents significant challenges, as continuous, personalized care is often limited outside clinical settings. Additionally, the absence of comprehensive solutions addressing diverse rehabilitation needs in home environments complicates recovery efforts. Here, we introduce a smart home platform that integrates wearable sensors, ambient monitoring, and large language model (LLM)-powered assistance to provide seamless health monitoring and intelligent support. The system leverages machine learning enabled plantar pressure arrays for motor recovery assessment (94% classification accuracy), a wearable eye-tracking module for cognitive evaluation, and ambient sensors for precise smart home control (100% operational success, 1 s latency). Additionally, the LLM-powered agent, Auto-Care, offers real-time interventions, such as health reminders and environmental adjustments, enhancing user satisfaction by 29%. This work establishes a fully integrated platform for long-term, personalized rehabilitation, offering new possibilities for managing chronic conditions and supporting aging populations.

[AI-50] NeuroLifting: Neural Inference on Markov Random Fields at Scale

链接: https://arxiv.org/abs/2411.18954
作者: Yaomin Wang,Chaolong Ying,Xiaodong Luo,Tianshu Yu
关键词-EN: Markov Random Fields, large-scale Markov Random, Markov Random, Random Fields, challenging task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inference in large-scale Markov Random Fields (MRFs) is a critical yet challenging task, traditionally approached through approximate methods like belief propagation and mean field, or exact methods such as the Toulbar2 solver. These strategies often fail to strike an optimal balance between efficiency and solution quality, particularly as the problem scale increases. This paper introduces NeuroLifting, a novel technique that leverages Graph Neural Networks (GNNs) to reparameterize decision variables in MRFs, facilitating the use of standard gradient descent optimization. By extending traditional lifting techniques into a non-parametric neural network framework, NeuroLifting benefits from the smooth loss landscape of neural networks, enabling efficient and parallelizable optimization. Empirical results demonstrate that, on moderate scales, NeuroLifting performs very close to the exact solver Toulbar2 in terms of solution quality, significantly surpassing existing approximate methods. Notably, on large-scale MRFs, NeuroLifting delivers superior solution quality against all baselines, as well as exhibiting linear computational complexity growth. This work presents a significant advancement in MRF inference, offering a scalable and effective solution for large-scale problems.

[AI-51] Knowledge Database or Poison Base? Detecting RAG Poisoning Attack through LLM Activations

链接: https://arxiv.org/abs/2411.18948
作者: Xue Tan,Hao Luan,Mingyu Luo,Xiaoyan Sun,Ping Chen,Jun Dai
关键词-EN: Large Language Models, Language Models, Large Language, real-world applications, ensuring the security
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are progressively deployed across diverse fields and real-world applications, ensuring the security and robustness of LLMs has become ever more critical. Retrieval-Augmented Generation (RAG) is a cutting-edge approach designed to address the limitations of large language models (LLMs). By retrieving information from the relevant knowledge database, RAG enriches the input to LLMs, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker’s target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs’ activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%. We also evaluate recent backdoor detection methods specifically designed for LLMs and applicable for identifying poisoned responses in RAG. The results demonstrate that our approach significantly surpasses them.

[AI-52] Federated Continual Graph Learning

链接: https://arxiv.org/abs/2411.18919
作者: Yinlin Zhu,Xunkai Li,Miao Hu,Di Wu
关键词-EN: data poses substantial, poses substantial challenges, substantial challenges due, managing evolving graph, continual graph learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
*备注: Under Review

点击查看摘要

Abstract:In the era of big data, managing evolving graph data poses substantial challenges due to storage costs and privacy issues. Training graph neural networks (GNNs) on such evolving data usually causes catastrophic forgetting, impairing performance on earlier tasks. Despite existing continual graph learning (CGL) methods mitigating this to some extent, they predominantly operate in centralized architectures and overlook the potential of distributed graph databases to harness collective intelligence for enhanced performance optimization. To address these challenges, we present a pioneering study on Federated Continual Graph Learning (FCGL), which adapts GNNs to multiple evolving graphs within decentralized settings while adhering to storage and privacy constraints. Our work begins with a comprehensive empirical analysis of FCGL, assessing its data characteristics, feasibility, and effectiveness, and reveals two principal challenges: local graph forgetting (LGF), where local GNNs forget prior knowledge when adapting to new tasks, and global expertise conflict (GEC), where the global GNN exhibits sub-optimal performance in both adapting to new tasks and retaining old ones, arising from inconsistent client expertise during server-side parameter aggregation. To tackle these, we propose the POWER framework, which mitigates LGF by preserving and replaying experience nodes with maximum local-global coverage at each client and addresses GEC by using a pseudo prototype reconstruction strategy and trajectory-aware knowledge transfer at the central server. Extensive evaluations across multiple graph datasets demonstrate POWER’s superior performance over straightforward federated extensions of the centralized CGL algorithms and vision-focused federated continual learning algorithms. Our code is available at this https URL.

[AI-53] Comprehensive Survey of Reinforcement Learning: From Algorithms to Practical Challenges

链接: https://arxiv.org/abs/2411.18892
作者: Majid Ghasemi,Amir Hossein Mousavi,Dariush Ebrahimi
关键词-EN: Artificial Intelligence, learn optimal behaviors, Deep Reinforcement Learning, Reinforcement Learning, paradigm in Artificial
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 79 pages

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as a powerful paradigm in Artificial Intelligence (AI), enabling agents to learn optimal behaviors through interactions with their environments. Drawing from the foundations of trial and error, RL equips agents to make informed decisions through feedback in the form of rewards or penalties. This paper presents a comprehensive survey of RL, meticulously analyzing a wide range of algorithms, from foundational tabular methods to advanced Deep Reinforcement Learning (DRL) techniques. We categorize and evaluate these algorithms based on key criteria such as scalability, sample efficiency, and suitability. We compare the methods in the form of their strengths and weaknesses in diverse settings. Additionally, we offer practical insights into the selection and implementation of RL algorithms, addressing common challenges like convergence, stability, and the exploration-exploitation dilemma. This paper serves as a comprehensive reference for researchers and practitioners aiming to harness the full potential of RL in solving complex, real-world problems.

[AI-54] An Integrated Artificial Intelligence Operating System for Advanced Low-Altitude Aviation Applications

链接: https://arxiv.org/abs/2411.18845
作者: Minzhe Tan,Xinlin Fan,Jian He,Yi Hou,Zhan Liu,Yaopeng Jiang,YM Jiang
关键词-EN: integrating cutting-edge technologies, comprehensive artificial intelligence, artificial intelligence operating, operating system tailored, intelligence operating system
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
*备注:

点击查看摘要

Abstract:This paper introduces a comprehensive artificial intelligence operating system tailored for low-altitude aviation applications, integrating cutting-edge technologies for enhanced performance, safety, and efficiency. The system comprises six core components: OrinFlight OS, a high-performance operating system optimized for real-time task execution; UnitedVision, a versatile visual processing module supporting advanced image analysis; UnitedSense, a multi-sensor fusion module providing precise environmental modeling; UnitedNavigator, a dynamic path-planning and navigation system; UnitedMatrix, enabling multi-drone coordination and task execution; and UnitedInSight, a ground station for monitoring and management. Complemented by the UA DevKit low-code platform, the system facilitates user-friendly customization and application development. Leveraging NVIDIA Orin’s computational power and advanced AI algorithms, this system addresses complex challenges in modern aviation, offering robust solutions for navigation, perception, and collaborative operations. This work highlights the system’s architecture, features, and potential applications, demonstrating its ability to meet the demands of intelligent aviation environments.

[AI-55] Unifying Generative and Dense Retrieval for Sequential Recommendation

链接: https://arxiv.org/abs/2411.18814
作者: Liu Yang,Fabian Paischer,Kaveh Hassani,Jiacheng Li,Shuai Shao,Zhang Gabriel Li,Yun He,Xue Feng,Nima Noorshams,Sem Park,Bo Long,Robert D Nowak,Xiaoli Gao,Hamid Eghbalzadeh
关键词-EN: utilize advanced sequence, advanced sequence learning, sequence learning techniques, Sequential dense retrieval, models utilize advanced
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential dense retrieval models utilize advanced sequence learning techniques to compute item and user representations, which are then used to rank relevant items for a user through inner product computation between the user and all item representations. However, this approach requires storing a unique representation for each item, resulting in significant memory requirements as the number of items grow. In contrast, the recently proposed generative retrieval paradigm offers a promising alternative by directly predicting item indices using a generative model trained on semantic IDs that encapsulate items’ semantic information. Despite its potential for large-scale applications, a comprehensive comparison between generative retrieval and sequential dense retrieval under fair conditions is still lacking, leaving open questions regarding performance, and computation trade-offs. To address this, we compare these two approaches under controlled conditions on academic benchmarks and propose LIGER (LeveragIng dense retrieval for GEnerative Retrieval), a hybrid model that combines the strengths of these two widely used methods. LIGER integrates sequential dense retrieval into generative retrieval, mitigating performance differences and enhancing cold-start item recommendation in the datasets evaluated. This hybrid approach provides insights into the trade-offs between these approaches and demonstrates improvements in efficiency and effectiveness for recommendation systems in small-scale benchmarks.

[AI-56] he Performance of the LSTM-based Code Generated by Large Language Models (LLM s) in Forecasting Time Series Data

链接: https://arxiv.org/abs/2411.18731
作者: Saroj Gopali,Sima Siami-Namini,Faranak Abri,Akbar Siami Namin
关键词-EN: deep learning models, conducting automated scientific, complex deep learning, scientific data analysis, optimizing complex deep
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:As an intriguing case is the goodness of the machine and deep learning models generated by these LLMs in conducting automated scientific data analysis, where a data analyst may not have enough expertise in manually coding and optimizing complex deep learning models and codes and thus may opt to leverage LLMs to generate the required models. This paper investigates and compares the performance of the mainstream LLMs, such as ChatGPT, PaLM, LLama, and Falcon, in generating deep learning models for analyzing time series data, an important and popular data type with its prevalent applications in many application domains including financial and stock market. This research conducts a set of controlled experiments where the prompts for generating deep learning-based models are controlled with respect to sensitivity levels of four criteria including 1) Clarify and Specificity, 2) Objective and Intent, 3) Contextual Information, and 4) Format and Style. While the results are relatively mix, we observe some distinct patterns. We notice that using LLMs, we are able to generate deep learning-based models with executable codes for each dataset seperatly whose performance are comparable with the manually crafted and optimized LSTM models for predicting the whole time series dataset. We also noticed that ChatGPT outperforms the other LLMs in generating more accurate models. Furthermore, we observed that the goodness of the generated models vary with respect to the ``temperature’’ parameter used in configuring LLMS. The results can be beneficial for data analysts and practitioners who would like to leverage generative AIs to produce good prediction models with acceptable goodness.

[AI-57] ming Matters: Enhancing User Experience through Temporal Prediction in Smart Homes

链接: https://arxiv.org/abs/2411.18719
作者: Shrey Ganatra,Spandan Anaokar,Pushpak Bhattacharyya
关键词-EN: Internet of Things, perform using IoT, considered the sheer, sheer volume, user
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages + 1 reference, 5 figures, 5 tables

点击查看摘要

Abstract:Have you ever considered the sheer volume of actions we perform using IoT (Internet of Things) devices within our homes, offices, and daily environments? From the mundane act of flicking a light switch to the precise adjustment of room temperatures, we are surrounded by a wealth of data, each representing a glimpse into user behaviour. While existing research has sought to decipher user behaviours from these interactions and their timestamps, a critical dimension still needs to be explored: the timing of these actions. Despite extensive efforts to understand and forecast user behaviours, the temporal dimension of these interactions has received scant attention. However, the timing of actions holds profound implications for user experience, efficiency, and overall satisfaction with intelligent systems. In our paper, we venture into the less-explored realm of human-centric AI by endeavoring to predict user actions and their timing. To achieve this, we contribute a meticulously synthesized dataset comprising 11k sequences of actions paired with their respective date and time stamps. Building upon this dataset, we propose our model, which employs advanced machine learning techniques for k-class classification over time intervals within a day. To the best of our knowledge, this is the first attempt at time prediction for smart homes. We achieve a 40% (96-class) accuracy across all datasets and an 80% (8-class) accuracy on the dataset containing exact timestamps, showcasing the efficacy of our approach in predicting the temporal dynamics of user actions within smart environments.

[AI-58] Explainable deep learning improves human mental models of self-driving cars

链接: https://arxiv.org/abs/2411.18714
作者: Eoin M. Kenny,Akshay Dharmavaram,Sang Uk Lee,Tung Phan-Minh,Shreyas Rajesh,Yunqing Hu,Laura Major,Momchil S. Tomov,Julie A. Shah
关键词-EN: achieve human-like driving, deep neural networks, cars increasingly rely, Self-driving cars increasingly, black-box motion planners
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: * - equal contribution

点击查看摘要

Abstract:Self-driving cars increasingly rely on deep neural networks to achieve human-like driving. However, the opacity of such black-box motion planners makes it challenging for the human behind the wheel to accurately anticipate when they will fail, with potentially catastrophic consequences. Here, we introduce concept-wrapper network (i.e., CW-Net), a method for explaining the behavior of black-box motion planners by grounding their reasoning in human-interpretable concepts. We deploy CW-Net on a real self-driving car and show that the resulting explanations refine the human driver’s mental model of the car, allowing them to better predict its behavior and adjust their own behavior accordingly. Unlike previous work using toy domains or simulations, our study presents the first real-world demonstration of how to build authentic autonomous vehicles (AVs) that give interpretable, causally faithful explanations for their decisions, without sacrificing performance. We anticipate our method could be applied to other safety-critical systems with a human in the loop, such as autonomous drones and robotic surgeons. Overall, our study suggests a pathway to explainability for autonomous agents as a whole, which can help make them more transparent, their deployment safer, and their usage more ethical.

[AI-59] Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

链接: https://arxiv.org/abs/2411.18708
作者: Tiffany Zhu,Kexun Zhang,William Yang Wang
关键词-EN: impressive essay writing, impressive essay, essay writing, writing and problem-solving, OpenAI ChatGPT
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 6 main pages, 5 figures

点击查看摘要

Abstract:The impressive essay writing and problem-solving capabilities of large language models (LLMs) like OpenAI’s ChatGPT have opened up new avenues in education. Our goal is to gain insights into the widespread use of LLMs among secondary students to inform their future development. Despite school restrictions, our survey of over 300 middle and high school students revealed that a remarkable 70% of students have utilized LLMs, higher than the usage percentage among young adults, and this percentage remains consistent across 7th to 12th grade. Students also reported using LLMs for multiple subjects, including language arts, history, and math assignments, but expressed mixed thoughts on their effectiveness due to occasional hallucinations in historical contexts and incorrect answers for lack of rigorous reasoning. The survey feedback called for LLMs better adapted for students, and also raised questions to developers and educators on how to help students from underserved communities leverage LLMs’ capabilities for equal access to advanced education resources. We propose a few ideas to address such issues, including subject-specific models, personalized learning, and AI classrooms.

[AI-60] Immune: Improving Safety Against Jailbreaks in Multi-modal LLM s via Inference-Time Alignment

链接: https://arxiv.org/abs/2411.18688
作者: Soumya Suvra Ghosal,Souradip Chakraborty,Vaibhav Singh,Tianrui Guan,Mengdi Wang,Ahmad Beirami,Furong Huang,Alvaro Velasquez,Dinesh Manocha,Amrit Singh Bedi
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, deployment of Multimodal
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks: carefully crafted image-prompt pairs that compel the model to generate harmful content. In this work, we first highlight a critical safety gap, demonstrating that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model during decoding to defend against jailbreak attacks. Additionally, we provide a rigorous mathematical characterization of Immune, offering provable guarantees against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model’s original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

[AI-61] Embodied Red Teaming for Auditing Robotic Foundation Models

链接: https://arxiv.org/abs/2411.18676
作者: Sathwik Karnik,Zhang-Wei Hong,Nishant Abhangi,Yen-Chen Lin,Tsun-Hsuan Wang,Pulkit Agrawal
关键词-EN: Language-conditioned robot models, Language-conditioned robot, robotic foundation models, enable robots, robotic foundation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language-conditioned robot models (i.e., robotic foundation models) enable robots to perform a wide range of tasks based on natural language instructions. Despite strong performance on existing benchmarks, evaluating the safety and effectiveness of these models is challenging due to the complexity of testing all possible language variations. Current benchmarks have two key limitations: they rely on a limited set of human-generated instructions, missing many challenging cases, and they focus only on task performance without assessing safety, such as avoiding damage. To address these gaps, we introduce Embodied Red Teaming (ERT), a new evaluation method that generates diverse and challenging instructions to test these models. ERT uses automated red teaming techniques with Vision Language Models (VLMs) to create contextually grounded, difficult instructions. Experimental results show that state-of-the-art models frequently fail or behave unsafely on ERT tests, underscoring the shortcomings of current benchmarks in evaluating real-world performance and safety. Code and videos are available at: this https URL.

[AI-62] ScaleViz: Scaling Visualization Recommendation Models on Large Data PAKDD2024

链接: https://arxiv.org/abs/2411.18657
作者: Ghazi Shazan Ahmad,Shubham Agarwal,Subrata Mitra,Ryan Rossi,Manav Doshi,Vibhor Porwal,Syam Manoj Kumar Paila
关键词-EN: Automated visualization recommendations, derive crucial insights, automated vis-rec models, derive crucial, visualization recommendations
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
*备注: Accepted at PAKDD 2024 (Oral)

点击查看摘要

Abstract:Automated visualization recommendations (vis-rec) help users to derive crucial insights from new datasets. Typically, such automated vis-rec models first calculate a large number of statistics from the datasets and then use machine-learning models to score or classify multiple visualizations choices to recommend the most effective ones, as per the statistics. However, state-of-the art models rely on very large number of expensive statistics and therefore using such models on large datasets become infeasible due to prohibitively large computational time, limiting the effectiveness of such techniques to most real world complex and large datasets. In this paper, we propose a novel reinforcement-learning (RL) based framework that takes a given vis-rec model and a time-budget from the user and identifies the best set of input statistics that would be most effective while generating the visual insights within a given time budget, using the given model. Using two state-of-the-art vis-rec models applied on three large real-world datasets, we show the effectiveness of our technique in significantly reducing time-to visualize with very small amount of introduced error. Our approach is about 10X times faster compared to the baseline approaches that introduce similar amounts of error.

[AI-63] PRSI: Privacy-Preserving Recommendation Model Based on Vector Splitting and Interactive Protocols

链接: https://arxiv.org/abs/2411.18653
作者: Xiaokai Cao,Wenjin Mo,Zhenyu He,Changdong Wang
关键词-EN: recommending interesting products, highly valuable research, valuable research topic, Federated Recommendation Systems, recommending interesting
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the development of the internet, recommending interesting products to users has become a highly valuable research topic for businesses. Recommendation systems play a crucial role in addressing this issue. To prevent the leakage of each user’s (client’s) private data, Federated Recommendation Systems (FedRec) have been proposed and widely used. However, extensive research has shown that FedRec suffers from security issues such as data privacy leakage, and it is challenging to train effective models with FedRec when each client only holds interaction information for a single user. To address these two problems, this paper proposes a new privacy-preserving recommendation system (PRSI), which includes a preprocessing module and two main phases. The preprocessing module employs split vectors and fake interaction items to protect clients’ interaction information and recommendation results. The two main phases are: (1) the collection of interaction information and (2) the sending of recommendation results. In the interaction information collection phase, each client uses the preprocessing module and random communication methods (according to the designed interactive protocol) to protect their ID information and IP addresses. In the recommendation results sending phase, the central server uses the preprocessing module and triplets to distribute recommendation results to each client under secure conditions, following the designed interactive protocol. Finally, we conducted multiple sets of experiments to verify the security, accuracy, and communication cost of the proposed method.

[AI-64] Dynamic Logistic Ensembles with Recursive Probability and Automatic Subset Splitting for Enhanced Binary Classification

链接: https://arxiv.org/abs/2411.18649
作者: Mohammad Zubair Khan,David Li
关键词-EN: paper presents, dynamic logistic ensemble, dynamic logistic, binary classification, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 Pages, 2024 IEEE 15th Annual Ubiquitous Computing, Electronics \ Mobile Communication Conference (UEMCON)}. Published in the Proceedings of UEMCON 2024, \c{opyright}2024 IEEE

点击查看摘要

Abstract:This paper presents a novel approach to binary classification using dynamic logistic ensemble models. The proposed method addresses the challenges posed by datasets containing inherent internal clusters that lack explicit feature-based separations. By extending traditional logistic regression, we develop an algorithm that automatically partitions the dataset into multiple subsets, constructing an ensemble of logistic models to enhance classification accuracy. A key innovation in this work is the recursive probability calculation, derived through algebraic manipulation and mathematical induction, which enables scalable and efficient model construction. Compared to traditional ensemble methods such as Bagging and Boosting, our approach maintains interpretability while offering competitive performance. Furthermore, we systematically employ maximum likelihood and cost functions to facilitate the analytical derivation of recursive gradients as functions of ensemble depth. The effectiveness of the proposed approach is validated on a custom dataset created by introducing noise and shifting data to simulate group structures, resulting in significant performance improvements with layers. Implemented in Python, this work balances computational efficiency with theoretical rigor, providing a robust and interpretable solution for complex classification tasks with broad implications for machine learning applications. Code at this https URL

[AI-65] MADE: Graph Backdoor Defense with Masked Unlearning

链接: https://arxiv.org/abs/2411.18648
作者: Xiao Lin amd Mingjie Li,Yisen Wang
关键词-EN: social network analysis, Graph Neural Networks, Neural Networks, network analysis, social network
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have garnered significant attention from researchers due to their outstanding performance in handling graph-related tasks, such as social network analysis, protein design, and so on. Despite their widespread application, recent research has demonstrated that GNNs are vulnerable to backdoor attacks, implemented by injecting triggers into the training datasets. Trained on the poisoned data, GNNs will predict target labels when attaching trigger patterns to inputs. This vulnerability poses significant security risks for applications of GNNs in sensitive domains, such as drug discovery. While there has been extensive research into backdoor defenses for images, strategies to safeguard GNNs against such attacks remain underdeveloped. Furthermore, we point out that conventional backdoor defense methods designed for images cannot work well when directly implemented on graph data. In this paper, we first analyze the key difference between image backdoor and graph backdoor attacks. Then we tackle the graph defense problem by presenting a novel approach called MADE, which devises an adversarial mask generation mechanism that selectively preserves clean sub-graphs and further leverages masks on edge weights to eliminate the influence of triggers effectively. Extensive experiments across various graph classification tasks demonstrate the effectiveness of MADE in significantly reducing the attack success rate (ASR) while maintaining a high classification accuracy.

[AI-66] Enhancing Project Performance Forecasting using Machine Learning Techniques

链接: https://arxiv.org/abs/2411.17914
作者: Soheila Sadeghi
关键词-EN: urban road reconstruction, road reconstruction project, delivering urban road, project performance metrics, Work Breakdown Structure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Accurate forecasting of project performance metrics is crucial for successfully managing and delivering urban road reconstruction projects. Traditional methods often rely on static baseline plans and fail to consider the dynamic nature of project progress and external factors. This research proposes a machine learning-based approach to forecast project performance metrics, such as cost variance and earned value, for each Work Breakdown Structure (WBS) category in an urban road reconstruction project. The proposed model utilizes time series forecasting techniques, including Autoregressive Integrated Moving Average (ARIMA) and Long Short-Term Memory (LSTM) networks, to predict future performance based on historical data and project progress. The model also incorporates external factors, such as weather patterns and resource availability, as features to enhance the accuracy of forecasts. By applying the predictive power of machine learning, the performance forecasting model enables proactive identification of potential deviations from the baseline plan, which allows project managers to take timely corrective actions. The research aims to validate the effectiveness of the proposed approach using a case study of an urban road reconstruction project, comparing the model’s forecasts with actual project performance data. The findings of this research contribute to the advancement of project management practices in the construction industry, offering a data-driven solution for improving project performance monitoring and control.

[AI-67] Field Assessment of Force Torque Sensors for Planetary Rover Navigation

链接: https://arxiv.org/abs/2411.04700
作者: Levin Gerdes,Carlos Pérez del Pulgar,Raúl Castilla Arquillo,Martin Azkarate
关键词-EN: Proprioceptive sensors, planetary rovers serve, serve for state, state estimation, locomotion performance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Proprioceptive sensors on planetary rovers serve for state estimation and for understanding terrain and locomotion performance. While inertial measurement units (IMUs) are widely used to this effect, force-torque sensors are less explored for planetary navigation despite their potential to directly measure interaction forces and provide insights into traction performance. This paper presents an evaluation of the performance and use cases of force-torque sensors based on data collected from a six-wheeled rover during tests over varying terrains, speeds, and slopes. We discuss challenges, such as sensor signal reliability and terrain response accuracy, and identify opportunities regarding the use of these sensors. The data is openly accessible and includes force-torque measurements from each of the six-wheel assemblies as well as IMU data from within the rover chassis. This paper aims to inform the design of future studies and rover upgrades, particularly in sensor integration and control algorithms, to improve navigation capabilities.

[AI-68] Enhanced anomaly detection in well log data through the application of ensemble GANs

链接: https://arxiv.org/abs/2411.19875
作者: Abdulrahman Al-Fakih,A. Koeshidayatullah,Tapan Mukerji,SanLinn I. Kaka
关键词-EN: generative adversarial networks, modeling data distributions, shown significant success, log data, remains relatively underexplored
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although generative adversarial networks (GANs) have shown significant success in modeling data distributions for image datasets, their application to structured or tabular data, such as well logs, remains relatively underexplored. This study extends the ensemble GANs (EGANs) framework to capture the distribution of well log data and detect anomalies that fall outside of these distributions. The proposed approach compares the performance of traditional methods, such as Gaussian mixture models (GMMs), with EGANs in detecting anomalies outside the expected data distributions. For the gamma ray (GR) dataset, EGANs achieved a precision of 0.62 and F1 score of 0.76, outperforming GMM’s precision of 0.38 and F1 score of 0.54. Similarly, for travel time (DT), EGANs achieved a precision of 0.70 and F1 score of 0.79, surpassing GMM 0.56 and 0.71. In the neutron porosity (NPHI) dataset, EGANs recorded a precision of 0.53 and F1 score of 0.68, outshining GMM 0.47 and 0.61. For the bulk density (RHOB) dataset, EGANs achieved a precision of 0.52 and an F1 score of 0.67, slightly outperforming GMM, which yielded a precision of 0.50 and an F1 score of 0.65. This work’s novelty lies in applying EGANs for well log data analysis, showcasing their ability to learn data patterns and identify anomalies that deviate from them. This approach offers more reliable anomaly detection compared to traditional methods like GMM. The findings highlight the potential of EGANs in enhancing anomaly detection for well log data, delivering significant implications for optimizing drilling strategies and reservoir management through more accurate, data-driven insights into subsurface characterization.

[AI-69] Scaling Transformers for Low-Bitrate High-Quality Speech Coding

链接: https://arxiv.org/abs/2411.19842
作者: Julian D Parker,Anton Smirnov,Jordi Pons,CJ Carr,Zack Zukowski,Zach Evans,Xubo Liu
关键词-EN: neural audio codec, audio codec models, multimodal context, Finite Scalar Quantization, neural audio
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of 400 or 700 bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.

[AI-70] Density-Calibrated Conformal Quantile Regression

链接: https://arxiv.org/abs/2411.19523
作者: Yuan Lu
关键词-EN: Conformal Quantile Regression, Quantile Regression, Density-Calibrated Conformal Quantile, Conformal Quantile, feature space
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces the Density-Calibrated Conformal Quantile Regression (CQR-d) method, a novel approach for constructing prediction intervals that adapts to varying uncertainty across the feature space. Building upon conformal quantile regression, CQR-d incorporates local information through a weighted combination of local and global conformity scores, where the weights are determined by local data density. We prove that CQR-d provides valid marginal coverage at level 1 - \alpha - \epsilon , where \epsilon represents a small tolerance from numerical optimization. Through extensive simulation studies and an application to the a heteroscedastic dataset available in R, we demonstrate that CQR-d maintains the desired coverage while producing substantially narrower prediction intervals compared to standard conformal quantile regression (CQR). Notably, in our application on heteroscedastic data, CQR-d achieves an 8.6% reduction in average interval width while maintaining comparable coverage. The method’s effectiveness is particularly pronounced in settings with clear local uncertainty patterns, making it a valuable tool for prediction tasks in heterogeneous data environments.

[AI-71] Concept-driven Off Policy Evaluation

链接: https://arxiv.org/abs/2411.19395
作者: Ritam Majumdar,Jack Teversham,Sonali Parbhoo
关键词-EN: Evaluating off-policy decisions, batch data poses, data poses significant, poses significant challenges, significant challenges due
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 37 pages, 10 figures

点击查看摘要

Abstract:Evaluating off-policy decisions using batch data poses significant challenges due to limited sample sizes leading to high variance. To improve Off-Policy Evaluation (OPE), we must identify and address the sources of this variance. Recent research on Concept Bottleneck Models (CBMs) shows that using human-explainable concepts can improve predictions and provide better understanding. We propose incorporating concepts into OPE to reduce variance. Our work introduces a family of concept-based OPE estimators, proving that they remain unbiased and reduce variance when concepts are known and predefined. Since real-world applications often lack predefined concepts, we further develop an end-to-end algorithm to learn interpretable, concise, and diverse parameterized concepts optimized for variance reduction. Our experiments with synthetic and real-world datasets show that both known and learned concept-based estimators significantly improve OPE performance. Crucially, we show that, unlike other OPE methods, concept-based estimators are easily interpretable and allow for targeted interventions on specific concepts, further enhancing the quality of these estimators.

[AI-72] Contrastive representations of high-dimensional structured treatments

链接: https://arxiv.org/abs/2411.19245
作者: Oriol Corcoll Andreu,Athanasios Vlontzos,Michael O’Riordan,Ciaran M. Gilligan-Lee
关键词-EN: Estimating causal effects, causal effect estimation, Estimating causal, effect estimation, decision making
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating causal effects is vital for decision making. In standard causal effect estimation, treatments are usually binary- or continuous-valued. However, in many important real-world settings, treatments can be structured, high-dimensional objects, such as text, video, or audio. This provides a challenge to traditional causal effect estimation. While leveraging the shared structure across different treatments can help generalize to unseen treatments at test time, we show in this paper that using such structure blindly can lead to biased causal effect estimation. We address this challenge by devising a novel contrastive approach to learn a representation of the high-dimensional treatments, and prove that it identifies underlying causal factors and discards non-causally relevant factors. We prove that this treatment representation leads to unbiased estimates of the causal effect, and empirically validate and benchmark our results on synthetic and real-world datasets.

[AI-73] Beautimeter: Harnessing GPT for Assessing Architectural and Urban Beauty based on the 15 Properties of Living Structure

链接: https://arxiv.org/abs/2411.19094
作者: Bin Jiang
关键词-EN: generative pre-trained transformer, pre-trained transformer, designed to evaluate, powered by generative, generative pre-trained
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figure, and two tables

点击查看摘要

Abstract:Beautimeter is a new tool powered by generative pre-trained transformer (GPT) technology, designed to evaluate architectural and urban beauty. Rooted in Christopher Alexander’s theory of centers, this work builds on the idea that all environments possess, to varying degrees, an innate sense of life. Alexander identified 15 fundamental properties, such as levels of scale and thick boundaries, that characterize living structure, which Beautimeter uses as a basis for its analysis. By integrating GPT’s advanced natural language processing capabilities, Beautimeter assesses the extent to which a structure embodies these 15 properties, enabling a nuanced evaluation of architectural and urban aesthetics. Using ChatGPT, the tool helps users generate insights into the perceived beauty and coherence of spaces. We conducted a series of case studies, evaluating images of architectural and urban environments, as well as carpets, paintings, and other artifacts. The results demonstrate Beautimeter’s effectiveness in analyzing aesthetic qualities across diverse contexts. Our findings suggest that by leveraging GPT technology, Beautimeter offers architects, urban planners, and designers a powerful tool to create spaces that resonate deeply with people. This paper also explores the implications of such technology for architecture and urban design, highlighting its potential to enhance both the design process and the assessment of built environments. Keywords: Living structure, structural beauty, Christopher Alexander, AI in Design, human centered design

[AI-74] GRU-PFG: Extract Inter-Stock Correlation from Stock Factors with Graph Neural Network

链接: https://arxiv.org/abs/2411.18997
作者: Yonggai Zhuang,Haoran Chen,Kequan Wang,Teng Fei
关键词-EN: industries presents challenges, stock factors, stock, industries presents, presents challenges
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
*备注: 17pages

点击查看摘要

Abstract:The complexity of stocks and industries presents challenges for stock prediction. Currently, stock prediction models can be divided into two categories. One category, represented by GRU and ALSTM, relies solely on stock factors for prediction, with limited effectiveness. The other category, represented by HIST and TRA, incorporates not only stock factors but also industry information, industry financial reports, public sentiment, and other inputs for prediction. The second category of models can capture correlations between stocks by introducing additional information, but the extra data is difficult to standardize and generalize. Considering the current state and limitations of these two types of models, this paper proposes the GRU-PFG (Project Factors into Graph) model. This model only takes stock factors as input and extracts inter-stock correlations using graph neural networks. It achieves prediction results that not only outperform the others models relies solely on stock factors, but also achieve comparable performance to the second category models. The experimental results show that on the CSI300 dataset, the IC of GRU-PFG is 0.134, outperforming HIST’s 0.131 and significantly surpassing GRU and Transformer, achieving results better than the second category models. Moreover as a model that relies solely on stock factors, it has greater potential for generalization.

[AI-75] Redesigning the ensemble Kalman filter with a dedicated model of epistemic uncertainty

链接: https://arxiv.org/abs/2411.18864
作者: Chatchuea Kimchaiwong,Jeremie Houssineau,Adam M. Johansen
关键词-EN: observations received serially, ensemble Kalman filters, ensemble Kalman, standard ensemble Kalman, incorporating information
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The problem of incorporating information from observations received serially in time is widespread in the field of uncertainty quantification. Within a probabilistic framework, such problems can be addressed using standard filtering techniques. However, in many real-world problems, some (or all) of the uncertainty is epistemic, arising from a lack of knowledge, and is difficult to model probabilistically. This paper introduces a possibilistic ensemble Kalman filter designed for this setting and characterizes some of its properties. Using possibility theory to describe epistemic uncertainty is appealing from a philosophical perspective, and it is easy to justify certain heuristics often employed in standard ensemble Kalman filters as principled approaches to capturing uncertainty within it. The possibilistic approach motivates a robust mechanism for characterizing uncertainty which shows good performance with small sample sizes, and can outperform standard ensemble Kalman filters at given sample size, even when dealing with genuinely aleatoric uncertainty.

[AI-76] RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data

链接: https://arxiv.org/abs/2411.18822
作者: Maxwell A. Xu,Jaya Narain,Gregory Darnell,Haraldur Hallgrimsson,Hyewon Jeong,Darren Forde,Richard Fineman,Karthik J. Raghuram,James M. Rehg,Shirley Ren
关键词-EN: softened contrastive loss, learnable distance measure, trastive learning approach, present RelCon, textit
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present RelCon, a novel self-supervised \textitRelative \textitContrastive learning approach that uses a learnable distance measure in combination with a softened contrastive loss for training an motion foundation model from wearable sensors. The learnable distance measure captures motif similarity and domain-specific semantic information such as rotation invariance. The learned distance provides a measurement of semantic similarity between a pair of accelerometer time-series segments, which is used to measure the distance between an anchor and various other sampled candidate segments. The self-supervised model is trained on 1 billion segments from 87,376 participants from a large wearables dataset. The model achieves strong performance across multiple downstream tasks, encompassing both classification and regression. To our knowledge, we are the first to show the generalizability of a self-supervised learning model with motion data from wearables across distinct evaluation tasks.

[AI-77] he Return of Pseudosciences in Artificial Intelligence: Have Machine Learning and Deep Learning Forgotten Lessons from Statistics and History?

链接: https://arxiv.org/abs/2411.18656
作者: Jérémie Sublime
关键词-EN: achieved seemingly exceptional, seemingly exceptional performance, Machine Learning, today world, range of tasks
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In today’s world, AI programs powered by Machine Learning are ubiquitous, and have achieved seemingly exceptional performance across a broad range of tasks, from medical diagnosis and credit rating in banking, to theft detection via video analysis, and even predicting political or sexual orientation from facial images. These predominantly deep learning methods excel due to their extraordinary capacity to process vast amounts of complex data to extract complex correlations and relationship from different levels of features. In this paper, we contend that the designers and final users of these ML methods have forgotten a fundamental lesson from statistics: correlation does not imply causation. Not only do most state-of-the-art methods neglect this crucial principle, but by doing so they often produce nonsensical or flawed causal models, akin to social astrology or physiognomy. Consequently, we argue that current efforts to make AI models more ethical by merely reducing biases in the training data are insufficient. Through examples, we will demonstrate that the potential for harm posed by these methods can only be mitigated by a complete rethinking of their core models, improved quality assessment metrics and policies, and by maintaining humans oversight throughout the process. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.18656 [stat.ML] (or arXiv:2411.18656v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.18656 Focus to learn more arXiv-issued DOI via DataCite

机器学习

[LG-0] Scalable Out-of-distribution Robustness in the Presence of Unobserved Confounders

链接: https://arxiv.org/abs/2411.19923
作者: Parjanya Prashant,Seyedeh Baharan Khatami,Bruno Ribeiro,Babak Salimi
关键词-EN: text, OOD generalization differs, unobserved confounder, OOD, label shift
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 3 figures

点击查看摘要

Abstract:We consider the task of out-of-distribution (OOD) generalization, where the distribution shift is due to an unobserved confounder ( Z ) affecting both the covariates ( X ) and the labels ( Y ). In this setting, traditional assumptions of covariate and label shift are unsuitable due to the confounding, which introduces heterogeneity in the predictor, i.e., \hatY = f_Z(X) . OOD generalization differs from traditional domain adaptation by not assuming access to the covariate distribution ( X^\textte ) of the test samples during training. These conditions create a challenging scenario for OOD robustness: (a) Z^\texttr is an unobserved confounder during training, (b) P^\textteZ \neq P^\texttrZ , © X^\textte is unavailable during training, and (d) the posterior predictive distribution depends on P^\textte(Z) , i.e., \hatY = E_P^\textte(Z)[f_Z(X)] . In general, accurate predictions are unattainable in this scenario, and existing literature has proposed complex predictors based on identifiability assumptions that require multiple additional variables. Our work investigates a set of identifiability assumptions that tremendously simplify the predictor, whose resulting elegant simplicity outperforms existing approaches.

[LG-1] Noncommutative Model Selection and the Data-Driven Estimation of Real Cohomology Groups

链接: https://arxiv.org/abs/2411.19894
作者: Araceli Guzmán-Tristán,Antonio Rieser,Eduardo Velázquez-Richards
关键词-EN: compact metric-measure space, real cohomology groups, completely data-driven methods, metric-measure space, compact metric-measure
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 15 pages, sequel to “Noncommutative Model Selection for Data Clustering and Dimension Reduction Using Relative von Neumann Entropy”

点击查看摘要

Abstract:We propose three completely data-driven methods for estimating the real cohomology groups H^k (X ; \mathbbR) of a compact metric-measure space (X, d_X, \mu_X) embedded in a metric-measure space (Y,d_Y,\mu_Y) , given a finite set of points S sampled from a uniform distrbution \mu_X on X , possibly corrupted with noise from Y . We present the results of several computational experiments in the case that X is embedded in \mathbbR^n , where two of the three algorithms performed well.

[LG-2] Open source Differentiable ODE Solving Infrastructure

链接: https://arxiv.org/abs/2411.19882
作者: Rakshit Kr. Singh,Aaron Rock Menezes,Rida Irfan,Bharath Ramsundar
关键词-EN: Ordinary Differential Equations, including reaction kinetics, Ordinary Differential, Differential Equations, including reaction
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ordinary Differential Equations (ODEs) are widely used in physics, chemistry, and biology to model dynamic systems, including reaction kinetics, population dynamics, and biological processes. In this work, we integrate GPU-accelerated ODE solvers into the open-source DeepChem framework, making these tools easily accessible. These solvers support multiple numerical methods and are fully differentiable, enabling easy integration into more complex differentiable programs. We demonstrate the capabilities of our implementation through experiments on Lotka-Volterra predator-prey dynamics, pharmacokinetic compartment models, neural ODEs, and solving PDEs using reaction-diffusion equations. Our solvers achieved high accuracy with mean squared errors ranging from 10^-4 to 10^-6 and showed scalability in solving large systems with up to 100 compartments.

[LG-3] GradAlign for Training-free Model Performance Inference

链接: https://arxiv.org/abs/2411.19819
作者: Yuxuan Li,Yunhui Guo
关键词-EN: Training-free NAS, Neural Tangent Kernel, plays an important, important role, role in deciding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Architecture plays an important role in deciding the performance of deep neural networks. However, the search for the optimal architecture is often hindered by the vast search space, making it a time-intensive process. Recently, a novel approach known as training-free neural architecture search (NAS) has emerged, aiming to discover the ideal architecture without necessitating extensive training. Training-free NAS leverages various indicators for architecture selection, including metrics such as the count of linear regions, the density of per-sample losses, and the stability of the finite-width Neural Tangent Kernel (NTK) matrix. Despite the competitive empirical performance of current training-free NAS techniques, they suffer from certain limitations, including inconsistent performance and a lack of deep understanding. In this paper, we introduce GradAlign, a simple yet effective method designed for inferring model performance without the need for training. At its core, GradAlign quantifies the extent of conflicts within per-sample gradients during initialization, as substantial conflicts hinder model convergence and ultimately result in worse performance. We evaluate GradAlign against established training-free NAS methods using standard NAS benchmarks, showing a better overall performance. Moreover, we show that the widely adopted metric of linear region count may not suffice as a dependable criterion for selecting network architectures during at initialization.

[LG-4] Rethinking the initialization of Momentum in Federated Learning with Heterogeneous Data

链接: https://arxiv.org/abs/2411.19798
作者: Chenguang Xiao,Shuo Wang
关键词-EN: Federated Learning performance, Data Heterogeneity, Momentum Federated Learning, Federated Learning, Reversed Momentum Federated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data Heterogeneity is a major challenge of Federated Learning performance. Recently, momentum based optimization techniques have beed proved to be effective in mitigating the heterogeneity issue. Along with the model updates, the momentum updates are transmitted to the server side and aggregated. Therefore, the local training initialized with a global momentum is guided by the global history of the gradients. However, we spot a problem in the traditional cumulation of the momentum which is suboptimal in the Federated Learning systems. The momentum used to weight less on the historical gradients and more on the recent gradients. This however, will engage more biased local gradients in the end of the local training. In this work, we propose a new way to calculate the estimated momentum used in local initialization. The proposed method is named as Reversed Momentum Federated Learning (RMFL). The key idea is to assign exponentially decayed weights to the gradients with the time going forward, which is on the contrary to the traditional momentum cumulation. The effectiveness of RMFL is evaluated on three popular benchmark datasets with different heterogeneity levels.

[LG-5] ractable Agreement Protocols

链接: https://arxiv.org/abs/2411.19791
作者: Natalie Collina,Surbhi Goel,Varun Gupta,Aaron Roth
关键词-EN: enabling collaboration, reduction that converts, achieve consensus, Aumann Agreement Theorem, interactive protocol
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We present an efficient reduction that converts any machine learning algorithm into an interactive protocol, enabling collaboration with another party (e.g., a human) to achieve consensus on predictions and improve accuracy. This approach imposes calibration conditions on each party, which are computationally and statistically tractable relaxations of Bayesian rationality. These conditions are sensible even in prior-free settings, representing a significant generalization of Aumann’s classic “agreement theorem.” In our protocol, the model first provides a prediction. The human then responds by either agreeing or offering feedback. The model updates its state and revises its prediction, while the human may adjust their beliefs. This iterative process continues until the two parties reach agreement. Initially, we study a setting that extends Aumann’s Agreement Theorem, where parties aim to agree on a one-dimensional expectation by iteratively sharing their current estimates. Here, we recover the convergence theorem of Aaronson’05 under weaker assumptions. We then address the case where parties hold beliefs over distributions with d outcomes, exploring two feedback mechanisms. The first involves vector-valued estimates of predictions, while the second adopts a decision-theoretic approach: the human, needing to take an action from a finite set based on utility, communicates their utility-maximizing action at each round. In this setup, the number of rounds until agreement remains independent of d. Finally, we generalize to scenarios with more than two parties, where computational complexity scales linearly with the number of participants. Our protocols rely on simple, efficient conditions and produce predictions that surpass the accuracy of any individual party’s alone. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2411.19791 [cs.LG] (or arXiv:2411.19791v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.19791 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] Riemannian Denoising Score Matching for Molecular Structure Optimization with Accurate Energy

链接: https://arxiv.org/abs/2411.19769
作者: Jeheon Woo,Seonghwan Kim,Jun Hyeong Kim,Woo Youn Kim
关键词-EN: Riemannian score matching, modified score matching, score matching, score matching method, generating molecular structures
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:This study introduces a modified score matching method aimed at generating molecular structures with high energy accuracy. The denoising process of score matching or diffusion models mirrors molecular structure optimization, where scores act like physical force fields that guide particles toward equilibrium states. To achieve energetically accurate structures, it can be advantageous to have the score closely approximate the gradient of the actual potential energy surface. Unlike conventional methods that simply design the target score based on structural differences in Euclidean space, we propose a Riemannian score matching approach. This method represents molecular structures on a manifold defined by physics-informed internal coordinates to efficiently mimic the energy landscape, and performs noising and denoising within this space. Our method has been evaluated by refining several types of starting structures on the QM9 and GEOM datasets, demonstrating that the proposed Riemannian score matching method significantly improves the accuracy of the generated molecular structures, attaining chemical accuracy. The implications of this study extend to various applications in computational chemistry, offering a robust tool for accurate molecular structure prediction.

[LG-7] A Note on Small Percolating Sets on Hypercubes via Generative AI

链接: https://arxiv.org/abs/2411.19734
作者: Gergely Bérczi,Adam Zsolt Wagner
关键词-EN: pattern-recognition technique called, technique called PatternBoost, study bootstrap percolation, apply a generative, generative AI pattern-recognition
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:We apply a generative AI pattern-recognition technique called PatternBoost to study bootstrap percolation on hypercubes. With this, we slightly improve the best existing upper bound for the size of percolating subsets of the hypercube.

[LG-8] Risk-Averse Certification of Bayesian Neural Networks

链接: https://arxiv.org/abs/2411.19729
作者: Xiyue Zhang,Zifan Wang,Yulong Gao,Licio Romao,Alessandro Abate,Marta Kwiatkowska
关键词-EN: deep learning models, incorporating risk measures, real-world environments, learning models, dynamic nature
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In light of the inherently complex and dynamic nature of real-world environments, incorporating risk measures is crucial for the robustness evaluation of deep learning models. In this work, we propose a Risk-Averse Certification framework for Bayesian neural networks called RAC-BNN. Our method leverages sampling and optimisation to compute a sound approximation of the output set of a BNN, represented using a set of template polytopes. To enhance robustness evaluation, we integrate a coherent distortion risk measure–Conditional Value at Risk (CVaR)–into the certification framework, providing probabilistic guarantees based on empirical distributions obtained through sampling. We validate RAC-BNN on a range of regression and classification benchmarks and compare its performance with a state-of-the-art method. The results show that RAC-BNN effectively quantifies robustness under worst-performing risky scenarios, and achieves tighter certified bounds and higher efficiency in complex tasks.

[LG-9] Relative Representations of Latent Spaces enable Efficient Semantic Channel Equalization

链接: https://arxiv.org/abs/2411.19719
作者: Tomás Hüttebräucker,Simone Fiorellino,Mohamed Sana,Paolo Di Lorenzo,Emilio Calvanese Strinati
关键词-EN: language mismatche poses, trained agents interact, independently trained agents, mismatche poses, poses a significant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In multi-user semantic communication, language mismatche poses a significant challenge when independently trained agents interact. We present a novel semantic equalization algorithm that enables communication between agents with different languages without additional retraining. Our algorithm is based on relative representations, a framework that enables different agents employing different neural network models to have unified representation. It proceeds by projecting the latent vectors of different models into a common space defined relative to a set of data samples called \textitanchors, whose number equals the dimension of the resulting space. A communication between different agents translates to a communication of semantic symbols sampled from this relative space. This approach, in addition to aligning the semantic representations of different agents, allows compressing the amount of information being exchanged, by appropriately selecting the number of anchors. Eventually, we introduce a novel anchor selection strategy, which advantageously determines prototypical anchors, capturing the most relevant information for the downstream task. Our numerical results show the effectiveness of the proposed approach allowing seamless communication between agents with radically different models, including differences in terms of neural network architecture and datasets used for initial training.

[LG-10] Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems COLING2025

链接: https://arxiv.org/abs/2411.19710
作者: Rafael Teixeira de Lima(1),Shubham Gupta(1),Cesar Berrospi(2),Lokesh Mishra(2),Michele Dolfi(2),Peter Staar(2),Panagiotis Vagenas(2) ((1) IBM Research Paris-Saclay, (2) IBM Research Zurich)
关键词-EN: Large Language Models, Language Models, Large Language, Retrieval Augmented Generation, application of Large
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: to be published in the 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system’s use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (QA) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate QA datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development.

[LG-11] Fast Mutual Information Computation for Large Binary Datasets

链接: https://arxiv.org/abs/2411.19702
作者: Andre O. Falcao
关键词-EN: quantifies shared information, natural language processing, Mutual Information, powerful statistical measure, high-dimensional data analysis
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Mutual Information (MI) is a powerful statistical measure that quantifies shared information between random variables, particularly valuable in high-dimensional data analysis across fields like genomics, natural language processing, and network science. However, computing MI becomes computationally prohibitive for large datasets where it is typically required a pairwise computational approach where each column is compared to others. This work introduces a matrix-based algorithm that accelerates MI computation by leveraging vectorized operations and optimized matrix calculations. By transforming traditional pairwise computational approaches into bulk matrix operations, the proposed method enables efficient MI calculation across all variable pairs. Experimental results demonstrate significant performance improvements, with computation times reduced up to 50,000 times in the largest dataset using optimized implementations, particularly when utilizing hardware optimized frameworks. The approach promises to expand MI’s applicability in data-driven research by overcoming previous computational limitations.

[LG-12] Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender Fairness in Federated Recommendation WSDM2025

链接: https://arxiv.org/abs/2411.19678
作者: Siqing Zhang,Yuchen Ding,Wei Tang,Wei Sun,Yong Liao,Peng Yuan Zhou
关键词-EN: inadequately explored question, federated recommendation systems, stringent privacy constraints, explored question, federated recommendation
类目: Machine Learning (cs.LG)
*备注: accepted by WSDM 2025

点击查看摘要

Abstract:Under stringent privacy constraints, whether federated recommendation systems can achieve group fairness remains an inadequately explored question. Taking gender fairness as a representative issue, we identify three phenomena in federated recommendation systems: performance difference, data imbalance, and preference disparity. We discover that the state-of-the-art methods only focus on the first phenomenon. Consequently, their imposition of inappropriate fairness constraints detrimentally affects the model training. Moreover, due to insufficient sensitive attribute protection of existing works, we can infer the gender of all users with 99.90% accuracy even with the addition of maximal noise. In this work, we propose Privacy-Preserving Orthogonal Aggregation (PPOA), which employs the secure aggregation scheme and quantization technique, to prevent the suppression of minority groups by the majority and preserve the distinct preferences for better group fairness. PPOA can assist different groups in obtaining their respective model aggregation results through a designed orthogonal mapping while keeping their attributes private. Experimental results on three real-world datasets demonstrate that PPOA enhances recommendation effectiveness for both females and males by up to 8.25% and 6.36%, respectively, with a maximum overall improvement of 7.30%, and achieves optimal fairness in most cases. Extensive ablation experiments and visualizations indicate that PPOA successfully maintains preferences for different gender groups.

[LG-13] On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

链接: https://arxiv.org/abs/2411.19671
作者: Xianliang Li,Jun Luo,Zhiwei Zheng,Hanxiao Wang,Li Luo,Lingkun Wen,Linlong Wu,Sheng Xu
关键词-EN: training neural networks, neural networks, Momentum-based optimizers, widely adopted, momentum
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance generalization performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.

[LG-14] Learned Random Label Predictions as a Neural Network Complexity Metric

链接: https://arxiv.org/abs/2411.19640
作者: Marlon Becker,Benjamin Risse
关键词-EN: randomly generated labels, deep neural networks, learning randomly generated, empirically investigate, investigate the impact
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We empirically investigate the impact of learning randomly generated labels in parallel to class labels in supervised learning on memorization, model complexity, and generalization in deep neural networks. To this end, we introduce a multi-head network architecture as an extension of standard CNN architectures. Inspired by methods used in fair AI, our approach allows for the unlearning of random labels, preventing the network from memorizing individual samples. Based on the concept of Rademacher complexity, we first use our proposed method as a complexity metric to analyze the effects of common regularization techniques and challenge the traditional understanding of feature extraction and classification in CNNs. Second, we propose a novel regularizer that effectively reduces sample memorization. However, contrary to the predictions of classical statistical learning theory, we do not observe improvements in generalization.

[LG-15] PACMANN: Point Adaptive Collocation Method for Artificial Neural Networks

链接: https://arxiv.org/abs/2411.19632
作者: Coen Visser,Alexander Heinlein,Bianca Giovanardi
关键词-EN: Partial Differential Equations, Differential Equations, Partial Differential, collocation points, adaptive collocation point
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are an emerging tool for approximating the solution of Partial Differential Equations (PDEs) in both forward and inverse problems. PINNs minimize a loss function which includes the PDE residual determined for a set of collocation points. Previous work has shown that the number and distribution of these collocation points have a significant influence on the accuracy of the PINN solution. Therefore, the effective placement of these collocation points is an active area of research. Specifically, adaptive collocation point sampling methods have been proposed, which have been reported to scale poorly to higher dimensions. In this work, we address this issue and present the Point Adaptive Collocation Method for Artificial Neural Networks (PACMANN). Inspired by classic optimization problems, this approach incrementally moves collocation points toward regions of higher residuals using gradient-based optimization algorithms guided by the gradient of the squared residual. We apply PACMANN for forward and inverse problems, and demonstrate that this method matches the performance of state-of-the-art methods in terms of the accuracy/efficiency tradeoff for the low-dimensional problems, while outperforming available approaches for high-dimensional problems; the best performance is observed for the Adam optimizer. Key features of the method include its low computational cost and simplicity of integration in existing physics-informed neural network pipelines.

[LG-16] Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT

链接: https://arxiv.org/abs/2411.19584
作者: Hemal Mahmud,Hasan Mahmud
关键词-EN: user complex emotions, process of identifying, identifying the emotional, emotional tone, uncover the user
类目: Machine Learning (cs.LG)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:Sentiment analysis (SA) is a process of identifying the emotional tone or polarity within a given text and aims to uncover the user’s complex emotions and inner feelings. While sentiment analysis has been extensively studied for languages like English, research in Bengali, remains limited, particularly for fine-grained sentiment categorization. This work aims to connect this gap by developing a novel approach that integrates rule-based algorithms with pre-trained language models. We developed a dataset from scratch, comprising over 15,000 manually labeled reviews. Next, we constructed a Lexicon Data Dictionary, assigning polarity scores to the reviews. We developed a novel rule based algorithm Bangla Sentiment Polarity Score (BSPS), an approach capable of generating sentiment scores and classifying reviews into nine distinct sentiment categories. To assess the performance of this method, we evaluated the classified sentiments using BanglaBERT, a pre-trained transformer-based language model. We also performed sentiment classification directly with BanglaBERT on the original data and evaluated this model’s results. Our analysis revealed that the BSPS + BanglaBERT hybrid approach outperformed the standalone BanglaBERT model, achieving higher accuracy, precision, and nuanced classification across the nine sentiment categories. The results of our study emphasize the value and effectiveness of combining rule-based and pre-trained language model approaches for enhanced sentiment analysis in Bengali and suggest pathways for future research and application in languages with similar linguistic complexities.

[LG-17] Differentiable Causal Discovery For Latent Hierarchical Causal Models

链接: https://arxiv.org/abs/2411.19556
作者: Parjanya Prashant,Ignavier Ng,Kun Zhang,Biwei Huang
关键词-EN: Discovering causal, Discovering causal structures, fundamental challenge, Discovering, causal
类目: Machine Learning (cs.LG)
*备注: 25 pages with references, 7 figures

点击查看摘要

Abstract:Discovering causal structures with latent variables from observational data is a fundamental challenge in causal discovery. Existing methods often rely on constraint-based, iterative discrete searches, limiting their scalability to large numbers of variables. Moreover, these methods frequently assume linearity or invertibility, restricting their applicability to real-world scenarios. We present new theoretical results on the identifiability of nonlinear latent hierarchical causal models, relaxing previous assumptions in literature about the deterministic nature of latent variables and exogenous noise. Building on these insights, we develop a novel differentiable causal discovery algorithm that efficiently estimates the structure of such models. To the best of our knowledge, this is the first work to propose a differentiable causal discovery method for nonlinear latent hierarchical models. Our approach outperforms existing methods in both accuracy and scalability. We demonstrate its practical utility by learning interpretable hierarchical latent structures from high-dimensional image data and demonstrate its effectiveness on downstream tasks.

[LG-18] Analysis of High-dimensional Gaussian Labeled-unlabeled Mixture Model via Message-passing Algorithm

链接: https://arxiv.org/abs/2411.19553
作者: Xiaosi Gu,Tomoyuki Obuchi
关键词-EN: machine learning methodology, Semi-supervised learning, Gaussian Mixture Model, machine learning, learning methodology
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) is a machine learning methodology that leverages unlabeled data in conjunction with a limited amount of labeled data. Although SSL has been applied in various applications and its effectiveness has been empirically demonstrated, it is still not fully understood when and why SSL performs well. Some existing theoretical studies have attempted to address this issue by modeling classification problems using the so-called Gaussian Mixture Model (GMM). These studies provide notable and insightful interpretations. However, their analyses are focused on specific purposes, and a thorough investigation of the properties of GMM in the context of SSL has been lacking. In this paper, we conduct such a detailed analysis of the properties of the high-dimensional GMM for binary classification in the SSL setting. To this end, we employ the approximate message passing and state evolution methods, which are widely used in high-dimensional settings and originate from statistical mechanics. We deal with two estimation approaches: the Bayesian one and the l2-regularized maximum likelihood estimation (RMLE). We conduct a comprehensive comparison between these two approaches, examining aspects such as the global phase diagram, estimation error for the parameters, and prediction error for the labels. A specific comparison is made between the Bayes-optimal (BO) estimator and RMLE, as the BO setting provides optimal estimation performance and is ideal as a benchmark. Our analysis shows that with appropriate regularizations, RMLE can achieve near-optimal performance in terms of both the estimation error and prediction error, especially when there is a large amount of unlabeled data. These results demonstrate that the l2 regularization term plays an effective role in estimation and prediction in SSL approaches.

[LG-19] Development of Low-Cost IoT Units for Thermal Comfort Measurement and AC Energy Consumption Prediction System

链接: https://arxiv.org/abs/2411.19536
作者: Yutong Chen,Daisuke Sumiyoshi,Riki Sakai,Takahiro Yamamoto,Takahiro Ueno,Jewon Oh
关键词-EN: Japanese government initiated, Behavioral Insights, Insights X Technology, Japanese government, substantial energy consumption
类目: Machine Learning (cs.LG)
*备注: RoomVent2024 conference

点击查看摘要

Abstract:In response to the substantial energy consumption in buildings, the Japanese government initiated the BI-Tech (Behavioral Insights X Technology) project in 2019, aimed at promoting voluntary energy-saving behaviors through the utilization of AI and IoT technologies. Our study aimed at small and medium-sized office buildings introduces a cost-effective IoT-based BI-Tech system, utilizing the Raspberry Pi 4B+ platform for real-time monitoring of indoor thermal conditions and air conditioner (AC) set-point temperature. Employing machine learning and image recognition, the system analyzes data to calculate the PMV index and predict energy consumption changes due to temperature adjustments. The integration of mobile and desktop applications conveys this information to users, encouraging energy-efficient behavior modifications. The machine learning model achieved with an R2 value of 97%, demonstrating the system’s efficiency in promoting energy-saving habits among users.

[LG-20] ContextGNN: Beyond Two-Tower Recommendation Systems

链接: https://arxiv.org/abs/2411.19513
作者: Yiwen Yuan,Zecheng Zhang,Xinwei He,Akihiro Nitta,Weihua Hu,Dong Wang,Manan Shah,Shenyang Huang,Blaž Stojanovič,Alan Krumholz,Jan Eric Lenssen,Jure Leskovec,Matthias Fey
关键词-EN: systems predominantly utilize, predominantly utilize two-tower, evaluate user-item rankings, respective embeddings, predominantly utilize
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 14 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Recommendation systems predominantly utilize two-tower architectures, which evaluate user-item rankings through the inner product of their respective embeddings. However, one key limitation of two-tower models is that they learn a pair-agnostic representation of users and items. In contrast, pair-wise representations either scale poorly due to their quadratic complexity or are too restrictive on the candidate pairs to rank. To address these issues, we introduce Context-based Graph Neural Networks (ContextGNNs), a novel deep learning architecture for link prediction in recommendation systems. The method employs a pair-wise representation technique for familiar items situated within a user’s local subgraph, while leveraging two-tower representations to facilitate the recommendation of exploratory items. A final network then predicts how to fuse both pair-wise and two-tower recommendations into a single ranking of items. We demonstrate that ContextGNN is able to adapt to different data characteristics and outperforms existing methods, both traditional and GNN-based, on a diverse set of practical recommendation tasks, improving performance by 20% on average.

[LG-21] Graph-Enhanced EEG Foundation Model

链接: https://arxiv.org/abs/2411.19507
作者: Limin Wang,Toyotaro Suzumura,Hiroki Kanezashi
关键词-EN: provide critical insights, signals provide critical, diagnosis and healthcare, provide critical, critical insights
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) signals provide critical insights for applications in disease diagnosis and healthcare. However, the scarcity of labeled EEG data poses a significant challenge. Foundation models offer a promising solution by leveraging large-scale unlabeled data through pre-training, enabling strong performance across diverse tasks. While both temporal dynamics and inter-channel relationships are vital for understanding EEG signals, existing EEG foundation models primarily focus on the former, overlooking the latter. To address this limitation, we propose a novel foundation model for EEG that integrates both temporal and inter-channel information. Our architecture combines Graph Neural Networks (GNNs), which effectively capture relational structures, with a masked autoencoder to enable efficient pre-training. We evaluated our approach using three downstream tasks and experimented with various GNN architectures. The results demonstrate that our proposed model, particularly when employing the GCN architecture with optimized configurations, consistently outperformed baseline methods across all tasks. These findings suggest that our model serves as a robust foundation model for EEG analysis.

[LG-22] SANGO: Socially Aware Navigation through Grouped Obstacles

链接: https://arxiv.org/abs/2411.19497
作者: Rahath Malladi,Amol Harsh,Arshia Sangwan,Sunita Chauhan,Sandeep Manjanna
关键词-EN: dynamically grouping obstacles, Proximal Policy Optimization, paper introduces SANGO, Socially Aware Navigation, Grouped Obstacles
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Indian Control Conference 2024 (ICC-10)

点击查看摘要

Abstract:This paper introduces SANGO (Socially Aware Navigation through Grouped Obstacles), a novel method that ensures socially appropriate behavior by dynamically grouping obstacles and adhering to social norms. Using deep reinforcement learning, SANGO trains agents to navigate complex environments leveraging the DBSCAN algorithm for obstacle clustering and Proximal Policy Optimization (PPO) for path planning. The proposed approach improves safety and social compliance by maintaining appropriate distances and reducing collision rates. Extensive experiments conducted in custom simulation environments demonstrate SANGO’s superior performance in significantly reducing discomfort (by up to 83.5%), reducing collision rates (by up to 29.4%) and achieving higher successful navigation in dynamic and crowded scenarios. These findings highlight the potential of SANGO for real-world applications, paving the way for advanced socially adept robotic navigation systems.

[LG-23] Diffusion Models Meet Network Management: Improving Traffic Matrix Analysis with Diffusion-based Approach

链接: https://arxiv.org/abs/2411.19493
作者: Xinyu Yuan,Yan Qiao,Zhenchun Wei,Zeyu Zhang,Minyue Li,Pei Zhao,Rongyao Hu,Wenjing Li
关键词-EN: maintenance relying heavily, network management related, management related tasks, network traffic monitoring, traffic matrix analysis
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to network operation and maintenance relying heavily on network traffic monitoring, traffic matrix analysis has been one of the most crucial issues for network management related tasks. However, it is challenging to reliably obtain the precise measurement in computer networks because of the high measurement cost, and the unavoidable transmission loss. Although some methods proposed in recent years allowed estimating network traffic from partial flow-level or link-level measurements, they often perform poorly for traffic matrix estimation nowadays. Despite strong assumptions like low-rank structure and the prior distribution, existing techniques are usually task-specific and tend to be significantly worse as modern network communication is extremely complicated and dynamic. To address the dilemma, this paper proposed a diffusion-based traffic matrix analysis framework named Diffusion-TM, which leverages problem-agnostic diffusion to notably elevate the estimation performance in both traffic distribution and accuracy. The novel framework not only takes advantage of the powerful generative ability of diffusion models to produce realistic network traffic, but also leverages the denoising process to unbiasedly estimate all end-to-end traffic in a plug-and-play manner under theoretical guarantee. Moreover, taking into account that compiling an intact traffic dataset is usually infeasible, we also propose a two-stage training scheme to make our framework be insensitive to missing values in the dataset. With extensive experiments with real-world datasets, we illustrate the effectiveness of Diffusion-TM on several tasks. Moreover, the results also demonstrate that our method can obtain promising results even with 5% known values left in the datasets.

[LG-24] Random Feature Models with Learnable Activation Functions

链接: https://arxiv.org/abs/2411.19468
作者: Zailin Ma,Jiansheng Yang,Yaodong Yang
关键词-EN: Current random feature, random feature models, capture diverse patterns, random feature, Current random
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current random feature models typically rely on fixed activation functions, limiting their ability to capture diverse patterns in data. To address this, we introduce the Random Feature model with Learnable Activation Functions (RFLAF), a novel model that significantly enhances the expressivity and interpretability of traditional random feature (RF) models. We begin by studying the RF model with a single radial basis function, where we discover a new kernel and provide the first theoretical analysis on it. By integrating the basis functions with learnable weights, we show that RFLAF can represent a broad class of random feature models whose activation functions belong in C_c(\mathbbR) . Theoretically, we prove that the model requires only about twice the parameter number compared to a traditional RF model to achieve the significant leap in expressivity. Experimentally, RFLAF demonstrates two key advantages: (1) it performs better across various tasks compared to traditional RF model with the same number of parameters, and (2) the optimized weights offer interpretability, as the learned activation function can be directly inferred from these weights. Our model paves the way for developing more expressive and interpretable frameworks within random feature models.

[LG-25] Multi-task CNN Behavioral Embedding Model For Transaction Fraud Detection ICDM

链接: https://arxiv.org/abs/2411.19457
作者: Bo Qu,Zhurong Wang,Minghao Gu,Daisuke Yagi,Yang Zhao,Yinan Shan,Frank Zahradnik
关键词-EN: burgeoning e-Commerce sector, e-Commerce sector requires, sector requires advanced, requires advanced solutions, Transaction Fraud Detection
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, ICDMW 2024

点击查看摘要

Abstract:The burgeoning e-Commerce sector requires advanced solutions for the detection of transaction fraud. With an increasing risk of financial information theft and account takeovers, deep learning methods have become integral to the embedding of behavior sequence data in fraud detection. However, these methods often struggle to balance modeling capabilities and efficiency and incorporate domain knowledge. To address these issues, we introduce the multitask CNN behavioral Embedding Model for Transaction Fraud Detection. Our contributions include 1) introducing a single-layer CNN design featuring multirange kernels which outperform LSTM and Transformer models in terms of scalability and domain-focused inductive bias, and 2) the integration of positional encoding with CNN to introduce sequence-order signals enhancing overall performance, and 3) implementing multitask learning with randomly assigned label weights, thus removing the need for manual tuning. Testing on real-world data reveals our model’s enhanced performance of downstream transaction models and comparable competitiveness with the Transformer Time Series (TST) model.

[LG-26] Autocorrelation Matters: Understanding the Role of Initialization Schemes for State Space Models

链接: https://arxiv.org/abs/2411.19455
作者: Fusheng Liu,Qianxiao Li
关键词-EN: SSM kernel basis, parameters primarily rely, state space model, initializing state space, SSM state matrix
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current methods for initializing state space model (SSM) parameters primarily rely on the HiPPO framework \citepgu2023how, which is based on online function approximation with the SSM kernel basis. However, the HiPPO framework does not explicitly account for the effects of the temporal structures of input sequences on the optimization of SSMs. In this paper, we take a further step to investigate the roles of SSM initialization schemes by considering the autocorrelation of input sequences. Specifically, we: (1) rigorously characterize the dependency of the SSM timescale on sequence length based on sequence autocorrelation; (2) find that with a proper timescale, allowing a zero real part for the eigenvalues of the SSM state matrix mitigates the curse of memory while still maintaining stability at initialization; (3) show that the imaginary part of the eigenvalues of the SSM state matrix determines the conditioning of SSM optimization problems, and uncover an approximation-estimation tradeoff when training SSMs with a specific class of target functions.

[LG-27] A Simple Sparse Matrix Vector Multiplication Approach to Padded Convolution

链接: https://arxiv.org/abs/2411.19419
作者: Zan Chaudhry
关键词-EN: efficiently representing convolution, efficiently representing, vectorized input, sparse matrix-vector multiplication, sparse transformation matrix
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 10 pages, 2 figures, 2 tables

点击查看摘要

Abstract:We introduce an algorithm for efficiently representing convolution with zero-padding and stride as a sparse transformation matrix, applied to a vectorized input through sparse matrix-vector multiplication (SpMV). We provide a theoretical contribution with an explicit expression for the number of non-zero multiplications in convolutions with stride and padding, offering insight into the potential for leveraging sparsity in convolution operations. A proof-of-concept implementation is presented in Python, demonstrating the performance of our method on both CPU and GPU architectures. This work contributes to the broader exploration of sparse matrix techniques in convolutional algorithms, with a particular focus on leveraging matrix multiplications for parallelization. Our findings lay the groundwork for future advancements in exploiting sparsity to improve the efficiency of convolution operations in fields such as machine learning and signal processing.

[LG-28] On the effectiveness of discrete representations in sparse mixture of experts

链接: https://arxiv.org/abs/2411.19402
作者: Giang Do,Kha Pham,Hung Le,Truyen Tran
关键词-EN: Sparse mixture, computational costs, effective solution, solution for scaling, capacity without increasing
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Sparse mixture of experts (SMoE) is an effective solution for scaling up model capacity without increasing the computational costs. A crucial component of SMoE is the router, responsible for directing the input to relevant experts; however, it also presents a major weakness, leading to routing inconsistencies and representation collapse issues. Instead of fixing the router like previous works, we propose an alternative that assigns experts to input via indirection, which employs the discrete representation of input that points to the expert. The discrete representations are learnt via vector quantization, resulting in a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE). We provide theoretical support and empirical evidence demonstrating the VQMoE’s ability to overcome the challenges present in traditional routers. Through extensive evaluations on both large language models and vision tasks for pre-training and fine-tuning, we show that VQMoE achieves a 28% improvement in robustness compared to other SMoE routing methods, while maintaining strong performance in fine-tuning tasks.

[LG-29] Scale Invariance of Graph Neural Networks

链接: https://arxiv.org/abs/2411.19392
作者: Qin Jiang,Chengjia Wang,Michael Lones,Wei Pang
关键词-EN: Graph Neural Networks, unified model capable, graph learning, Graph Neural, Neural Networks
类目: Machine Learning (cs.LG)
*备注: 13 pages,. arXiv admin note: substantial text overlap with arXiv:2411.08758

点击查看摘要

Abstract:We address two fundamental challenges in Graph Neural Networks (GNNs): (1) the lack of theoretical support for invariance learning, a critical property in image processing, and (2) the absence of a unified model capable of excelling on both homophilic and heterophilic graph datasets. To tackle these issues, we establish and prove scale invariance in graphs, extending this key property to graph learning, and validate it through experiments on real-world datasets. Leveraging directed multi-scaled graphs and an adaptive self-loop strategy, we propose ScaleNet, a unified network architecture that achieves state-of-the-art performance across four homophilic and two heterophilic benchmark datasets. Furthermore, we show that through graph transformation based on scale invariance, uniform weights can replace computationally expensive edge weights in digraph inception networks while maintaining or improving performance. For another popular GNN approach to digraphs, we demonstrate the equivalence between Hermitian Laplacian methods and GraphSAGE with incidence normalization. ScaleNet bridges the gap between homophilic and heterophilic graph learning, offering both theoretical insights into scale invariance and practical advancements in unified graph learning. Our implementation is publicly available at this https URL.

[LG-30] Parameter-Efficient Transfer Learning for Music Foundation Models

链接: https://arxiv.org/abs/2411.19371
作者: Yiwei Ding,Alexander Lerch
关键词-EN: task independent encoding, music foundation models, promising a general, recently being released, musical information
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6+2 pages

点击查看摘要

Abstract:More music foundation models are recently being released, promising a general, mostly task independent encoding of musical information. Common ways of adapting music foundation models to downstream tasks are probing and fine-tuning. These common transfer learning approaches, however, face challenges. Probing might lead to suboptimal performance because the pre-trained weights are frozen, while fine-tuning is computationally expensive and is prone to overfitting. Our work investigates the use of parameter-efficient transfer learning (PETL) for music foundation models which integrates the advantage of probing and fine-tuning. We introduce three types of PETL methods: adapter-based methods, prompt-based methods, and reparameterization-based methods. These methods train only a small number of parameters, and therefore do not require significant computational resources. Results show that PETL methods outperform both probing and fine-tuning on music auto-tagging. On key detection and tempo estimation, they achieve similar results as fine-tuning with significantly less training cost. However, the usefulness of the current generation of foundation model on key and tempo tasks is questioned by the similar results achieved by training a small model from scratch. Code available at this https URL

[LG-31] Perspective of Software Engineering Researchers on Machine Learning Practices Regarding Research Review and Education

链接: https://arxiv.org/abs/2411.19304
作者: Anamaria Mojica-Hanke,David Nader Palacio,Denys Poshyvanyk,Mario Linares-Vásquez,Steffen Herbold
关键词-EN: Toggle, significantly impacts Software, impacts Software Engineering, Machine Learning, Software Engineering
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:Context: Machine Learning (ML) significantly impacts Software Engineering (SE), but studies mainly focus on practitioners, neglecting researchers. This overlooks practices and challenges in teaching, researching, or reviewing ML applications in SE. Objective: This study aims to contribute to the knowledge, about the synergy between ML and SE from the perspective of SE researchers, by providing insights into the practices followed when researching, teaching, and reviewing SE studies that apply ML. Method: We analyzed SE researchers familiar with ML or who authored SE articles using ML, along with the articles themselves. We examined practices, SE tasks addressed with ML, challenges faced, and reviewers’ and educators’ perspectives using grounded theory coding and qualitative analysis. Results: We found diverse practices focusing on data collection, model training, and evaluation. Some recommended practices (e.g., hyperparameter tuning) appeared in less than 20% of literature. Common challenges involve data handling, model evaluation (incl. non-functional properties), and involving human expertise in evaluation. Hands-on activities are common in education, though traditional methods persist. Conclusion: Despite accepted practices in applying ML to SE, significant gaps remain. By enhancing guidelines, adopting diverse teaching methods, and emphasizing underrepresented practices, the SE community can bridge these gaps and advance the field. Comments: under review Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2411.19304 [cs.SE] (or arXiv:2411.19304v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2411.19304 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anamaria Irmgard Mojica Hanke [view email] [v1] Thu, 28 Nov 2024 18:21:24 UTC (330 KB) Full-text links: Access Paper: View a PDF of the paper titled Perspective of Software Engineering Researchers on Machine Learning Practices Regarding Research, Review, and Education, by Anamaria Mojica-Hanke and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2024-11 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-32] Controlling Participation in Federated Learning with Feedback

链接: https://arxiv.org/abs/2411.19242
作者: Michael Cummins,Guner Dilsad Er,Michael Muehlebach
关键词-EN: traditional methods typically, methods typically rely, client participation, training round, address the problem
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We address the problem of client participation in federated learning, where traditional methods typically rely on a random selection of a small subset of clients for each training round. In contrast, we propose FedBack, a deterministic approach that leverages control-theoretic principles to manage client participation in ADMM-based federated learning. FedBack models client participation as a discrete-time dynamical system and employs an integral feedback controller to adjust each client’s participation rate individually, based on the client’s optimization dynamics. We provide global convergence guarantees for our approach by building on the recent federated learning research. Numerical experiments on federated image classification demonstrate that FedBack achieves up to 50% improvement in communication and computational efficiency over algorithms that rely on a random selection of clients.

[LG-33] Large width penalization for neural network-based prediction interval estimation

链接: https://arxiv.org/abs/2411.19181
作者: Worachit Amnuaypongsa,Jitkomut Songsiri
关键词-EN: highly uncertain environments, nature of systems, large PI widths, accuracy in highly, highly uncertain
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 28 pages, 12 figures

点击查看摘要

Abstract:Forecasting accuracy in highly uncertain environments is challenging due to the stochastic nature of systems. Deterministic forecasting provides only point estimates and cannot capture potential outcomes. Therefore, probabilistic forecasting has gained significant attention due to its ability to quantify uncertainty, where one of the approaches is to express it as a prediction interval (PI), that explicitly shows upper and lower bounds of predictions associated with a confidence level. High-quality PI is characterized by a high PI coverage probability (PICP) and a narrow PI width. In many real-world applications, the PI width is generally used in risk management to prepare resources that improve reliability and effectively manage uncertainty. A wider PI width results in higher costs for backup resources as decision-making processes often focus on the worst-case scenarios arising with large PI widths under extreme conditions. This study aims to reduce the large PI width from the PI estimation method by proposing a new PI loss function that penalizes the average of the large PI widths more heavily. The proposed formulation is compatible with gradient-based algorithms, the standard approach to training neural networks (NNs), and integrating state-of-the-art NNs and existing deep learning techniques. Experiments with the synthetic dataset reveal that our formulation significantly reduces the large PI width while effectively maintaining the PICP to achieve the desired probability. The practical implementation of our proposed loss function is demonstrated in solar irradiance forecasting, highlighting its effectiveness in minimizing the large PI width in data with high uncertainty and showcasing its compatibility with more complex neural network models. Therefore, reducing large PI widths from our method can lead to significant cost savings by over-allocation of reserve resources.

[LG-34] Puzzle: Distillation-Based NAS for Inference-Optimized LLM s

链接: https://arxiv.org/abs/2411.19146
作者: Akhiad Bercovich,Tomer Ronen,Talor Abramovich,Nir Ailon,Nave Assaf,Mohammad Dabbah,Ido Galil,Amnon Geifman,Yonatan Geifman,Izhak Golan,Netanel Haber,Ehud Karpas,Itay Levy,Shahar Mor,Zach Moshe,Najeeb Nabwani,Omri Puny,Ran Rubin,Itamar Schen,Ido Shahaf,Oren Tropp,Omer Ullman Argov,Ran Zilberstein,Ran El-Yaniv
关键词-EN: demonstrated remarkable capabilities, demonstrated remarkable, adoption is limited, limited by high, capabilities
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but their adoption is limited by high computational costs during inference. While increasing parameter counts enhances accuracy, it also widens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a framework to accelerate LLM inference on specific hardware while preserving their capabilities. Through an innovative application of neural architecture search (NAS) at an unprecedented scale, Puzzle systematically optimizes models with tens of billions of parameters under hardware constraints. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We demonstrate the real-world impact of our framework through Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B), a publicly available model derived from Llama-3.1-70B-Instruct. Nemotron-51B achieves a 2.17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while preserving 98.4% of the original model’s capabilities. Nemotron-51B currently stands as the most accurate language model capable of inference on a single GPU with large batch sizes. Remarkably, this transformation required just 45B training tokens, compared to over 15T tokens used for the 70B model it was derived from. This establishes a new paradigm where powerful models can be optimized for efficient deployment with only negligible compromise of their capabilities, demonstrating that inference performance, not parameter count alone, should guide model selection. With the release of Nemotron-51B and the presentation of the Puzzle framework, we provide practitioners immediate access to state-of-the-art language modeling capabilities at significantly reduced computational costs. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.19146 [cs.LG] (or arXiv:2411.19146v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.19146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] EA: Trajectory Encoding Augmentation for Robust and Transferable Policies in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2411.19133
作者: Batıkan Bora Ormancı,Phillip Swazinna,Steffen Udluft,Thomas A. Runkler
关键词-EN: offline reinforcement learning, investigate offline reinforcement, Trajectory Encoding Augmentation, reinforcement learning, investigate offline
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we investigate offline reinforcement learning (RL) with the goal of training a single robust policy that generalizes effectively across environments with unseen dynamics. We propose a novel approach, Trajectory Encoding Augmentation (TEA), which extends the state space by integrating latent representations of environmental dynamics obtained from sequence encoders, such as AutoEncoders. Our findings show that incorporating these encodings with TEA improves the transferability of a single policy to novel environments with new dynamics, surpassing methods that rely solely on unmodified states. These results indicate that TEA captures critical, environment-specific characteristics, enabling RL agents to generalize effectively across dynamic conditions.

[LG-36] Personalized Federated Fine-Tuning for LLM s via Data-Driven Heterogeneous Model Architectures

链接: https://arxiv.org/abs/2411.19128
作者: Yicheng Zhang,Zhen Qin,Zhaomin Wu,Shuiguang Deng
关键词-EN: pre-trained large language, large language models, large amount, pre-trained large, large language
类目: Machine Learning (cs.LG)
*备注: On going work. Codes are released at this https URL

点击查看摘要

Abstract:A large amount of instructional text data is essential to enhance the performance of pre-trained large language models (LLMs) for downstream tasks. This data can contain sensitive information and therefore cannot be shared in practice, resulting in data silos that limit the effectiveness of LLMs on various tasks. Federated learning (FL) enables collaborative fine-tuning across different clients without sharing their data. Nonetheless, in practice, this instructional text data is highly heterogeneous in both quantity and distribution across clients, necessitating distinct model structures to best accommodate the variations. However, existing federated fine-tuning approaches either enforce the same model structure or rely on predefined ad-hoc architectures unaware of data distribution, resulting in suboptimal performance. To address this challenge, we propose FedAMoLE, a lightweight personalized federated fine-tuning framework that leverages data-driven heterogeneous model architectures. FedAMoLE introduces the Adaptive Mixture of LoRA Experts (AMoLE) module, which facilitates model heterogeneity with minimal communication overhead by allocating varying numbers of LoRA-based domain experts to each client. Furthermore, we develop a reverse selection-based expert assignment (RSEA) strategy, which enables data-driven model architecture adjustment during fine-tuning by allowing domain experts to select clients that best align with their knowledge domains. Extensive experiments across six different scenarios of data heterogeneity demonstrate that FedAMoLE significantly outperforms existing methods for federated LLM fine-tuning, achieving superior accuracy while maintaining good scalability.

[LG-37] Advancing Generalization in PINNs through Latent-Space Representations

链接: https://arxiv.org/abs/2411.19125
作者: Honghui Wang,Yifan Pu,Shiji Song,Gao Huang
关键词-EN: made significant strides, Physics-informed neural networks, partial differential equations, made significant, significant strides
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have made significant strides in modeling dynamical systems governed by partial differential equations (PDEs). However, their generalization capabilities across varying scenarios remain limited. To overcome this limitation, we propose PIDO, a novel physics-informed neural PDE solver designed to generalize effectively across diverse PDE configurations, including varying initial conditions, PDE coefficients, and training time horizons. PIDO exploits the shared underlying structure of dynamical systems with different properties by projecting PDE solutions into a latent space using auto-decoding. It then learns the dynamics of these latent representations, conditioned on the PDE coefficients. Despite its promise, integrating latent dynamics models within a physics-informed framework poses challenges due to the optimization difficulties associated with physics-informed losses. To address these challenges, we introduce a novel approach that diagnoses and mitigates these issues within the latent space. This strategy employs straightforward yet effective regularization techniques, enhancing both the temporal extrapolation performance and the training stability of PIDO. We validate PIDO on a range of benchmarks, including 1D combined equations and 2D Navier-Stokes equations. Additionally, we demonstrate the transferability of its learned representations to downstream applications such as long-term integration and inverse problems.

[LG-38] Deep Learning for GWP Prediction: A Framework Using PCA Quantile Transformation and Ensemble Modeling

链接: https://arxiv.org/abs/2411.19124
作者: Navin Rajapriya,Kotaro Kawajiri
关键词-EN: anthropogenic greenhouse gases, Developing environmentally sustainable, Developing environmentally, global warming potential, critical for mitigating
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注: 10 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Developing environmentally sustainable refrigerants is critical for mitigating the impact of anthropogenic greenhouse gases on global warming. This study presents a predictive modeling framework to estimate the 100-year global warming potential (GWP 100) of single-component refrigerants using a fully connected neural network implemented on the Multi-Sigma platform. Molecular descriptors from RDKit, Mordred, and alvaDesc were utilized to capture various chemical features. The RDKit-based model achieved the best performance, with a Root Mean Square Error (RMSE) of 481.9 and an R2 score of 0.918, demonstrating superior predictive accuracy and generalizability. Dimensionality reduction through Principal Component Analysis (PCA) and quantile transformation were applied to address the high-dimensional and skewed nature of the dataset,enhancing model stability and performance. Factor analysis identified vital molecular features, including molecular weight, lipophilicity, and functional groups, such as nitriles and allylic oxides, as significant contributors to GWP values. These insights provide actionable guidance for designing environmentally sustainable refrigerants. Integrating RDKit descriptors with Multi-Sigma’s framework, which includes PCA, quantile transformation, and neural networks, provides a scalable solution for the rapid virtual screening of low-GWP refrigerants. This approach can potentially accelerate the identification of eco-friendly alternatives, directly contributing to climate mitigation by enabling the design of next-generation refrigerants aligned with global sustainability objectives.

[LG-39] Introducing Three New Benchmark Datasets for Hierarchical Text Classification

链接: https://arxiv.org/abs/2411.19119
作者: Jaco du Toit,Herman Redelinghuys,Marcel Dunaiski
关键词-EN: Hierarchical Text Classification, Hierarchical Text, natural language processing, language processing task, classify text documents
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Hierarchical Text Classification (HTC) is a natural language processing task with the objective to classify text documents into a set of classes from a structured class hierarchy. Many HTC approaches have been proposed which attempt to leverage the class hierarchy information in various ways to improve classification performance. Machine learning-based classification approaches require large amounts of training data and are most-commonly compared through three established benchmark datasets, which include the Web Of Science (WOS), Reuters Corpus Volume 1 Version 2 (RCV1-V2) and New York Times (NYT) datasets. However, apart from the RCV1-V2 dataset which is well-documented, these datasets are not accompanied with detailed description methodologies. In this paper, we introduce three new HTC benchmark datasets in the domain of research publications which comprise the titles and abstracts of papers from the Web of Science publication database. We first create two baseline datasets which use existing journal-and citation-based classification schemas. Due to the respective shortcomings of these two existing schemas, we propose an approach which combines their classifications to improve the reliability and robustness of the dataset. We evaluate the three created datasets with a clustering-based analysis and show that our proposed approach results in a higher quality dataset where documents that belong to the same class are semantically more similar compared to the other datasets. Finally, we provide the classification performance of four state-of-the-art HTC approaches on these three new datasets to provide baselines for future studies on machine learning-based techniques for scientific publication classification.

[LG-40] Neural Window Decoder for SC-LDPC Codes

链接: https://arxiv.org/abs/2411.19092
作者: Dae-Young Yun,Hee-Youl Kwak,Yongjune Kim,Sang-Hyo Kim,Jong-Seon No
关键词-EN: coupled low-density parity-check, spatially coupled low-density, neural window decoder, low-density parity-check, spatially coupled
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 12 pages, 16 figures

点击查看摘要

Abstract:In this paper, we propose a neural window decoder (NWD) for spatially coupled low-density parity-check (SC-LDPC) codes. The proposed NWD retains the conventional window decoder (WD) process but incorporates trainable neural weights. To train the weights of NWD, we introduce two novel training strategies. First, we restrict the loss function to target variable nodes (VNs) of the window, which prunes the neural network and accordingly enhances training efficiency. Second, we employ the active learning technique with a normalized loss term to prevent the training process from biasing toward specific training regions. Next, we develop a systematic method to derive non-uniform schedules for the NWD based on the training results. We introduce trainable damping factors that reflect the relative importance of check node (CN) updates. By skipping updates with less importance, we can omit \mathbf41% of CN updates without performance degradation compared to the conventional WD. Lastly, we address the error propagation problem inherent in SC-LDPC codes by deploying a complementary weight set, which is activated when an error is detected in the previous window. This adaptive decoding strategy effectively mitigates error propagation without requiring modifications to the code and decoder structures.

[LG-41] Improving sub-seasonal wind-speed forecasts in Europe with a non-linear model

链接: https://arxiv.org/abs/2411.19077
作者: Ganglin Tian(1),Camille Le Coz(1),Anastase Alexandre Charantonis(1, 2),Alexis Tantet(1),Naveen Goutham(1, 3),Riwal Plougonven(1) ((1) LMD/IPSL, École Polytechnique, Palaiseau, France, (2) INRIA, Paris, France, (3) EDF Ramp;D, Palaiseau, France)
关键词-EN: provide valuable guidance, power system planning, wind power system, winds decrease sharply, surface wind speed
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Sub-seasonal wind speed forecasts provide valuable guidance for wind power system planning and operations, yet the forecasting skills of surface winds decrease sharply after two weeks. However, large-scale variables exhibit greater predictability on this time scale. This study explores the potential of leveraging non-linear relationships between 500 hPa geopotential height (Z500) and surface wind speed to improve subs-seasonal wind speed forecasting skills in Europe. Our proposed framework uses a Multiple Linear Regression (MLR) or a Convolutional Neural Network (CNN) to regress surface wind speed from Z500. Evaluations on ERA5 reanalysis indicate that the CNN performs better due to their non-linearity. Applying these models to sub-seasonal forecasts from the European Centre for Medium-Range Weather Forecasts, various verification metrics demonstrate the advantages of non-linearity. Yet, this is partly explained by the fact that these statistical models are under-dispersive since they explain only a fraction of the target variable variance. Introducing stochastic perturbations to represent the stochasticity of the unexplained part from the signal helps compensate for this issue. Results show that the perturbed CNN performs better than the perturbed MLR only in the first weeks, while the perturbed MLR’s performance converges towards that of the perturbed CNN after two weeks. The study finds that introducing stochastic perturbations can address the issue of insufficient spread in these statistical models, with improvements from the non-linearity varying with the lead time of the forecasts.

[LG-42] Aggregating Data for Optimal and Private Learning

链接: https://arxiv.org/abs/2411.19045
作者: Sushant Agarwal,Yukti Makhija,Rishi Saket,Aravindan Raghuveer
关键词-EN: Multiple Instance Regression, learning frameworks arising, Multiple Instance, Label Proportions, learning frameworks
类目: Machine Learning (cs.LG)
*备注: 36 pages

点击查看摘要

Abstract:Multiple Instance Regression (MIR) and Learning from Label Proportions (LLP) are learning frameworks arising in many applications, where the training data is partitioned into disjoint sets or bags, and only an aggregate label i.e., bag-label for each bag is available to the learner. In the case of MIR, the bag-label is the label of an undisclosed instance from the bag, while in LLP, the bag-label is the mean of the bag’s labels. In this paper, we study for various loss functions in MIR and LLP, what is the optimal way to partition the dataset into bags such that the utility for downstream tasks like linear regression is maximized. We theoretically provide utility guarantees, and show that in each case, the optimal bagging strategy (approximately) reduces to finding an optimal clustering of the feature vectors or the labels with respect to natural objectives such as k -means. We also show that our bagging mechanisms can be made label-differentially private, incurring an additional utility error. We then generalize our results to the setting of Generalized Linear Models (GLMs). Finally, we experimentally validate our theoretical results.

[LG-43] Pilot Contamination Aware Transformer for Downlink Power Control in Cell-Free Massive MIMO Networks

链接: https://arxiv.org/abs/2411.19020
作者: Atchutaram K. Kocharlakota,Sergiy A. Vorobyov,Robert W. Heath Jr
关键词-EN: online iterative steps, conventional iterative optimization, massive multiple-input multiple-output, cell-free massive multiple-input, computationally intensive due
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 13 paged (double-column), 10 figures, 3 tables

点击查看摘要

Abstract:Learning-based downlink power control in cell-free massive multiple-input multiple-output (CFmMIMO) systems offers a promising alternative to conventional iterative optimization algorithms, which are computationally intensive due to online iterative steps. Existing learning-based methods, however, often fail to exploit the intrinsic structure of channel data and neglect pilot allocation information, leading to suboptimal performance, especially in large-scale networks with many users. This paper introduces the pilot contamination-aware power control (PAPC) transformer neural network, a novel approach that integrates pilot allocation data into the network, effectively handling pilot contamination scenarios. PAPC employs the attention mechanism with a custom masking technique to utilize structural information and pilot data. The architecture includes tailored preprocessing and post-processing stages for efficient feature extraction and adherence to power constraints. Trained in an unsupervised learning framework, PAPC is evaluated against the accelerated proximal gradient (APG) algorithm, showing comparable spectral efficiency fairness performance while significantly improving computational efficiency. Simulations demonstrate PAPC’s superior performance over fully connected networks (FCNs) that lack pilot information, its scalability to large-scale CFmMIMO networks, and its computational efficiency improvement over APG. Additionally, by employing padding techniques, PAPC adapts to the dynamically varying number of users without retraining.

[LG-44] Neural Operators for Predictor Feedback Control of Nonlinear Delay Systems

链接: https://arxiv.org/abs/2411.18964
作者: Luke Bhan,Peijia Qin,Miroslav Krstic,Yuanyuan Shi
关键词-EN: neural operator, neural operator approximation, critical for delay-compensating, Predictor, operator
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 22 pages, 2 figures

点击查看摘要

Abstract:Predictor feedback designs are critical for delay-compensating controllers in nonlinear systems. However, these designs are limited in practical applications as predictors cannot be directly implemented, but require numerical approximation schemes. These numerical schemes, typically combining finite difference and successive approximations, become computationally prohibitive when the dynamics of the system are expensive to compute. To alleviate this issue, we propose approximating the predictor mapping via a neural operator. In particular, we introduce a new perspective on predictor designs by recasting the predictor formulation as an operator learning problem. We then prove the existence of an arbitrarily accurate neural operator approximation of the predictor operator. Under the approximated-predictor, we achieve semiglobal practical stability of the closed-loop nonlinear system. The estimate is semiglobal in a unique sense - namely, one can increase the set of initial states as large as desired but this will naturally increase the difficulty of training a neural operator approximation which appears practically in the stability estimate. Furthermore, we emphasize that our result holds not just for neural operators, but any black-box predictor satisfying a universal approximation error bound. From a computational perspective, the advantage of the neural operator approach is clear as it requires training once, offline and then is deployed with very little computational cost in the feedback controller. We conduct experiments controlling a 5-link robotic manipulator with different state-of-the-art neural operator architectures demonstrating speedups on the magnitude of 10^2 compared to traditional predictor approximation schemes.

[LG-45] ICLERB: In-Context Learning Embedding and Reranker Benchmark

链接: https://arxiv.org/abs/2411.18947
作者: Marie Al Ghossein,Emile Contal,Alexandre Robicquet
关键词-EN: enables Large Language, Large Language Models, Large Language, relevant information, Language Models
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In-Context Learning (ICL) enables Large Language Models (LLMs) to perform new tasks by conditioning on prompts with relevant information. Retrieval-Augmented Generation (RAG) enhances ICL by incorporating retrieved documents into the LLM’s context at query time. However, traditional retrieval methods focus on semantic relevance, treating retrieval as a search problem. In this paper, we propose reframing retrieval for ICL as a recommendation problem, aiming to select documents that maximize utility in ICL tasks. We introduce the In-Context Learning Embedding and Reranker Benchmark (ICLERB), a novel evaluation framework that compares retrievers based on their ability to enhance LLM accuracy in ICL settings. Additionally, we propose a novel Reinforcement Learning-to-Rank from AI Feedback (RLRAIF) algorithm, designed to fine-tune retrieval models using minimal feedback from the LLM. Our experimental results reveal notable differences between ICLERB and existing benchmarks, and demonstrate that small models fine-tuned with our RLRAIF algorithm outperform large state-of-the-art retrieval models. These findings highlight the limitations of existing evaluation methods and the need for specialized benchmarks and training strategies adapted to ICL.

[LG-46] FedRGL: Robust Federated Graph Learning for Label Noise

链接: https://arxiv.org/abs/2411.18905
作者: De Li,Haodong Qian,Qiyu Li,Zhou Tan,Zemin Gan,Jinyan Wang,Xianxian Li
关键词-EN: distributed machine learning, machine learning paradigm, learning paradigm based, graph neural networks, local graph data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Graph Learning (FGL) is a distributed machine learning paradigm based on graph neural networks, enabling secure and collaborative modeling of local graph data among clients. However, label noise can degrade the global model’s generalization performance. Existing federated label noise learning methods, primarily focused on computer vision, often yield suboptimal results when applied to FGL. To address this, we propose a robust federated graph learning method with label noise, termed FedRGL. FedRGL introduces dual-perspective consistency noise node filtering, leveraging both the global model and subgraph structure under class-aware dynamic thresholds. To enhance client-side training, we incorporate graph contrastive learning, which improves encoder robustness and assigns high-confidence pseudo-labels to noisy nodes. Additionally, we measure model quality via predictive entropy of unlabeled nodes, enabling adaptive robust aggregation of the global model. Comparative experiments on multiple real-world graph datasets show that FedRGL outperforms 12 baseline methods across various noise rates, types, and numbers of clients.

[LG-47] Swarm Intelligence-Driven Client Selection for Federated Learning in Cybersecurity applications

链接: https://arxiv.org/abs/2411.18877
作者: Koffka Khan,Wayne Goodridge
关键词-EN: Swarm Intelligence Optimization, Particle Swarm Optimization, Federated Learning, selection in Federated, Swarm Intelligence
类目: Machine Learning (cs.LG)
*备注: 21 pages, 1 figure, 15 tables

点击查看摘要

Abstract:This study addresses a critical gap in the literature regarding the use of Swarm Intelligence Optimization (SI) algorithms for client selection in Federated Learning (FL), with a focus on cybersecurity applications. Existing research primarily explores optimization techniques for centralized machine learning, leaving the unique challenges of client diveristy, non-IID data distributions, and adversarial noise in decentralized FL largely unexamined. To bridge this gap, we evaluate nine SI algorithms-Grey Wolf Optimization (GWO), Particle Swarm Optimization (PSO), Cuckoo Search, Bat Algorithm, Bee Colony, Ant Colony Optimization, Fish Swarm, Glow Worm, and Intelligent Water Droplet-across four experimental scenarios: fixed client participation, dynamic participation patterns, hetergeneous non-IID data distributions, and adversarial noise conditions. Results indicate that GWO exhibits superior adaptability and robustness, achieving the highest accuracy, recall and F1-scoress across all configurations, while PSO and Cuckoo Search also demonstrate strong performance. These findings underscore the potential of SI algorithms to address decentralized and adversarial FL challenges, offereing scalable and resilient solutions for cybersecurity applications, including intrusion detection in IoT and large-scale networks.

[LG-48] Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach

链接: https://arxiv.org/abs/2411.18873
作者: Yijia Zhang,Zhihong Gou,Shijie Cao,Weigang Feng,Sicheng Zhang,Guohao Dai,Ningyi Xu
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, significant energy consumption, revolutionized various fields
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have revolutionized various fields, but their deployment on GPUs often leads to significant energy consumption. Unlike existing methods for reducing GPU energy consumption, which are either hardware-inflexible or limited by workload constraints, this paper addresses the problem at the GPU kernel level. We propose a novel search-based compilation method to generate energy-efficient GPU kernels by incorporating energy efficiency into the search process. To accelerate the energy evaluation process, we develop an accurate energy cost model based on high-level kernel features. Furthermore, we introduce a dynamic updating strategy for the energy cost model, reducing the need for on-device energy measurements and accelerating the search process. Our evaluation demonstrates that the proposed approach can generate GPU kernels with up to 21.69% reduced energy consumption while maintaining low latency.

[LG-49] A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems

链接: https://arxiv.org/abs/2411.18872
作者: Roozbeh Yousefzadeh,Xuenan Cao
关键词-EN: IMO problems, formal proofs, proofs, problems, IMO
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using AI to write formal proofs for mathematical problems is a challenging task that has seen some advancements in recent years. Automated systems such as Lean can verify the correctness of proofs written in formal language, yet writing the proofs in formal language can be challenging for humans and machines. The miniF2F benchmark has 20 IMO problems in its testing set, yet formal proofs are available only for 7 of these problems (3 of which are written only by mathematicians). The model with best accuracy can only prove 4 of these 20 IMO problems, from 1950s and 60s, while its training set is a secret. In this work, we write complete, original formal proofs for the remaining 13 IMO problems in Lean along with 3 extra problems from IMO 2022 and 2023. This effort expands the availability of proof currently in the public domain by creating 5,150 lines of Lean proof. The goal of the paper is to pave the way for developing AI models that can automatically write the formal proofs for all the IMO problems in miniF2F and beyond. In this pursuit, we devise a method to decompose the proof of these problems into their building blocks, constructing a dataset of about 900 lemmas with 25,500 lines of Lean code. These lemmas are not trivial, yet they are approachable, providing the opportunity to evaluate and diagnose the failures and successes of AI models. We then evaluate the ability of GPT-4 in writing formal proofs for these lemmas with zero shot prompting, CoT reasoning and lemma retrieval. In evaluating the responses, we also analyze the confounding factor of LLM’s ability to write the proofs in natural language vs Lean language.

[LG-50] ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

链接: https://arxiv.org/abs/2411.18825
作者: Letian Chen,Matthew Gombolay
关键词-EN: Large Language Models, hoc reward functions, demonstrated compelling performance, design of complex, compelling performance
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has demonstrated compelling performance in robotic tasks, but its success often hinges on the design of complex, ad hoc reward functions. Researchers have explored how Large Language Models (LLMs) could enable non-expert users to specify reward functions more easily. However, LLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot represent the problem properly with only text-based descriptions. To address these challenges, we propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a novel framework that combines natural language guidance with visual user demonstrations to align robot behavior with user intentions better. By incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only task specifications, while leveraging inverse reinforcement learning (IRL) to balance feature weights and match the demonstrated behaviors optimally. ELEMENTAL also introduces an iterative feedback-loop through self-reflection to improve feature, reward, and policy learning. Our experiment results demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and achieves 41.3% better generalization in out-of-distribution tasks, highlighting its robustness in LfD.

[LG-51] One-Step Early Stopping Strategy using Neural Tangent Kernel Theory and Rademacher Complexity

链接: https://arxiv.org/abs/2411.18806
作者: Daniel Martin Xavier,Ludovic Chamoin,Jawher Jerray,Laurent Fribourg
关键词-EN: early stopping strategy, stopping strategy consists, strategy consists, training error, initial training error
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:The early stopping strategy consists in stopping the training process of a neural network (NN) on a set S of input data before training error is minimal. The advantage is that the NN then retains good generalization properties, i.e. it gives good predictions on data outside S , and a good estimate of the statistical error (population loss'') is obtained. We give here an analytical estimation of the optimal stopping time involving basically the initial training error vector and the eigenvalues of the neural tangent kernel’'. This yields an upper bound on the population loss which is well-suited to the underparameterized context (where the number of parameters is moderate compared with the number of data). Our method is illustrated on the example of an NN simulating the MPC control of a Van der Pol oscillator.

[LG-52] Stratified Non-Negative Tensor Factorization

链接: https://arxiv.org/abs/2411.18805
作者: Alexander Sietsema,Zerrin Vural,James Chapman,Yotam Yaniv,Deanna Needell
关键词-EN: Non-negative matrix factorization, decompose non-negative high-dimensional, non-negative low-rank components, non-negative tensor factorization, non-negative high-dimensional data
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 5 pages. Will appear in IEEE Asilomar Conference on Signals, Systems, and Computers 2024

点击查看摘要

Abstract:Non-negative matrix factorization (NMF) and non-negative tensor factorization (NTF) decompose non-negative high-dimensional data into non-negative low-rank components. NMF and NTF methods are popular for their intrinsic interpretability and effectiveness on large-scale data. Recent work developed Stratified-NMF, which applies NMF to regimes where data may come from different sources (strata) with different underlying distributions, and seeks to recover both strata-dependent information and global topics shared across strata. Applying Stratified-NMF to multi-modal data requires flattening across modes, and therefore loses geometric structure contained implicitly within the tensor. To address this problem, we extend Stratified-NMF to the tensor setting by developing a multiplicative update rule and demonstrating the method on text and image data. We find that Stratified-NTF can identify interpretable topics with lower memory requirements than Stratified-NMF. We also introduce a regularized version of the method and demonstrate its effects on image data.

[LG-53] Graph-Based Biomarker Discovery and Interpretation for Alzheimers Disease

链接: https://arxiv.org/abs/2411.18796
作者: Maryam Khalid,Fadeel Sher Khan,John Broussard,Arko Barman
关键词-EN: Alzheimer Disease, Early diagnosis, therapeutic drug targets, crucial objectives, management of Alzheimer
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Early diagnosis and discovery of therapeutic drug targets are crucial objectives for the effective management of Alzheimer’s Disease (AD). Current approaches for AD diagnosis and treatment planning are based on radiological imaging and largely inaccessible for population-level screening due to prohibitive costs and limited availability. Recently, blood tests have shown promise in diagnosing AD and highlighting possible biomarkers that can be used as drug targets for AD management. Blood tests are significantly more accessible to disadvantaged populations, cost-effective, and minimally invasive. However, biomarker discovery in the context of AD diagnosis is complex as there exist important associations between various biomarkers. Here, we introduce BRAIN (Biomarker Representation, Analysis, and Interpretation Network), a novel machine learning (ML) framework to jointly optimize the diagnostic accuracy and biomarker discovery processes to identify all relevant biomarkers that contribute to AD diagnosis. Using a holistic graph-based representation for biomarkers, we highlight their inter-dependencies and explain why different ML models identify different discriminative biomarkers. We apply BRAIN to a publicly available blood biomarker dataset, revealing three novel biomarker sub-networks whose interactions vary between the control and AD groups, offering a new paradigm for drug discovery and biomarker analysis for AD.

[LG-54] Investigating Plausibility of Biologically Inspired Bayesian Learning in ANNs

链接: https://arxiv.org/abs/2411.18788
作者: Ram Zaveri
关键词-EN: Catastrophic forgetting, artificial systems, biologically inspired Bayesian, Current artificial systems, systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Catastrophic forgetting has been the leading issue in the domain of lifelong learning in artificial systems. Current artificial systems are reasonably good at learning domains they have seen before; however, as soon as they encounter something new, they either go through a significant performance deterioration or if you try to teach them the new distribution of data, they forget what they have learned before. Additionally, they are also prone to being overly confident when performing inference on seen as well as unseen data, causing significant reliability issues when lives are at stake. Therefore, it is extremely important to dig into this problem and formulate an approach that will be continually adaptable as well as reliable. If we move away from the engineering domain of such systems and look into biological systems, we can realize that these very systems are very efficient at computing the reliance as well as the uncertainty of accurate predictions that further help them refine the inference in a life-long setting. These systems are not perfect; however, they do give us a solid understanding of the reasoning under uncertainty which takes us to the domain of Bayesian reasoning. We incorporate this Bayesian inference with thresholding mechanism as to mimic more biologically inspired models, but only at spatial level. Further, we reproduce a recent study on Bayesian Inference with Spiking Neural Networks for Continual Learning to compare against it as a suitable biologically inspired Bayesian framework. Overall, we investigate the plausibility of biologically inspired Bayesian Learning in artificial systems on a vision dataset, MNIST, and show relative performance improvement under the conditions when the model is forced to predict VS when the model is not.

[LG-55] Classification of Deceased Patients from Non-Deceased Patients using Random Forest and Support Vector Machine Classifiers

链接: https://arxiv.org/abs/2411.18759
作者: Dheeman Saha,Aaron Segura,Biraj Tiwari
关键词-EN: Analyzing large datasets, Analyzing large, large datasets, datasets and summarizing, data mining process
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analyzing large datasets and summarizing it into useful information is the heart of the data mining process. In healthcare, information can be converted into knowledge about patient historical patterns and possible future trends. During the COVID-19 pandemic, data mining COVID-19 patient information poses an opportunity to discover patterns that may signal that the patient is at high risk for death. COVID-19 patients die from sepsis, a complex disease process involving multiple organ systems. We extracted the variables physicians are most concerned about regarding viral septic infections. With the aim of distinguishing COVID-19 patients who survive their hospital stay and those COVID-19 who do not, the authors of this study utilize the Support Vector Machine (SVM) and the Random Forest (RF) classification techniques to classify patients according to their demographics, laboratory test results, and preexisting health conditions. After conducting a 10-fold validation procedure, we assessed the performance of the classification through a Receiver Operating Characteristic (ROC) curve, and a Confusion Matrix was used to determine the accuracy of the classifiers. We also performed a cluster analysis on the binary factors, such as if the patient had a preexisting condition and if sepsis was identified, and the numeric values from patient demographics and laboratory test results as predictors.

[LG-56] Locally Differentially Private Online Federated Learning With Correlated Noise

链接: https://arxiv.org/abs/2411.18752
作者: Jiaojiao Zhang,Linglingzhi Zhu,Dominik Fay,Mikael Johansson
关键词-EN: locally differentially private, employs temporally correlated, temporally correlated noise, online federated learning, correlated noise
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2403.16542

点击查看摘要

Abstract:We introduce a locally differentially private (LDP) algorithm for online federated learning that employs temporally correlated noise to improve utility while preserving privacy. To address challenges posed by the correlated noise and local updates with streaming non-IID data, we develop a perturbed iterate analysis that controls the impact of the noise on the utility. Moreover, we demonstrate how the drift errors from local updates can be effectively managed for several classes of nonconvex loss functions. Subject to an (\epsilon,\delta) -LDP budget, we establish a dynamic regret bound that quantifies the impact of key parameters and the intensity of changes in the dynamic environment on the learning performance. Numerical experiments confirm the efficacy of the proposed algorithm.

[LG-57] Inference Privacy: Properties and Mechanisms

链接: https://arxiv.org/abs/2411.18746
作者: Fengwei Tian,Ravi Tandon
关键词-EN: reconstructing users’ private, Ensuring privacy, privacy, users’ private inputs, stage is crucial
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring privacy during inference stage is crucial to prevent malicious third parties from reconstructing users’ private inputs from outputs of public models. Despite a large body of literature on privacy preserving learning (which ensures privacy of training data), there is no existing systematic framework to ensure the privacy of users’ data during inference. Motivated by this problem, we introduce the notion of Inference Privacy (IP), which can allow a user to interact with a model (for instance, a classifier, or an AI-assisted chat-bot) while providing a rigorous privacy guarantee for the users’ data at inference. We establish fundamental properties of the IP privacy notion and also contrast it with the notion of Local Differential Privacy (LDP). We then present two types of mechanisms for achieving IP: namely, input perturbations and output perturbations which are customizable by the users and can allow them to navigate the trade-off between utility and privacy. We also demonstrate the usefulness of our framework via experiments and highlight the resulting trade-offs between utility and privacy during inference.

[LG-58] Foundation Models in Radiology: What How When Why and Why Not

链接: https://arxiv.org/abs/2411.18730
作者: Magdalini Paschali,Zhihong Chen,Louis Blankemeier,Maya Varma,Alaa Youssef,Christian Bluethgen,Curtis Langlotz,Sergios Gatidis,Akshay Chaudhari
关键词-EN: large-scale deep learning, deep learning models, learning models capable, foundation models, Recent advances
类目: Machine Learning (cs.LG)
*备注: This pre-print has been accepted for publication in Radiology

点击查看摘要

Abstract:Recent advances in artificial intelligence have witnessed the emergence of large-scale deep learning models capable of interpreting and generating both textual and imaging data. Such models, typically referred to as foundation models, are trained on extensive corpora of unlabeled data and demonstrate high performance across various tasks. Foundation models have recently received extensive attention from academic, industry, and regulatory bodies. Given the potentially transformative impact that foundation models can have on the field of radiology, this review aims to establish a standardized terminology concerning foundation models, with a specific focus on the requirements of training data, model training paradigms, model capabilities, and evaluation strategies. We further outline potential pathways to facilitate the training of radiology-specific foundation models, with a critical emphasis on elucidating both the benefits and challenges associated with such models. Overall, we envision that this review can unify technical advances and clinical needs in the training of foundation models for radiology in a safe and responsible manner, for ultimately benefiting patients, providers, and radiologists.

[LG-59] Addressing bias in Recommender Systems: A Case Study on Data Debiasing Techniques in Mobile Games RECSYS2024 RECSYS

链接: https://arxiv.org/abs/2411.18716
作者: Yixiong Wang,Maria Paskevich,Hui Wang
关键词-EN: experiences rapid growth, rapid growth, experiences rapid, mobile gaming industry, mobile gaming
类目: Machine Learning (cs.LG)
*备注: RobustRecSys workshop @ RecSys 2024

点击查看摘要

Abstract:The mobile gaming industry, particularly the free-to-play sector, has been around for more than a decade, yet it still experiences rapid growth. The concept of games-as-service requires game developers to pay much more attention to recommendations of content in their games. With recommender systems (RS), the inevitable problem of bias in the data comes hand in hand. A lot of research has been done on the case of bias in RS for online retail or services, but much less is available for the specific case of the game industry. Also, in previous works, various debiasing techniques were tested on explicit feedback datasets, while it is much more common in mobile gaming data to only have implicit feedback. This case study aims to identify and categorize potential bias within datasets specific to model-based recommendations in mobile games, review debiasing techniques in the existing literature, and assess their effectiveness on real-world data gathered through implicit feedback. The effectiveness of these methods is then evaluated based on their debiasing quality, data requirements, and computational demands.

[LG-60] Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits

链接: https://arxiv.org/abs/2411.18704
作者: Daniel Morales-Brotons,Thijs Vogels,Hadrien Hendrikx
关键词-EN: Stochastic Gradient Descent, Gradient Descent, Stochastic Gradient, Exponential Moving Average, popular method
类目: Machine Learning (cs.LG)
*备注: 27 pages, 9 figures. Accepted at TMLR, April 2024

点击查看摘要

Abstract:Weight averaging of Stochastic Gradient Descent (SGD) iterates is a popular method for training deep learning models. While it is often used as part of complex training pipelines to improve generalization or serve as a `teacher’ model, weight averaging lacks proper evaluation on its own. In this work, we present a systematic study of the Exponential Moving Average (EMA) of weights. We first explore the training dynamics of EMA, give guidelines for hyperparameter tuning, and highlight its good early performance, partly explaining its success as a teacher. We also observe that EMA requires less learning rate decay compared to SGD since averaging naturally reduces noise, introducing a form of implicit regularization. Through extensive experiments, we show that EMA solutions differ from last-iterate solutions. EMA models not only generalize better but also exhibit improved i) robustness to noisy labels, ii) prediction consistency, iii) calibration and iv) transfer learning. Therefore, we suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.

[LG-61] Another look at inference after prediction

链接: https://arxiv.org/abs/2411.19908
作者: Jessica Gronsbell,Jianhui Gao,Yaqi Shi,Zachary R. McCaw,David Cheng
关键词-EN: difficult to obtain, predictors are readily, inference, Prediction-based, partially observed outcome
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prediction-based (PB) inference is increasingly used in applications where the outcome of interest is difficult to obtain, but its predictors are readily available. Unlike traditional inference, PB inference performs statistical inference using a partially observed outcome and a set of covariates by leveraging a prediction of the outcome generated from a machine learning (ML) model. Motwani and Witten (2023) recently revisited two innovative PB inference approaches for ordinary least squares. They found that the method proposed by Wang et al. (2020) yields a consistent estimator for the association of interest when the ML model perfectly captures the underlying regression function. Conversely, the prediction-powered inference (PPI) method proposed by Angelopoulos et al. (2023) yields valid inference regardless of the model’s accuracy. In this paper, we study the statistical efficiency of the PPI estimator. Our analysis reveals that a more efficient estimator, proposed 25 years ago by Chen and Chen (2000), can be obtained by simply adding a weight to the PPI estimator. We also contextualize PB inference with methods from the economics and statistics literature dating back to the 1960s. Our extensive theoretical and numerical analyses indicate that the Chen and Chen (CC) estimator offers a balance between robustness to ML model specification and statistical efficiency, making it the preferred choice for use in practice.

[LG-62] Noncommutative Model Selection for Data Clustering and Dimension Reduction Using Relative von Neumann Entropy

链接: https://arxiv.org/abs/2411.19902
作者: Araceli Guzmán-Tristán,Antonio Rieser
关键词-EN: completely data-driven algorithms, data, data sets, dimension reduction algorithm, propose a pair
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Other Statistics (stat.OT)
*备注: 20 pages

点击查看摘要

Abstract:We propose a pair of completely data-driven algorithms for unsupervised classification and dimension reduction, and we empirically study their performance on a number of data sets, both simulated data in three-dimensions and images from the COIL-20 data set. The algorithms take as input a set of points sampled from a uniform distribution supported on a metric space, the latter embedded in an ambient metric space, and they output a clustering or reduction of dimension of the data. They work by constructing a natural family of graphs from the data and selecting the graph which maximizes the relative von Neumann entropy of certain normalized heat operators constructed from the graphs. Once the appropriate graph is selected, the eigenvectors of the graph Laplacian may be used to reduce the dimension of the data, and clusters in the data may be identified with the kernel of the associated graph Laplacian. Notably, these algorithms do not require information about the size of a neighborhood or the desired number of clusters as input, in contrast to popular algorithms such as k -means, and even more modern spectral methods such as Laplacian eigenmaps, among others. In our computational experiments, our clustering algorithm outperforms k -means clustering on data sets with non-trivial geometry and topology, in particular data whose clusters are not concentrated around a specific point, and our dimension reduction algorithm is shown to work well in several simple examples. Comments: 20 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Other Statistics (stat.OT) Cite as: arXiv:2411.19902 [stat.ML] (or arXiv:2411.19902v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.19902 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-63] Efficient quantum-enhanced classical simulation for patches of quantum landscapes

链接: https://arxiv.org/abs/2411.19896
作者: Sacha Lerch,Ricard Puig,Manuel S. Rudolph,Armando Angrisani,Tyson Jones,M. Cerezo,Supanut Thanasilp,Zoë Holmes
关键词-EN: Understanding the capabilities, classical simulation methods, methods is key, key to identifying, computers are advantageous
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 + 47 pages, 4 figures

点击查看摘要

Abstract:Understanding the capabilities of classical simulation methods is key to identifying where quantum computers are advantageous. Not only does this ensure that quantum computers are used only where necessary, but also one can potentially identify subroutines that can be offloaded onto a classical device. In this work, we show that it is always possible to generate a classical surrogate of a sub-region (dubbed a “patch”) of an expectation landscape produced by a parameterized quantum circuit. That is, we provide a quantum-enhanced classical algorithm which, after simple measurements on a quantum device, allows one to classically simulate approximate expectation values of a subregion of a landscape. We provide time and sample complexity guarantees for a range of families of circuits of interest, and further numerically demonstrate our simulation algorithms on an exactly verifiable simulation of a Hamiltonian variational ansatz and long-time dynamics simulation on a 127-qubit heavy-hex topology.

[LG-64] Machine learning force-field model for kinetic Monte Carlo simulations of itinerant Ising magnets

链接: https://arxiv.org/abs/2411.19780
作者: Alexa Tyberg,Yunhao Fan,Gia-Wei Chern
关键词-EN: kinetic Monte Carlo, Monte Carlo, large-scale kinetic Monte, kinetic Monte, scalable machine learning
类目: atistical Mechanics (cond-mat.stat-mech); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:We present a scalable machine learning (ML) framework for large-scale kinetic Monte Carlo (kMC) simulations of itinerant electron Ising systems. As the effective interactions between Ising spins in such itinerant magnets are mediated by conducting electrons, the calculation of energy change due to a local spin update requires solving an electronic structure problem. Such repeated electronic structure calculations could be overwhelmingly prohibitive for large systems. Assuming the locality principle, a convolutional neural network (CNN) model is developed to directly predict the effective local field and the corresponding energy change associated with a given spin update based on Ising configuration in a finite neighborhood. As the kernel size of the CNN is fixed at a constant, the model can be directly scalable to kMC simulations of large lattices. Our approach is reminiscent of the ML force-field models widely used in first-principles molecular dynamics simulations. Applying our ML framework to a square-lattice double-exchange Ising model, we uncover unusual coarsening of ferromagnetic domains at low temperatures. Our work highlights the potential of ML methods for large-scale modeling of similar itinerant systems with discrete dynamical variables.

[LG-65] Nonparametric Instrumental Regression via Kernel Methods is Minimax Optimal

链接: https://arxiv.org/abs/2411.19653
作者: Dimitri Meunier,Zhu Li,Tim Christensen,Arthur Gretton
关键词-EN: kernel instrumental variable, strong empirical performance, demonstrated strong empirical, instrumental variable algorithm, kernel NPIV
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the kernel instrumental variable algorithm of \citetsingh2019kernel, a nonparametric two-stage least squares (2SLS) procedure which has demonstrated strong empirical performance. We provide a convergence analysis that covers both the identified and unidentified settings: when the structural function cannot be identified, we show that the kernel NPIV estimator converges to the IV solution with minimum norm. Crucially, our convergence is with respect to the strong L_2 -norm, rather than a pseudo-norm. Additionally, we characterize the smoothness of the target function without relying on the instrument, instead leveraging a new description of the projected subspace size (this being closely related to the link condition in inverse learning literature). With the subspace size description and under standard kernel learning assumptions, we derive, for the first time, the minimax optimal learning rate for kernel NPIV in the strong L_2 -norm. Our result demonstrates that the strength of the instrument is essential to achieve efficient learning. We also improve the original kernel NPIV algorithm by adopting a general spectral regularization in stage 1 regression. The modified regularization can overcome the saturation effect of Tikhonov regularization.

[LG-66] Non-linear Equalization in 112 Gb/s PONs Using Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2411.19631
作者: Rodrigo Fischer,Patrick Matalla,Sebastian Randel,Laurent Schmalen
关键词-EN: passive optical networks, investigate Kolmogorov-Arnold networks, passive optical, investigate Kolmogorov-Arnold, non-linear equalization
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted for possible publication at Optical Fiber Communication Conference (OFC) 2025

点击查看摘要

Abstract:We investigate Kolmogorov-Arnold networks (KANs) for non-linear equalization of 112 Gb/s PAM4 passive optical networks (PONs). Using pruning and extensive hyperparameter search, we outperform linear equalizers and convolutional neural networks at low computational complexity.

[LG-67] OpenQDC: Open Quantum Data Commons

链接: https://arxiv.org/abs/2411.19629
作者: Cristian Gabellini,Nikhil Shenoy,Stephan Thaler,Semih Canturk,Daniel McNeela,Dominique Beaini,Michael Bronstein,Prudencio Tossou
关键词-EN: Machine Learning Interatomic, Learning Interatomic Potentials, Machine Learning, Interatomic Potentials, highly promising alternative
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning Interatomic Potentials (MLIPs) are a highly promising alternative to force-fields for molecular dynamics (MD) simulations, offering precise and rapid energy and force calculations. However, Quantum-Mechanical (QM) datasets, crucial for MLIPs, are fragmented across various repositories, hindering accessibility and model development. We introduce the openQDC package, consolidating 37 QM datasets from over 250 quantum methods and 400 million geometries into a single, accessible resource. These datasets are meticulously preprocessed, and standardized for MLIP training, covering a wide range of chemical elements and interactions relevant in organic chemistry. OpenQDC includes tools for normalization and integration, easily accessible via Python. Experiments with well-known architectures like SchNet, TorchMD-Net, and DimeNet reveal challenges for those architectures and constitute a leaderboard to accelerate benchmarking and guide novel algorithms development. Continuously adding datasets to OpenQDC will democratize QM dataset access, foster more collaboration and innovation, enhance MLIP development, and support their adoption in the MD field.

[LG-68] Materials Learning Algorithms (MALA): Scalable Machine Learning for Electronic Structure Calculations in Large-Scale Atomistic Simulations

链接: https://arxiv.org/abs/2411.19617
作者: Attila Cangi,Lenz Fiedler,Bartosz Brzoza,Karan Shah,Timothy J. Callow,Daniel Kotik,Steve Schmerler,Matthew C. Barry,James M. Goff,Andrew Rohskopf,Dayton J. Vogel,Normand Modine,Aidan P. Thompson,Sivasankaran Rajamanickam
关键词-EN: Materials Learning Algorithms, machine learning framework, learning framework designed, Learning Algorithms, large-scale atomistic simulations
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present the Materials Learning Algorithms (MALA) package, a scalable machine learning framework designed to accelerate density functional theory (DFT) calculations suitable for large-scale atomistic simulations. Using local descriptors of the atomic environment, MALA models efficiently predict key electronic observables, including local density of states, electronic density, density of states, and total energy. The package integrates data sampling, model training and scalable inference into a unified library, while ensuring compatibility with standard DFT and molecular dynamics codes. We demonstrate MALA’s capabilities with examples including boron clusters, aluminum across its solid-liquid phase boundary, and predicting the electronic structure of a stacking fault in a large beryllium slab. Scaling analyses reveal MALA’s computational efficiency and identify bottlenecks for future optimization. With its ability to model electronic structures at scales far beyond standard DFT, MALA is well suited for modeling complex material systems, making it a versatile tool for advanced materials research.

[LG-69] opology-Preserving Scaling in Data Augmentation

链接: https://arxiv.org/abs/2411.19512
作者: Vu-Anh Le,Mehmet Dik
关键词-EN: non-uniform scaling transformations, scaling transformations, stability under non-uniform, scaling, leq
类目: Algebraic Topology (math.AT); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:We propose an algorithmic framework for dataset normalization in data augmentation pipelines that preserves topological stability under non-uniform scaling transformations. Given a finite metric space ( X \subset \mathbbR^n ) with Euclidean distance ( d_X ), we consider scaling transformations defined by scaling factors ( s_1, s_2, \ldots, s_n 0 ). Specifically, we define a scaling function ( S ) that maps each point ( x = (x_1, x_2, \ldots, x_n) \in X ) to [ S(x) = (s_1 x_1, s_2 x_2, \ldots, s_n x_n). ] Our main result establishes that the bottleneck distance ( d_B(D, D_S) ) between the persistence diagrams ( D ) of ( X ) and ( D_S ) of ( S(X) ) satisfies: [ d_B(D, D_S) \leq (s_\max - s_\min) \cdot \operatornamediam(X), ] where ( s_\min = \min_1 \leq i \leq n s_i ), ( s_\max = \max_1 \leq i \leq n s_i ), and ( \operatornamediam(X) ) is the diameter of ( X ). Based on this theoretical guarantee, we formulate an optimization problem to minimize the scaling variability ( \Delta_s = s_\max - s_\min ) under the constraint ( d_B(D, D_S) \leq \epsilon ), where ( \epsilon 0 ) is a user-defined tolerance. We develop an algorithmic solution to this problem, ensuring that data augmentation via scaling transformations preserves essential topological features. We further extend our analysis to higher-dimensional homological features, alternative metrics such as the Wasserstein distance, and iterative or probabilistic scaling scenarios. Our contributions provide a rigorous mathematical framework for dataset normalization in data augmentation pipelines, ensuring that essential topological characteristics are maintained despite scaling transformations. Comments: 20 pages Subjects: Algebraic Topology (math.AT); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2411.19512 [math.AT] (or arXiv:2411.19512v1 [math.AT] for this version) https://doi.org/10.48550/arXiv.2411.19512 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-70] Real-time Anomaly Detection at the L1 Trigger of CMS Experiment

链接: https://arxiv.org/abs/2411.19506
作者: Abhijith Gandrakota(on behalf of CMS collaboration)
关键词-EN: CMS experiment Global, experiment Global Trigger, LHC Run, experiment Global, test crate FPGAs
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Contribution to 42nd International Conference on High Energy Physics (ICHEP 2024)

点击查看摘要

Abstract:We present the preparation, deployment, and testing of an autoencoder trained for unbiased detection of new physics signatures in the CMS experiment Global Trigger (GT) test crate FPGAs during LHC Run 3. The GT makes the final decision whether to readout or discard the data from each LHC collision, which occur at a rate of 40 MHz, within a 50 ns latency. The Neural Network makes a prediction for each event within these constraints, which can be used to select anomalous events for further analysis. The GT test crate is a copy of the main GT system, receiving the same input data, but whose output is not used to trigger the readout of CMS, providing a platform for thorough testing of new trigger algorithms on live data, but without interrupting data taking. We describe the methodology to achieve ultra low latency anomaly detection, and present the integration of the DNN into the GT test crate, as well as the monitoring, testing, and validation of the algorithm during proton collisions.

[LG-71] Unsupervised Learning Approach to Anomaly Detection in Gravitational Wave Data

链接: https://arxiv.org/abs/2411.19450
作者: Ammar Fayad
关键词-EN: Einstein General Theory, Theory of Relativity, Einstein General, General Theory, Gravitational waves
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gravitational waves (GW), predicted by Einstein’s General Theory of Relativity, provide a powerful probe of astrophysical phenomena and fundamental physics. In this work, we propose an unsupervised anomaly detection method using variational autoencoders (VAEs) to analyze GW time-series data. By training on noise-only data, the VAE accurately reconstructs noise inputs while failing to reconstruct anomalies, such as GW signals, which results in measurable spikes in the reconstruction error. The method was applied to data from the LIGO H1 and L1 detectors. Evaluation on testing datasets containing both noise and GW events demonstrated reliable detection, achieving an area under the ROC curve (AUC) of 0.89. This study introduces VAEs as a robust, unsupervised approach for identifying anomalies in GW data, which offers a scalable framework for detecting known and potentially new phenomena in physics.

[LG-72] Machine learning the Ising transition: A comparison between discriminative and generative approaches

链接: https://arxiv.org/abs/2411.19370
作者: Difei Zhang,Frank Schäfer,Julian Arnold
关键词-EN: many-body physics, central task, task in many-body, Abstract, task
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 11+5 pages, 4+4 figures

点击查看摘要

Abstract:The detection of phase transitions is a central task in many-body physics. To automate this process, the task can be phrased as a classification problem. Classification problems can be approached in two fundamentally distinct ways: through either a discriminative or a generative method. In general, it is unclear which of these two approaches is most suitable for a given problem. The choice is expected to depend on factors such as the availability of system knowledge, dataset size, desired accuracy, computational resources, and other considerations. In this work, we answer the question of how one should approach the solution of phase-classification problems by performing a numerical case study on the thermal phase transition in the classical two-dimensional square-lattice ferromagnetic Ising model.

[LG-73] LD-EnSF: Synergizing Latent Dynamics with Ensemble Score Filters for Fast Data Assimilation with Sparse Observations

链接: https://arxiv.org/abs/2411.19305
作者: Pengpeng Xiao,Phillip Si,Peng Chen
关键词-EN: Ensemble Score Filter, Data assimilation, Latent Ensemble Score, Data assimilation techniques, modeling complex physical
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Data assimilation techniques are crucial for correcting the trajectory when modeling complex physical systems. A recently developed data assimilation method, Latent Ensemble Score Filter (Latent-EnSF), has shown great promise in addressing the key limitation of EnSF for highly sparse observations in high-dimensional and nonlinear data assimilation problems. It performs data assimilation in a latent space for encoded states and observations in every assimilation step, and requires costly full dynamics to be evolved in the original space. In this paper, we introduce Latent Dynamics EnSF (LD-EnSF), a novel methodology that completely avoids the full dynamics evolution and significantly accelerates the data assimilation process, which is especially valuable for complex dynamical problems that require fast data assimilation in real time. To accomplish this, we introduce a novel variant of Latent Dynamics Networks (LDNets) to effectively capture and preserve the system’s dynamics within a very low-dimensional latent space. Additionally, we propose a new method for encoding sparse observations into the latent space using Long Short-Term Memory (LSTM) networks, which leverage not only the current step’s observations, as in Latent-EnSF, but also all previous steps, thereby improving the accuracy and robustness of the observation encoding. We demonstrate the robustness, accuracy, and efficiency of the proposed method for two challenging dynamical systems with highly sparse (in both space and time) and noisy observations.

[LG-74] he role of data-induced randomness in quantum machine learning classification tasks

链接: https://arxiv.org/abs/2411.19281
作者: Berta Casas,Xavier Bonet-Monroig,Adrián Pérez-Salinas
关键词-EN: Quantum machine learning, Quantum machine, classical machine learning, machine learning models, machine learning
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:Quantum machine learning (QML) has surged as a prominent area of research with the objective to go beyond the capabilities of classical machine learning models. A critical aspect of any learning task is the process of data embedding, which directly impacts model performance. Poorly designed data-embedding strategies can significantly impact the success of a learning task. Despite its importance, rigorous analyses of data-embedding effects are limited, leaving many cases without effective assessment methods. In this work, we introduce a metric for binary classification tasks, the class margin, by merging the concepts of average randomness and classification margin. This metric analytically connects data-induced randomness with classification accuracy for a given data-embedding map. We benchmark a range of data-embedding strategies through class margin, demonstrating that data-induced randomness imposes a limit on classification performance. We expect this work to provide a new approach to evaluate QML models by their data-embedding processes, addressing gaps left by existing analytical tools.

[LG-75] Quantum feedback control with a transformer neural network architecture

链接: https://arxiv.org/abs/2411.19253
作者: Pranav Vaidhyanathan,Florian Marquardt,Mark T. Mitchison,Natalia Ares
关键词-EN: natural language processing, Attention-based neural networks, Attention-based neural, language processing, revolutionized various fields
类目: Quantum Physics (quant-ph); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Attention-based neural networks such as transformers have revolutionized various fields such as natural language processing, genomics, and vision. Here, we demonstrate the use of transformers for quantum feedback control through a supervised learning approach. In particular, due to the transformer’s ability to capture long-range temporal correlations and training efficiency, we show that it can surpass some of the limitations of previous control approaches, e.g.~those based on recurrent neural networks trained using a similar approach or reinforcement learning. We numerically show, for the example of state stabilization of a two-level system, that our bespoke transformer architecture can achieve unit fidelity to a target state in a short time even in the presence of inefficient measurement and Hamiltonian perturbations that were not included in the training set. We also demonstrate that this approach generalizes well to the control of non-Markovian systems. Our approach can be used for quantum error correction, fast control of quantum states in the presence of colored noise, as well as real-time tuning, and characterization of quantum devices.

[LG-76] ABROCA Distributions For Algorithmic Bias Assessment: Considerations Around Interpretation

链接: https://arxiv.org/abs/2411.19090
作者: Conrad Borchers,Ryan S. Baker
关键词-EN: Algorithmic bias continues, ROC curves, Algorithmic bias, learning analytics, key concern
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to Learning Analytics and Knowledge (LAK 2025)

点击查看摘要

Abstract:Algorithmic bias continues to be a key concern of learning analytics. We study the statistical properties of the Absolute Between-ROC Area (ABROCA) metric. This fairness measure quantifies group-level differences in classifier performance through the absolute difference in ROC curves. ABROCA is particularly useful for detecting nuanced performance differences even when overall Area Under the ROC Curve (AUC) values are similar. We sample ABROCA under various conditions, including varying AUC differences and class distributions. We find that ABROCA distributions exhibit high skewness dependent on sample sizes, AUC differences, and class imbalance. When assessing whether a classifier is biased, this skewness inflates ABROCA values by chance, even when data is drawn (by simulation) from populations with equivalent ROC curves. These findings suggest that ABROCA requires careful interpretation given its distributional properties, especially when used to assess the degree of bias and when classes are imbalanced.

[LG-77] Intrinsic Wrapped Gaussian Process Regression Modeling for Manifold-valued Response Variable

链接: https://arxiv.org/abs/2411.18989
作者: Zhanfeng Wang,Xinyu Li,Jian Qing Shi
关键词-EN: Gaussian process regression, wrapped Gaussian process, response variable measured, intrinsic wrapped Gaussian, process regression model
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel intrinsic wrapped Gaussian process regression model for response variable measured on Riemannian manifold. We apply the parallel transport operator to define an intrinsic covariance structure addressing a critical aspect of constructing a well defined Gaussian process regression model. We show that the posterior distribution of regression function is invariant to the choice of orthonormal frames for the coordinate representations of the covariance function. This method can be applied to data situated not only on Euclidean submanifolds but also on manifolds without a natural ambient space. The asymptotic properties for estimating the posterior distribution is established. Numerical studies, including simulation and real-world examples, indicate that the proposed method delivers strong performance.

[LG-78] MSEMG: Surface Electromyography Denoising with a Mamba-based Efficient Network ICASSP

链接: https://arxiv.org/abs/2411.18902
作者: Yu-Tung Liu,Kuan-Chen Wang,Rong Chao,Sabato Marco Siniscalchi,Ping-Cheng Yeh,Yu Tsao
关键词-EN: Surface electromyography, contaminated by electrocardiogram, monitored muscle, muscle is closed, Mamba State Space
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This paper is under review of 2025 ICASSP

点击查看摘要

Abstract:Surface electromyography (sEMG) recordings can be contaminated by electrocardiogram (ECG) signals when the monitored muscle is closed to the heart. Traditional signal-processing-based approaches, such as high-pass filtering and template subtraction, have been used to remove ECG interference but are often limited in their effectiveness. Recently, neural-network-based methods have shown greater promise for sEMG denoising, but they still struggle to balance both efficiency and effectiveness. In this study, we introduce MSEMG, a novel system that integrates the Mamba State Space Model with a convolutional neural network to serve as a lightweight sEMG denoising model. We evaluated MSEMG using sEMG data from the Non-Invasive Adaptive Prosthetics database and ECG signals from the MIT-BIH Normal Sinus Rhythm Database. The results show that MSEMG outperforms existing methods, generating higher-quality sEMG signals with fewer parameters. The source code for MSEMG is available at this https URL.

[LG-79] Graph Max Shift: A Hill-Climbing Method for Graph Clustering

链接: https://arxiv.org/abs/2411.18794
作者: Ery Arias-Castro,Elizabeth Coda,Wanli Qiao
关键词-EN: gradient ascent methods, ascent methods previously, methods previously proposed, points in space, analogous with gradient
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a method for graph clustering that is analogous with gradient ascent methods previously proposed for clustering points in space. We show that, when applied to a random geometric graph with data iid from some density with Morse regularity, the method is asymptotically consistent. Here, consistency is understood with respect to a density-level clustering defined by the partition of the support of the density induced by the basins of attraction of the density modes.

[LG-80] A quantum inspired predictor of Parkinsons disease built on a diverse multimodal dataset

链接: https://arxiv.org/abs/2411.18640
作者: Diya Vatsavai,Anya Iyer,Ashwin A. Nair
关键词-EN: neurodegenerative disorder globally, fastest growing neurodegenerative, growing neurodegenerative disorder, disorder globally, fastest growing
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 20 pages, 3 figures, 1 table

点击查看摘要

Abstract:Parkinsons disease, the fastest growing neurodegenerative disorder globally, has seen a 50 percent increase in cases within just two years. As speech, memory, and motor symptoms worsen over time, early diagnosis is crucial for preserving patients quality of life. While machine-learning-based detection has shown promise, relying on a single feature for classification can be error-prone due to the variability of symptoms between patients. To address this limitation we utilized the mPower database, which includes 150,000 samples across four key biomarkers: voice, gait, tapping, and demographic data. From these measurements, we extracted 64 features and trained a baseline Random Forest model to select the features above the 80th percentile. For classification, we designed a simulatable quantum support vector machine (qSVM) that detects high-dimensional patterns, leveraging recent advancements in quantum machine learning. With a novel, simulatable architecture that can be run on standard hardware rather than resource-intensive quantum computers, our model achieves an accuracy of 90 percent and an AUC of 0.98, surpassing benchmark models. By utilizing an innovative classification framework built on a diverse set of features, our model offers a pathway for accessible global Parkinsons screening.

[LG-81] opological Approach for Data Assimilation

链接: https://arxiv.org/abs/2411.18627
作者: Max M. Chumley,Firas A. Khasawneh
关键词-EN: high fidelity physics, fidelity physics based, difficult or impossible, high fidelity, fidelity physics
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:Many dynamical systems are difficult or impossible to model using high fidelity physics based models. Consequently, researchers are relying more on data driven models to make predictions and forecasts. Based on limited training data, machine learning models often deviate from the true system states over time and need to be continually updated as new measurements are taken using data assimilation. Classical data assimilation algorithms typically require knowledge of the measurement noise statistics which may be unknown. In this paper, we introduce a new data assimilation algorithm with a foundation in topological data analysis. By leveraging the differentiability of functions of persistence, gradient descent optimization is used to minimize topological differences between measurements and forecast predictions by tuning data driven model coefficients without using noise information from the measurements. We describe the method and focus on its capabilities performance using the chaotic Lorenz system as an example.

信息检索

[IR-0] Cross-Domain Recommendation Meets Large Language Models

链接: https://arxiv.org/abs/2411.19862
作者: Ajay Krishna Vajjala,Dipak Meher,Ziwei Zhu,David S. Rosenblum
关键词-EN: cold-start problem, faced by single-domain, promising solution, single-domain recommender systems, CDR
类目: Information Retrieval (cs.IR)
*备注: 12 pages

点击查看摘要

Abstract:Cross-domain recommendation (CDR) has emerged as a promising solution to the cold-start problem, faced by single-domain recommender systems. However, existing CDR models rely on complex neural architectures, large datasets, and significant computational resources, making them less effective in data-scarce scenarios or when simplicity is crucial. In this work, we leverage the reasoning capabilities of large language models (LLMs) and explore their performance in the CDR domain across multiple domain pairs. We introduce two novel prompt designs tailored for CDR and demonstrate that LLMs, when prompted effectively, outperform state-of-the-art CDR baselines across various metrics and domain combinations in the rating prediction and ranking tasks. This work bridges the gap between LLMs and recommendation systems, showcasing their potential as effective cross-domain recommenders.

[IR-1] A Review of LLM -based Explanations in Recommender Systems

链接: https://arxiv.org/abs/2411.19576
作者: Alan Said
关键词-EN: Large Language Models, Language Models, Large Language, rise of Large, improved explainability
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs), such as LLaMA and ChatGPT, has opened new opportunities for enhancing recommender systems through improved explainability. This paper provides a systematic literature review focused on leveraging LLMs to generate explanations for recommendations – a critical aspect for fostering transparency and user trust. We conducted a comprehensive search within the ACM Guide to Computing Literature, covering publications from the launch of ChatGPT (November 2022) to the present (November 2024). Our search yielded 232 articles, but after applying inclusion criteria, only six were identified as directly addressing the use of LLMs in explaining recommendations. This scarcity highlights that, despite the rise of LLMs, their application in explainable recommender systems is still in an early stage. We analyze these select studies to understand current methodologies, identify challenges, and suggest directions for future research. Our findings underscore the potential of LLMs improving explanations of recommender systems and encourage the development of more transparent and user-centric recommendation explanation solutions.

[IR-2] Zero-Indexing Internet Search Augmented Generation for Large Language Models

链接: https://arxiv.org/abs/2411.19478
作者: Guangxin He,Zonghong Dai,Jiangcheng Zhu,Binqiang Zhao,Chenyue Li,You Peng,Chen Wang,Binhang Yuan
关键词-EN: enhance large language, large language model, language model performance, enhance large, large language
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval augmented generation has emerged as an effective method to enhance large language model performance. This approach typically relies on an internal retrieval module that uses various indexing mechanisms to manage a static pre-processed corpus. However, such a paradigm often falls short when it is necessary to integrate the most up-to-date information that has not been updated into the corpus during generative inference time. In this paper, we explore an alternative approach that leverages standard search engine APIs to dynamically integrate the latest online information (without maintaining any index for any fixed corpus), thereby improving the quality of generated content. We design a collaborative LLM-based paradigm, where we include: (i) a parser-LLM that determines if the Internet augmented generation is demanded and extracts the search keywords if so with a single inference; (ii) a mixed ranking strategy that re-ranks the retrieved HTML files to eliminate bias introduced from the search engine API; and (iii) an extractor-LLM that can accurately and efficiently extract relevant information from the fresh content in each HTML file. We conduct extensive empirical studies to evaluate the performance of this Internet search augmented generation paradigm. The experimental results demonstrate that our method generates content with significantly improved quality. Our system has been successfully deployed in a production environment to serve this http URL’s generative inference requests.

[IR-3] Parallel and Mini-Batch Stable Matching for Large-Scale Reciprocal Recommender Systems

链接: https://arxiv.org/abs/2411.19214
作者: Kento Nakada,Kazuki Kawamura,Ryosuke Furukawa
关键词-EN: Reciprocal recommender systems, online two-sided matching, two-sided matching platforms, online two-sided, online job
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Reciprocal recommender systems (RRSs) are crucial in online two-sided matching platforms, such as online job or dating markets, as they need to consider the preferences of both sides of the match. The concentration of recommendations to a subset of users on these platforms undermines their match opportunities and reduces the total number of matches. To maximize the total number of expected matches among market participants, stable matching theory with transferable utility has been applied to RRSs. However, computational complexity and memory efficiency quadratically increase with the number of users, making it difficult to implement stable matching algorithms for several users. In this study, we propose novel methods using parallel and mini-batch computations for reciprocal recommendation models to improve the computational time and space efficiency of the optimization process for stable matching. Experiments on both real and synthetic data confirmed that our stable matching theory-based RRS increased the computation speed and enabled tractable large-scale data processing of up to one million samples with a single graphics processing unit graphics board, without losing the match count.

[IR-4] Headache to Overstock? Promoting Long-tail Items through Debiased Product Bundling

链接: https://arxiv.org/abs/2411.19107
作者: Shuo Xu,Haokai Ma,Yunshan Ma,Xiaohao Liu,Lei Meng,Xiangxu Meng,Tat-Seng Chua
关键词-EN: Product bundling aims, thematically related items, long-tail bundling scenario, long-tail product bundling, product bundling scenario
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Product bundling aims to organize a set of thematically related items into a combined bundle for shipment facilitation and item promotion. To increase the exposure of fresh or overstocked products, sellers typically bundle these items with popular products for inventory clearance. This specific task can be formulated as a long-tail product bundling scenario, which leverages the user-item interactions to define the popularity of each item. The inherent popularity bias in the pre-extracted user feedback features and the insufficient utilization of other popularity-independent knowledge may force the conventional bundling methods to find more popular items, thereby struggling with this long-tail bundling scenario. Through intuitive and empirical analysis, we navigate the core solution for this challenge, which is maximally mining the popularity-free features and effectively incorporating them into the bundling process. To achieve this, we propose a Distilled Modality-Oriented Knowledge Transfer framework (DieT) to effectively counter the popularity bias misintroduced by the user feedback features and adhere to the original intent behind the real-world bundling behaviors. Specifically, DieT first proposes the Popularity-free Collaborative Distribution Modeling module (PCD) to capture the popularity-independent information from the bundle-item view, which is proven most effective in the long-tail bundling scenario to enable the directional information transfer. With the tailored Unbiased Bundle-aware Knowledge Transferring module (UBT), DieT can highlight the significance of popularity-free features while mitigating the negative effects of user feedback features in the long-tail scenario via the knowledge distillation paradigm. Extensive experiments on two real-world datasets demonstrate the superiority of DieT over a list of SOTA methods in the long-tail bundling scenario.

[IR-5] Counterfactual Learning-Driven Representation Disentanglement for Search-Enhanced Recommendation

链接: https://arxiv.org/abs/2411.18631
作者: Jiajun Cui,Xu Chen,Shuai Xiao,Chen Ju,Jinsong Lan,Qingwen Liu,Wei Zhang
关键词-EN: activities provide additional, provide additional insights, enhancing personalized recommendation, search activities provide, internet platforms
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:For recommender systems in internet platforms, search activities provide additional insights into user interest through query-click interactions with items, and are thus widely used for enhancing personalized recommendation. However, these interacted items not only have transferable features matching users’ interest helpful for the recommendation domain, but also have features related to users’ unique intents in the search domain. Such domain gap of item features is neglected by most current search-enhanced recommendation methods. They directly incorporate these search behaviors into recommendation, and thus introduce partial negative transfer. To address this, we propose a Counterfactual learning-driven representation disentanglement framework for search-enhanced recommendation, based on the common belief that a user would click an item under a query not solely because of the item-query match but also due to the item’s query-independent general features (e.g., color or style) that interest the user. These general features exclude the reflection of search-specific intents contained in queries, ensuring a pure match to users’ underlying interest to complement recommendation. According to counterfactual thinking, how would user preferences and query match change for items if we removed their query-related features in search, we leverage search queries to construct counterfactual signals to disentangle item representations, isolating only query-independent general features. These representations subsequently enable feature augmentation and data augmentation for the recommendation scenario. Comprehensive experiments on real datasets demonstrate ClardRec is effective in both collaborative filtering and sequential recommendation scenarios.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-02

目录

概览 (2024-12-02)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载