Arxiv今日论文 | 2024-12-10

本篇博文主要展示 2024-12-10 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大视觉-语言模型 (Large Vision-Language Models, LVLMs) 在生成文本时出现的幻觉问题，即生成的文本与输入的视觉内容不准确对应。解决方案的关键在于通过对比解码 (contrastive decoding) 来校准模型的响应，具体方法包括改变视觉内容（如图像下采样和编辑）以生成新的对比样本。论文通过分析不同对比样本对模型输出的概率分布（如熵和分布距离）的影响，提出了一种简单有效的对比样本融合方法，以在不同场景下应用对比解码，从而缓解幻觉问题。

链接: https://arxiv.org/abs/2412.06775
作者: Yi-Lun Lee,Yi-Hsuan Tsai,Wei-Chen Chiu
关键词-EN: shown impressive capabilities, generated text inaccurately, text inaccurately reflects, large vision-language models, generating plausible responses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review. Project pages: this https URL

点击查看摘要

Abstract:While large vision-language models (LVLMs) have shown impressive capabilities in generating plausible responses correlated with input visual contents, they still suffer from hallucinations, where the generated text inaccurately reflects visual contents. To address this, recent approaches apply contrastive decoding to calibrate the model’s response via contrasting output distributions with original and visually distorted samples, demonstrating promising hallucination mitigation in a training-free manner. However, the potential of changing information in visual inputs is not well-explored, so a deeper investigation into the behaviors of visual contrastive decoding is of great interest. In this paper, we first explore various methods for contrastive decoding to change visual contents, including image downsampling and editing. Downsampling images reduces the detailed textual information while editing yields new contents in images, providing new aspects as visual contrastive samples. To further study benefits by using different contrastive samples, we analyze probability-level metrics, including entropy and distribution distance. Interestingly, the effect of these samples in mitigating hallucinations varies a lot across LVLMs and benchmarks. Based on our analysis, we propose a simple yet effective method to combine contrastive samples, offering a practical solution for applying contrastive decoding across various scenarios. Extensive experiments are conducted to validate the proposed fusion method among different benchmarks.
zh

[NLP-1] raining Large Language Models to Reason in a Continuous Latent Space

【速读】：该论文试图解决大语言模型（LLMs）在“语言空间”中进行推理时可能存在的局限性问题，特别是在复杂推理任务中，传统的链式思维（Chain-of-Thought, CoT）可能过早地锁定单一的确定性路径，导致推理效率低下。解决方案的关键在于引入了一种新的推理范式——椰子（Chain of Continuous Thought, Coconut），通过利用LLM的最后一个隐藏状态作为推理状态的连续表示（称为“连续思维”），并将其直接反馈给模型作为后续输入嵌入，从而在连续空间中进行推理。这种方法允许模型在推理过程中编码多个可能的下一步推理步骤，实现广度优先搜索（BFS），避免了CoT中的过早确定性路径选择，显著提升了在需要大量回溯的逻辑推理任务中的表现。

链接: https://arxiv.org/abs/2412.06769
作者: Shibo Hao,Sainbayar Sukhbaatar,DiJia Su,Xian Li,Zhiting Hu,Jason Weston,Yuandong Tian
关键词-EN: Large language models, Large language, reasoning, Continuous Thought, restricted to reason
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are restricted to reason in the “language space”, where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed “continuous thought”). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.
zh

[NLP-2] Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

【速读】：该论文试图解决的问题是如何在构建安全和可靠的语言模型时，使模型能够适当地拒绝执行某些指令或回答某些问题。解决方案的关键在于引入拒绝标记（refusal tokens），这些标记在训练过程中被前置到模型的响应中。通过在推理阶段调整生成拒绝标记的概率，可以灵活地控制模型在不同类别查询上的拒绝行为，而无需重新训练或微调模型。这种方法避免了为不同用户偏好训练多个模型的计算成本，并实现了对单一模型拒绝率的动态调整。

链接: https://arxiv.org/abs/2412.06748
作者: Neel Jain,Aditya Shrivastava,Chenyang Zhu,Daben Liu,Alfy Samuel,Ashwinee Panda,Anoop Kumar,Micah Goldblum,Tom Goldstein
关键词-EN: reliable language models, refusal, key component, component of building, building safe
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:A key component of building safe and reliable language models is enabling the models to appropriately refuse to follow certain instructions or answer certain questions. We may want models to output refusal messages for various categories of user queries, for example, ill-posed questions, instructions for committing illegal acts, or queries which require information past the model’s knowledge horizon. Engineering models that refuse to answer such questions is complicated by the fact that an individual may want their model to exhibit varying levels of sensitivity for refusing queries of various categories, and different users may want different refusal rates. The current default approach involves training multiple models with varying proportions of refusal messages from each category to achieve the desired refusal rates, which is computationally expensive and may require training a new model to accommodate each user’s desired preference over refusal rates. To address these challenges, we propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model’s responses during training. We then show how to increase or decrease the probability of generating the refusal token for each category during inference to steer the model’s refusal behavior. Refusal tokens enable controlling a single model’s refusal rates without the need of any further fine-tuning, but only by selectively intervening during generation.
zh

[NLP-3] ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

【速读】：该论文试图解决传统固定测试集在评估基础模型开放式能力方面的不足。解决方案的关键在于提出了一种新的测试范式——ONEBench（OpeN-Ended Benchmarking），它通过整合多个评估数据集形成一个统一且不断扩展的样本池，允许用户根据特定能力需求生成自定义的开放式评估基准。ONEBench通过聚合不同测试集的样本，能够评估超出原始测试集覆盖范围的多样化能力，同时缓解过拟合和数据集偏差问题。其核心创新在于将模型评估框架化为一个选择和聚合样本级测试的集体过程，并通过算法解决异质性和不完整性问题，确保在数据稀疏的情况下仍能准确排名模型。

链接: https://arxiv.org/abs/2412.06745
作者: Adhiraj Ghosh,Sebastian Dziadzio,Ameya Prabhu,Vishaal Udandarao,Samuel Albanie,Matthias Bethge
关键词-EN: Traditional fixed test, Traditional fixed, sets fall short, fixed test sets, test sets fall
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests. The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the aggregation over diverse metrics, while incompleteness describes comparing models evaluated on different data subsets. To address these challenges, we explore algorithms to aggregate sparse measurements into reliable model scores. Our aggregation algorithm ensures identifiability(asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model ranking with less data. On homogenous datasets, we show our aggregation algorithm provides rankings that highly correlate with those produced by average scores. We also demonstrate robustness to ~95% of measurements missing, reducing evaluation cost by up to 20x with little-to-no change in model rankings. We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains. Overall, we present a technique for open-ended evaluation, which can aggregate over incomplete, heterogeneous sample-level measurements to continually grow a benchmark alongside the rapidly developing foundation models. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.06745 [cs.LG] (or arXiv:2412.06745v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.06745 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-4] JAPAGEN: Efficient Few/Zero-shot Learning via Japanese Training Dataset Generation with LLM ACL

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）是否能够作为其他语言任务的有效监督训练数据生成器。解决方案的关键在于提出了JAPAGEN方法，即利用LLMs在少样本和零样本学习场景下为六种不同的日语下游任务合成监督训练数据，并使用这些合成数据训练紧凑模型（如BERT）。实验结果表明，JAPAGEN在需要正式文本输入的分类任务中表现出色，与传统的LLM提示策略相比具有竞争力。

链接: https://arxiv.org/abs/2412.06738
作者: Takuro Fujii,Satoru Katsumata
关键词-EN: Large Language Models, enhanced inference efficiency, potential of Large, Large Language, offering advantages
类目: Computation and Language (cs.CL)
备注: Accepted by PACLIC38 (2024)

点击查看摘要

Abstract:Recently some studies have highlighted the potential of Large Language Models (LLMs) as effective generators of supervised training data, offering advantages such as enhanced inference efficiency and reduced costs associated with data collection. However, these studies have predominantly focused on English language tasks. In this paper, we address the fundamental research question: Can LLMs serve as proficient training data generators for other language tasks? Specifically, we leverage LLMs to synthesize supervised training data under few-shot and zero-shot learning scenarios across six diverse Japanese downstream tasks. Subsequently, we utilize this synthesized data to train compact models (e.g., BERT). This novel methodology is termed JAPAGEN. Our experimental findings underscore that JAPAGEN achieves robust performance in classification tasks that necessitate formal text inputs, demonstrating competitive results compared to conventional LLM prompting strategies.
zh

[NLP-5] AutoDCWorkflow: LLM -based Data Cleaning Workflow Auto-Generation and Benchmark

【速读】：该论文试图解决如何利用大型语言模型 (LLMs) 自动生成数据清洗工作流的问题。解决方案的关键在于设计了一个基于LLM的自动化数据清洗工作流 (AutoDCWorkflow) 管道，该管道通过提示LLMs执行数据清洗操作来修复三种数据质量问题：重复值、缺失值和不一致的数据格式。该管道包括三个主要LLM驱动的组件：(1) 选择目标列，识别与目标相关的列；(2) 检查列质量，评估每列的数据质量并生成数据质量报告；(3) 生成操作参数，根据数据质量报告结果预测下一步操作及其参数。此外，论文还提出了一个数据清洗基准，用于评估LLM在不同难度级别的数据清洗任务中的表现。实验结果表明，LLMs能够在无需微调的情况下有效规划和生成数据清洗工作流。

链接: https://arxiv.org/abs/2412.06724
作者: Lan Li,Liri Fang,Vetle I. Torvik
关键词-EN: large language models, Data Cleaning, Data Cleaning Workflow, Data, data quality
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs’ ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.
zh

[NLP-6] VP-MEL: Visual Prompts Guided Multimodal Entity Linking

【速读】：该论文试图解决多模态实体链接 (Multimodal Entity Linking, MEL) 方法对提及词 (mention words) 的过度依赖问题，尤其是在提及词缺失时无法有效利用图像-文本对进行实体链接的挑战。解决方案的关键在于提出了视觉提示引导的多模态实体链接 (Visual Prompts guided Multimodal Entity Linking, VP-MEL) 任务，通过在图像中标记特定区域（即视觉提示），并在没有提及词的情况下，利用这些标记的图像-文本对来对齐视觉提示与知识库中的特定实体。论文还提出了一个新的数据集 VPWiki 和框架 FBMEL，后者通过增强视觉提示的重要性并充分利用图像-文本对的信息，显著提升了 VP-MEL 任务的性能。

链接: https://arxiv.org/abs/2412.06720
作者: Hongze Mi,Jinyuan Li,Xuying Zhang,Haoran Cheng,Jiahao Wang,Di Sun,Gang Pan
关键词-EN: Multimodal Entity Linking, Entity Linking, Multimodal Entity, MEL, mention words
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Entity Linking (MEL) is extensively utilized in the domains of information retrieval. However, existing MEL methods typically utilize mention words as mentions for retrieval. This results in a significant dependence of MEL on mention words, thereby constraining its capacity to effectively leverage information from both images and text. In situations where mention words are absent, MEL methods struggle to leverage image-text pairs for entity linking. To solve these issues, we introduce a Visual Prompts guided Multimodal Entity Linking (VP-MEL) task. VP-MEL directly marks specific regions within the image. These markers are referred to as visual prompts in VP-MEL. Without mention words, VP-MEL aims to utilize marked image-text pairs to align visual prompts with specific entities in the knowledge bases. A new dataset for the VP-MEL task, VPWiki, is proposed in this paper. Moreover, we propose a framework named FBMEL, which enhances the significance of visual prompts and fully leverages the information in image-text pairs. Experimental results on the VPWiki dataset demonstrate that FBMEL outperforms baseline methods across multiple benchmarks for the VP-MEL task.
zh

[NLP-7] How to Merge Your Multimodal Models Over Time?

【速读】：该论文试图解决在实际应用中，随着新任务和领域不断出现，如何逐步整合专家模型（expert models）的知识以构建更强大的单一模型的问题。解决方案的关键在于提出了一个名为TIME（Temporal Integration of Model Expertise）的统一框架，该框架定义了时间维度上的模型合并（temporal model merging），并通过三个轴来实现：(1) 初始化阶段（Initialization Phase），(2) 部署阶段（Deployment Phase），以及 (3) 合并技术（Merging Technique）。TIME框架通过在FoMo-in-Flux基准上进行广泛的实验，揭示了时间模型合并的关键挑战和最佳实践，从而为有效的时间模型合并提供了深入的理解和指导。

链接: https://arxiv.org/abs/2412.06712
作者: Sebastian Dziadzio,Vishaal Udandarao,Karsten Roth,Ameya Prabhu,Zeynep Akata,Samuel Albanie,Matthias Bethge
关键词-EN: temporal model merging, Model merging, Model, merging combines multiple, combines multiple expert
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Code at this https URL

点击查看摘要

Abstract:Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME - Temporal Integration of Model Expertise - which defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to uncover key insights for temporal model merging, offering a better understanding of current challenges and best practices for effective temporal model merging.
zh

[NLP-8] OmniEvalKit: A Modular Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions

【速读】：该论文试图解决现有基准测试工具在评估大型语言模型（LLMs）时通常只关注单一方面的局限性问题。解决方案的关键在于提出了OmniEvalKit，这是一个模块化、轻量级且自动化的评估工具箱，能够全面评估LLMs及其多语言、多领域和多模态扩展能力。OmniEvalKit通过静态构建器（Static Builder）和动态数据流（Dynamic Data Flow）的模块化架构，支持超过100个LLMs和50个评估数据集，涵盖数千种模型-数据集组合，从而实现对LLMs的全面、灵活和高效的评估。

链接: https://arxiv.org/abs/2412.06693
作者: Yi-Kai Zhang,Xu-Xiang Zhong,Shiyin Lu,Qing-Guo Chen,De-Chuan Zhan,Han-Jia Ye
关键词-EN: Large Language Models, Large Language, advancements in Large, Language Models, rapid advancements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) have significantly expanded their applications, ranging from multilingual support to domain-specific tasks and multimodal integration. In this paper, we present OmniEvalKit, a novel benchmarking toolbox designed to evaluate LLMs and their omni-extensions across multilingual, multidomain, and multimodal capabilities. Unlike existing benchmarks that often focus on a single aspect, OmniEvalKit provides a modular, lightweight, and automated evaluation system. It is structured with a modular architecture comprising a Static Builder and Dynamic Data Flow, promoting the seamless integration of new models and datasets. OmniEvalKit supports over 100 LLMs and 50 evaluation datasets, covering comprehensive evaluations across thousands of model-dataset combinations. OmniEvalKit is dedicated to creating an ultra-lightweight and fast-deployable evaluation framework, making downstream applications more convenient and versatile for the AI community.
zh

[NLP-9] I Dont Know: Explicit Modeling of Uncertainty with an [IDK] Token NEURIPS2024

【速读】：该论文试图解决大语言模型（Large Language Models）在生成文本时容易产生幻觉（hallucinations）的问题，即模型会输出不准确或错误的信息。解决方案的关键在于引入一种新的校准方法，通过在模型的词汇表中添加一个特殊的[IDK]（“I don’t know”）标记，并设计一个目标函数，将错误预测的概率质量转移到[IDK]标记上。这种方法使模型能够在不确定的情况下明确表达其不确定性，从而减少错误输出的发生，同时仅在编码知识方面产生较小的损失。

链接: https://arxiv.org/abs/2412.06676
作者: Roi Cohen,Konstantin Dobler,Eden Biran,Gerard de Melo
关键词-EN: Large Language Models, Large Language, capture real-world knowledge, Language Models, capture real-world
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Published at NeurIPS 2024

点击查看摘要

Abstract:Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks. Despite recent advances, these models are still prone to what are commonly known as hallucinations, causing them to emit unwanted and factually incorrect text. In this work, we propose a novel calibration method that can be used to combat hallucinations. We add a special [IDK] (“I don’t know”) token to the model’s vocabulary and introduce an objective function that shifts probability mass to the [IDK] token for incorrect predictions. This approach allows the model to express uncertainty in its output explicitly. We evaluate our proposed method across multiple model architectures and factual downstream tasks. We find that models trained with our method are able to express uncertainty in places where they would previously make mistakes while suffering only a small loss of encoded knowledge. We further perform extensive ablation studies of multiple variations of our approach and provide a detailed analysis of the precision-recall tradeoff of our method.
zh

[NLP-10] GEAR: A Simple GENERATE EMBED AVERAGE AND RANK Approach for Unsupervised Reverse Dictionary COLING2025

【速读】：该论文试图解决反向词典任务 (Reverse Dictionary, RD)，即根据文本描述或词典定义获取最相关的单词或词集。解决方案的关键在于利用大型语言模型 (LLMs) 与嵌入模型 (embedding models) 的结合，尽管方法简单，但该方案在已有的反向词典数据集上超越了监督基线，并表现出较少的过拟合。此外，论文还探讨了不同词典风格、语域和目标受众对反向词典系统质量的影响，并得出结论：未调优的嵌入模型单独使用时表现远低于仅使用LLM的基线（但在高度技术性词典中具有竞争力），但在结合方法中对提升性能至关重要。

链接: https://arxiv.org/abs/2412.06654
作者: Fatemah Almeman,Luis Espinosa-Anke
关键词-EN: Reverse Dictionary, task of obtaining, textual description, dictionary definition, description or dictionary
类目: Computation and Language (cs.CL)
备注: 9 pages, Accepted at COLING 2025

点击查看摘要

Abstract:Reverse Dictionary (RD) is the task of obtaining the most relevant word or set of words given a textual description or dictionary definition. Effective RD methods have applications in accessibility, translation or writing support systems. Moreover, in NLP research we find RD to be used to benchmark text encoders at various granularities, as it often requires word, definition and sentence embeddings. In this paper, we propose a simple approach to RD that leverages LLMs in combination with embedding models. Despite its simplicity, this approach outperforms supervised baselines in well studied RD datasets, while also showing less over-fitting. We also conduct a number of experiments on different dictionaries and analyze how different styles, registers and target audiences impact the quality of RD systems. We conclude that, on average, untuned embeddings alone fare way below an LLM-only baseline (although they are competitive in highly technical dictionaries), but are crucial for boosting performance in combined methods.
zh

[NLP-11] Copyright-Protected Language Generation via Adaptive Model Fusion

【速读】：该论文试图解决语言模型在生成过程中可能复现训练数据中的版权内容的问题。解决方案的关键在于引入了一种名为“版权保护模型融合 (Copyright-Protecting Model Fusion, CP-Fuse)”的新方法，该方法通过在推理阶段结合训练于不相交版权数据集的模型，并自适应地聚合模型输出，以最小化版权内容的复现。CP-Fuse的核心在于其平衡属性，能够防止模型重复记忆的数据被生成，同时保持文本和代码生成的质量。此外，CP-Fuse的后期处理特性使其能够无缝集成其他版权保护措施，并展现出对常见训练数据提取技术的鲁棒性。

链接: https://arxiv.org/abs/2412.06619
作者: Javier Abad,Konstantin Donhauser,Francesco Pinto,Fanny Yang
关键词-EN: language models reproducing, risk of language, models reproducing copyrighted, reproducing copyrighted material, reproducing copyrighted
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 47 pages, 21 Figures. arXiv admin note: substantial text overlap with arXiv:2407.20105

点击查看摘要

Abstract:The risk of language models reproducing copyrighted material from their training data has led to the development of various protective measures. Among these, inference-time strategies that impose constraints via post-processing have shown promise in addressing the complexities of copyright regulation. However, they often incur prohibitive computational costs or suffer from performance trade-offs. To overcome these limitations, we introduce Copyright-Protecting Model Fusion (CP-Fuse), a novel approach that combines models trained on disjoint sets of copyrighted material during inference. In particular, CP-Fuse adaptively aggregates the model outputs to minimize the reproduction of copyrighted content, adhering to a crucial balancing property that prevents the regurgitation of memorized data. Through extensive experiments, we show that CP-Fuse significantly reduces the reproduction of protected material without compromising the quality of text and code generation. Moreover, its post-hoc nature allows seamless integration with other protective measures, further enhancing copyright safeguards. Lastly, we show that CP-Fuse is robust against common techniques for extracting training data.
zh

[NLP-12] owards Controllable Speech Synthesis in the Era of Large Language Models : A Survey

【速读】：该论文试图解决文本到语音（Text-to-speech, TTS）生成中的可控性问题，特别是如何实现对合成语音的精细控制，包括情感、韵律、音色和时长等属性。解决方案的关键在于利用深度学习技术，如扩散模型和大型语言模型，来增强TTS的可控性。论文通过全面调研现有的可控TTS方法，从基础控制技术到利用自然语言提示的方法，提供了一个清晰的分类体系和研究现状的概述。此外，论文还详细总结了相关数据集、评估指标以及应用和未来发展方向，为学术界和工业界的研究者提供了宝贵的参考资源。

链接: https://arxiv.org/abs/2412.06602
作者: Tianxin Xie,Yan Rong,Pengfei Zhang,Li Liu
关键词-EN: generate natural-sounding human, natural-sounding human speech, controllable TTS, prominent research area, TTS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: A comprehensive survey on controllable TTS, 23 pages, 6 tables, 4 figures, 280 references

点击查看摘要

Abstract:Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that aims to generate natural-sounding human speech from text. Recently, with the increasing industrial demand, TTS technologies have evolved beyond synthesizing human-like speech to enabling controllable speech generation. This includes fine-grained control over various attributes of synthesized speech such as emotion, prosody, timbre, and duration. Besides, advancements in deep learning, such as diffusion and large language models, have significantly enhanced controllable TTS over the past several years. In this paper, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts, aiming to provide a clear understanding of the current state of research. We examine the general controllable TTS pipeline, challenges, model architectures, and control strategies, offering a comprehensive and clear taxonomy of existing methods. Additionally, we provide a detailed summary of datasets and evaluation metrics and shed some light on the applications and future directions of controllable TTS. To the best of our knowledge, this survey paper provides the first comprehensive review of emerging controllable TTS methods, which can serve as a beneficial resource for both academic researchers and industry practitioners.
zh

[NLP-13] Anchoring Bias in Large Language Models : An Experimental Study

【速读】：该论文试图解决大型语言模型（LLMs）中存在的认知偏差问题，特别是锚定偏差（anchoring bias），即初始信息对后续判断产生过度影响的现象。解决方案的关键在于通过从多个角度收集提示信息，以防止模型过度依赖单一信息源，从而避免锚定效应。研究表明，简单的算法如思维链（Chain-of-Thought）、原则思维（Thoughts of Principles）、忽略锚定提示（Ignoring Anchor Hints）和反思（Reflection）不足以有效缓解锚定偏差，而需要更全面的提示收集策略。

链接: https://arxiv.org/abs/2412.06593
作者: Jiaxu Lou
关键词-EN: Large Language Models, Large Language, Language Models, comprehend human-like text, significantly advanced artificial
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) like GPT-4 and Gemini have significantly advanced artificial intelligence by enabling machines to generate and comprehend human-like text. Despite their impressive capabilities, LLMs are not immune to limitations, including various biases. While much research has explored demographic biases, the cognitive biases in LLMs have not been equally scrutinized. This study delves into anchoring bias, a cognitive bias where initial information disproportionately influences judgment. Utilizing an experimental dataset, we examine how anchoring bias manifests in LLMs and verify the effectiveness of various mitigation strategies. Our findings highlight the sensitivity of LLM responses to biased hints. At the same time, our experiments show that, to mitigate anchoring bias, one needs to collect hints from comprehensive angles to prevent the LLMs from being anchored to individual pieces of information, while simple algorithms such as Chain-of-Thought, Thoughts of Principles, Ignoring Anchor Hints, and Reflection are not sufficient.
zh

[NLP-14] Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered Difficult and Noisy COLING2025

【速读】：该论文试图解决大语言模型 (LLMs) 在文本分类任务中分类精度未普遍超越较小模型的问题，并提出了一种基于数据质量增强 (Data Quality Enhancement, DQE) 的方法。解决方案的关键在于通过贪心算法选择数据，将数据集划分为采样和未采样子集，并利用采样数据对 LLMs 进行微调。随后，使用微调后的模型对未采样数据进行预测，将错误预测的数据分类为未覆盖、困难和噪声数据，从而提升 LLMs 在文本分类任务中的性能和训练效率，实验结果表明该方法显著提高了分类性能并节省了近一半的训练时间。

链接: https://arxiv.org/abs/2412.06575
作者: Min Zeng,Caiquan Liu,Shiqi Zhang,Li Xie,Chen Sang,Xiaoxin Chen,Xiaoxin Chen
关键词-EN: attracted widespread attention, large language models, text classification, classification, recent years
类目: Computation and Language (cs.CL)
备注: Accepted by COLING 2025(main, long paper)

点击查看摘要

Abstract:In recent years, the use of large language models (LLMs) for text classification has attracted widespread attention. Despite this, the classification accuracy of LLMs has not yet universally surpassed that of smaller models. LLMs can enhance their performance in text classification through fine-tuning. However, existing data quality research based on LLMs is challenging to apply directly to solve text classification problems. To further improve the performance of LLMs in classification tasks, this paper proposes a data quality enhancement (DQE) method for text classification based on LLMs. This method starts by using a greedy algorithm to select data, dividing the dataset into sampled and unsampled subsets, and then performing fine-tuning of the LLMs using the sampled data. Subsequently, this model is used to predict the outcomes for the unsampled data, categorizing incorrectly predicted data into uncovered, difficult, and noisy data. Experimental results demonstrate that our method effectively enhances the performance of LLMs in text classification tasks and significantly improves training efficiency, saving nearly half of the training time. Our method has achieved state-of-the-art performance in several open-source classification tasks.
zh

[NLP-15] ProcessBench: Identifying Process Errors in Mathematical Reasoning

【速读】：该论文试图解决语言模型在解决数学问题时经常出错的问题，特别是自动化识别推理过程中的错误步骤。解决方案的关键在于引入了一个名为ProcessBench的基准测试，用于评估模型识别数学推理中错误步骤的能力。ProcessBench包含3,400个测试案例，主要针对竞赛和奥林匹克级别的数学问题，每个案例都包含由专家标注错误位置的逐步解决方案。模型需要识别出最早包含错误的步骤，或判断所有步骤均正确。通过对比过程奖励模型（PRMs）和批评模型（critic models）的表现，研究发现现有PRMs在处理更复杂的数学问题时表现不佳，而批评模型和经过专门训练的PRM在识别错误步骤方面表现更好。论文希望通过ProcessBench推动未来在推理过程评估方面的研究，为语言模型的可扩展监督铺平道路。

链接: https://arxiv.org/abs/2412.06559
作者: Chujie Zheng,Zhenru Zhang,Beichen Zhang,Runji Lin,Keming Lu,Bowen Yu,Dayiheng Liu,Jingren Zhou,Junyang Lin
关键词-EN: regularly make mistakes, solving math problems, models regularly make, automated identification, math problems
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.
zh

[NLP-16] Understanding Factual Recall in Transformers via Associative Memories

【速读】：该论文试图解决的问题是：如何在浅层（单层）Transformer模型中实现高效的实际信息存储能力。解决方案的关键在于证明浅层Transformer可以通过结合关联记忆（associative memories）来达到接近最优的存储容量。具体来说，论文首先证明了线性和多层感知机（MLP）关联记忆的存储容量与其参数数量呈线性关系，然后通过引入一个合成的事实回忆任务，证明了单层自注意力机制（self-attention）加MLP的Transformer模型在自注意力参数或MLP参数与事实数量呈线性关系时，能够达到100%的准确率。此外，论文还通过分析梯度流轨迹，展示了模型在训练过程中表现出顺序学习行为。

链接: https://arxiv.org/abs/2412.06538
作者: Eshaan Nichani,Jason D. Lee,Alberto Bietti
关键词-EN: Large language models, Large language, perform factual recall, factual recall, factual recall task
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the MLP as an associative memory to store the dataset of facts. We complement these expressivity results with an analysis of the gradient flow trajectory of a simplified linear attention model trained on our factual recall task, where we show that the model exhibits sequential learning behavior.
zh

[NLP-17] he Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents : A Roadmap

【速读】：该论文试图解决大语言模型（LLMs）在生成不可靠输出方面的挑战，并提出通过结合形式化方法（Formal Methods, FMs）来提升LLMs的可靠性和可信度。解决方案的关键在于利用FMs的严格推理和认证技术来增强LLMs的输出可靠性，同时通过LLMs的学习能力和适应性来改进FMs的可用性、效率和可扩展性。最终目标是实现LLMs与FMs的融合，以开发出更值得信赖的AI软件系统，从而在软件工程实践中提高信任度和效率，并推动智能FM工具的发展以应对复杂的现实世界挑战。

链接: https://arxiv.org/abs/2412.06512
作者: Yedi Zhang,Yufan Cai,Xinyue Zuo,Xiaokun Luan,Kailong Wang,Zhe Hou,Yifan Zhang,Zhiyuan Wei,Meng Sun,Jun Sun,Jing Sun,Jin Song Dong
关键词-EN: Large Language Models, profoundly influencing daily, influencing daily life, exceptional language understanding, Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 24 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as a transformative AI paradigm, profoundly influencing daily life through their exceptional language understanding and contextual generation capabilities. Despite their remarkable performance, LLMs face a critical challenge: the propensity to produce unreliable outputs due to the inherent limitations of their learning-based nature. Formal methods (FMs), on the other hand, are a well-established computation paradigm that provides mathematically rigorous techniques for modeling, specifying, and verifying the correctness of systems. FMs have been extensively applied in mission-critical software engineering, embedded systems, and cybersecurity. However, the primary challenge impeding the deployment of FMs in real-world settings lies in their steep learning curves, the absence of user-friendly interfaces, and issues with efficiency and adaptability. This position paper outlines a roadmap for advancing the next generation of trustworthy AI systems by leveraging the mutual enhancement of LLMs and FMs. First, we illustrate how FMs, including reasoning and certification techniques, can help LLMs generate more reliable and formally certified outputs. Subsequently, we highlight how the advanced learning capabilities and adaptability of LLMs can significantly enhance the usability, efficiency, and scalability of existing FM tools. Finally, we show that unifying these two computation paradigms – integrating the flexibility and intelligence of LLMs with the rigorous reasoning abilities of FMs – has transformative potential for the development of trustworthy AI software systems. We acknowledge that this integration has the potential to enhance both the trustworthiness and efficiency of software engineering practices while fostering the development of intelligent FM tools capable of addressing complex yet real-world challenges. Comments: 24 pages, 4 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2412.06512 [cs.AI] (or arXiv:2412.06512v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.06512 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-18] Small Languages Big Models: A Study of Continual Training on Languages of Norway

【速读】：该论文试图解决训练大型语言模型时，针对低资源语言（如挪威语和萨米语）数据不足的问题。解决方案的关键在于提出了一种新颖的三阶段持续训练方法，并结合因果语言建模（Causal Language Modeling）和掩码语言建模（Masked Language Modeling）以获得更灵活的模型。最终，研究团队基于这一方法训练并公开发布了114亿参数的生成式语言模型NorMistral-11B，支持挪威博克马尔语、尼诺斯克语和北萨米语。

链接: https://arxiv.org/abs/2412.06484
作者: David Samuel,Vladislav Mikhailov,Erik Velldal,Lilja Øvrelid,Lucas Georges Gabriel Charpentier,Andrey Kutuzov
关键词-EN: requires vast amounts, models requires vast, widely spoken languages, amounts of data, posing a challenge
类目: Computation and Language (cs.CL)
备注: pre-print, under review

点击查看摘要

Abstract:Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Sámi. To address this issue, we present a novel three-stage continual training approach. We also experiment with combining causal and masked language modeling to get more flexible models. Based on our findings, we train, evaluate, and openly release a new large generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.
zh

[NLP-19] SafeWorld: Geo-Diverse Safety Alignment NEURIPS2024

【速读】：该论文试图解决大型语言模型（LLMs）在面对全球多样化的文化和法律标准时，难以确保生成内容既有益又符合文化敏感性和法律合规性的问题。解决方案的关键在于提出了SafeWorld基准，该基准包含2,342个基于50个国家和493个地区/种族的高质量、人工验证的文化规范和法律政策的测试查询。此外，论文还提出了一种多维度的自动安全评估框架，用于评估响应的上下文适当性、准确性和全面性。为了进一步提升LLMs对地理多样性安全标准的适应性，论文采用了直接偏好优化（DPO）对齐训练，通过构建有用的偏好对来鼓励模型在必要时提供精确的文化和政策参考。最终，经过训练的SafeWorldLM模型在所有评估维度上显著优于包括GPT-4在内的其他模型。

链接: https://arxiv.org/abs/2412.06483
作者: Da Yin,Haoyi Qiu,Kung-Hsiang Huang,Kai-Wei Chang,Nanyun Peng
关键词-EN: widely discussed topic, rapidly evolving field, Large Language Models, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:In the rapidly evolving field of Large Language Models (LLMs), ensuring safety is a crucial and widely discussed topic. However, existing works often overlook the geo-diversity of cultural and legal standards across the world. To demonstrate the challenges posed by geo-diverse safety standards, we introduce SafeWorld, a novel benchmark specifically designed to evaluate LLMs’ ability to generate responses that are not only helpful but also culturally sensitive and legally compliant across diverse global contexts. SafeWorld encompasses 2,342 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races. On top of it, we propose a multi-dimensional automatic safety evaluation framework that assesses the contextual appropriateness, accuracy, and comprehensiveness of responses. Our evaluations reveal that current LLMs struggle to meet these criteria. To enhance LLMs’ alignment with geo-diverse safety standards, we synthesize helpful preference pairs for Direct Preference Optimization (DPO) alignment training. The preference pair construction aims to encourage LLMs to behave appropriately and provide precise references to relevant cultural norms and policies when necessary. Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all three evaluation dimensions by a large margin. Global human evaluators also note a nearly 20% higher winning rate in helpfulness and harmfulness evaluation. Our code and data can be found here: this https URL.
zh

[NLP-20] Gated Delta Networks: Improving Mamba2 with Delta Rule

【速读】：该论文试图解决线性Transformer在检索和长上下文任务中的性能限制问题。解决方案的关键在于引入门控机制（gating）和增量更新规则（delta update rule）的互补结合，即门控增量规则（gated delta rule），并开发了针对现代硬件优化的并行训练算法。通过这一架构，Gated DeltaNet在多个基准测试中超越了现有的模型如Mamba2和DeltaNet，并进一步通过混合架构（结合Gated DeltaNet层与滑动窗口注意力或Mamba2层）提升了训练效率和任务性能。

链接: https://arxiv.org/abs/2412.06464
作者: Songlin Yang,Jan Kautz,Ali Hatamizadeh
关键词-EN: Linear Transformers, standard Transformers, Transformers have gained, Transformers, efficient alternatives
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
zh

[NLP-21] BoRA: Bi-dimensional Weight-Decomposed Low-Rank Adaptation

【速读】：该论文试图解决现有参数高效微调方法（如LoRA和DoRA）在权重矩阵调整上的不对称性问题，特别是在垂直维度上的改进有限。解决方案的关键在于提出了BoRA（Balance-Rank Adaptation），通过在水平和垂直维度上对称地优化权重矩阵，调整列向和行向的幅度，从而实现对称性优化。这一创新方法在多个基准测试中显著超越了现有的最先进参数高效微调方法，包括LoRA和DoRA。

链接: https://arxiv.org/abs/2412.06441
作者: Qiushi Wang,Yuchen Fan,Junwei Bao,Hongfei Jiang,Yang Song
关键词-EN: large-scale pre-trained models, Weight-Decomposed Low-Rank Adaptation, Low-Rank Adaptation, Parameter-Efficient Fine-Tuning, recent years
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) have significantly enhanced the adaptability of large-scale pre-trained models. Weight-Decomposed Low-Rank Adaptation (DoRA) improves upon LoRA by separating the magnitude and direction components of the weight matrix, leading to superior performance. However, DoRA’s improvements are limited to the vertical dimension, resulting in an asymmetrical pattern between horizontal and vertical dimensions. This paper introduces BoRA, an innovative extension of LoRA and DoRA, characterized by symmetrical properties across horizontal and vertical dimensions. Our approach optimizes the weight matrix symmetrically by adjusting both column-wise and row-wise magnitudes. Extensive experiments demonstrate that BoRA surpasses state-of-the-art PEFT methods, including LoRA and DoRA, achieving superior results across various benchmarks.
zh

[NLP-22] Integrating Expert Labels into LLM -based Emission Goal Detection: Example Selection vs Automatic Prompt Design

【速读】：该论文试图解决企业报告中减排目标检测的问题，特别是如何将专家反馈（即标注的示例段落）整合到基于大语言模型（LLM）的流程中。解决方案的关键在于比较两种策略：(1) 动态选择少量示例（few-shot examples）和 (2) 由LLM自动优化提示（prompt）。研究结果表明，自动提示优化是更优的方法，而结合两种方法的收益有限。定性结果显示，优化后的提示确实能够捕捉到减排目标提取任务中的许多复杂细节。

链接: https://arxiv.org/abs/2412.06432
作者: Marco Wrzalik,Adrian Ulges,Anne Uersfeld,Florian Faust
关键词-EN: addressing climate change, monitoring companies’ progress, climate change, emission reduction goals, address the detection
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We address the detection of emission reduction goals in corporate reports, an important task for monitoring companies’ progress in addressing climate change. Specifically, we focus on the issue of integrating expert feedback in the form of labeled example passages into LLM-based pipelines, and compare the two strategies of (1) a dynamic selection of few-shot examples and (2) the automatic optimization of the prompt by the LLM itself. Our findings on a public dataset of 769 climate-related passages from real-world business reports indicate that automatic prompt optimization is the superior approach, while combining both methods provides only limited benefit. Qualitative results indicate that optimized prompts do indeed capture many intricacies of the targeted emission goal extraction task.
zh

[NLP-23] LLM -BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

【速读】：该论文试图解决大型语言模型（LLMs）在部署过程中因模型规模大和计算成本高而面临的挑战。解决方案的关键在于提出了一种基于块级重要性分数传播（block-wise importance score propagation, LLM-BIP）的结构化剪枝方法。LLM-BIP通过精确评估连接对各自transformer块输出的影响，利用Lipschitz连续性假设推导的上界在一次前向传播中高效近似计算重要性分数，从而避免了全局剪枝方法中梯度评估不准确和层级剪枝方法中剪枝误差累积的问题。实验结果表明，该方法在多个模型和任务上显著提升了推理精度和降低了困惑度。

链接: https://arxiv.org/abs/2412.06419
作者: Haihang Wu
关键词-EN: Large language models, high computational costs, demonstrated remarkable performance, Large language, large size
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across various language tasks, but their widespread deployment is impeded by their large size and high computational costs. Structural pruning is a prevailing technique used to introduce sparsity into pre-trained models and facilitate direct hardware acceleration during inference by removing redundant connections (structurally-grouped parameters), such as channels and attention heads. Existing structural pruning approaches often employ either global or layer-wise pruning criteria; however, they are hindered by ineffectiveness stemming from inaccurate evaluation of connection importance. Global pruning methods typically assess component importance using near-zero and unreliable gradients, while layer-wise pruning approaches encounter significant pruning error accumulation issues. To this end, we propose a more accurate pruning metric based on the block-wise importance score propagation, termed LLM-BIP. Specifically, LLM-BIP precisely evaluates connection importance by gauging its influence on the respective transformer block output, which can be efficiently approximated in a single forward pass through an upper bound derived from the assumption of Lipschitz continuity. We evaluate the proposed method using LLaMA-7B, Vicuna-7B, and LLaMA-13B across common zero-shot tasks. The results demonstrate that our approach achieves an average of 3.26% increase in accuracy for common reasoning tasks compared to previous best baselines. It also reduces perplexity by 14.09 and 68.76 on average for the WikiText2 dataset and PTB dataset, respectively.
zh

[NLP-24] GameArena: Evaluating LLM Reasoning through Live Computer Games

【速读】：该论文试图解决现有基准测试在评估大型语言模型（LLMs）推理能力时存在的局限性，特别是静态数据集易受数据污染和饱和问题，以及二元实时人类反馈难以区分推理能力与其他能力的问题。解决方案的关键在于引入GameArena，这是一个动态基准测试，通过与人类进行互动游戏来评估LLM的特定推理能力（如演绎和归纳推理）。GameArena不仅保持了参与者的娱乐性和参与度，还通过回顾性分析游戏数据，揭示了LLM的推理过程，并提供了细粒度的推理能力评估。该研究收集了超过2000个游戏会话，并对五种最先进的LLM进行了详细评估，首次实现了在实际环境中收集LLM逐步推理数据的目标。

链接: https://arxiv.org/abs/2412.06394
作者: Lanxiang Hu,Qiyu Li,Anze Xie,Nan Jiang,Ion Stoica,Haojian Jin,Hao Zhang
关键词-EN: large language models, reasoning, reasoning capabilities, language models, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the reasoning abilities of large language models (LLMs) is challenging. Existing benchmarks often depend on static datasets, which are vulnerable to data contamination and may get saturated over time, or on binary live human feedback that conflates reasoning with other abilities. As the most prominent dynamic benchmark, Chatbot Arena evaluates open-ended questions in real-world settings, but lacks the granularity in assessing specific reasoning capabilities. We introduce GameArena, a dynamic benchmark designed to evaluate LLM reasoning capabilities through interactive gameplay with humans. GameArena consists of three games designed to test specific reasoning capabilities (e.g., deductive and inductive reasoning), while keeping participants entertained and engaged. We analyze the gaming data retrospectively to uncover the underlying reasoning processes of LLMs and measure their fine-grained reasoning capabilities. We collect over 2000 game sessions and provide detailed assessments of various reasoning capabilities for five state-of-the-art LLMs. Our user study with 100 participants suggests that GameArena improves user engagement compared to Chatbot Arena. For the first time, GameArena enables the collection of step-by-step LLM reasoning data in the wild.
zh

[NLP-25] Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimers Disease Detection

【速读】：该论文试图解决自动语音识别（ASR）在基于语音的阿尔茨海默病（AD）自动检测中的误差传播问题。研究揭示了词错误率（WER）与AD检测性能之间的非线性关系，表明即使ASR转录存在显著错误，仍可能达到与人工转录相当的检测准确性。解决方案的关键在于识别哪些ASR错误对检测性能影响较大：研究发现，尽管停用词（stopwords）在错误中占比较大，但对区分AD的作用有限；而与诊断任务相关的关键词则对检测性能具有显著影响。这一发现为理解ASR误差与下游检测模型之间的关系提供了重要见解。

链接: https://arxiv.org/abs/2412.06332
作者: Jiawen Kang,Junan Li,Jinchao Li,Xixin Wu,Helen Meng
关键词-EN: Automatic Speech Recognition, Automatic Speech, Alzheimer disease, speech-based automatic detection, Speech Recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: Accepted by IEEE ISCSLP 2024

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) plays an important role in speech-based automatic detection of Alzheimer’s disease (AD). However, recognition errors could propagate downstream, potentially impacting the detection decisions. Recent studies have revealed a non-linear relationship between word error rates (WER) and AD detection performance, where ASR transcriptions with notable errors could still yield AD detection accuracy equivalent to that based on manual transcriptions. This work presents a series of analyses to explore the effect of ASR transcription errors in BERT-based AD detection systems. Our investigation reveals that not all ASR errors contribute equally to detection performance. Certain words, such as stopwords, despite constituting a large proportion of errors, are shown to play a limited role in distinguishing AD. In contrast, the keywords related to diagnosis tasks exhibit significantly greater importance relative to other words. These findings provide insights into the interplay between ASR errors and the downstream detection model.
zh

[NLP-26] PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

【速读】：该论文试图解决在医疗领域中，缺乏针对儿科（pediatrics）的、能够全面评估大型语言模型（LLMs）问答能力的中文标准数据集的问题。解决方案的关键在于构建了PediaBench，这是首个用于LLM评估的中文儿科数据集。PediaBench包含了4,565个客观问题和1,632个主观问题，涵盖12种儿科疾病，采用基于不同难度级别的综合评分标准，全面评估LLM在指令跟随、知识理解、临床病例分析等方面的能力。通过在20个开源和商业LLM上的广泛实验验证了PediaBench的有效性，并深入分析了LLM在回答中文儿科问题时的能力及其局限性。

链接: https://arxiv.org/abs/2412.06287
作者: Qian Zhang,Panfeng Chen,Jiali Li,Linkun Feng,Shuyu Liu,Mei Chen,Hui Li,Yanhao Wang
关键词-EN: Large Language Models, Language Models, Large Language, emergence of Large, evaluate their question-answering
类目: Computation and Language (cs.CL)
备注: 21 pages, 12 figures

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,565 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at this https URL.
zh

[NLP-27] Methods for Legal Citation Prediction in the Age of LLM s: An Australian Law Case Study

【速读】：该论文试图解决在澳大利亚法律背景下，大型语言模型（LLMs）在法律引用预测中出现的幻觉问题，即生成错误的法律参考。解决方案的关键在于任务特定的指令微调（instruction tuning），通过在特定任务数据集上进行微调，显著提升了引用准确性，达到了最佳性能。此外，论文还强调了数据库粒度（database granularity）和嵌入类型（type of embeddings）在检索系统性能中的重要性，并指出结合检索增强、查询扩展或投票集成的混合方法在检索方法中表现最佳，其中投票集成结合了指令微调LLMs的预测质量和检索系统的优势。

链接: https://arxiv.org/abs/2412.06272
作者: Ehsan Shareghi,Jiuzhou Han,Paul Burgess
关键词-EN: Large Language Models, Large Language, Language Models, shown great potential, recent years
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: For code, data, and models see this https URL

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have shown great potential across a wide range of legal tasks. Despite these advances, mitigating hallucination remains a significant challenge, with state-of-the-art LLMs still frequently generating incorrect legal references. In this paper, we focus on the problem of legal citation prediction within the Australian law context, where correctly identifying and citing relevant legislations or precedents is critical. We compare several approaches: prompting general purpose and law-specialised LLMs, retrieval-only pipelines with both generic and domain-specific embeddings, task-specific instruction-tuning of LLMs, and hybrid strategies that combine LLMs with retrieval augmentation, query expansion, or voting ensembles. Our findings indicate that domain-specific pre-training alone is insufficient for achieving satisfactory citation accuracy even after law-specialised pre-training. In contrast, instruction tuning on our task-specific dataset dramatically boosts performance reaching the best results across all settings. We also highlight that database granularity along with the type of embeddings play a critical role in the performance of retrieval systems. Among retrieval-based approaches, hybrid methods consistently outperform retrieval-only setups, and among these, ensemble voting delivers the best result by combining the predictive quality of instruction-tuned LLMs with the retrieval system.
zh

[NLP-28] Optimizing Multi-Task Learning for Enhanced Performance in Large Language Models

【速读】：该论文旨在解决如何基于GPT-4在多任务学习框架下提升大型语言模型的性能问题。解决方案的关键在于通过共享特征提取器和任务特定模块的结合设计，实现多任务间的知识共享与优化。实验结果表明，该多任务学习模型在文本分类准确率和自动摘要生成的ROUGE值上均优于单任务GPT-4、多任务GPT-3、BERT基础模型和经典的Bi-LSTM with Attention模型，验证了多任务学习在提升模型泛化能力和任务间协同学习方面的优势。

链接: https://arxiv.org/abs/2412.06249
作者: Zhen Qi,Jiajing Chen,Shuo Wang,Bingying Liu,Hongye Zheng,Chihang Wang
关键词-EN: performance improvement method, multi-task learning, multi-task learning framework, large language models, multi-task learning model
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study aims to explore the performance improvement method of large language models based on GPT-4 under the multi-task learning framework and conducts experiments on two tasks: text classification and automatic summary generation. Through the combined design of shared feature extractors and task-specific modules, we achieve knowledge-sharing and optimization of multiple tasks in the same model. The experiment uses multiple subtasks of the GLUE dataset to compare the performance of the multi-task model with the single-task GPT-4, the multi-task version of GPT-3, the BERT basic model, and the classic Bi-LSTM with Attention model. The results show that the proposed multi-task learning model outperforms other comparison models in terms of text classification accuracy and ROUGE value of summary generation, demonstrating the advantages of multi-task learning in improving model generalization ability and collaborative learning between tasks. The model maintains a stable loss convergence rate during training, showing good learning efficiency and adaptability to the test set. This study verifies the applicability of the multi-task learning framework in large language models, especially in improving the model’s ability to balance different tasks. In the future, with the combination of large language models and multimodal data and the application of dynamic task adjustment technology, the framework based on multi-task learning is expected to play a greater role in practical applications across fields and provide new ideas for the development of general artificial intelligence.
zh

[NLP-29] A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension

【速读】：该论文试图解决的问题是探究监督微调 (Supervised Fine-Tuning, SFT) 和上下文学习 (In-Context Learning, ICL) 这两种学习范式对大型语言模型 (LLMs) 隐藏表示的影响。解决方案的关键在于使用内在维度 (Intrinsic Dimension, ID) 来估计模型在执行特定自然语言任务时提取的表示之间的自由度。通过分析ID在SFT和ICL过程中的变化，研究发现ICL相较于SFT在嵌入空间中诱导出更高的ID，表明ICL生成的表示位于更高维度的流形上。

链接: https://arxiv.org/abs/2412.06245
作者: Saahith Janapati,Yangfeng Ji
关键词-EN: Large Language Models, performance of Large, natural language tasks, Large Language, supervised fine-tuning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The performance of Large Language Models (LLMs) on natural language tasks can be improved through both supervised fine-tuning (SFT) and in-context learning (ICL), which operate via distinct mechanisms. Supervised fine-tuning updates the model’s weights by minimizing loss on training data, whereas in-context learning leverages task demonstrations embedded in the prompt, without changing the model’s parameters. This study investigates the effects of these learning paradigms on the hidden representations of LLMs using Intrinsic Dimension (ID). We use ID to estimate the number of degrees of freedom between representations extracted from LLMs as they perform specific natural language tasks. We first explore how the ID of LLM representations evolves during SFT and how it varies due to the number of demonstrations in ICL. We then compare the IDs induced by SFT and ICL and find that ICL consistently induces a higher ID compared to SFT, suggesting that representations generated during ICL reside in higher dimensional manifolds in the embedding space.
zh

[NLP-30] LLM s as Debate Partners: Utilizing Genetic Algorithms and Adversarial Search for Adaptive Arguments

【速读】：该论文试图解决传统大型语言模型 (LLMs) 在辩论中缺乏战略规划能力的问题，关键解决方案是引入 DebateBrawl 平台，该平台通过整合遗传算法 (GA) 和对抗搜索 (AS) 来优化辩论策略，从而实现自适应和更具挑战性的辩论体验。这种方法不仅提高了 AI 生成论点的连贯性和上下文相关性，还通过实时调整策略增强了 AI 的适应能力。实验结果表明，DebateBrawl 在保持高事实准确性（92%）的同时，能够生成多样化的论点，显著提升了辩论教育的效果，并有助于通过 AI 辅助的论证改善公共话语质量。

链接: https://arxiv.org/abs/2412.06229
作者: Prakash Aryan
关键词-EN: Large Language Models, integrates Large Language, Genetic Algorithms, Language Models, Adversarial Search
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:This paper introduces DebateBrawl, an innovative AI-powered debate platform that integrates Large Language Models (LLMs), Genetic Algorithms (GA), and Adversarial Search (AS) to create an adaptive and engaging debating experience. DebateBrawl addresses the limitations of traditional LLMs in strategic planning by incorporating evolutionary optimization and game-theoretic techniques. The system demonstrates remarkable performance in generating coherent, contextually relevant arguments while adapting its strategy in real-time. Experimental results involving 23 debates show balanced outcomes between AI and human participants, with the AI system achieving an average score of 2.72 compared to the human average of 2.67 out of 10. User feedback indicates significant improvements in debating skills and a highly satisfactory learning experience, with 85% of users reporting improved debating abilities and 78% finding the AI opponent appropriately challenging. The system’s ability to maintain high factual accuracy (92% compared to 78% in human-only debates) while generating diverse arguments addresses critical concerns in AI-assisted discourse. DebateBrawl not only serves as an effective educational tool but also contributes to the broader goal of improving public discourse through AI-assisted argumentation. The paper discusses the ethical implications of AI in persuasive contexts and outlines the measures implemented to ensure responsible development and deployment of the system, including robust fact-checking mechanisms and transparency in decision-making processes.
zh

[NLP-31] SiReRAG: Indexing Similar and Related Information for Multihop Reasoning

【速读】：该论文试图解决现有检索增强生成 (RAG) 系统中索引方法仅基于语义相似性 (similarity) 或相关信息 (relatedness) 的问题，而未能全面涵盖这两个视角，导致在需要多跳推理的复杂任务中表现不佳。解决方案的关键在于提出了一种新的索引方法 SiReRAG，该方法同时考虑了相似性和相关信息。具体来说，SiReRAG 通过递归摘要构建相似性树，并通过提取命题和实体、基于共享实体对命题进行分组以及生成递归摘要来构建相关性树。最终，将这两种树索引并展平为一个统一的检索池，从而显著提升了多跳数据集上的性能，平均 F1 分数提高了 1.9%，并且在现有重排序方法上也有显著增强，平均 F1 分数提升了 7.8%。

链接: https://arxiv.org/abs/2412.06206
作者: Nan Zhang,Prafulla Kumar Choubey,Alexander Fabbri,Gabriel Bernadett-Shapiro,Rui Zhang,Prasenjit Mitra,Caiming Xiong,Chien-Sheng Wu
关键词-EN: retrieval-augmented generation, important step, step towards strong, related information, strong performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Indexing is an important step towards strong performance in retrieval-augmented generation (RAG) systems. However, existing methods organize data based on either semantic similarity (similarity) or related information (relatedness), but do not cover both perspectives comprehensively. Our analysis reveals that modeling only one perspective results in insufficient knowledge synthesis, leading to suboptimal performance on complex tasks requiring multihop reasoning. In this paper, we propose SiReRAG, a novel RAG indexing approach that explicitly considers both similar and related information. On the similarity side, we follow existing work and explore some variances to construct a similarity tree based on recursive summarization. On the relatedness side, SiReRAG extracts propositions and entities from texts, groups propositions via shared entities, and generates recursive summaries to construct a relatedness tree. We index and flatten both similarity and relatedness trees into a unified retrieval pool. Our experiments demonstrate that SiReRAG consistently outperforms state-of-the-art indexing methods on three multihop datasets (MuSiQue, 2WikiMultiHopQA, and HotpotQA), with an average 1.9% improvement in F1 scores. As a reasonably efficient solution, SiReRAG enhances existing reranking methods significantly, with up to 7.8% improvement in average F1 scores.
zh

[NLP-32] SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在扩展到更长上下文窗口时，传统注意力机制 (attention mechanisms) 计算成本随输入长度呈二次增长的问题。解决方案的关键是引入了一种动态稀疏注意力方法，称为SparseAccelerate，它根据输入特征动态调整稀疏模式，从而有效平滑注意力复杂度曲线。该方法在输入长度从16K到128K tokens的范围内表现出色，显著降低了Time-To-First-Token (TTFT) 延迟并节省了内存，尤其适用于中端硬件上的长上下文任务和内存密集型应用。SparseAccelerate不仅减少了延迟，还改变了复杂度随上下文长度增长的趋势，显示出相较于其他方法更小的TTFT增长梯度，从而为高效、实时的大上下文LLM推理提供了重要进展。

链接: https://arxiv.org/abs/2412.06198
作者: James Vo
关键词-EN: Large Language Models, Language Models, traditionally grows quadratically, Large Language, longer context windows
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) scale to longer context windows, the computational cost of attention mechanisms, which traditionally grows quadratically with input length, presents a critical challenge for real-time and memory-constrained deployments. Existing sparse attention techniques have sought to reduce this complexity, but they often incur significant overhead or compromise accuracy, making them less practical for large contexts on mid-range hardware. In this paper, we introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics, effectively flattening the attention complexity curve. Our approach is effective for input lengths starting at 16K tokens and scales efficiently up to 128K tokens on dual NVIDIA A5000 GPUs (24GB each). Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in Time-To-First-Token (TTFT) latency at 32K tokens, while also providing substantial memory savings. These improvements yield practical gains for memory-intensive applications and long-context tasks that were previously infeasible with standard attention. Beyond latency reductions, SparseAccelerate fundamentally shifts the scaling trend, demonstrating the smallest TTFT growth gradient relative to context length among competing methods. Ongoing evaluations on diverse benchmarks confirm its scalability, positioning SparseAccelerate as a critical advancement toward efficient, real-time, and large-context LLM inference on accessible hardware.
zh

[NLP-33] Annotations for Exploring Food Tweets From Multiple Aspects

【速读】：该论文旨在通过扩展拉脱维亚推特食客语料库（Latvian Twitter Eater Corpus, LTEC）来解决机器翻译、命名实体识别、时间线平衡的情感分析以及文本与图像关系分类等任务中的数据需求问题。解决方案的关键在于手动注释的评估数据子集的引入，这些数据为不同任务提供了高质量的标注，从而支持基线模型的实验和未来建模方法的挑战识别。

链接: https://arxiv.org/abs/2412.06179
作者: Matīss Rikters,Edison Marrese-Taylor,Rinalds Vīksna
关键词-EN: Twitter Eater Corpus, Latvian Twitter Eater, Eater Corpus, Latvian Twitter, Twitter Eater
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research builds upon the Latvian Twitter Eater Corpus (LTEC), which is focused on the narrow domain of tweets related to food, drinks, eating and drinking. LTEC has been collected for more than 12 years and reaching almost 3 million tweets with the basic information as well as extended automatically and manually annotated metadata. In this paper we supplement the LTEC with manually annotated subsets of evaluation data for machine translation, named entity recognition, timeline-balanced sentiment analysis, and text-image relation classification. We experiment with each of the data sets using baseline models and highlight future challenges for various modelling approaches.
zh

[NLP-34] Query-Efficient Planning with Language Models

【速读】：该论文试图解决在复杂环境中高效规划的问题，关键在于利用大型语言模型 (Large Language Models, LLMs) 提升规划的查询效率。论文提出了两种竞争性框架：第一种是将LLM作为启发式算法嵌入搜索型规划器中，用于选择有前景的节点和动作；第二种是将LLM作为生成式规划器，直接生成从起点到目标的完整动作序列，并通过世界模型反馈进行调整。研究表明，尽管两种方法都优于基线模型，但生成式规划器显著减少了交互次数，主要因为LLM作为规划器能更迅速地根据即时反馈调整规划策略。

链接: https://arxiv.org/abs/2412.06162
作者: Gonzalo Gonzalez-Pumariega,Wayne Chen,Kushal Kedia,Sanjiban Choudhury
关键词-EN: complex environments requires, Large Language Models, world model, complex environments, environments requires
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages (not including references or appendix); 13 figures (9 main paper, 4 appendix); (v1) preprint

点击查看摘要

Abstract:Planning in complex environments requires an agent to efficiently query a world model to find a feasible sequence of actions from start to goal. Recent work has shown that Large Language Models (LLMs), with their rich prior knowledge and reasoning capabilities, can potentially help with planning by searching over promising states and adapting to feedback from the world. In this paper, we propose and study two fundamentally competing frameworks that leverage LLMs for query-efficient planning. The first uses LLMs as a heuristic within a search-based planner to select promising nodes to expand and propose promising actions. The second uses LLMs as a generative planner to propose an entire sequence of actions from start to goal, query a world model, and adapt based on feedback. We show that while both approaches improve upon comparable baselines, using an LLM as a generative planner results in significantly fewer interactions. Our key finding is that the LLM as a planner can more rapidly adapt its planning strategies based on immediate feedback than LLM as a heuristic. We present evaluations and ablations on Robotouille and PDDL planning benchmarks and discuss connections to existing theory on query-efficient planning algorithms. Code is available at this https URL
zh

[NLP-35] he Computational Limits of State-Space Models and Mamba via the Lens of Circuit Complexity

【速读】：该论文试图解决的问题是验证Mamba和状态空间模型（State-space Models, SSMs）在计算复杂性上的局限性，特别是它们是否在理论上比Transformer更具计算表达能力。解决方案的关键在于通过电路复杂性框架（circuit complexity framework）证明，即使Mamba和SSMs具有多项式精度（poly(n)-precision）和常数深度层（constant-depth layers），它们仍然属于DLOGTIME-uniform TC^0复杂度类。这一结果表明，Mamba在理论上与Transformer具有相同的计算能力，无法解决TC^0之外的问题（如算术公式问题、布尔公式值问题和排列组合问题），从而挑战了Mamba比Transformer更具计算表达能力的假设。论文的核心贡献是通过严格的证明展示了Mamba和选择性SSM架构可以被DLOGTIME-uniform TC^0电路模拟，并且它们无法解决TC^0之外的问题。

链接: https://arxiv.org/abs/2412.06148
作者: Yifang Chen,Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song
关键词-EN: State-space Models, mathsf, circuit complexity framework, Mamba, complexity framework
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we analyze the computational limitations of Mamba and State-space Models (SSMs) by using the circuit complexity framework. Despite Mamba’s stateful design and recent attention as a strong candidate to outperform Transformers, we have demonstrated that both Mamba and SSMs with \mathrmpoly(n) -precision and constant-depth layers reside within the \mathsfDLOGTIME -uniform \mathsfTC^0 complexity class. This result indicates Mamba has the same computational capabilities as Transformer theoretically, and it cannot solve problems like arithmetic formula problems, boolean formula value problems, and permutation composition problems if \mathsfTC^0 \neq \mathsfNC^1 . Therefore, it challenges the assumption Mamba is more computationally expressive than Transformers. Our contributions include rigorous proofs showing that Selective SSM and Mamba architectures can be simulated by \mathsfDLOGTIME -uniform \mathsfTC^0 circuits, and they cannot solve problems outside \mathsfTC^0 .
zh

[NLP-36] Hate Speech According to the Law: An Analysis for Effective Detection

【速读】：该论文试图解决在线平台上仇恨言论（hate speech）的分类和法律框架适用性问题。其关键解决方案在于利用预训练模型和大型语言模型（如Qwen2-7B-Instruct和Meta-Llama-3-70B）处理仇恨言论数据集，并通过伪标签（pseudo-labeling）技术提升模型性能。此外，论文强调了法律专家意见的重要性，特别是在法律注释（annotations）的辅助下，能够更有效地分类可起诉的仇恨言论。然而，解决方案的核心在于更多地关注不同国家法律之间的差异，以确保法律框架的适用性和有效性。

链接: https://arxiv.org/abs/2412.06144
作者: Katerina Korre,John Pavlopoulos,Paolo Gajo,Alberto Barrón-Cedeño
关键词-EN: hate speech, prosecutable hate speech, hate, hate speech extends, speech
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The issue of hate speech extends beyond the confines of the online realm. It is a problem with real-life repercussions, prompting most nations to formulate legal frameworks that classify hate speech as a punishable offence. These legal frameworks differ from one country to another, contributing to the big chaos that online platforms have to face when addressing reported instances of hate speech. With the definitions of hate speech falling short in introducing a robust framework, we turn our gaze onto hate speech laws. We consult the opinion of legal experts on a hate speech dataset and we experiment by employing various approaches such as pretrained models both on hate speech and legal data, as well as exploiting two large language models (Qwen2-7B-Instruct and Meta-Llama-3-70B). Due to the time-consuming nature of data acquisition for prosecutable hate speech, we use pseudo-labeling to improve our pretrained models. This study highlights the importance of amplifying research on prosecutable hate speech and provides insights into effective strategies for combating hate speech within the parameters of legal frameworks. Our findings show that legal knowledge in the form of annotations can be useful when classifying prosecutable hate speech, yet more focus should be paid on the differences between the laws.
zh

[NLP-37] MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

【速读】：该论文试图解决医学领域中大型视觉-语言模型（Medical Large Vision-Language Models, Med-LVLMs）在处理多模态数据时因模态对齐问题导致的幻觉现象，即模型优先处理文本知识而忽视视觉信息，从而产生与医学图像信息相矛盾的错误输出。解决方案的关键在于提出了一种新的多模态医学偏好优化方法（MMedPO），通过引入两种类型的非偏好样本（plausible hallucinations 和 lesion region neglect）来增强模型的模态对齐能力。MMedPO通过计算每个样本的临床相关性得分，并将其作为权重整合到偏好优化过程中，从而显著提高了Med-LVLMs的事实准确性，在医学视觉问答（Med-VQA）和报告生成任务中分别取得了14.2%和51.7%的性能提升。

链接: https://arxiv.org/abs/2412.06141
作者: Kangyu Zhu,Peng Xia,Yun Li,Hongtu Zhu,Sheng Wang,Huaxiu Yao
关键词-EN: Large Vision-Language Models, advancement of Large, Large Vision-Language, Vision-Language Models, propelled their application
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in this https URL.
zh

[NLP-38] AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion

【速读】：该论文试图解决在特定任务中微调大型语言模型（LLMs）时，高质量、多样化训练数据的获取问题。现有方法要么依赖大规模种子数据集，要么难以同时保证生成数据的任务相关性和多样性。论文提出的解决方案是AIDE，一种新颖的数据合成框架，通过多跳过程扩展10个种子数据点，确保生成数据的多样性和任务相关性。AIDE通过提取种子数据的主要主题和关键知识属性来指导合成过程，并在每一步跳跃中继续提取新数据的主题和属性，重复此过程直至K跳。为防止随着跳跃深度增加而生成无关数据，AIDE引入了残差连接机制和自反思来提升数据质量。

链接: https://arxiv.org/abs/2412.06136
作者: Jiayu Li,Xuan Zhu,Fang Liu,Yanjun Qi
关键词-EN: tasks requires high-quality, specific tasks requires, diverse training data, Fine-tuning large language, training data relevant
类目: Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) for specific tasks requires high-quality, diverse training data relevant to the task. Recent research has leveraged LLMs to synthesize training data, but existing approaches either depend on large seed datasets or struggle to ensure both task relevance and data diversity in the generated outputs. To address these challenges, we propose AIDE, a novel data synthesis framework that uses a multi-hop process to expand 10 seed data points while ensuring diversity and task relevance. AIDE extracts the main topic and key knowledge attributes from the seed data to guide the synthesis process. In each subsequent hop, it extracts the topic and attributes from the newly generated data and continues guided synthesis. This process repeats for a total of K hops. To prevent irrelevant data generation as the hop depth increases, AIDE incorporates a residual connection mechanism and uses self-reflection to improve data quality. Our empirical results demonstrate that fine-tuning Mistral-7B, Llama-3.1-8B and Llama-3.2-3B with AIDE achieves more than 10% accuracy improvements over the base models across 13 tasks from 5 different benchmarks, while outperforming the models fine-tuned with state-of-the-art data synthesis methods like Evol-Instruct, DataTune and Prompt2Model.
zh

[NLP-39] Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings

【速读】：该论文试图解决当前大型语言模型（LLMs）在社会偏见基准测试中主要依赖预定义的多项选择题格式，无法充分反映现实世界交互复杂性和开放性问题。解决方案的关键在于扩展现有的BBQ数据集，引入填空题和简答题类型，以在开放式环境中评估偏见。研究发现，LLMs在某些受保护属性（如年龄和社会经济地位）上表现出更强的偏见，但这些偏见输出可作为去偏见的宝贵上下文和思维链。论文提出的去偏见方法结合了零样本、少样本和思维链技术，能显著降低偏见水平至接近零。通过开源评估和去偏见代码，论文旨在促进对LLMs中偏见和刻板印象的进一步测量和缓解。

链接: https://arxiv.org/abs/2412.06134
作者: Zhao Liu
关键词-EN: Large Language Models, Language Models, Current social bias, Large Language, pre-defined question formats
类目: Computation and Language (cs.CL)
备注: 12 panges

点击查看摘要

Abstract:Current social bias benchmarks for Large Language Models (LLMs) primarily rely on pre-defined question formats like multiple-choice, limiting their ability to reflect the complexity and open-ended nature of real-world interactions. To address this gap, we extend an existing BBQ dataset introduced by incorporating fill-in-the-blank and short-answer question types, designed to evaluate biases in an open-ended setting. Our finding reveals that LLMs tend to produce responses that are more biased against certain protected attributes, like age and socio-economic status. On the other hand, these biased outputs produced by LLMs can serve as valuable contexts and chains of thought for debiasing. Our debiasing approach combined zero-shot, few-shot, and chain-of-thought could significantly reduce the level of bias to almost 0. We open-source our evaluation and debiasing code hoping to encourage further measurements and mitigation of bias and stereotype in LLMs.
zh

[NLP-40] Infusing Prompts with Syntax and Semantics

【速读】：该论文试图解决语言模型在生成输出时经常出现的语言结构缺陷问题。解决方案的关键在于直接将各种类型的句法（syntactic）和语义（semantic）信息注入到大型语言模型中，并通过自然语言查询到SQL的翻译任务，特别是在资源较少的语言上，验证了低成本的句法和语义信息对模型性能的显著提升作用。研究表明，这种注入方式能够大幅提升语言模型的表现，超越了之前的最优系统。

链接: https://arxiv.org/abs/2412.06107
作者: Anton Bulle Labate,Fabio Gagliardi Cozman
关键词-EN: flawed linguistic structure, impressive success, generate outputs, outputs with flawed, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite impressive success, language models often generate outputs with flawed linguistic structure. We analyze the effect of directly infusing various kinds of syntactic and semantic information into large language models. To demonstrate the value of our proposals, we focus on the translation of natural language queries to SQL, in particular dealing with languages with less resources than English, to better investigate how much help we can get from low cost syntactic and semantic information. We show that linguistic analysis can significantly boost language models, to the point that we have surpassed previous best systems.
zh

[NLP-41] Enhanced Computationally Efficient Long LoRA Inspired Perceiver Architectures for Auto-Regressive Language Modeling

【速读】：该论文试图解决Transformer架构中注意力机制的二次复杂度问题，特别是在处理长序列时的高计算成本。解决方案的关键在于提出了一种基于PerceiverAR架构的改进方法，通过引入三种不同的架构增强方案，并在其中一种方案中借鉴了Long-LoRA的高效注意力计算方法，设计了名为Long LoRA Pereceiver (LLP)的新架构。该架构不仅能够有效降低计算复杂度，还能在保持高性能的同时，作为大型语言模型（LLMs）的基础架构使用，而非仅作为微调的附加组件。实验结果表明，LLP在多个基准测试中相较于现有的Transformer模型有显著的性能提升。

链接: https://arxiv.org/abs/2412.06106
作者: Kaleel Mahmood,Shaoyi Huang
关键词-EN: Natural Language Processing, Large Language Models, Language Processing field, Natural Language, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Transformer architecture has revolutionized the Natural Language Processing field and is the backbone of Large Language Models (LLMs). The Transformer uses the attention mechanism that computes the pair-wise similarity between its input tokens to produce latent vectors that are able to understand the semantic meaning of the input text. One of the challenges in the Transformer architecture is the quadratic complexity of the attention mechanism that prohibits the efficient processing of long sequence lengths. While many recent research works have attempted to provide a reduction from O(n^2) time complexity of attention to semi-linear complexity, it remains an unsolved problem in the sense of maintaining a high performance when such complexity is reduced. One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance while reducing the computation complexity. In this paper, we use the PerceiverAR that was proposed for Auto-Regressive modeling as a baseline, and provide three different architectural enhancements to it with varying computation overhead tradeoffs. Inspired by the recently proposed efficient attention computation approach of Long-LoRA, we then present an equally efficient Perceiver-based architecture (termed as Long LoRA Pereceiver - LLP) that can be used as the base architecture in LLMs instead of just a fine-tuning add-on. Our results on different benchmarks indicate impressive improvements compared to recent Transformer based models.
zh

[NLP-42] Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates Mean Length of Utterances and Annotation Invariance

【速读】：该论文试图解决从小规模语料库（treebanks）中准确估计语法结构多样性的问题。解决方案的关键在于提出了一个新的度量标准——推导熵率（derivational entropy rate），并证明了平均话语长度（MLU, Mean Length of Utterance）与语法复杂性之间的根本联系。通过结合MLU和推导熵率，论文提供了一种无需理论假设的语法复杂性评估方法。此外，论文引入了平滑诱导树库熵（SITE, Smoothed Induced Treebank Entropy）作为工具，用于从小规模树库中准确估计这些度量，从而解决了小规模语料库带来的限制问题。

链接: https://arxiv.org/abs/2412.06095
作者: Fermin Moscoso del Prado Martin
关键词-EN: derivational entropy rate, derivational entropy, study of aging, historical linguistics, type of speakers
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:In many fields, such as language acquisition, neuropsychology of language, the study of aging, and historical linguistics, corpora are used for estimating the diversity of grammatical structures that are produced during a period by an individual, community, or type of speakers. In these cases, treebanks are taken as representative samples of the syntactic structures that might be encountered. Generalizing the potential syntactic diversity from the structures documented in a small corpus requires careful extrapolation whose accuracy is constrained by the limited size of representative sub-corpora. In this article, I demonstrate – theoretically, and empirically – that a grammar’s derivational entropy and the mean length of the utterances (MLU) it generates are fundamentally linked, giving rise to a new measure, the derivational entropy rate. The mean length of utterances becomes the most practical index of syntactic complexity; I demonstrate that MLU is not a mere proxy, but a fundamental measure of syntactic diversity. In combination with the new derivational entropy rate measure, it provides a theory-free assessment of grammatical complexity. The derivational entropy rate indexes the rate at which different grammatical annotation frameworks determine the grammatical complexity of treebanks. I introduce the Smoothed Induced Treebank Entropy (SITE) as a tool for estimating these measures accurately, even from very small treebanks. I conclude by discussing important implications of these results for both NLP and human language processing.
zh

[NLP-43] KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在特定任务或领域中适应时，由于模型规模增大导致的计算开销和内存使用问题。现有的参数高效微调（PEFT）方法，如LoRA及其变体，虽然简化了微调过程，但忽略了与目标任务无关或噪声的知识，影响了模型性能。论文提出的解决方案是知识感知的奇异值适应（KaSA），其关键在于利用奇异值分解（SVD）结合知识感知的奇异值，动态激活与当前任务相关的知识，从而提高模型在自然语言理解（NLU）、生成（NLG）、指令跟随和常识推理等任务中的表现。实验结果表明，KaSA在多个基准测试和合成数据集上均优于现有的PEFT方法。

链接: https://arxiv.org/abs/2412.06071
作者: Fan Wang,Juyong Jiang,Chansung Park,Sunghun Kim,Jing Tang
关键词-EN: significant computational overhead, increasing sizes, sizes of large, significant computational, computational overhead
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing sizes of large language models (LLMs) result in significant computational overhead and memory usage when adapting these models to specific tasks or domains. Various parameter-efficient fine-tuning (PEFT) methods have been devised to mitigate these challenges by training a small set of parameters for the task-specific updates of the model weights. Among PEFT methods, LoRA stands out for its simplicity and efficiency, inspiring the development of a series of variants. However, LoRA and its successors disregard the knowledge that is noisy or irrelevant to the targeted task, detrimentally impacting model performance and leading to suboptimality. To address this limitation, we introduce Knowledge-aware Singular-value Adaptation (KaSA), a PEFT method that leverages singular value decomposition (SVD) with knowledge-aware singular values to dynamically activate knowledge based on its relevance to the task at hand. We conduct extensive experiments across a range of LLMs on tasks spanning natural language understanding (NLU), generation (NLG), instruction following, and commonsense reasoning. The experimental results demonstrate that KaSA consistently outperforms FFT and 14 popular PEFT baselines across 16 benchmarks and 4 synthetic datasets, underscoring our method’s efficacy and adaptability. The source code of our method is available at this https URL.
zh

[NLP-44] Steering Large Language Models to Evaluate and Amplify Creativity NEURIPS2024

【速读】：该论文试图解决大型语言模型（LLMs）在判断文本创造性方面的不足，并提出了一种基于模型内部状态差异的解决方案。关键在于通过对比模型在“无聊”和“创造性”提示下的内部状态变化，提取出与人类判断高度一致的创造性度量，并利用这些内部状态差异在推理时增强生成文本的创造性。

链接: https://arxiv.org/abs/2412.06060
作者: Matthew Lyle Olson,Neale Ratzlaff,Musashi Hinck,Shao-yen Tseng,Vasudev Lal
关键词-EN: Large Language Models, Large Language, Language Models, generating creative text, capable of generating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: (Spotlight) NeurIPS 2024 Workshop on Creativity Generative AI. Authors 1 and 2 contributed equally

点击查看摘要

Abstract:Although capable of generating creative text, Large Language Models (LLMs) are poor judges of what constitutes “creativity”. In this work, we show that we can leverage this knowledge of how to write creatively in order to better judge what is creative. We take a mechanistic approach that extracts differences in the internal states of an LLM when prompted to respond “boringly” or “creatively” to provide a robust measure of creativity that corresponds strongly with human judgment. We also show these internal state differences can be applied to enhance the creativity of generated text at inference time.
zh

[NLP-45] 1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval (LeSeR) for Regulatory Question Answering COLING2025

【速读】：该论文旨在解决监管领域中的信息检索与答案生成问题（Regulatory Information Retrieval and Answer Generation, RIRAG），通过结合先进的检索和生成技术来提升在监管文本中的信息获取和答案生成的准确性。解决方案的关键在于采用了多种嵌入模型（如Stella, BGE, CDE, Mpnet），并通过微调（fine-tuning）和重排序（reranking）技术优化检索结果。特别是，论文提出的LeSeR方法在检索任务中表现出色，达到了0.8201的召回率（recall@10）和0.6655的平均精度（map@10），展示了自然语言处理技术在监管应用中的巨大潜力。

链接: https://arxiv.org/abs/2412.06009
作者: Jebish Purbey,Drishti Sharma,Siddhant Gupta,Khawaja Murad,Siddartha Pullakhandam,Ram Mohan Rao Kadiyala
关键词-EN: leveraging advanced information, advanced information retrieval, Regulatory Information Retrieval, answer generation techniques, RegNLP RIRAG
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 5 pages, Accepted to RegNLP @ COLING 2025

点击查看摘要

Abstract:This paper presents the system description of our entry for the COLING 2025 RegNLP RIRAG (Regulatory Information Retrieval and Answer Generation) challenge, focusing on leveraging advanced information retrieval and answer generation techniques in regulatory domains. We experimented with a combination of embedding models, including Stella, BGE, CDE, and Mpnet, and leveraged fine-tuning and reranking for retrieving relevant documents in top ranks. We utilized a novel approach, LeSeR, which achieved competitive results with a recall@10 of 0.8201 and map@10 of 0.6655 for retrievals. This work highlights the transformative potential of natural language processing techniques in regulatory applications, offering insights into their capabilities for implementing a retrieval augmented generation system while identifying areas for future improvement in robustness and domain adaptation.
zh

[NLP-46] Does RLHF Scale? Exploring the Impacts From Data Model and Method

【速读】：该论文试图解决强化学习从人类反馈 (Reinforcement Learning from Human Feedback, RLHF) 在大语言模型 (Large Language Models, LLMs) 中的扩展性问题。研究的关键在于系统分析 RLHF 框架中的关键组件——模型规模、数据组成和推理预算——及其对性能的影响。研究发现，增加数据多样性和数据量能提升奖励模型的性能，从而有助于过程监督模型的扩展；策略训练中，每个提示的响应样本数量在初期能提升性能，但很快达到瓶颈；较大的奖励模型对策略训练的提升有限；而较大的策略模型在固定奖励模型下从 RLHF 中获益较少。总体而言，RLHF 的扩展效率低于预训练，且随着计算资源的增加，收益递减。基于这些观察，论文提出了在计算资源限制下优化 RLHF 性能的策略。

链接: https://arxiv.org/abs/2412.06000
作者: Zhenyu Hou,Pengfan Du,Yilin Niu,Zhengxiao Du,Aohan Zeng,Xiao Liu,Minlie Huang,Hongning Wang,Jie Tang,Yuxiao Dong
关键词-EN: Large Language Models, Human Feedback, Reinforcement Learning, Learning from Human, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study explores the scaling properties of Reinforcement Learning from Human Feedback (RLHF) in Large Language Models (LLMs). Although RLHF is considered an important step in post-training of LLMs, its scaling potential is still largely unknown. We systematically analyze key components in the RLHF framework–model size, data composition, and inference budget–and their impacts on performance. Our findings show that increasing data diversity and volume improves reward model performance, helping process-supervision models scale better. For policy training, more response samples per prompt boost performance initially but quickly plateau. And larger reward models offer modest gains in policy training. In addition, larger policy models benefit less from RLHF with a fixed reward model. Overall, RLHF scales less efficiently than pretraining, with diminishing returns from additional computational resources. Based on these observations, we propose strategies to optimize RLHF performance within computational limits.
zh

[NLP-47] Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt

【速读】：该论文试图解决在增强语言模型新能力（如工具使用）时，提示（prompting）和微调（fine-tuning）方法存在的局限性问题。提示方法虽然设置快速，但依赖于在模型提示中提供每个工具使用的显式演示，导致工具使用与任务紧密耦合，限制了泛化能力；而微调方法虽然去除了任务运行时对工具使用演示的需求，但将新能力绑定到单一模型，增加了重复的设置成本。论文提出的解决方案是引入语言钩子（language hooks），这是一种新颖的框架，能够将新能力的增强与模型的任务特定提示和模型本身解耦。关键在于语言钩子算法通过在基础模型的文本生成过程中插入模块化程序的执行，这些程序根据现有文本和可用能力条件触发，能够调用外部工具、辅助语言模型（如使用工具特定提示）并修改现有上下文，从而实现更强的泛化能力和任务适应性。

链接: https://arxiv.org/abs/2412.05967
作者: Damien de Mijolla,Wen Yang,Philippa Duckett,Christopher Frye,Mark Worrall
关键词-EN: augmenting language models, competing paradigms, language models, model, tool usage
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work was conducted during Summer 2023. Experimental results and references reflect the state of the field at that time and may not account for subsequent developments

点击查看摘要

Abstract:Prompting and fine-tuning have emerged as two competing paradigms for augmenting language models with new capabilities, such as the use of tools. Prompting approaches are quick to set up but rely on providing explicit demonstrations of each tool’s usage in the model’s prompt, thus coupling tool use to the task at hand and limiting generalisation. Fine-tuning removes the need for task-specific demonstrations of tool usage at runtime; however, this ties new capabilities to a single model, thus making already-heavier setup costs a recurring expense. In this paper, we introduce language hooks, a novel framework for augmenting language models with new capabilities that is decoupled both from the model’s task-specific prompt and from the model itself. The language hook algorithm interleaves text generation by the base model with the execution of modular programs that trigger conditionally based on the existing text and the available capabilities. Upon triggering, programs may call external tools, auxiliary language models (e.g. using tool specific prompts), and modify the existing context. We benchmark our method against state-of-the-art baselines, find that it outperforms task-aware approaches, and demonstrate its ability to generalise to novel tasks.
zh

[NLP-48] A Cross-Validation Study of Turkish Sentiment Analysis Datasets and Tools

【速读】：该论文试图解决土耳其语情感分析研究中数据集有限和多样性不足的问题。解决方案的关键在于对2012年至2022年间发表的31篇研究文章进行严格审查，收集并标注了23个土耳其语情感分析数据集，并使用分类法（taxonomy）对其进行分类。此外，论文还对这些数据集应用了最先进的情感分析工具，分析了其在不同数据集上的性能表现，发现工具性能显著依赖于目标文本的特征。通过这些步骤，论文促进了土耳其语情感分析的更细致理解。

链接: https://arxiv.org/abs/2412.05964
作者: Şevval Çakıcı,Dilara Karaduman,Mehmet Akif Çırlan,Ali Hürriyetoğlu
关键词-EN: gained increasing significance, increasing significance, prompting researchers, Turkish, gained increasing
类目: Computation and Language (cs.CL)
备注: 16 pages, 4 tables, no figures. Preprint version. To be submitted to the Language Resources and Evaluation journal

点击查看摘要

Abstract:In recent years, sentiment analysis has gained increasing significance, prompting researchers to explore datasets in various languages, including Turkish. However, the limited availability of Turkish datasets has led to their multifaceted usage in different studies, yielding diverse outcomes. To overcome this challenge, a rigorous review was conducted of research articles published between 2012 and 2022. 31 studies were listed, and 23 Turkish datasets obtained from publicly available sources and email requests used in these studies were collected. We labeled these 31 studies using a taxonomy. We provide a map of sentiment analysis datasets according to this taxonomy in Turkish over 10 years. Moreover, we run state-of-the-art sentiment analysis tools on these datasets and analyzed performance across popular Turkish sentiment datasets. We observed that the performance of the sentiment analysis tools significantly depends on the characteristics of the target text. Our study fosters a more nuanced understanding of sentiment analysis in the Turkish language.
zh

[NLP-49] Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

【速读】：该论文试图解决多模态大语言模型（MLLMs）在视觉-语言任务中仅依赖粗粒度概念标注（如图像描述）的局限性问题。解决方案的关键在于引入细粒度概念标注（如对象标签和对象区域），并通过构建新的多模态多粒度概念标注数据集（MMGiC）来实现。该数据集整合了不同粒度的概念标注，通过结构化模板和通用MLLM框架，使得模型能够在多个粒度上对齐视觉和语言信息，从而提升模型在概念定位和学习上的能力。实验结果表明，MMGiC与图像描述数据的适当结合在多个基准测试中显著提升了性能，验证了细粒度标注的有效性。

链接: https://arxiv.org/abs/2412.05939
作者: Xiao Xu,Tianhao Niu,Yuxi Xie,Libo Qin,Wanxiang Che,Min-Yen Kan
关键词-EN: Multimodal Large Language, Large Language Models, coarse-grained concept annotations, concept annotations, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: A manuscript that should have been Arxived in May :)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel in vision–language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image–caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image–caption data alone on POPE and SEED-Bench. Code, data and models will be available at this https URL.
zh

[NLP-50] Paraphrase-Aligned Machine Translation

【速读】：该论文试图解决大语言模型（LLMs）在机器翻译中生成的输出可能偏离目标语言母语者常用表达的问题，这一问题通常源于语言系统间句子结构的差异。解决方案的关键是提出了ParaAlign Translator方法，通过微调LLMs以重述（paraphrase）句子，使其结构与目标语言系统对齐，从而提升后续翻译的性能。实验结果表明，该方法在资源丰富和低资源场景下均提升了LLaMA-3-8B模型的表现，并达到了与更大规模的LLaMA-3-70B模型相当或更优的效果。

链接: https://arxiv.org/abs/2412.05916
作者: Ke-Ching Chang,Chung-Chi Chen,An-Zi Yen
关键词-EN: demonstrated significant capabilities, Large Language Models, Large Language, demonstrated significant, significant capabilities
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant capabilities in machine translation. However, their translation quality is sometimes questioned, as the generated outputs may deviate from expressions typically used by native speakers. These deviations often arise from differences in sentence structure between language systems. To address this issue, we propose ParaAlign Translator, a method that fine-tunes LLMs to paraphrase sentences, aligning their structures with those of the target language systems. This approach improves the performance of subsequent translations. Experimental results demonstrate that the proposed method enhances the LLaMA-3-8B model’s performance in both resource-rich and low-resource scenarios and achieves parity with or surpassing the much larger LLaMA-3-70B model.
zh

[NLP-51] XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

【速读】：该论文试图解决生成式大型语言模型（LLM）在长上下文推理任务中由于KV-Cache框架导致的内存需求快速增长问题。解决方案的关键在于通过量化各层缓存数据对推理精度的不同影响，提出了一种个性化的缓存大小分配策略。具体来说，论文将缓存分配问题建模为组合优化问题，并通过实验和理论验证了这种个性化分配能够显著减少内存消耗（平均减少61.6%），同时保持相当的推理精度，从而提高计算效率（提升2.1倍）和吞吐量（最高提升5.5倍）。

链接: https://arxiv.org/abs/2412.05896
作者: Weizhuo Li,Zhigang Wang,Yu Gu,Ge Yu
关键词-EN: generative Large Language, Large Language Model, Large Language, achieved remarkable success, Recently the generative
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache framework makes a compromise between time and space complexities. However, caching data generates the increasingly growing memory demand, that can quickly exhaust the limited memory capacity of the modern accelerator like GPUs, particularly in long-context inference tasks. Existing studies reduce memory consumption by evicting some of cached data that have less important impact on inference accuracy. But the benefit in practice is far from ideal due to the static cache allocation across different LLM network layers. This paper observes that the layer-specific cached data have very different impacts on accuracy. We quantify this difference, and give experimental and theoretical validation. We accordingly make a formal analysis and shows that customizing the cache size for each layer in a personalized manner can yield a significant memory reduction, while still providing comparable accuracy. We simulate the cache allocation as a combinatorial optimization problem and give a global optimal solution. In particular, we devise a mini- and sampling-based inference over a lightweight variant of the LLM model, so as to quickly capture the difference and then feed it into the personalized algorithms. Extensive experiments on real-world datasets demonstrate that our proposals can reduce KV cache memory consumption by 61.6% on average, improve computational efficiency by 2.1x and then increase the throughput by up to 5.5x.
zh

[NLP-52] Automated Extraction and Creation of FBS Design Reasoning Knowledge Graphs from Structured Data in Product Catalogues Lacking Contextual Information

【速读】：该论文试图解决从传统结构化数据（如规格表和产品目录）中自动提取知识并构建基于本体（Ontology）的知识图谱（KG）的问题。解决方案的关键在于开发了一种基于规则的技术和数字工作流程，用于从这些结构化数据中提取上下文信息，并基于上下文分类规则生成功能-行为-结构（FBS）本体。该解决方案包括两个主要组件：一是用于推导上下文和基于上下文的分类规则的过程，二是用于填充和检索FBS本体知识图谱的工作流程。通过结合知识图谱和自然语言处理（NLP）技术，实现了知识的自动化提取、表示和检索。

链接: https://arxiv.org/abs/2412.05868
作者: Vijayalaxmi Sahadevan,Sushil Mario,Yash Jaiswal,Divyanshu Bajpai,Vishal Singh,Hiralal Aggarwal,Suhas Suresh,Manjunath Maigur
关键词-EN: decision making scenarios, effective knowledge management, Ontology-based knowledge graphs, making scenarios, desirable for effective
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, with 17 figures and 10 tables

点击查看摘要

Abstract:Ontology-based knowledge graphs (KG) are desirable for effective knowledge management and reuse in various decision making scenarios, including design. Creating and populating extensive KG based on specific ontological models can be highly labour and time-intensive unless automated processes are developed for knowledge extraction and graph creation. Most research and development on automated extraction and creation of KG is based on extensive unstructured data sets that provide contextual information. However, some of the most useful information about the products and services of a company has traditionally been recorded as structured data. Such structured data sets rarely follow a standard ontology, do not capture explicit mapping of relationships between the entities, and provide no contextual information. Therefore, this research reports a method and digital workflow developed to address this gap. The developed method and workflow employ rule-based techniques to extract and create a Function Behaviour-Structure (FBS) ontology-based KG from legacy structured data, especially specification sheets and product catalogues. The solution approach consists of two main components: a process for deriving context and context-based classification rules for FBS ontology concepts and a workflow for populating and retrieving the FBS ontology-based KG. KG and Natural Language Processing (NLP) are used to automate knowledge extraction, representation, and retrieval. The workflow’s effectiveness is demonstrated via pilot implementation in an industrial context. Insights gained from the pilot study are reported regarding the challenges and opportunities, including discussing the FBS ontology and concepts.
zh

[NLP-53] Domain-Specific Translation with Open-Source Large Language Models : Resource-Oriented Analysis

【速读】：该论文试图解决在特定领域（如医疗领域）中，开源的自回归解码器大型语言模型（LLMs）与面向任务的机器翻译（MT）模型在翻译质量上的差异问题。解决方案的关键在于，尽管通过微调LLMs（如Mistral和Llama）可以提升其在医疗翻译任务中的表现，但专门设计的编码器-解码器MT模型（如NLLB-200）在大多数情况下仍表现出更高的翻译质量，尤其是在中低资源语言对中。因此，论文强调了在特定领域中，继续开发和使用专门的MT模型的重要性，并建议通过预训练领域特定的中等规模语言模型来提高翻译任务的质量和效率。

链接: https://arxiv.org/abs/2412.05862
作者: Aman Kassahun Wassie,Mahdi Molaei,Yasmin Moslem
关键词-EN: open-source autoregressive decoder-only, autoregressive decoder-only large, decoder-only large language, task-oriented machine translation, open-source autoregressive
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we compare the domain-specific translation performance of open-source autoregressive decoder-only large language models (LLMs) with task-oriented machine translation (MT) models. Our experiments focus on the medical domain and cover four language pairs with varied resource availability: English-to-French, English-to-Portuguese, English-to-Swahili, and Swahili-to-English. Despite recent advancements, LLMs exhibit a clear gap in specialized translation quality compared to multilingual encoder-decoder MT models such as NLLB-200. In three out of four language directions in our study, NLLB-200 3.3B outperforms all LLMs in the size range of 8B parameters in medical translation. While fine-tuning LLMs such as Mistral and Llama improves their performance at medical translation, these models still fall short compared to fine-tuned NLLB-200 3.3B models. Our findings highlight the ongoing need for specialized MT models to achieve higher-quality domain-specific translation, especially in medium-resource and low-resource settings. As larger LLMs outperform their 8B variants, this also encourages pre-training domain-specific medium-sized LMs to improve quality and efficiency in specialized translation tasks.
zh

[NLP-54] Depression detection from Social Media Bangla Text Using Recurrent Neural Networks

【速读】：该论文试图解决从社交媒体文本中识别抑郁情绪的问题，这是导致心理健康问题的根源之一，并且与自杀事件密切相关。解决方案的关键在于应用自然语言处理技术对Facebook文本进行情感分析，特别是针对抑郁情绪的检测。通过预处理步骤（如词干提取、停用词移除）和特征提取技术（如文体特征、TF-IDF、词嵌入）对收集的983条社交媒体文本进行处理，并使用LSTM、GRU、支持向量机和朴素贝叶斯分类器进行分类预测。最终通过F1-score和准确率等主要分类指标评估模型性能，旨在帮助心理学家通过分析社交媒体上的情感表达，减少抑郁个体的负面行为，从而实现诊断和治疗。

链接: https://arxiv.org/abs/2412.05861
作者: Sultan Ahmed,Salman Rakin,Mohammad Washeef Ibn Waliur,Nuzhat Binte Islam,Billal Hossain,Md. Mostofa Akbar
关键词-EN: Emotion artificial intelligence, social media posts, social media, artificial intelligence, field of study
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Initial version with Bangla text. arXiv admin note: substantial text overlap with arXiv:2411.04542

点击查看摘要

Abstract:Emotion artificial intelligence is a field of study that focuses on figuring out how to recognize emotions, especially in the area of text mining. Today is the age of social media which has opened a door for us to share our individual expressions, emotions, and perspectives on any event. We can analyze sentiment on social media posts to detect positive, negative, or emotional behavior toward society. One of the key challenges in sentiment analysis is to identify depressed text from social media text that is a root cause of mental ill-health. Furthermore, depression leads to severe impairment in day-to-day living and is a major source of suicide incidents. In this paper, we apply natural language processing techniques on Facebook texts for conducting emotion analysis focusing on depression using multiple machine learning algorithms. Preprocessing steps like stemming, stop word removal, etc. are used to clean the collected data, and feature extraction techniques like stylometric feature, TF-IDF, word embedding, etc. are applied to the collected dataset which consists of 983 texts collected from social media posts. In the process of class prediction, LSTM, GRU, support vector machine, and Naive-Bayes classifiers have been used. We have presented the results using the primary classification metrics including F1-score, and accuracy. This work focuses on depression detection from social media posts to help psychologists to analyze sentiment from shared posts which may reduce the undesirable behaviors of depressed individuals through diagnosis and treatment.
zh

[NLP-55] Cooperative SQL Generation for Segmented Databases By Using Multi-functional LLM Agents

【速读】：该论文试图解决文本到SQL任务（Text-to-SQL task），即根据用户文本问题自动生成SQL查询。解决方案的关键在于提出了一个基于多功能代理（Multi-functional Agents）的协同SQL生成框架（CSMA），通过大型语言模型（LLM）代理之间的信息交互来实现。CSMA框架包括三个阶段：问题相关模式收集、问题对应SQL查询生成和SQL查询正确性检查。通过代理之间的协作，每个代理仅使用其掌握的部分数据库模式信息来生成和验证SQL查询，从而在保持代理私有数据的同时，实现了高效的SQL生成。实验结果表明，CSMA在Spider和Bird基准测试中达到了与最先进方法相当的高性能水平。

链接: https://arxiv.org/abs/2412.05850
作者: Zhiguang Wu,Fengbin Zhu,Xuequn Shang,Yupei Zhang,Pan Zhou
关键词-EN: automatically yield SQL, yield SQL queries, Cooperative SQL Generation, SQL query, user text questions
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-SQL task aims to automatically yield SQL queries according to user text questions. To address this problem, we propose a Cooperative SQL Generation framework based on Multi-functional Agents (CSMA) through information interaction among large language model (LLM) based agents who own part of the database schema seperately. Inspired by the collaboration in human teamwork, CSMA consists of three stages: 1) Question-related schema collection, 2) Question-corresponding SQL query generation, and 3) SQL query correctness check. In the first stage, agents analyze their respective schema and communicate with each other to collect the schema information relevant to the question. In the second stage, agents try to generate the corresponding SQL query for the question using the collected information. In the third stage, agents check if the SQL query is created correctly according to their known information. This interaction-based method makes the question-relevant part of database schema from each agent to be used for SQL generation and check. Experiments on the Spider and Bird benckmark demonstrate that CSMA achieves a high performance level comparable to the state-of-the-arts, meanwhile holding the private data in these individual agents.
zh

[NLP-56] Are Clinical T5 Models Better for Clinical Text? ALT ML4H

【速读】：该论文试图解决临床领域中基于Transformer的编码器/解码器架构（如T5模型）在不同临床任务和领域中的表现问题。关键在于评估临床T5模型与经过FLAN调优的通用T5模型相比，是否在临床任务中表现更优，以及它们在新临床领域中的泛化能力。研究结果表明，临床T5模型在现有模型基础上仅提供了边际改进，并且在不同领域中的表现较差，这为未来开发临床大语言模型（LLMs）提供了重要参考。

链接: https://arxiv.org/abs/2412.05845
作者: Yahan Li,Keith Harrigian,Ayah Zirikly,Mark Dredze
关键词-EN: Large language models, Large language, decoder architecture, transformer-based encoder, models
类目: Computation and Language (cs.CL)
备注: Proceedings of Machine Learning for Health (ML4H) Symposium 2024, December 15th, 2024, Vancouver, Canada, 32 pages

点击查看摘要

Abstract:Large language models with a transformer-based encoder/decoder architecture, such as T5, have become standard platforms for supervised tasks. To bring these technologies to the clinical domain, recent work has trained new or adapted existing models to clinical data. However, the evaluation of these clinical T5 models and comparison to other models has been limited. Are the clinical T5 models better choices than FLAN-tuned generic T5 models? Do they generalize better to new clinical domains that differ from the training sets? We comprehensively evaluate these models across several clinical tasks and domains. We find that clinical T5 models provide marginal improvements over existing models, and perform worse when evaluated on different domains. Our results inform future choices in developing clinical LLMs.
zh

[NLP-57] A Self-Learning Multimodal Approach for Fake News Detection

【速读】：该论文试图解决社交媒体中虚假新闻的检测问题，特别是在缺乏标注数据的情况下。解决方案的关键在于引入了一种自学习的跨模态模型，该模型结合了对比学习（contrastive learning）和大型语言模型（Large Language Models, LLMs）的优势，能够在无需标注数据的情况下提取特征，并同时分析文本和图像信息。通过对比学习，模型能够有效提取特征，而LLMs则利用其处理多样化语言数据的能力，提升了跨模态分析的效果。实验结果表明，该模型在公共数据集上表现优异，准确率、精确率、召回率和F1分数均超过85%，显著提升了跨模态虚假新闻检测的性能。

链接: https://arxiv.org/abs/2412.05843
作者: Hao Chen,Hui Guo,Baochen Hu,Shu Hu,Jinrong Hu,Siwei Lyu,Xi Wu,Xin Wang
关键词-EN: online news content, false information, rapid growth, growth of social, social media
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid growth of social media has resulted in an explosion of online news content, leading to a significant increase in the spread of misleading or false information. While machine learning techniques have been widely applied to detect fake news, the scarcity of labeled datasets remains a critical challenge. Misinformation frequently appears as paired text and images, where a news article or headline is accompanied by a related visuals. In this paper, we introduce a self-learning multimodal model for fake news classification. The model leverages contrastive learning, a robust method for feature extraction that operates without requiring labeled data, and integrates the strengths of Large Language Models (LLMs) to jointly analyze both text and image features. LLMs are excel at this task due to their ability to process diverse linguistic data drawn from extensive training corpora. Our experimental results on a public dataset demonstrate that the proposed model outperforms several state-of-the-art classification approaches, achieving over 85% accuracy, precision, recall, and F1-score. These findings highlight the model’s effectiveness in tackling the challenges of multimodal fake news detection.
zh

[NLP-58] An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback Mechanism

【速读】：该论文试图解决多模态多跳问答（multimodal multi-hop question answering）中的两个主要问题：1) 检索到的证据包含大量冗余信息，导致无关信息误导预测，从而显著降低性能；2) 推理过程缺乏可解释的推理步骤，使得模型难以发现处理复杂问题时的逻辑错误。解决方案的关键在于提出了一种基于大规模语言模型（LLMs）但不完全依赖它们的统一方法，将多模态多跳问答视为联合蕴涵树生成和问答问题。具体而言，设计了一个多任务学习框架，通过专家混合（mixture of experts）防止任务特定错误相互干扰，并促进可解释性和预测任务之间的共同知识共享。此外，通过迭代反馈机制，将联合训练的结果反馈给LLM以重新生成蕴涵树，从而迭代优化潜在答案。

链接: https://arxiv.org/abs/2412.05821
作者: Qing Zhang,Haocheng Lv,Jie Liu,Zhiyun Chen,Jianyong Duan,Hao Wang,Li He,Mingying Xv
关键词-EN: multi-hop question answering, multimodal multi-hop question, large-scale language models, question answering, convert multimodal information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:With the rise of large-scale language models (LLMs), it is currently popular and effective to convert multimodal information into text descriptions for multimodal multi-hop question answering. However, we argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges: 1) The retrieved evidence containing a large amount of redundant information, inevitably leads to a significant drop in performance due to irrelevant information misleading the prediction. 2) The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions. To solve these problems, we propose a unified LLMs-based approach but without heavily relying on them due to the LLM’s potential errors, and innovatively treat multimodal multi-hop question answering as a joint entailment tree generation and question answering problem. Specifically, we design a multi-task learning framework with a focus on facilitating common knowledge sharing across interpretability and prediction tasks while preventing task-specific errors from interfering with each other via mixture of experts. Afterward, we design an iterative feedback mechanism to further enhance both tasks by feeding back the results of the joint training to the LLM for regenerating entailment trees, aiming to iteratively refine the potential answer. Notably, our method has won the first place in the official leaderboard of WebQA (since April 10, 2024), and achieves competitive results on MultimodalQA.
zh

[NLP-59] SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation

【速读】：该论文试图解决大尺度多模态模型（Large Multimodal Models, LMMs）在复杂场景下文本与图像对齐的挑战。现有方法依赖于提示工程（prompt engineering）、昂贵的人工标注和持续升级，限制了灵活性和可扩展性。论文提出的解决方案是一个模型无关的迭代自改进框架（SILMM），通过直接偏好优化（Direct Preference Optimization, DPO）实现自反馈和文本-图像对齐的优化。对于使用离散视觉标记的LMMs，DPO可以直接应用；而对于使用连续视觉特征的LMMs，论文提出了多样性机制和基于核的连续DPO来适应这一框架。实验结果表明，SILMM在多个文本到图像生成基准上显著提升了性能，验证了其有效性和优越性。

链接: https://arxiv.org/abs/2412.05818
作者: Leigang Qu,Haochuan Li,Wenjie Wang,Xiang Liu,Juncheng Li,Liqiang Nie,Tat-Seng Chua
关键词-EN: Large Multimodal Models, Large Multimodal, Multimodal Models, pushing forward advancements, demonstrated impressive capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: project page: this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (SILMM) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less suitable for LMMs with continuous visual features, as obtaining generation probabilities is challenging. To adapt SILMM to LMMs with continuous features, we propose a diversity mechanism to obtain diverse representations and a kernel-based continuous DPO for alignment. Extensive experiments on three compositional text-to-image generation benchmarks validate the effectiveness and superiority of SILMM, showing improvements exceeding 30% on T2I-CompBench++ and around 20% on DPG-Bench.
zh

[NLP-60] Speech Is Not Enough: Interpreting Nonverbal Indicators of Common Knowledge and Engagement AAAI2025

【速读】：该论文旨在开发一种能够支持群体问题解决和社会动态的AI伙伴。其关键解决方案在于利用多模态分析（multimodal analytics）技术，识别和追踪群体成员的非语言互动，结合其语言参与，形成对协作和参与度的全面理解，从而为AI伙伴提供必要的上下文信息。该研究在课堂环境中展示了其在检测和追踪学生任务导向互动中的非语言行为方面的当前能力，并探讨了这对追踪共识和参与度的影响。

链接: https://arxiv.org/abs/2412.05797
作者: Derek Palmer,Yifan Zhu,Kenneth Lai,Hannah VanderHoeven,Mariah Bradford,Ibrahim Khebour,Carlos Mabrey,Jack Fitzgerald,Nikhil Krishnaswamy,Martha Palmer,James Pustejovsky
关键词-EN: group problem solving, social dynamics, problem solving, solving and social, group problem
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 3 pages, 2 figures, appearing at AAAI 2025 Demos Track

点击查看摘要

Abstract:Our goal is to develop an AI Partner that can provide support for group problem solving and social dynamics. In multi-party working group environments, multimodal analytics is crucial for identifying non-verbal interactions of group members. In conjunction with their verbal participation, this creates an holistic understanding of collaboration and engagement that provides necessary context for the AI Partner. In this demo, we illustrate our present capabilities at detecting and tracking nonverbal behavior in student task-oriented interactions in the classroom, and the implications for tracking common ground and engagement.
zh

[NLP-61] Uncovering Uncertainty in Transformer Inference NEURIPS2024

【速读】：该论文试图解决的问题是理解基于transformer的语言模型中，模型潜在表示如何逐步精炼，以及在正确和错误生成之间是否存在可观察的差异。解决方案的关键在于验证迭代推理假设（Iterative Inference Hypothesis, IIH），即残差流中的第n个token嵌入遵循损失递减的轨迹，并通过交叉熵检测嵌入收敛速度以反映生成过程中的不确定性，从而区分正确与错误的token生成。

链接: https://arxiv.org/abs/2412.05768
作者: Greyson Brothers,Willa Mannering,Amber Tien,John Winder
关键词-EN: Iterative Inference Hypothesis, transformer-based language models, Inference Hypothesis, Iterative Inference, model latent representations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted poster at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Workshop on Foundation Model Interventions

点击查看摘要

Abstract:We explore the Iterative Inference Hypothesis (IIH) within the context of transformer-based language models, aiming to understand how a model’s latent representations are progressively refined and whether observable differences are present between correct and incorrect generations. Our findings provide empirical support for the IIH, showing that the nth token embedding in the residual stream follows a trajectory of decreasing loss. Additionally, we observe that the rate at which residual embeddings converge to a stable output representation reflects uncertainty in the token generation process. Finally, we introduce a method utilizing cross-entropy to detect this uncertainty and demonstrate its potential to distinguish between correct and incorrect token generations on a dataset of idioms.
zh

[NLP-62] A Comparative Study on Code Generation with Transformers

【速读】：该论文试图解决在自然语言处理（NLP）广泛应用的背景下，如何通过基于Transformer架构的自动化系统替代传统手动编码技术的问题。解决方案的关键在于通过对比研究评估不同复杂度的Transformer模型在生成C++源代码方面的鲁棒性，涵盖从基础算术到复杂计算的多样化问题集。

链接: https://arxiv.org/abs/2412.05749
作者: Namrata Das,Rakshya Panta,Neelam Karki,Ruchi Manandhar,Dinesh Baniya Kshatri
关键词-EN: Natural Language Processing, generating solutions autonomously, supplant traditional manual, traditional manual coding, manual coding techniques
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In an era of widespread influence of Natural Language Processing (NLP), there have been multiple research efforts to supplant traditional manual coding techniques with automated systems capable of generating solutions autonomously. With rapid research for code generation and a sole focus on large language models, there emerges a need to compare and evaluate the performance of transformer architectures based on several complexities of the model. This paper introduces the concept of a “A Comparative Study on Code Generation with Transformers,” a model based on Transformer architecture, and NLP methodologies to automatically generate C++ source code for different varieties of problems. Here, a comparative study is performed to evaluate the robustness of transformer-based models on the basis of their architecture complexities and their capability to handle diverse problem sets, from basic arithmetic to complex computations.
zh

[NLP-63] PromptRefine: Enhancing Few-Shot Performance on Low-Resource Indic Languages with Example Selection from Related Example Banks

【速读】：该论文试图解决在低资源印度语言（Indic languages）中，由于真实数据稀缺导致的选择最优少样本示例（few-shot demonstrations）的难题。解决方案的关键是提出了PromptRefine，一种新颖的交替最小化方法（Alternating Minimization approach），用于示例选择。PromptRefine通过利用相关高资源印度语言的辅助示例库，并采用多任务学习技术来对齐语言特定的检索器，从而实现有效的跨语言检索。此外，PromptRefine还通过引入多样性来增强泛化能力并减少偏差。

链接: https://arxiv.org/abs/2412.05710
作者: Soumya Suvra Ghosal,Soumyabrata Pal,Koyel Mukherjee,Dinesh Manocha
关键词-EN: Large Language Models, recently demonstrated impressive, demonstrated impressive few-shot, Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently demonstrated impressive few-shot learning capabilities through in-context learning (ICL). However, ICL performance is highly dependent on the choice of few-shot demonstrations, making the selection of the most optimal examples a persistent research challenge. This issue is further amplified in low-resource Indic languages, where the scarcity of ground-truth data complicates the selection process. In this work, we propose PromptRefine, a novel Alternating Minimization approach for example selection that improves ICL performance on low-resource Indic languages. PromptRefine leverages auxiliary example banks from related high-resource Indic languages and employs multi-task learning techniques to align language-specific retrievers, enabling effective cross-language retrieval. Additionally, we incorporate diversity in the selected examples to enhance generalization and reduce bias. Through comprehensive evaluations on four text generation tasks – Cross-Lingual Question Answering, Multilingual Question Answering, Machine Translation, and Cross-Lingual Summarization using state-of-the-art LLMs such as LLAMA-3.1-8B, LLAMA-2-7B, Qwen-2-7B, and Qwen-2.5-7B, we demonstrate that PromptRefine significantly outperforms existing frameworks for retrieving examples.
zh

[NLP-64] On the effective transfer of knowledge from English to Hindi Wikipedia COLING

【速读】：该论文试图解决维基百科中高资源语言（HRLs）和低资源语言（LRLs）之间内容质量不均衡的问题，特别是英语和印地语之间的知识差距。解决方案的关键在于提出一个轻量级框架，通过利用大型语言模型的上下文学习能力，从外部资源（如英语书籍）提取相关信息，并将其适配为符合维基百科的中立观点（NPOV）政策和风格。如果英语维基百科页面不完整或过时，该框架会生成新内容并机器翻译为印地语；如果英语页面内容全面且最新，则直接将知识从英语转移到印地语。该框架通过自动和人工评估，显著提升了印地语维基百科文章的内容质量，分别提高了65%和62%。

链接: https://arxiv.org/abs/2412.05708
作者: Paramita Das,Amartya Roy,Ritabrata Chakraborty,Animesh Mukherjee
关键词-EN: largest multilingual encyclopedia, remains inherently incomplete, multilingual encyclopedia, inherently incomplete, English
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: accepted at COLING Industry Track 2025

点击查看摘要

Abstract:Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps, we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books) and adapts it to align with Wikipedia’s distinctive style, including its \textitneutral point of view (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date, the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.
zh

[NLP-65] Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

【速读】：该论文试图解决在GPU内存有限且输入上下文长度超过生成长度的情况下，如何提高推理效率的问题。解决方案的关键在于在输入处理阶段对键值对缓存（KV cache）进行压缩，从而允许使用更大的批处理大小，显著提升吞吐量，同时保持原始模型的准确性。

链接: https://arxiv.org/abs/2412.05693
作者: Michael R. Metel,Boxing Chen,Mehdi Rezagholizadeh
关键词-EN: developed eviction policies, remove key-value, efficient inference, works have developed, developed eviction
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model’s accuracy.
zh

[NLP-66] Graph with Sequence: Broad-Range Semantic Modeling for Fake News Detection

【速读】：该论文试图解决社交媒体上虚假新闻检测中语义深度不足的问题，提出了一种名为BREAK的广义语义模型。解决方案的关键在于通过全连接图捕获全面的语义信息，并采用双降噪模块来减少结构噪声和特征噪声。具体来说，语义结构降噪模块通过在序列结构和全连接图之间迭代优化图的连通性，揭示标签相关的语义相互关系结构；而语义特征降噪模块通过多样化表示，利用KL散度对齐降噪图和序列编码器的不同输出，在高维空间中实现特征多样化。这两个模块在双层框架中联合优化，增强了降噪语义的整合，从而提升了虚假新闻检测的性能。

链接: https://arxiv.org/abs/2412.05672
作者: Junwei Yin,Min Gao,Kai Shu,Wentao Li,Yinqiu Huang,Zongwei Wang
关键词-EN: threatens social stability, social media threatens, media threatens social, social stability, social media
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid proliferation of fake news on social media threatens social stability, creating an urgent demand for more effective detection methods. While many promising approaches have emerged, most rely on content analysis with limited semantic depth, leading to suboptimal comprehension of news this http URL address this limitation, capturing broader-range semantics is essential yet challenging, as it introduces two primary types of noise: fully connecting sentences in news graphs often adds unnecessary structural noise, while highly similar but authenticity-irrelevant sentences introduce feature noise, complicating the detection process. To tackle these issues, we propose BREAK, a broad-range semantics model for fake news detection that leverages a fully connected graph to capture comprehensive semantics while employing dual denoising modules to minimize both structural and feature noise. The semantic structure denoising module balances the graph’s connectivity by iteratively refining it between two bounds: a sequence-based structure as a lower bound and a fully connected graph as the upper bound. This refinement uncovers label-relevant semantic interrelations structures. Meanwhile, the semantic feature denoising module reduces noise from similar semantics by diversifying representations, aligning distinct outputs from the denoised graph and sequence encoders using KL-divergence to achieve feature diversification in high-dimensional space. The two modules are jointly optimized in a bi-level framework, enhancing the integration of denoised semantics into a comprehensive representation for detection. Extensive experiments across four datasets demonstrate that BREAK significantly outperforms existing methods in identifying fake news. Code is available at this https URL.
zh

[NLP-67] Shifting NER into High Gear: The Auto-AdvER Approach

【速读】：该论文试图解决汽车广告领域中命名实体识别（Named Entity Recognition, NER）的问题，旨在开发一个专门针对汽车广告文本的NER模式和数据集Auto-AdvER。解决方案的关键在于设计了一个包含“Condition”、“Historic”和“Sales Options”三个标签的标注模式，并通过行业需求驱动的标注原则和方法确保了数据集的独特性和实用性。研究展示了92%的标注一致性（F1-Score），并通过对比BERT、DeBERTaV3等编码器模型和Llama、Qwen、GPT-4等大语言模型（LLMs）的性能，发现LLMs在该任务中表现更优，尽管其成本较高且并非完美。Auto-AdvER的开发为广告分析、市场动态分析和数据驱动的预测性维护等应用提供了基础，适用于汽车领域及其他专业领域的NER需求。

链接: https://arxiv.org/abs/2412.05655
作者: Filippos Ventirozos,Ioanna Nteka,Tania Nandy,Jozef Baca,Peter Appleby,Matthew Shardlow
关键词-EN: car advertisement genre, unique NER dataset, Large Language Models, specialised named entity, linguistically unique NER
类目: Computation and Language (cs.CL)
备注: 11 pages, 1 figure, 3 tables

点击查看摘要

Abstract:This paper presents a case study on the development of Auto-AdvER, a specialised named entity recognition schema and dataset for text in the car advertisement genre. Developed with industry needs in mind, Auto-AdvER is designed to enhance text mining analytics in this domain and contributes a linguistically unique NER dataset. We present a schema consisting of three labels: “Condition”, “Historic” and “Sales Options”. We outline the guiding principles for annotation, describe the methodology for schema development, and show the results of an annotation study demonstrating inter-annotator agreement of 92% F1-Score. Furthermore, we compare the performance by using encoder-only models: BERT, DeBERTaV3 and decoder-only open and closed source Large Language Models (LLMs): Llama, Qwen, GPT-4 and Gemini. Our results show that the class of LLMs outperforms the smaller encoder-only models. However, the LLMs are costly and far from perfect for this task. We present this work as a stepping stone toward more fine-grained analysis and discuss Auto-AdvER’s potential impact on advertisement analytics and customer insights, including applications such as the analysis of market dynamics and data-driven predictive maintenance. Our schema, as well as our associated findings, are suitable for both private and public entities considering named entity recognition in the automotive domain, or other specialist domains.
zh

[NLP-68] Mixture of Hidden-Dimensions Transformer

【速读】：该论文试图解决Transformer模型在扩展隐藏维度时面临的效率问题，即均匀增加隐藏维度会导致计算和内存成本的膨胀，同时未能有效强调每个token最相关的特征。解决方案的关键在于提出了一种名为MoHD（Mixture of Hidden Dimensions）的稀疏条件激活架构。MoHD通过共享子维度来捕捉通用token特征，并利用路由机制动态激活特定于每个token的子维度，从而在扩展隐藏维度时几乎不增加计算或参数，同时通过激活缩放和组融合机制来保留激活流，确保训练和推理的高效性及性能。实验结果表明，MoHD在参数效率和任务性能上均优于传统的Vanilla Transformers。

链接: https://arxiv.org/abs/2412.05644
作者: Yilong Chen,Junyuan Shang,Zhengyu Zhang,Jiawei Sheng,Tingwen Liu,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang
关键词-EN: hidden dimensions efficiently, Transformer models encounter, models encounter challenges, hidden dimensions, hidden dimension sparsity
类目: Computation and Language (cs.CL)
备注: 16 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Transformer models encounter challenges in scaling hidden dimensions efficiently, as uniformly increasing them inflates computational and memory costs while failing to emphasize the most relevant features for each token. For further understanding, we study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions, revealing an “activation flow” pattern. Notably, there are shared sub-dimensions with sustained activation across multiple consecutive tokens and specialized sub-dimensions uniquely activated for each token. To better model token-relevant sub-dimensions, we propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. Particularly, MoHD employs shared sub-dimensions for common token features and a routing mechanism to dynamically activate specialized sub-dimensions. To mitigate potential information loss from sparsity, we design activation scaling and group fusion mechanisms to preserve activation flow. In this way, MoHD expands hidden dimensions with negligible increases in computation or parameters, efficient training and inference while maintaining performance. Evaluations across 10 NLP tasks show that MoHD surpasses Vanilla Transformers in parameter efficiency and task performance. It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost. MOHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity to boost efficiency
zh

[NLP-69] CharacterBox: Evaluating the Role-Playing Capabilities of LLM s in Text-Based Virtual Worlds

【速读】：该论文试图解决大语言模型（LLMs）在角色扮演能力评估中的挑战，特别是如何全面捕捉角色在复杂情节中的细微行为和特质。解决方案的关键在于提出了CharacterBox，这是一个模拟沙盒，能够生成情境化的细粒度角色行为轨迹。CharacterBox包含两个主要组件：基于心理学和行为科学的角色代理（character agent）和协调角色间互动及环境变化的叙述代理（narrator agent）。通过这些行为轨迹，CharacterBox能够更深入地评估角色扮演能力，并引入了两种基于轨迹的方法来提升LLM的性能。此外，为了降低成本和促进社区应用，论文还微调了两个较小模型（CharacterNR和CharacterRM）作为GPT API的替代，展示了其与高级GPT API相当的性能。

链接: https://arxiv.org/abs/2412.05631
作者: Lei Wang,Jianxun Lian,Yi Huang,Yanqi Dai,Haoxuan Li,Xu Chen,Xing Xie,Ji-Rong Wen
关键词-EN: Large Language Models, Large Language, including intelligent non-player, intelligent non-player characters, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Role-playing is a crucial capability of Large Language Models (LLMs), enabling a wide range of practical applications, including intelligent non-player characters, digital twins, and emotional companions. Evaluating this capability in LLMs is challenging due to the complex dynamics involved in role-playing, such as maintaining character fidelity throughout a storyline and navigating open-ended narratives without a definitive ground truth. Current evaluation methods, which primarily focus on question-answering or conversational snapshots, fall short of adequately capturing the nuanced character traits and behaviors essential for authentic role-playing. In this paper, we propose CharacterBox, which is a simulation sandbox designed to generate situational fine-grained character behavior trajectories. These behavior trajectories enable a more comprehensive and in-depth evaluation of role-playing capabilities. CharacterBox consists of two main components: the character agent and the narrator agent. The character agent, grounded in psychological and behavioral science, exhibits human-like behaviors, while the narrator agent coordinates interactions between character agents and environmental changes. Additionally, we introduce two trajectory-based methods that leverage CharacterBox to enhance LLM performance. To reduce costs and facilitate the adoption of CharacterBox by public communities, we fine-tune two smaller models, CharacterNR and CharacterRM, as substitutes for GPT API calls, and demonstrate their competitive performance compared to advanced GPT APIs.
zh

[NLP-70] BERTCaps: BERT Capsule for Persian Multi-Domain Sentiment Analysis

【速读】：该论文试图解决多领域情感分析（Multidomain Sentiment Analysis）中，现有方法在处理与训练数据不同领域时表现不佳的问题。解决方案的关键在于提出了一种新的基于深度学习的方法——BERTCapsules，该方法结合了BERT模型和Capsule网络。BERT用于实例表示，而Capsule结构则用于学习提取的图结构，从而在多领域情感分析中实现了更高的准确性。实验结果表明，该方法在情感分类和领域分类任务中分别达到了0.9712和0.8509的准确率。

链接: https://arxiv.org/abs/2412.05591
作者: Mohammadali Memari,Soghra Mikaeyl Nejad,Amir Parsa Rabiei,Mehrshad Eisaei,Saba Hesaraki
关键词-EN: analysis involves estimating, domain specific information, exploiting domain specific, specific information, sentiment analysis involves
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multidomain sentiment analysis involves estimating the polarity of an unstructured text by exploiting domain specific information. One of the main issues common to the approaches discussed in the literature is their poor applicability to domains that differ from those used to construct opinion this http URL paper aims to present a new method for Persian multidomain SA analysis using deep learning approaches. The proposed BERTCapsules approach consists of a combination of BERT and Capsule models. In this approach, BERT was used for Instance representation, and Capsule Structure was used to learn the extracted graphs. Digikala dataset, including ten domains with both positive and negative polarity, was used to evaluate this approach. The evaluation of the BERTCaps model achieved an accuracy of 0.9712 in sentiment classification binary classification and 0.8509 in domain classification .
zh

[NLP-71] UNet and LSTM combined approach for Breast Ultrasound Image Segmentation

【速读】：该论文试图解决现有基于UNet和UNet++网络的乳腺超声图像（BUSI）分割模型在处理时间维度信息上的不足问题。解决方案的关键在于通过在UNet++架构中集成长短期记忆网络（LSTM）层和自注意力机制（self-attention mechanisms），以捕捉图像中的时间特征。此外，引入多尺度特征提取模块（Multiscale Feature Extraction Module）来增强对不同尺度特征的提取能力。通过结合这些改进与数据增强技术，论文在BUSI with GT数据集上实现了高精度的分割性能，具体表现为98.88%的准确率、99.53%的特异性、95.34%的精确率、91.20%的敏感性、93.74的F1分数和92.74%的Dice系数。

链接: https://arxiv.org/abs/2412.05585
作者: Saba Hesaraki,Morteza Akbari,Ramin Mousa
关键词-EN: prompt detection playing, Breast cancer stands, diminishing mortality rates, Breast cancer, fatality among females
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Breast cancer stands as a prevalent cause of fatality among females on a global scale, with prompt detection playing a pivotal role in diminishing mortality rates. The utilization of ultrasound scans in the BUSI dataset for medical imagery pertaining to breast cancer has exhibited commendable segmentation outcomes through the application of UNet and UNet++ networks. Nevertheless, a notable drawback of these models resides in their inattention towards the temporal aspects embedded within the images. This research endeavors to enrich the UNet++ architecture by integrating LSTM layers and self-attention mechanisms to exploit temporal characteristics for segmentation purposes. Furthermore, the incorporation of a Multiscale Feature Extraction Module aims to grasp varied scale features within the UNet++. Through the amalgamation of our proposed methodology with data augmentation on the BUSI with GT dataset, an accuracy rate of 98.88%, specificity of 99.53%, precision of 95.34%, sensitivity of 91.20%, F1-score of 93.74, and Dice coefficient of 92.74% are achieved. These findings demonstrate competitiveness with cutting-edge techniques outlined in existing literature.
zh

[NLP-72] LLM s-as-Judges: A Comprehensive Survey on LLM -based Evaluation Methods DATE

【速读】：该论文试图解决如何系统性地理解和应用大语言模型（LLMs）作为基于自然语言响应的评估者（LLMs-as-judges）的问题。解决方案的关键在于从功能性、方法论、应用领域、元评估和局限性五个关键角度进行全面调研。具体来说，论文首先定义了LLMs-as-Judges的概念，并探讨了其功能和构建评估系统的方法；接着，分析了其在不同领域的应用潜力和评估方法；最后，详细讨论了LLMs-as-judges的局限性及未来发展方向。通过这种结构化的分析，论文旨在为LLMs-as-judges在研究和实践中的发展和应用提供深入的见解。

链接: https://arxiv.org/abs/2412.05579
作者: Haitao Li,Qian Dong,Junjie Chen,Huixue Su,Yujia Zhou,Qingyao Ai,Ziyi Ye,Yiqun Liu
关键词-EN: Large Language Models, advancement of Large, Language Models, Large Language, LLM judges
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 60 pages, comprehensive and continuously updated

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ‘‘LLMs-as-judges’’. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at this https URL.
zh

[NLP-73] A polar coordinate system represents syntax in large language models

【速读】：该论文试图解决现有结构探针（Structural Probe）在神经激活中仅能表示语法关系的存在，而无法准确表示语法关系的类型和方向的问题。解决方案的关键在于引入极坐标探针（Polar Probe），通过同时分析词嵌入之间的距离和方向来恢复语法关系的类型和方向。这一方法不仅显著提升了语法关系识别的准确性，还揭示了大型语言模型（LLMs）中间层中存在的低维子空间中的极坐标系统，并展示了该系统在最新前沿模型中的精确性提升。此外，通过新的基准测试，证明了相似的语法关系在嵌套的语法树层次中以相似的方式编码。

链接: https://arxiv.org/abs/2412.05571
作者: Pablo Diego-Simón,Stéphane D’Ascoli,Emmanuel Chemla,Yair Lakretz,Jean-Rémi King
关键词-EN: Structural Probe, Originally formalized, Structural Probe word, syntactic relations, large language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Originally formalized with symbolic representations, syntactic trees may also be effectively represented in the activations of large language models (LLMs). Indeed, a ‘Structural Probe’ can find a subspace of neural activations, where syntactically related words are relatively close to one-another. However, this syntactic code remains incomplete: the distance between the Structural Probe word embeddings can represent the existence but not the type and direction of syntactic relations. Here, we hypothesize that syntactic relations are, in fact, coded by the relative direction between nearby embeddings. To test this hypothesis, we introduce a ‘Polar Probe’ trained to read syntactic relations from both the distance and the direction between word embeddings. Our approach reveals three main findings. First, our Polar Probe successfully recovers the type and direction of syntactic relations, and substantially outperforms the Structural Probe by nearly two folds. Second, we confirm that this polar coordinate system exists in a low-dimensional subspace of the intermediate layers of many LLMs and becomes increasingly precise in the latest frontier models. Third, we demonstrate with a new benchmark that similar syntactic relations are coded similarly across the nested levels of syntactic trees. Overall, this work shows that LLMs spontaneously learn a geometry of neural activations that explicitly represents the main symbolic structures of linguistic theory.
zh

[NLP-74] A Survey on Uncertainty Quantification of Large Language Models : Taxonomy Open Research Challenges and Future Directions

【速读】：该论文试图解决大语言模型（LLMs）在生成内容时可能产生的幻觉（hallucinations）问题，即生成看似合理但事实错误的响应，并伴随高度的自信。解决方案的关键在于量化LLMs的不确定性（uncertainty quantification），通过评估模型对特定提示的响应不确定性来检测这些非事实性响应。论文综述了现有的不确定性量化方法，并提出了一个统一的分类框架，以帮助理解当前的技术水平。此外，论文还强调了这些方法在聊天机器人、文本应用以及机器人领域的具体应用，并指出了未来研究中的开放挑战。

链接: https://arxiv.org/abs/2412.05563
作者: Ola Shorinwa,Zhiting Mei,Justin Lidard,Allen Z. Ren,Anirudha Majumdar
关键词-EN: large language models, spurred widespread integration, language models, content generation, facets of society
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The remarkable performance of large language models (LLMs) in content generation, coding, and common-sense reasoning has spurred widespread integration into many facets of society. However, integration of LLMs raises valid questions on their reliability and trustworthiness, given their propensity to generate hallucinations: plausible, factually-incorrect responses, which are expressed with striking confidence. Previous work has shown that hallucinations and other non-factual responses generated by LLMs can be detected by examining the uncertainty of the LLM in its response to the pertinent prompt, driving significant research efforts devoted to quantifying the uncertainty of LLMs. This survey seeks to provide an extensive review of existing uncertainty quantification methods for LLMs, identifying their salient features, along with their strengths and weaknesses. We present existing methods within a relevant taxonomy, unifying ostensibly disparate methods to aid understanding of the state of the art. Furthermore, we highlight applications of uncertainty quantification methods for LLMs, spanning chatbot and textual applications to embodied artificial intelligence applications in robotics. We conclude with open research challenges in uncertainty quantification of LLMs, seeking to motivate future research.
zh

[NLP-75] On the Expressive Power of Modern Hopfield Networks

【速读】：该论文试图解决现代Hopfield网络（Modern Hopfield Networks, MHNs）在计算表达能力上的理论极限问题。解决方案的关键在于通过电路复杂度理论（circuit complexity theory）建立了MHNs计算能力的严格理论界限，证明MHNs属于DLOGTIME-uniform TC^0类。论文的核心贡献在于指出，除非TC^0 = NC^1，否则具有多项式精度、常数层数和O(n)隐藏维度的MHNs无法解决NC^1难的问题，如无向图连通性问题和树同构问题。此外，研究还将结果扩展到核化Hopfield网络（Kernelized Hopfield Networks），进一步揭示了现代Hopfield网络在表达能力上的局限性，并为开发新的基于Hopfield的架构提供了理论指导。

链接: https://arxiv.org/abs/2412.05562
作者: Xiaoyu Li,Yuanpeng Li,Yingyu Liang,Zhenmei Shi,Zhao Song
关键词-EN: Modern Hopfield networks, Hopfield networks, capable of replacing, attention mechanisms, Modern Hopfield
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern Hopfield networks (MHNs) have emerged as powerful tools in deep learning, capable of replacing components such as pooling layers, LSTMs, and attention mechanisms. Recent advancements have enhanced their storage capacity, retrieval speed, and error rates. However, the fundamental limits of their computational expressiveness remain unexplored. Understanding the expressive power of MHNs is crucial for optimizing their integration into deep learning architectures. In this work, we establish rigorous theoretical bounds on the computational capabilities of MHNs using circuit complexity theory. Our key contribution is that we show that MHNs are \mathsfDLOGTIME -uniform \mathsfTC^0 . Hence, unless \mathsfTC^0 = \mathsfNC^1 , a \mathrmpoly(n) -precision modern Hopfield networks with a constant number of layers and O(n) hidden dimension cannot solve \mathsfNC^1 -hard problems such as the undirected graph connectivity problem and the tree isomorphism problem. We also extended our results to Kernelized Hopfield Networks. These results demonstrate the limitation in the expressive power of the modern Hopfield networks. Moreover, Our theoretical analysis provides insights to guide the development of new Hopfield-based architectures.
zh

[NLP-76] SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

【速读】：该论文试图解决学习引导视觉导航领域中，不同粒度语言指令（high-level category-specific search 和 low-level language-guided navigation）之间的统一框架问题。解决方案的关键在于提出了一个新颖的状态自适应混合专家模型 (State-Adaptive Mixture of Experts, SAME)，该模型能够根据不同粒度的语言指令和动态环境观察，有效推断出导航决策。通过这一模型，论文展示了一个能够同时处理七种导航任务的多功能智能体，其性能优于或与特定任务的智能体相当。

链接: https://arxiv.org/abs/2412.05552
作者: Gengze Zhou,Yicong Hong,Zun Wang,Chongyang Zhao,Mohit Bansal,Qi Wu
关键词-EN: detailed textual commands, high-level category-specific search, learning instruction-guided visual, instruction-guided visual navigation, low-level language-guided navigation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework – we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.
zh

[NLP-77] SplaXBERT: Leveraging Mixed Precision Training and Context Splitting for Question Answering

【速读】：该论文试图解决在处理长文本时，传统BERT模型在问答任务中的效率和准确性问题。解决方案的关键在于采用了基于ALBERT-xlarge的SplaXBERT模型，并通过上下文分割（context-splitting）和混合精度训练（mixed precision training）技术，显著提升了模型在SQuAD v1.1数据集上的表现，实现了85.95%的精确匹配率和92.97%的F1分数，同时在资源效率上也优于传统BERT模型。

链接: https://arxiv.org/abs/2412.05499
作者: Zhu Yufan,Hao Zeyu,Li Siqi,Niu Boqian
关键词-EN: mixed precision training, achieves high efficiency, built on ALBERT-xlarge, precision training, achieves high
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:SplaXBERT, built on ALBERT-xlarge with context-splitting and mixed precision training, achieves high efficiency in question-answering tasks on lengthy texts. Tested on SQuAD v1.1, it attains an Exact Match of 85.95% and an F1 Score of 92.97%, outperforming traditional BERT-based models in both accuracy and resource efficiency.
zh

[NLP-78] Knowledge Graphs are all you need: Leveraging KGs in Physics Question Answering

【速读】：该论文试图解决高中物理问题分解为子问题时的逻辑一致性问题，解决方案的关键在于利用大型语言模型（LLMs）生成的知识图谱来捕捉问题的内在逻辑，并以此指导子问题的生成。通过这种方法，生成的子问题在逻辑上与原问题更加一致，相较于传统分解技术显著提高了对原问题逻辑的忠实度。这一方法不仅提升了学习体验，还展示了LLMs在教育方法学上的潜在变革能力。

链接: https://arxiv.org/abs/2412.05453
作者: Krishnasai Addala,Kabir Dev Paul Baghel,Dhruv Jain,Chhavi Kirtani,Avinash Anand,Rajiv Ratn Shah
关键词-EN: decompose high school-level, high school-level physics, large language models, school-level physics questions, knowledge graphs generated
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study explores the effectiveness of using knowledge graphs generated by large language models to decompose high school-level physics questions into sub-questions. We introduce a pipeline aimed at enhancing model response quality for Question Answering tasks. By employing LLMs to construct knowledge graphs that capture the internal logic of the questions, these graphs then guide the generation of subquestions. We hypothesize that this method yields sub-questions that are more logically consistent with the original questions compared to traditional decomposition techniques. Our results show that sub-questions derived from knowledge graphs exhibit significantly improved fidelity to the original question’s logic. This approach not only enhances the learning experience by providing clearer and more contextually appropriate sub-questions but also highlights the potential of LLMs to transform educational methodologies. The findings indicate a promising direction for applying AI to improve the quality and effectiveness of educational content.
zh

[NLP-79] owards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications

【速读】：该论文试图解决多智能体协作（multi-agent collaboration）在企业应用中的设计与评估挑战，特别是如何有效协调和路由智能体之间的通信。解决方案的关键在于提出了一个新颖的多智能体协作框架，并通过两种关键操作模式进行评估：(1) 协调模式，通过并行通信和负载引用（payload referencing）实现复杂任务的完成；(2) 路由模式，用于智能体之间的高效消息转发。研究结果表明，多智能体协作显著提升了目标成功率（最高可达70%），负载引用机制在代码密集型任务中提升了23%的性能，而路由机制则大幅降低了通信延迟。这些发现为企业在多智能体系统的部署提供了重要指导，并推动了高效、可扩展的多智能体协作框架的发展。

链接: https://arxiv.org/abs/2412.05449
作者: Raphael Shu,Nilaksh Das,Michelle Yuan,Monica Sunkara,Yi Zhang
关键词-EN: large language models, shown strong capabilities, language models, powered by large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical report for multi-agent collaboration on AWS Bedrock Agents

点击查看摘要

Abstract:AI agents powered by large language models (LLMs) have shown strong capabilities in problem solving. Through combining many intelligent agents, multi-agent collaboration has emerged as a promising approach to tackle complex, multi-faceted problems that exceed the capabilities of single AI agents. However, designing the collaboration protocols and evaluating the effectiveness of these systems remains a significant challenge, especially for enterprise applications. This report addresses these challenges by presenting a comprehensive evaluation of coordination and routing capabilities in a novel multi-agent collaboration framework. We evaluate two key operational modes: (1) a coordination mode enabling complex task completion through parallel communication and payload referencing, and (2) a routing mode for efficient message forwarding between agents. We benchmark on a set of handcrafted scenarios from three enterprise domains, which are publicly released with the report. For coordination capabilities, we demonstrate the effectiveness of inter-agent communication and payload referencing mechanisms, achieving end-to-end goal success rates of 90%. Our analysis yields several key findings: multi-agent collaboration enhances goal success rates by up to 70% compared to single-agent approaches in our benchmarks; payload referencing improves performance on code-intensive tasks by 23%; latency can be substantially reduced with a routing mechanism that selectively bypasses agent orchestration. These findings offer valuable guidance for enterprise deployments of multi-agent systems and advance the development of scalable, efficient multi-agent collaboration frameworks.
zh

[NLP-80] Diversity Over Quantity: A Lesson From Few Shot Relation Classification

【速读】：该论文试图解决少样本关系分类 (Few-shot Relation Classification, FSRC) 中模型对新关系的泛化能力问题。研究表明，尽管数据规模扩大对自然语言处理 (NLP) 的进展有显著贡献，但在 FSRC 中，关系类型的多样性 (diversity in relation types) 更为关键。论文提出的关键解决方案是引入 REBEL-FS 基准，该基准包含的关系类型数量远超现有数据集，并通过系统实验证明，增加训练数据中关系类型的多样性能够显著提升模型在各种少样本学习场景下的性能，尤其是在高负样本设置下。这一发现挑战了“更多数据必然带来更好性能”的常见假设，表明通过专注于多样性的数据策划可以大幅减少对大规模数据集的依赖。

链接: https://arxiv.org/abs/2412.05434
作者: Amir DN Cohen,Shauli Ravfogel,Shaltiel Shmidman,Yoav Goldberg
关键词-EN: few-shot relation classification, FSRC, relation types, relation classification, relation
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In few-shot relation classification (FSRC), models must generalize to novel relations with only a few labeled examples. While much of the recent progress in NLP has focused on scaling data size, we argue that diversity in relation types is more crucial for FSRC performance. In this work, we demonstrate that training on a diverse set of relations significantly enhances a model’s ability to generalize to unseen relations, even when the overall dataset size remains fixed. We introduce REBEL-FS, a new FSRC benchmark that incorporates an order of magnitude more relation types than existing datasets. Through systematic experiments, we show that increasing the diversity of relation types in the training data leads to consistent gains in performance across various few-shot learning scenarios, including high-negative settings. Our findings challenge the common assumption that more data alone leads to better performance and suggest that targeted data curation focused on diversity can substantially reduce the need for large-scale datasets in FSRC. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2412.05434 [cs.CL] (or arXiv:2412.05434v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.05434 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-81] CALICO: Conversational Agent Localization via Synthetic Data Generation NEURIPS2023 NEURIPS

【速读】：该论文试图解决将大型语言模型（LLMs）微调以实现对话代理训练数据的跨语言本地化问题。解决方案的关键在于CALICO方法，它支持三种操作：逐字复制、字面翻译和本地化生成（如生成目标语言中更合适的城市和机场名称）。此外，CALICO设计了一个迭代过滤机制来去除噪声样本，从而提升下游对话代理的性能。通过构建并发布新的多语言本地化（HL）版本的MultiATIS++测试集，论文证明了CALICO在字面翻译和本地化生成方面均优于现有的LINGUIST方法。

链接: https://arxiv.org/abs/2412.05388
作者: Andy Rosenbaum,Pegah Kharazmi,Ershad Banijamali,Lu Zeng,Christopher DiPersio,Pan Wei,Gokmen Oz,Clement Chung,Karolina Owczarzak,Fabian Triefenbach,Wael Hamza
关键词-EN: Large Language Models, fine-tune Large Language, agent training data, fine-tune Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to The 37th International Conference on Neural Information Processing Systems (NeurIPS 2023) December 10-16, 2023 - SyntheticData4ML Workshop, New Orleans, United States this https URL

点击查看摘要

Abstract:We present CALICO, a method to fine-tune Large Language Models (LLMs) to localize conversational agent training data from one language to another. For slots (named entities), CALICO supports three operations: verbatim copy, literal translation, and localization, i.e. generating slot values more appropriate in the target language, such as city and airport names located in countries where the language is spoken. Furthermore, we design an iterative filtering mechanism to discard noisy generated samples, which we show boosts the performance of the downstream conversational agent. To prove the effectiveness of CALICO, we build and release a new human-localized (HL) version of the MultiATIS++ travel information test set in 8 languages. Compared to the original human-translated (HT) version of the test set, we show that our new HL version is more challenging. We also show that CALICO out-performs state-of-the-art LINGUIST (which relies on literal slot translation out of context) both on the HT case, where CALICO generates more accurate slot translations, and on the HL case, where CALICO generates localized slots which are closer to the HL test set.
zh

[NLP-82] Incremental Sentence Processing Mechanisms in Autoregressive Transformer Language Models

【速读】：该论文试图解决自回归变换器语言模型（Autoregressive transformer language models, LMs）在处理花园路径句（garden path sentences）时的机制问题，具体包括：(1) LMs是否使用句法特征或浅层启发式方法进行增量句处理；(2) LMs是否只表示一种潜在解释，还是多种；(3) LMs是否重新分析或修复其初始错误表示。解决方案的关键在于使用稀疏自编码器（sparse autoencoders）来识别决定LM偏好花园路径句继续方式的可解释特征。研究发现，尽管许多重要特征与句法结构相关，但也有一些特征反映了与句法无关的启发式方法。此外，虽然大多数活跃特征对应于一种句子的解释，但也有一些特征对应于另一种解释，表明LMs同时对两种可能性赋予权重。最后，LMs在处理花园路径句时使用的特征并未用于回答后续问题。

链接: https://arxiv.org/abs/2412.05353
作者: Michael Hanna,Aaron Mueller
关键词-EN: Autoregressive transformer language, successfully handling phenomena, NPI licensing, transformer language models, Autoregressive transformer
类目: Computation and Language (cs.CL)
备注: Code and data available at this https URL

点击查看摘要

Abstract:Autoregressive transformer language models (LMs) possess strong syntactic abilities, often successfully handling phenomena from agreement to NPI licensing. However, the features they use to incrementally process language inputs are not well understood. In this paper, we fill this gap by studying the mechanisms underlying garden path sentence processing in LMs. We ask: (1) Do LMs use syntactic features or shallow heuristics to perform incremental sentence processing? (2) Do LMs represent only one potential interpretation, or multiple? and (3) Do LMs reanalyze or repair their initial incorrect representations? To address these questions, we use sparse autoencoders to identify interpretable features that determine which continuation - and thus which reading - of a garden path sentence the LM prefers. We find that while many important features relate to syntactic structure, some reflect syntactically irrelevant heuristics. Moreover, while most active features correspond to one reading of the sentence, some features correspond to the other, suggesting that LMs assign weight to both possibilities simultaneously. Finally, LMs do not re-use features from garden path sentence processing to answer follow-up questions.
zh

[NLP-83] Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation

【速读】：该论文试图解决大型语言模型（LLM）在多轮对话（MPD）场景中的适应性问题，特别是其在多方对话中的表现不佳，限制了其在多人会议、讨论和日常交流等场景中的应用。解决方案的关键在于设计了一个多轮对话微调框架（MuPaS），通过在多轮对话数据集上对LLM进行微调，使其能够高效且有效地适应多方对话风格。此外，论文还提出了两种训练策略，将MuPaS转化为多轮对话模拟器，从而进一步提升模型在多方对话中的表现，包括生成高质量的对话内容、准确预测下一个发言者，并能够在分布外场景、话题和角色描述下合理生成对话。

链接: https://arxiv.org/abs/2412.05342
作者: Xiaoyu Wang,Ningyuan Xi,Teng Chen,Qingqing Gu,Yue Zhao,Xiaokai Chen,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: Large Language Models, Large Language, Language Models, including multi-personal meetings, scenarios including multi-personal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLM) are usually fine-tuned to participate in dyadic or two-party dialogues, which can not adapt well to multi-party dialogues (MPD), which hinders their applications in such scenarios including multi-personal meetings, discussions and daily communication. Previous LLM-based researches mainly focus on the multi-agent framework, while their base LLMs are still pairwisely fine-tuned. In this work, we design a multi-party fine-tuning framework (MuPaS) for LLMs on the multi-party dialogue datasets, and prove such a straightforward framework can let the LLM align with the multi-party conversation style efficiently and effectively. We also design two training strategies which can convert MuPaS into the MPD simulator. Substantial experiments show that MuPaS can achieve state-of-the-art multi-party response, higher accuracy of the-next-speaker prediction, higher human and automatic evaluated utterance qualities, and can even generate reasonably with out-of-distribution scene, topic and role descriptions. The MuPaS framework bridges the LLM training with more complicated multi-party applications, such as conversation generation, virtual rehearsal or meta-universe.
zh

[NLP-84] PyTerrier-GenRank: The PyTerrier Plugin for Reranking with Large Language Models

【速读】：该论文试图解决使用大型语言模型（LLMs）作为重排序器时，需要实验各种超参数（如提示格式、模型选择和重构策略）的问题。解决方案的关键是引入了PyTerrier-GenRank，这是一个PyTerrier插件，旨在简化与LLMs的无缝重排序实验，支持点式（pointwise）和列表式（listwise）提示等流行排序策略，并通过HuggingFace和OpenAI的托管端点进行验证。

链接: https://arxiv.org/abs/2412.05339
作者: Kaustubh D. Dhole
关键词-EN: rerankers requires experimenting, model choice, prompt formats, rerankers requires, requires experimenting
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using LLMs as rerankers requires experimenting with various hyperparameters, such as prompt formats, model choice, and reformulation strategies. We introduce PyTerrier-GenRank, a PyTerrier plugin to facilitate seamless reranking experiments with LLMs, supporting popular ranking strategies like pointwise and listwise prompting. We validate our plugin through HuggingFace and OpenAI hosted endpoints.
zh

[NLP-85] xt Is Not All You Need: Multimodal Prompting Helps LLM s Understand Humor

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在理解幽默方面的挑战，因为幽默通常是多模态的，依赖于语音歧义、节奏和时机来传达意义。解决方案的关键在于采用一种简单的多模态提示方法，通过同时提供笑话的文本和语音形式（使用现成的文本转语音 (Text-to-Speech, TTS) 系统生成），来增强模型对幽默的理解和解释。实验结果表明，与仅使用文本提示相比，多模态提示在所有测试数据集上显著提升了幽默解释的准确性。

链接: https://arxiv.org/abs/2412.05315
作者: Ashwin Baluja
关键词-EN: Large Language Models, impressive natural language, Language Models, Large Language, demonstrated impressive natural
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated impressive natural language understanding capabilities across various text-based tasks, understanding humor has remained a persistent challenge. Humor is frequently multimodal, relying on phonetic ambiguity, rhythm and timing to convey meaning. In this study, we explore a simple multimodal prompting approach to humor understanding and explanation. We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system. Using multimodal cues improves the explanations of humor compared to textual prompts across all tested datasets.
zh

[NLP-86] DocEDA: Automated Extraction and Design of Analog Circuits from Documents with Large Language Model

【速读】：该论文试图解决电子设计自动化 (EDA) 中电路设计文档中电气参数提取效率低、易出错的问题。解决方案的关键在于引入 DocEDA 系统，该系统结合了先进的计算机视觉技术和大型语言模型 (LLMs)，通过专门设计的布局分析模型对文档进行分类，并利用 LLMs 的链式思维推理能力自动提取电子元件参数。此外，DocEDA 还通过改进的 GAM-YOLO 模型与拓扑识别相结合，将电路图解析为电路网表，并采用空间映射增强的优化框架对文档布局进行优化。实验结果表明，DocEDA 显著提高了处理电路设计文档的效率和电气参数提取的准确性，具有广泛的适应性和潜在的革命性影响。

链接: https://arxiv.org/abs/2412.05301
作者: Hong Cai Chen,Longchang Wu,Ming Gao,Lingrui Shen,Jiarui Zhong,Yipin Xu
关键词-EN: Electronic Design Automation, Efficient and accurate, Design Automation, Large Language Models, critical for accelerating
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient and accurate extraction of electrical parameters from circuit datasheets and design documents is critical for accelerating circuit design in Electronic Design Automation (EDA). Traditional workflows often rely on engineers manually searching and extracting these parameters, which is time-consuming, and prone to human error. To address these challenges, we introduce DocEDA, an automated system that leverages advanced computer vision techniques and Large Language Models (LLMs) to extract electrical parameters seamlessly from documents. The layout analysis model specifically designed for datasheet is proposed to classify documents into circuit-related parts. Utilizing the inherent Chain-of-Thought reasoning capabilities of LLMs, DocEDA automates the extraction of electronic component parameters from documents. For circuit diagrams parsing, an improved GAM-YOLO model is hybrid with topology identification to transform diagrams into circuit netlists. Then, a space mapping enhanced optimization framework is evoked for optimization the layout in the document. Experimental evaluations demonstrate that DocEDA significantly enhances the efficiency of processing circuit design documents and the accuracy of electrical parameter extraction. It exhibits adaptability to various circuit design scenarios and document formats, offering a novel solution for EDA with the potential to transform traditional methodologies.
zh

[NLP-87] Specifications: The missing link to making the development of LLM systems an engineering discipline

【速读】：该论文试图解决生成式 AI (Generative AI) 在构建模块化和可靠系统方面的挑战，特别是如何为基于大语言模型 (LLM) 的组件（如代理）定义明确的规范 (specification)。解决方案的关键在于通过改进规范的定义，利用结构化输出、过程监督和测试时计算等技术，推动模块化和可靠的 LLM 系统的发展。

链接: https://arxiv.org/abs/2412.05299
作者: Ion Stoica,Matei Zaharia,Joseph Gonzalez,Ken Goldberg,Hao Zhang,Anastasios Angelopoulos,Shishir G. Patil,Lingjiao Chen,Wei-Lin Chiang,Jared Q. Davis
关键词-EN: significant strides made, short years, significant strides, robust systems, strides made
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the significant strides made by generative AI in just a few short years, its future progress is constrained by the challenge of building modular and robust systems. This capability has been a cornerstone of past technological revolutions, which relied on combining components to create increasingly sophisticated and reliable systems. Cars, airplanes, computers, and software consist of components-such as engines, wheels, CPUs, and libraries-that can be assembled, debugged, and replaced. A key tool for building such reliable and modular systems is specification: the precise description of the expected behavior, inputs, and outputs of each component. However, the generality of LLMs and the inherent ambiguity of natural language make defining specifications for LLM-based components (e.g., agents) both a challenging and urgent problem. In this paper, we discuss the progress the field has made so far-through advances like structured outputs, process supervision, and test-time compute-and outline several future directions for research to enable the development of modular and reliable LLM-based systems through improved specifications.
zh

[NLP-88] StackEval: Benchmarking LLM s in Coding Assistance

【速读】：该论文试图解决语言模型在编码辅助任务中的性能评估问题，涵盖代码编写、调试、代码审查和概念理解等方面。解决方案的关键在于提出了两个全面的基准测试：StackEval 和 StackUnseen。StackEval 是一个基于 Stack Overflow 问题的大规模基准，而 StackUnseen 则是一个动态基准，包含最新的 Stack Overflow 内容。这些基准不仅评估了语言模型（LLMs）在处理新内容和新兴内容方面的能力，还通过一个经过人工标注的数据集评估了 LLMs 作为编码任务评判者的能力及其潜在偏见。论文的核心贡献在于通过这些基准提供了对 LLMs 在编码辅助任务中的能力和局限性的深入洞察，并强调了这些基准在推动 LLM 开发和应用中的潜力。

链接: https://arxiv.org/abs/2412.05288
作者: Nidhish Shah,Zulkuf Genc,Dogu Araci
关键词-EN: Stack Overflow content, covering code writing, Stack Overflow questions, Stack Overflow, recent Stack Overflow
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present two comprehensive benchmarks to evaluate the performance of language models in coding assistance tasks, covering code writing, debugging, code review, and conceptual understanding. Our main contribution includes two curated datasets: StackEval, a large-scale benchmark derived from Stack Overflow questions, and StackUnseen, a dynamic benchmark featuring the most recent Stack Overflow content. These benchmarks offer novel insights into the capabilities and limitations of LLMs, particularly in handling new and emerging content. Additionally, we assess LLMs’ proficiency as judges for coding tasks using a curated, human-annotated dataset, exploring their evaluation capabilities and potential biases, including whether they favor their own generated solutions. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance. To ensure reproducibility, we publicly share our datasets and evaluation code at this https URL .
zh

[NLP-89] StarWhisper Telescope: Agent -Based Observation Assistant System to Approach AI Astrophysicist

链接: https://arxiv.org/abs/2412.06412
作者: Cunshi Wang,Xinjie Hu,Yu Zhang,Xunhao Chen,Pengliang Du,Yiming Mao,Rui Wang,Yuyang Li,Ying Wu,Hang Yang,Yansong Li,Beichuan Wang,Haiyang Mu,Zheng Wang,Jianfeng Tian,Liang Ge,Yongna Mao,Shengming Li,Xiaomeng Lu,Jinhang Zou,Yang Huang,Ningchen Sun,Jie Zheng,Min He,Yu Bai,Junjie Jin,Hong Wu,Chaohui Shang,Jifeng Liu
关键词-EN:
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 18 figures

点击查看摘要

[NLP-90] Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective

链接: https://arxiv.org/abs/2412.06033
作者: Andrew Jesson,Nicolas Beltran-Velez,David Blei
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-91] raining-Free Bayesianization for Low-Rank Adapters of Large Language Models

链接: https://arxiv.org/abs/2412.05723
作者: Haizhou Shi,Yibin Wang,Ligong Han,Huan Zhang,Hao Wang
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Pre-print, work in progress

点击查看摘要

[NLP-92] Comprehensive Evaluation of Multimodal AI Models in Medical Imaging Diagnosis: From Data Augmentation to Preference-Based Comparison

链接: https://arxiv.org/abs/2412.05536
作者: Cailian Ruan,Chengyue Huang,Yahe Yang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

计算机视觉

[CV-0] [MASK] is All You Need

【速读】：该论文试图解决生成式模型中两种主要范式（即基于掩码生成模型和基于非自回归扩散模型）之间的连接问题，并探索其在视觉领域的可扩展性。解决方案的关键在于提出了一种名为“离散插值”（Discrete Interpolants）的框架，通过在离散状态模型中引入[MASK]机制，将掩码生成模型与非自回归扩散模型以及生成任务与判别任务进行桥接。具体来说，该框架通过统一的分析空间对两种模型进行逐步分析，并将典型的判别任务（如图像分割）重新表述为从[MASK]标记中解码的过程，从而实现灵活的条件采样。这种方法不仅提升了模型在多个基准数据集上的性能，还展示了其在不同任务间的通用性和扩展性。

链接: https://arxiv.org/abs/2412.06787
作者: Vincent Tao Hu,Björn Ommer
关键词-EN: next-set prediction-based Masked, next-noise prediction-based Non-Autoregressive, prediction-based Masked Generative, prediction-based Non-Autoregressive Models, next-set prediction-based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report (WIP), Project Page(code, model, dataset): this https URL

点击查看摘要

Abstract:In generative models, two paradigms have gained attraction in various applications: next-set prediction-based Masked Generative Models and next-noise prediction-based Non-Autoregressive Models, e.g., Diffusion Models. In this work, we propose using discrete-state models to connect them and explore their scalability in the vision domain. First, we conduct a step-by-step analysis in a unified design space across two types of models including timestep-independence, noise schedule, temperature, guidance strength, etc in a scalable manner. Second, we re-cast typical discriminative tasks, e.g., image segmentation, as an unmasking process from [MASK]tokens on a discrete-state model. This enables us to perform various sampling processes, including flexible conditional sampling by only training once to model the joint distribution. All aforementioned explorations lead to our framework named Discrete Interpolants, which enables us to achieve state-of-the-art or competitive performance compared to previous discrete-state based methods in various benchmarks, like ImageNet256, MS COCO, and video dataset FaceForensics. In summary, by leveraging [MASK] in discrete-state models, we can bridge Masked Generative and Non-autoregressive Diffusion models, as well as generative and discriminative tasks.
zh

[CV-1] Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

【速读】：该论文试图解决现有神经系统在生成语义丰富的伴随手势（co-speech gestures）方面的挑战，这些系统虽然能够生成节奏性的节拍手势，但在生成语义上有意义的手势时表现不佳。解决方案的关键在于提出了RAG-Gesture，一种基于扩散模型的手势生成方法，该方法利用检索增强生成（Retrieval Augmented Generation, RAG）技术，通过显式的语言知识从伴随手势数据库中检索示例动作，并在推理时通过DDIM反演和检索引导将这些语义示例手势注入到手势生成流程中，无需额外训练。此外，该方法还提出了一种控制范式，允许用户调节每个检索插入对生成序列的影响程度。

链接: https://arxiv.org/abs/2412.06786
作者: M. Hamza Mughal,Rishabh Dabral,Merel C.J. Scholman,Vera Demberg,Christian Theobalt
关键词-EN: Non-verbal communication, semantically rich gestures, gesture generation, gesture generation approach, communication often comprises
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Project page: this https URL

点击查看摘要

Abstract:Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to explore the results on our project page.
zh

[CV-2] actile DreamFusion: Exploiting Tactile Sensing for 3D Generation NEURIPS2024

【速读】：该论文试图解决现有3D生成方法在几何细节上的不足，即生成的3D资产表面过于光滑或几何细节不准确地嵌入到反照率图（albedo maps）中。解决方案的关键在于引入触觉（touch）作为额外的模态，通过设计轻量级的3D纹理场（3D texture field）来合成视觉和触觉纹理，并利用2D扩散模型先验在视觉和触觉域的引导。具体方法包括：视觉纹理生成受高分辨率触觉法线（tactile normals）的调节，触觉纹理的局部优化则通过定制的TextureDreambooth进行。此外，论文提出了一种多部分生成管道（multi-part generation pipeline），以在不同区域合成不同的纹理。通过这些创新，论文首次利用高分辨率触觉传感来增强3D生成任务中的几何细节，实验结果表明该方法在保持视觉与触觉模态准确对齐的同时，提供了定制化和逼真的精细几何纹理。

链接: https://arxiv.org/abs/2412.06785
作者: Ruihan Gao,Kangle Deng,Gengshan Yang,Wenzhen Yuan,Jun-Yan Zhu
关键词-EN: shown visually compelling, visually compelling results, compelling results powered, shown visually, visually compelling
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to NeurIPS 2024. Project webpage: this https URL Code: this https URL

点击查看摘要

Abstract:3D generation methods have shown visually compelling results powered by diffusion image priors. However, they often fail to produce realistic geometric details, resulting in overly smooth surfaces or geometric details inaccurately baked in albedo maps. To address this, we introduce a new method that incorporates touch as an additional modality to improve the geometric details of generated 3D assets. We design a lightweight 3D texture field to synthesize visual and tactile textures, guided by 2D diffusion model priors on both visual and tactile domains. We condition the visual texture generation on high-resolution tactile normals and guide the patch-based tactile texture refinement with a customized TextureDreambooth. We further present a multi-part generation pipeline that enables us to synthesize different textures across various regions. To our knowledge, we are the first to leverage high-resolution tactile sensing to enhance geometric details for 3D generation tasks. We evaluate our method in both text-to-3D and image-to-3D settings. Our experiments demonstrate that our method provides customized and realistic fine geometric textures while maintaining accurate alignment between two modalities of vision and touch.
zh

[CV-3] P3-PO: Prescriptive Point Priors for Visuo-Spatial Generalization of Robot Policies

【速读】：该论文试图解决机器人学习中策略泛化能力不足的问题，特别是在面对不同环境条件和物体实例时，现有方法往往难以实现鲁棒的泛化。解决方案的关键在于提出了一个名为P3-PO（Prescriptive Point Priors for Policies）的新框架，通过利用计算机视觉和机器人学习的最新进展，构建了一种独特的环境状态表示。具体来说，该方法首先由人类标注者在单个演示帧上标注一组语义上有意义的点，然后利用现有的视觉模型将这些点在整个数据集中传播，最终将这些点作为输入用于最先进的策略架构进行策略学习。实验结果表明，P3-PO在四个真实世界任务中相较于先前方法在相同训练设置下提升了43%，并在新物体实例和更复杂环境中分别取得了58%和80%的性能提升。

链接: https://arxiv.org/abs/2412.06784
作者: Mara Levy,Siddhant Haldar,Lerrel Pinto,Abhinav Shirivastava
关键词-EN: robustly handle varied, handle varied environmental, varied environmental conditions, Developing generalizable robot, generalizable robot policies
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Developing generalizable robot policies that can robustly handle varied environmental conditions and object instances remains a fundamental challenge in robot learning. While considerable efforts have focused on collecting large robot datasets and developing policy architectures to learn from such data, naively learning from visual inputs often results in brittle policies that fail to transfer beyond the training data. This work presents Prescriptive Point Priors for Policies or P3-PO, a novel framework that constructs a unique state representation of the environment leveraging recent advances in computer vision and robot learning to achieve improved out-of-distribution generalization for robot manipulation. This representation is obtained through two steps. First, a human annotator prescribes a set of semantically meaningful points on a single demonstration frame. These points are then propagated through the dataset using off-the-shelf vision models. The derived points serve as an input to state-of-the-art policy architectures for policy learning. Our experiments across four real-world tasks demonstrate an overall 43% absolute improvement over prior methods when evaluated in identical settings as training. Further, P3-PO exhibits 58% and 80% gains across tasks for new object instances and more cluttered environments respectively. Videos illustrating the robot’s performance are best viewed at this http URL.
zh

[CV-4] CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

【速读】：该论文试图解决机器人视觉运动策略学习中扩散模型（diffusion-based models）在动作轨迹生成方面存在的效率低下和灵活性受限的问题。解决方案的关键在于提出了粗到细自回归策略（Coarse-to-Fine AutoRegressive Policy, CARP），通过将动作生成过程解耦为两个阶段：首先，动作自编码器（action autoencoder）学习整个动作序列的多尺度表示；然后，GPT风格的Transformer通过粗到细的自回归过程对序列预测进行精细化。CARP不仅在精度和平滑性上与扩散模型相当甚至更优，而且在效率上与传统自回归模型相当，同时提供了更高的灵活性和更快的推理速度。

链接: https://arxiv.org/abs/2412.06782
作者: Zhefei Gong,Pengxiang Ding,Shangke Lyu,Siteng Huang,Mingyang Sun,Wei Zhao,Zhaoxin Fan,Donglin Wang
关键词-EN: visuomotor policy learning, traditional autoregressive models, achieved significant success, achieved significant, improving the accuracy
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multi-task scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-the-art policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.
zh

[CV-5] Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

【速读】：该论文试图解决传统视觉地理定位方法的确定性问题，即现有方法忽略了图像定位中的固有模糊性。解决方案的关键在于提出了首个基于扩散 (diffusion) 和黎曼流形匹配 (Riemannian flow matching) 的生成式地理定位方法，该方法通过在地球表面直接进行去噪过程，实现了对图像可能位置的概率分布预测，而非单一的确定性定位。这一方法不仅在多个基准测试中达到了最先进的性能，还引入了概率性视觉地理定位任务及其相应的评估指标和基线，展示了扩散方法在处理模糊性方面的优势。

链接: https://arxiv.org/abs/2412.06781
作者: Nicolas Dufour,David Picard,Vicky Kalogeiton,Loic Landrieu
关键词-EN: Global visual geolocation, Global visual, Global, visual geolocation, captured on Earth
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth’s surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.
zh

[CV-6] Diverse Score Distillation

【速读】：该论文试图解决现有评分蒸馏（score distillation）方法在3D优化中输出多样性不足的问题。尽管底层扩散模型能够生成多样化的样本，但现有的评分蒸馏方法在模式寻求（mode-seeking）优化过程中限制了输出的多样性。论文提出的解决方案关键在于引入一种新的评分公式，该公式受去噪扩散采样过程的启发，通过随机初始种子定义生成路径，从而确保多样性。此外，论文还提出了一种近似方法，以适应优化过程中可能无法精确遵循生成路径的情况（如3D表示的渲染在共依赖方式下演化）。通过这种“多样化评分蒸馏”（Diverse Score Distillation, DSD）方法，论文在2D优化、基于文本的3D推理和单视图重建等任务中展示了其应用，并验证了其在提高样本多样性同时保持保真度的有效性。

链接: https://arxiv.org/abs/2412.06780
作者: Yanbo Xu,Jayanth Srinivasa,Gaowen Liu,Shubham Tulsiani
关键词-EN: Score distillation, score distillation formulations, Diverse Score Distillation, powerful mechanism, existing score distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Score distillation of 2D diffusion models has proven to be a powerful mechanism to guide 3D optimization, for example enabling text-based 3D generation or single-view reconstruction. A common limitation of existing score distillation formulations, however, is that the outputs of the (mode-seeking) optimization are limited in diversity despite the underlying diffusion model being capable of generating diverse samples. In this work, inspired by the sampling process in denoising diffusion, we propose a score formulation that guides the optimization to follow generation paths defined by random initial seeds, thus ensuring diversity. We then present an approximation to adopt this formulation for scenarios where the optimization may not precisely follow the generation paths (e.g. a 3D representation whose renderings evolve in a co-dependent manner). We showcase the applications of our `Diverse Score Distillation’ (DSD) formulation across tasks such as 2D optimization, text-based 3D inference, and single-view reconstruction. We also empirically validate DSD against prior score distillation formulations and show that it significantly improves sample diversity while preserving fidelity.
zh

[CV-7] Driv3R: Learning Dense 4D Reconstruction for Autonomous Driving

【速读】：该论文试图解决动态场景的实时4D重建问题，特别是在自动驾驶感知中的应用。解决方案的关键在于提出了一种基于DUSt3R框架的Driv3R方法，该方法通过多视角图像序列直接回归每帧的点云图，并利用记忆池来推理传感器间的空间关系和动态时间上下文，以增强多视角3D一致性和时间整合。此外，通过4D流预测器识别场景中的移动物体，使网络更专注于重建这些动态区域，并以无优化的方式将所有帧的点云图对齐到世界坐标系。

链接: https://arxiv.org/abs/2412.06777
作者: Xin Fei,Wenzhao Zheng,Yueqi Duan,Wei Zhan,Masayoshi Tomizuka,Kurt Keutzer,Jiwen Lu
关键词-EN: autonomous driving perception, driving perception, remains a crucial, crucial challenge, challenge for autonomous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Realtime 4D reconstruction for dynamic scenes remains a crucial challenge for autonomous driving perception. Most existing methods rely on depth estimation through self-supervision or multi-modality sensor fusion. In this paper, we propose Driv3R, a DUSt3R-based framework that directly regresses per-frame point maps from multi-view image sequences. To achieve streaming dense reconstruction, we maintain a memory pool to reason both spatial relationships across sensors and dynamic temporal contexts to enhance multi-view 3D consistency and temporal integration. Furthermore, we employ a 4D flow predictor to identify moving objects within the scene to direct our network focus more on reconstructing these dynamic regions. Finally, we align all per-frame pointmaps consistently to the world coordinate system in an optimization-free manner. We conduct extensive experiments on the large-scale nuScenes dataset to evaluate the effectiveness of our method. Driv3R outperforms previous frameworks in 4D dynamic scene reconstruction, achieving 15x faster inference speed compared to methods requiring global alignment. Code: this https URL.
zh

[CV-8] Visual Lexicon: Rich Image Features in Language Space

【速读】：该论文试图解决在自然语言处理中难以同时捕捉高层次语义和精细视觉细节的问题。解决方案的关键在于提出了一种名为 Visual Lexicon (ViLex) 的新型视觉语言，它能够在词汇标记的文本空间中编码丰富的图像信息，同时保留复杂的视觉细节。ViLex 通过自监督学习生成优化后的标记，利用冻结的文本到图像 (T2I) 扩散模型来重建输入图像，从而实现高保真度的语义级重建。此外，ViLex 标记可以独立使用或与自然语言标记结合，以提示预训练的 T2I 模型，增强视觉和文本输入的交互，提升视觉语言模型的性能。

链接: https://arxiv.org/abs/2412.06774
作者: XuDong Wang,Xingyi Zhou,Alireza Fathi,Trevor Darrell,Cordelia Schmid
关键词-EN: present Visual Lexicon, retaining intricate visual, Visual Lexicon, intricate visual details, retaining intricate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Tech report. 16 pages, 10 figures

点击查看摘要

Abstract:We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as “text tokens” or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings–even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline.
zh

[CV-9] Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

【速读】：该论文试图解决生成式 AI 模型（Generative AI models）中用户提示（user prompts）不明确导致生成结果不理想的问题，尤其是在文本到图像（Text-to-Image, T2I）生成领域。解决方案的关键在于设计一种主动式 T2I 代理（proactive T2I agents），该代理具备两个核心功能：(1) 在不确定用户意图时主动提出澄清问题；(2) 以可编辑的信念图（belief graph）形式展示其对用户意图的理解，使用户能够直观地调整和确认。通过构建原型并进行人类研究和自动化评估，研究验证了该方法的有效性，表明这些代理显著提升了用户与模型之间的意图对齐效果，并在多个基准测试中取得了至少两倍的视觉问答分数（VQAScore）提升。

链接: https://arxiv.org/abs/2412.06771
作者: Meera Hahn,Wenjun Zeng,Nithish Kannen,Rich Galt,Kartikeya Badola,Been Kim,Zi Wang
关键词-EN: leading to sub-optimal, sub-optimal responses, User, agents, users commonly struggle
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:User prompts for generative AI models are often underspecified, leading to sub-optimal responses. This problem is particularly evident in text-to-image (T2I) generation, where users commonly struggle to articulate their precise intent. This disconnect between the user’s vision and the model’s interpretation often forces users to painstakingly and repeatedly refine their prompts. To address this, we propose a design for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their understanding of user intent as an understandable belief graph that a user can edit. We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation. We observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow. Moreover, we develop a scalable automated evaluation approach using two agents, one with a ground truth image and the other tries to ask as few questions as possible to align with the ground truth. On DesignBench, a benchmark we created for artists and designers, the COCO dataset (Lin et al., 2014), and ImageInWords (Garg et al., 2024), we observed that these T2I agents were able to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard single-turn T2I generation. Demo: this https URL.
zh

[CV-10] Dynamic EventNeRF: Reconstructing General Dynamic Scenes from Multi-view Event Cameras

【速读】：该论文试图解决动态场景的体积重建问题，特别是在光照条件差和快速运动情况下的挑战。解决方案的关键在于利用事件相机（event cameras）的特性，即它们能够异步记录像素亮度的变化，从而减少对光照的依赖，更适合捕捉快速运动。论文提出了一种从稀疏的多视角事件流和稀疏RGB帧中进行时空重建的方法，通过训练一系列交叉淡化的时序条件NeRF模型（time-conditioned NeRF models），每个模型对应一个短时间段的记录。该方法通过事件和RGB数据的双重监督以及稀疏视角正则化来实现高质量的重建，并在实际多视角相机系统中验证了其有效性，超越了基于RGB的基准方法，达到了最先进的性能。

链接: https://arxiv.org/abs/2412.06770
作者: Viktor Rudnev,Gereon Fox,Mohamed Elgharib,Christian Theobalt,Vladislav Golyanik
关键词-EN: Volumetric reconstruction, computer vision, important problem, problem in computer, Volumetric
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Volumetric reconstruction of dynamic scenes is an important problem in computer vision. It is especially challenging in poor lighting and with fast motion. It is partly due to the limitations of RGB cameras: To capture fast motion without much blur, the framerate must be increased, which in turn requires more lighting. In contrast, event cameras, which record changes in pixel brightness asynchronously, are much less dependent on lighting, making them more suitable for recording fast motion. We hence propose the first method to spatiotemporally reconstruct a scene from sparse multi-view event streams and sparse RGB frames. We train a sequence of cross-faded time-conditioned NeRF models, one per short recording segment. The individual segments are supervised with a set of event- and RGB-based losses and sparse-view regularisation. We assemble a real-world multi-view camera rig with six static event cameras around the object and record a benchmark multi-view event stream dataset of challenging motions. Our work outperforms RGB-based baselines, producing state-of-the-art results, and opens up the topic of multi-view event-based reconstruction as a new path for fast scene capture beyond RGB cameras. The code and the data will be released soon at this https URL
zh

[CV-11] MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views ATC

【速读】：该论文试图解决从稀疏视角样本中同时实现高质量的3D表面网格恢复和逼真的新视角合成的难题。解决方案的关键在于提出了一种新颖的外观模型，称为MAtCha（MAtCha Gaussians），它通过将场景几何建模为图集（Atlas of Charts）并在其上渲染二维高斯面元（Gaussian surfels）来实现这一目标。MAtCha从现成的单目深度估计器中提取高频表面细节，并通过高斯面元渲染进行细化，从而在神经体积渲染的逼真度和网格模型的清晰几何之间实现了平衡。核心创新包括一种新颖的神经变形模型和结构损失，这些技术在保留从单目深度中提取的精细表面细节的同时，解决了其固有的尺度模糊问题。实验结果表明，MAtCha在表面重建质量和逼真度方面达到了最先进水平，同时显著减少了输入视角数量和计算时间。

链接: https://arxiv.org/abs/2412.06767
作者: Antoine Guédon,Tomoki Ichikawa,Kohei Yamashita,Ko Nishino
关键词-EN: sparse view samples, realizes explicit high-quality, simultaneously realizes explicit, surface mesh recovery, simultaneously realizes
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Webpage: this https URL

点击查看摘要

Abstract:We present a novel appearance model that simultaneously realizes explicit high-quality 3D surface mesh recovery and photorealistic novel view synthesis from sparse view samples. Our key idea is to model the underlying scene geometry Mesh as an Atlas of Charts which we render with 2D Gaussian surfels (MAtCha Gaussians). MAtCha distills high-frequency scene surface details from an off-the-shelf monocular depth estimator and refines it through Gaussian surfel rendering. The Gaussian surfels are attached to the charts on the fly, satisfying photorealism of neural volumetric rendering and crisp geometry of a mesh model, i.e., two seemingly contradicting goals in a single model. At the core of MAtCha lies a novel neural deformation model and a structure loss that preserve the fine surface details distilled from learned monocular depths while addressing their fundamental scale ambiguities. Results of extensive experimental validation demonstrate MAtCha’s state-of-the-art quality of surface reconstruction and photorealism on-par with top contenders but with dramatic reduction in the number of input views and computational time. We believe MAtCha will serve as a foundational tool for any visual application in vision, graphics, and robotics that require explicit geometry in addition to photorealism. Our project page is the following: this https URL
zh

[CV-12] Ranking-aware adapter for text-driven image ordering with CLIP

【速读】：该论文试图解决现有视觉-语言模型（VLMs）在处理多图像场景时依赖单一图像和文本提示的局限性，特别是在图像排序和检索任务中缺乏对多图像的综合理解能力。解决方案的关键在于将CLIP模型重新构建为学习排序任务，并通过引入轻量级适配器（adapter）来增强CLIP的文本引导图像排序能力。具体来说，该方法通过可学习的提示（learnable prompts）适应新的排序指令，并利用带有排序感知注意力（ranking-aware attention）的辅助分支，结合文本条件下的视觉差异进行额外监督，从而在图像排序任务中实现更全面的理解。该方法在多个任务中表现优于微调后的CLIP模型，并在特定任务如面部年龄估计和图像质量评估中与最先进的模型竞争。

链接: https://arxiv.org/abs/2412.06760
作者: Wei-Hsiang Yu,Yen-Yu Lin,Ming-Hsuan Yang,Yi-Hsuan Tsai
关键词-EN: made significant progress, require quantitative concepts, Recent advances, enabling VLMs, image quality assessment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: github link: this https URL

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have made significant progress in downstream tasks that require quantitative concepts such as facial age estimation and image quality assessment, enabling VLMs to explore applications like image ranking and retrieval. However, existing studies typically focus on the reasoning based on a single image and heavily depend on text prompting, limiting their ability to learn comprehensive understanding from multiple images. To address this, we propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task and introduces a lightweight adapter to augment CLIP for text-guided image ranking. Specifically, our approach incorporates learnable prompts to adapt to new instructions for ranking purposes and an auxiliary branch with ranking-aware attention, leveraging text-conditioned visual differences for additional supervision in image ranking. Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks and achieves competitive results compared to state-of-the-art models designed for specific tasks like facial age estimation and image quality assessment. Overall, our approach primarily focuses on ranking images with a single instruction, which provides a natural and generalized way of learning from visual differences across images, bypassing the need for extensive text prompts tailored to individual tasks. Code is available: this https URL.
zh

[CV-13] InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention

【速读】：该论文试图解决面部图像恢复中的几个关键问题，包括多样化的退化类型、实时处理需求以及最重要的身份特征保留。现有方法在处理速度和恢复质量上存在不足，尤其是在严重退化情况下，难以准确重建细粒度的身份细节。解决方案的关键在于引入了一种名为InstantRestore的新框架，该框架利用单步图像扩散模型和注意力共享机制，实现了快速且个性化的面部恢复。此外，InstantRestore通过引入一种新颖的landmark attention loss（地标注意力损失），对齐关键面部地标以优化注意力图，从而增强身份特征的保留。在推理阶段，InstantRestore通过一次前向传播即可实现近实时性能，无需依赖全扩散过程或针对每个身份的模型调优，提供了适用于大规模应用的可扩展解决方案。

链接: https://arxiv.org/abs/2412.06753
作者: Howard Zhang,Yuval Alaluf,Sizhuo Ma,Achuta Kadambi,Jian Wang,Kfir Aberman
关键词-EN: diverse degradation types, real-time processing demands, identity-specific features, aims to enhance, addressing challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identity details. To address these issues, we introduce InstantRestore, a novel framework that leverages a single-step image diffusion model and an attention-sharing mechanism for fast and personalized face restoration. Additionally, InstantRestore incorporates a novel landmark attention loss, aligning key facial landmarks to refine the attention maps, enhancing identity preservation. At inference time, given a degraded input and a small (~4) set of reference images, InstantRestore performs a single forward pass through the network to achieve near real-time performance. Unlike prior approaches that rely on full diffusion processes or per-identity model tuning, InstantRestore offers a scalable solution suitable for large-scale applications. Extensive experiments demonstrate that InstantRestore outperforms existing methods in quality and speed, making it an appealing choice for identity-preserving face restoration.
zh

[CV-14] 3D Graph Attention Networks for High Fidelity Pediatric Glioma Segmentation

【速读】：该论文试图解决儿童脑肿瘤（尤其是胶质瘤）在神经影像数据中的早期、准确分割问题，以支持有效的诊断和治疗规划。解决方案的关键在于提出了一种新型的3D UNet架构，结合空间注意力机制（spatial attention mechanism），能够自动分割儿童胶质瘤。该模型通过多参数MRI数据，捕捉多尺度特征并选择性地关注肿瘤相关区域，从而提高分割精度并减少周围组织的干扰。通过Dice相似系数和HD95等指标的定量评估，证明了该方法在复杂胶质瘤结构分割中的有效性，为自动化儿童胶质瘤分割提供了有前景的进展，有望改善临床决策和治疗效果。

链接: https://arxiv.org/abs/2412.06743
作者: Harish Thangaraj,Diya Katariya,Eshaan Joshi,Sangeetha N
关键词-EN: cancer related mortality, infiltrative growth patterns, complex infiltrative growth, represent a significant, complicate treatment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Pediatric brain tumors, particularly gliomas, represent a significant cause of cancer related mortality in children with complex infiltrative growth patterns that complicate treatment. Early, accurate segmentation of these tumors in neuroimaging data is crucial for effective diagnosis and intervention planning. This study presents a novel 3D UNet architecture with a spatial attention mechanism tailored for automated segmentation of pediatric gliomas. Using the BraTS pediatric glioma dataset with multiparametric MRI data, the proposed model captures multi-scale features and selectively attends to tumor relevant regions, enhancing segmentation precision and reducing interference from surrounding tissue. The model’s performance is quantitatively evaluated using the Dice similarity coefficient and HD95, demonstrating improved delineation of complex glioma structured. This approach offers a promising advancement in automating pediatric glioma segmentation, with the potential to improve clinical decision making and outcomes.
zh

[CV-15] ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet

【速读】：该论文试图解决深度学习模型在数据需求量大的问题，特别是在图像合成领域中，通过生成式 AI (Generative AI) 减少对真实数据的依赖。解决方案的关键在于提出了基于 Stable Diffusion 模型的 ContRail 框架，并结合 ControlNet 和多模态条件化方法，以生成高质量的合成铁路图像。这种方法不仅提升了铁路特定任务（如铁路语义分割）的性能，还通过丰富数据集来增强模型的表现。

链接: https://arxiv.org/abs/2412.06742
作者: Andrei-Robert Alexandrescu,Razvan-Gabriel Petec,Alexandru Manole,Laura-Silvia Diosan
关键词-EN: Deep Learning, ubiquitous paradigm due, numerous domains, extraordinary effectiveness, effectiveness and applicability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Deep Learning became an ubiquitous paradigm due to its extraordinary effectiveness and applicability in numerous domains. However, the approach suffers from the high demand of data required to achieve the potential of this type of model. An ever-increasing sub-field of Artificial Intelligence, Image Synthesis, aims to address this limitation through the design of intelligent models capable of creating original and realistic images, endeavour which could drastically reduce the need for real data. The Stable Diffusion generation paradigm recently propelled state-of-the-art approaches to exceed all previous benchmarks. In this work, we propose the ContRail framework based on the novel Stable Diffusion model ControlNet, which we empower through a multi-modal conditioning method. We experiment with the task of synthetic railway image generation, where we improve the performance in rail-specific tasks, such as rail semantic segmentation by enriching the dataset with realistic synthetic images.
zh

[CV-16] Convolution goes higher-order: a biologically inspired mechanism empowers image classification

【速读】：该论文试图解决传统卷积神经网络 (CNN) 在处理复杂视觉信息时可能存在的局限性，特别是对于高阶相关性的捕捉不足的问题。解决方案的关键在于引入可学习的高阶卷积操作，通过类似于Volterra级数扩展的方式，捕捉生物视觉处理中的乘性交互作用。这种扩展能够有效处理自然图像中像素强度的高阶相关性，并在多个标准基准数据集上显著超越传统CNN的性能。通过系统性扰动分析和表示相似性分析，研究验证了不同阶数卷积对视觉信息处理的不同贡献，揭示了网络层间不同的几何结构，从而为构建更有效的生物启发式计算机视觉模型提供了新的路径。

链接: https://arxiv.org/abs/2412.06740
作者: Simone Azeglio,Olivier Marre,Peter Neri,Ulisse Ferrari
关键词-EN: classical convolutional neural, nonlinear biological visual, biological visual processing, learnable higher-order convolutions, complex nonlinear biological
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:We propose a novel approach to image classification inspired by complex nonlinear biological visual processing, whereby classical convolutional neural networks (CNNs) are equipped with learnable higher-order convolutions. Our model incorporates a Volterra-like expansion of the convolution operator, capturing multiplicative interactions akin to those observed in early and advanced stages of biological visual processing. We evaluated this approach on synthetic datasets by measuring sensitivity to testing higher-order correlations and performance in standard benchmarks (MNIST, FashionMNIST, CIFAR10, CIFAR100 and Imagenette). Our architecture outperforms traditional CNN baselines, and achieves optimal performance with expansions up to 3rd/4th order, aligning remarkably well with the distribution of pixel intensities in natural images. Through systematic perturbation analysis, we validate this alignment by isolating the contributions of specific image statistics to model performance, demonstrating how different orders of convolution process distinct aspects of visual information. Furthermore, Representational Similarity Analysis reveals distinct geometries across network layers, indicating qualitatively different modes of visual information processing. Our work bridges neuroscience and deep learning, offering a path towards more effective, biologically inspired computer vision models. It provides insights into visual information processing and lays the groundwork for neural networks that better capture complex visual patterns, particularly in resource-constrained scenarios.
zh

[CV-17] ake Fake as Real: Realistic-like Robust Black-box Adversarial Attack to Evade AIGC Detection

【速读】：该论文旨在解决现有对抗攻击方法在生成式 AI (AIGC) 检测中的局限性，特别是针对基于生成对抗网络 (GAN) 和扩散模型的多类别自然图像检测器的攻击效果不佳以及攻击不可见性差的问题。解决方案的关键在于提出了一种名为 Realistic-like Robust Black-box Adversarial attack (R²BA) 的新型对抗攻击方法，该方法通过融合真实世界的后处理操作（如高斯模糊、JPEG 压缩、高斯噪声和光斑）来生成对抗样本。R²BA 利用随机粒子群算法优化后处理融合强度，并根据检测器的假概率动态调整检测器脆弱或鲁棒的后处理强度，从而在对抗性和不可见性之间取得平衡。实验结果表明，R²BA 在反检测性能、不可见性和鲁棒性方面均优于现有的白盒和黑盒攻击方法，为实际应用中的 AIGC 检测安全性提供了重要见解。

链接: https://arxiv.org/abs/2412.06727
作者: Caiyun Xie,Dengpan Ye,Yunming Zhang,Long Tang,Yunna Lv,Jiacheng Deng,Jiawei Song
关键词-EN: AI-generated content, multimedia content, AIGC detection, developing AIGC detection, AIGC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The security of AI-generated content (AIGC) detection based on GANs and diffusion models is closely related to the credibility of multimedia content. Malicious adversarial attacks can evade these developing AIGC detection. However, most existing adversarial attacks focus only on GAN-generated facial images detection, struggle to be effective on multi-class natural images and diffusion-based detectors, and exhibit poor invisibility. To fill this gap, we first conduct an in-depth analysis of the vulnerability of AIGC detectors and discover the feature that detectors vary in vulnerability to different post-processing. Then, considering the uncertainty of detectors in real-world scenarios, and based on the discovery, we propose a Realistic-like Robust Black-box Adversarial attack (R ^2 BA) with post-processing fusion optimization. Unlike typical perturbations, R ^2 BA uses real-world post-processing, i.e., Gaussian blur, JPEG compression, Gaussian noise and light spot to generate adversarial examples. Specifically, we use a stochastic particle swarm algorithm with inertia decay to optimize post-processing fusion intensity and explore the detector’s decision boundary. Guided by the detector’s fake probability, R ^2 BA enhances/weakens the detector-vulnerable/detector-robust post-processing intensity to strike a balance between adversariality and invisibility. Extensive experiments on popular/commercial AIGC detectors and datasets demonstrate that R ^2 BA exhibits impressive anti-detection performance, excellent invisibility, and strong robustness in GAN-based and diffusion-based cases. Compared to state-of-the-art white-box and black-box attacks, R ^2 BA shows significant improvements of 15% and 21% in anti-detection performance under the original and robust scenario respectively, offering valuable insights for the security of AIGC detection in real-world applications.
zh

[CV-18] oward Non-Invasive Diagnosis of Bankart Lesions with Deep Learning

【速读】：该论文试图解决Bankart损伤在标准磁共振成像（MRI）上的诊断难题，因其微妙的影像特征常需依赖侵入性的MRI关节造影（MRI arthrograms）。解决方案的关键在于开发深度学习（DL）模型，利用Swin Transformer架构在标准MRI和MRI关节造影上进行Bankart损伤的检测。通过整合来自矢状面、轴面和冠状面的预测结果，模型在标准MRI和MRI关节造影上的表现分别达到了0.87和0.90的AUC值，匹配或超越了放射科医生的表现，尤其是在非侵入性标准MRI上的表现。这表明DL模型能够提高诊断信心，减少对侵入性影像的依赖，并增强医疗的可及性。

链接: https://arxiv.org/abs/2412.06717
作者: Sahil Sethi,Sai Reddy,Mansi Sakarvadia,Jordan Serotte,Darlington Nwaudo,Nicholas Maassen,Lewis Shi
关键词-EN: glenoid labral tears, anterior-inferior glenoid labral, Bankart lesions, standard MRIs, standard
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for presentation at SPIE Medical Imaging 2025: Computer-Aided Diagnosis. The manuscript is expected to appear in the conference proceedings

点击查看摘要

Abstract:Bankart lesions, or anterior-inferior glenoid labral tears, are diagnostically challenging on standard MRIs due to their subtle imaging features-often necessitating invasive MRI arthrograms (MRAs). This study develops deep learning (DL) models to detect Bankart lesions on both standard MRIs and MRAs, aiming to improve diagnostic accuracy and reduce reliance on MRAs. We curated a dataset of 586 shoulder MRIs (335 standard, 251 MRAs) from 558 patients who underwent arthroscopy. Ground truth labels were derived from intraoperative findings, the gold standard for Bankart lesion diagnosis. Separate DL models for MRAs and standard MRIs were trained using the Swin Transformer architecture, pre-trained on a public knee MRI dataset. Predictions from sagittal, axial, and coronal views were ensembled to optimize performance. The models were evaluated on a 20% hold-out test set (117 MRIs: 46 MRAs, 71 standard MRIs). Bankart lesions were identified in 31.9% of MRAs and 8.6% of standard MRIs. The models achieved AUCs of 0.87 (86% accuracy, 83% sensitivity, 86% specificity) and 0.90 (85% accuracy, 82% sensitivity, 86% specificity) on standard MRIs and MRAs, respectively. These results match or surpass radiologist performance on our dataset and reported literature metrics. Notably, our model’s performance on non-invasive standard MRIs matched or surpassed the radiologists interpreting MRAs. This study demonstrates the feasibility of using DL to address the diagnostic challenges posed by subtle pathologies like Bankart lesions. Our models demonstrate potential to improve diagnostic confidence, reduce reliance on invasive imaging, and enhance accessibility to care.
zh

[CV-19] Parkinsons Disease Diagnosis Through Deep Learning: A Novel LSTM-Based Approach for Freezing of Gait Detection

【速读】：该论文试图解决帕金森病（Parkinson’s disease, PD）早期诊断的难题，特别是在症状不明显的初期阶段，区分健康个体与帕金森病患者的行为差异。解决方案的关键在于提出了一种基于长短期记忆网络（LSTM）的深度学习架构，用于自动检测帕金森病患者中的冻结步态（Freezing of Gait, FOG）事件。与传统机器学习算法不同，该方法无需手动特征工程，能够有效捕捉步态模式中的长期时间依赖性，从而提高诊断准确性。通过使用记忆块替代自连接隐藏单元，LSTM解决了梯度消失问题，并结合dropout和L2正则化技术防止过拟合，采用Adam优化器进行优化。实验结果表明，该方法在FOG事件检测上达到了97.71%的准确率、99%的敏感性、98%的精确度和96%的特异性，显著优于现有模型。

链接: https://arxiv.org/abs/2412.06709
作者: Aqib Nazir Mir,Iqra Nissar,Mumtaz Ahmed,Sarfaraz Masood,Danish Raza Rizvi
关键词-EN: Parkinson disease, extensive clinical datasets, learning holds tremendous, holds tremendous potential, Parkinson
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning holds tremendous potential in healthcare for uncovering hidden patterns within extensive clinical datasets, aiding in the diagnosis of various diseases. Parkinson’s disease (PD) is a neurodegenerative condition characterized by the deterioration of brain function. In the initial stages of PD, automatic diagnosis poses a challenge due to the similarity in behavior between individuals with PD and those who are healthy. Our objective is to propose an effective model that can aid in the early detection of Parkinson’s disease. We employed the VGRF gait signal dataset sourced from Physionet for distinguishing between healthy individuals and those diagnosed with Parkinson’s disease. This paper introduces a novel deep learning architecture based on the LSTM network for automatically detecting freezing of gait episodes in Parkinson’s disease patients. In contrast to conventional machine learning algorithms, this method eliminates manual feature engineering and proficiently captures prolonged temporal dependencies in gait patterns, thereby improving the diagnosis of Parkinson’s disease. The LSTM network resolves the issue of vanishing gradients by employing memory blocks in place of self-connected hidden units, allowing for optimal information assimilation. To prevent overfitting, dropout and L2 regularization techniques have been employed. Additionally, the stochastic gradient-based optimizer Adam is used for the optimization process. The results indicate that our proposed approach surpasses current state-of-the-art models in FOG episode detection, achieving an accuracy of 97.71%, sensitivity of 99%, precision of 98%, and specificity of 96%. This demonstrates its potential as a superior classification method for Parkinson’s disease detection.
zh

[CV-20] FlexEvent: Event Camera Object Detection at Arbitrary Frequencies

【速读】：该论文试图解决现有基于事件相机（event-based cameras）的目标检测方法在固定频率模式下无法充分利用事件相机高时间分辨率和自适应性的问题。解决方案的关键在于提出了FlexEvent框架，该框架包含两个核心组件：FlexFuser，一个自适应的事件-帧融合模块，用于整合高频事件数据与RGB帧的丰富语义信息；以及FAL（Frequency-Adaptive Learning），一种频率自适应学习机制，生成频率调整的标签以增强模型在不同操作频率下的泛化能力。这两个组件的结合使得FlexEvent能够在动态环境中实现高精度的目标检测，并适应从20 Hz到180 Hz的广泛频率范围，显著优于现有方法。

链接: https://arxiv.org/abs/2412.06708
作者: Dongyue Lu,Lingdong Kong,Gim Hee Lee,Camille Simon Chane,Wei Tsang Ooi
关键词-EN: offer unparalleled advantages, microsecond-level temporal resolution, cameras offer unparalleled, Event cameras offer, asynchronous operation
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint; 29 pages, 12 figures, 9 tables; Project Page at this https URL

点击查看摘要

Abstract:Event cameras offer unparalleled advantages for real-time perception in dynamic environments, thanks to their microsecond-level temporal resolution and asynchronous operation. Existing event-based object detection methods, however, are limited by fixed-frequency paradigms and fail to fully exploit the high-temporal resolution and adaptability of event cameras. To address these limitations, we propose FlexEvent, a novel event camera object detection framework that enables detection at arbitrary frequencies. Our approach consists of two key components: FlexFuser, an adaptive event-frame fusion module that integrates high-frequency event data with rich semantic information from RGB frames, and FAL, a frequency-adaptive learning mechanism that generates frequency-adjusted labels to enhance model generalization across varying operational frequencies. This combination allows our method to detect objects with high accuracy in both fast-moving and static scenarios, while adapting to dynamic environments. Extensive experiments on large-scale event camera datasets demonstrate that our approach surpasses state-of-the-art methods, achieving significant improvements in both standard and high-frequency settings. Notably, our method maintains robust performance when scaling from 20 Hz to 90 Hz and delivers accurate detection up to 180 Hz, proving its effectiveness in extreme conditions. Our framework sets a new benchmark for event-based object detection and paves the way for more adaptable, real-time vision systems.
zh

[CV-21] You See it You Got it: Learning 3D Creation on Pose-Free Videos at Scale

【速读】：该论文试图解决现有3D生成模型在3D内容创建中受限于有限规模的3D标签（gold-labels）或2D扩散先验的问题。解决方案的关键在于提出了See3D，一种视觉条件的多视图扩散模型，该模型通过大规模互联网视频数据进行训练，以实现开放世界的3D生成。See3D通过自动筛选多视图不一致性和不足观察的数据筛选管道，构建了一个高质量、多样化的多视图图像数据集WebVi3D，包含3.2亿帧来自1600万视频片段。为了克服缺乏显式3D几何或相机姿态标注的问题，论文引入了一种创新的视觉条件，即通过在掩码视频数据上添加时间依赖噪声生成的纯2D归纳视觉信号。最终，通过将See3D集成到基于扭曲的管道中，实现了高保真度的3D生成。该方法在零样本和开放世界生成任务中表现出色，显著优于依赖昂贵且受限3D数据集的模型。

链接: https://arxiv.org/abs/2412.06699
作者: Baorui Ma,Huachen Gao,Haoge Deng,Zhengxiong Luo,Tiejun Huang,Lulu Tang,Xinlong Wang
关键词-EN: models typically rely, rely on limited-scale, typically rely, Recent, generation models typically
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent 3D generation models typically rely on limited-scale 3D `gold-labels’ or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data – You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: this https URL
zh

[CV-22] Gen-3Diffusion: Realistic Image-to-3D Generation via 2D 3D Diffusion Synergy

【速读】：该论文试图解决从单张RGB图像生成逼真的3D物体和穿着服装的虚拟形象的问题。解决方案的关键在于提出了Gen-3Diffusion，通过2D和3D扩散模型的协同工作来实现这一目标。具体来说，论文利用预训练的2D扩散模型和3D扩散模型，通过精心设计的同步过程在训练和采样阶段协同工作。这种协同带来了两个主要优势：1) 2D扩散模型通过其强大的泛化能力为3D扩散模型提供形状先验；2) 3D扩散模型增强了2D多视角采样过程的3D一致性，从而生成更准确的多视角图像。实验结果表明，该方法能够生成具有高保真几何和纹理的逼真3D物体和虚拟形象。

链接: https://arxiv.org/abs/2412.06698
作者: Yuxuan Xue,Xianghui Xie,Riccardo Marin,Gerard Pons-Moll
关键词-EN: single RGB image, single RGB, Creating realistic, RGB image, diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Creating realistic 3D objects and clothed avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot guarantee the generated multi-view images are 3D consistent. In this paper, we propose Gen-3Diffusion: Realistic Image-to-3D Generation via 2D 3D Diffusion Synergy. We leverage a pre-trained 2D diffusion model and a 3D diffusion model via our elegantly designed process that synchronizes two diffusion models at both training and sampling time. The synergy between the 2D and 3D diffusion models brings two major advantages: 1) 2D helps 3D in generalization: the pretrained 2D model has strong generalization ability to unseen images, providing strong shape priors for the 3D diffusion model; 2) 3D helps 2D in multi-view consistency: the 3D diffusion model enhances the 3D consistency of 2D multi-view sampling process, resulting in more accurate multi-view generation. We validate our idea through extensive experiments in image-based objects and clothed avatar generation tasks. Results show that our method generates realistic 3D objects and avatars with high-fidelity geometry and texture. Extensive ablations also validate our design choices and demonstrate the strong generalization ability to diverse clothing and compositional shapes. Our code and pretrained models will be publicly released on this https URL.
zh

[CV-23] EMOv2: Pushing 5M Vision Model Frontier

【速读】：该论文旨在解决在密集预测任务中开发参数高效且轻量级模型的问题，同时权衡参数数量、浮点运算次数（FLOPs）和模型性能。解决方案的关键在于重新思考轻量级模型的基础设施，特别是将卷积神经网络（CNN）中的倒残差块（Inverted Residual Block, IRB）扩展到基于注意力机制的模型，并抽象出一个统一的残差元移动块（Meta Mobile Block, MMBlock）。通过这一设计，论文提出了一种现代化的改进倒残差移动块（Improved Inverted Residual Mobile Block, i2RMB），并改进了分层的高效模型（Efficient MOdel, EMOv2），避免了复杂的结构设计。实验结果表明，EMOv2在多种视觉识别、密集预测和图像生成任务中表现优异，尤其是在5M参数规模下，EMOv2-5M在分类任务中达到了82.9%的Top-1准确率，并在目标检测任务中实现了41.5 mAP，显著超越了现有的CNN和注意力机制模型。

链接: https://arxiv.org/abs/2412.06674
作者: Jiangning Zhang,Teng Hu,Haoyang He,Zhucun Xue,Yabiao Wang,Chengjie Wang,Yong Liu,Xiangtai Li,Dacheng Tao
关键词-EN: Residual Mobile Block, Inverted Residual Block, Meta Mobile Block, Mobile Block, trading off parameters
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5M by +2.6. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level. Code is available at this https URL.
zh

[CV-24] ILLUME: Illuminating Your LLM s to See Draw and Self-Enhance

【速读】：该论文试图解决多模态大语言模型（MLLM）在图像-文本对齐任务中对大规模数据集的依赖问题，并提升模型在理解和生成能力上的协同增强。解决方案的关键在于设计了一种包含语义信息的视觉分词器（vision tokenizer）和渐进式多阶段训练过程，从而将预训练所需的数据集大小减少至15M，显著低于传统需求。此外，论文提出了一种新颖的自增强多模态对齐方案，通过监督模型自我评估文本描述与生成的图像之间的一致性，提升了图像解释的准确性，并避免了因对齐错误导致的生成不现实或错误预测的问题。

链接: https://arxiv.org/abs/2412.06673
作者: Chunwei Wang,Guansong Lu,Junwei Yang,Runhui Huang,Jianhua Han,Lu Hou,Wei Zhang,Hang Xu
关键词-EN: single large language, large language model, large language, multimodal large language, next-token prediction formulation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language model through a unified next-token prediction formulation. To address the large dataset size typically required for image-text alignment, we propose to enhance data efficiency through the design of a vision tokenizer that incorporates semantic information and a progressive multi-stage training procedure. This approach reduces the dataset size to just 15M for pretraining – over four times fewer than what is typically needed – while achieving competitive or even superior performance with existing unified MLLMs, such as Janus. Additionally, to promote synergistic enhancement between understanding and generation capabilities, which is under-explored in previous works, we introduce a novel self-enhancing multimodal alignment scheme. This scheme supervises the MLLM to self-assess the consistency between text descriptions and self-generated images, facilitating the model to interpret images more accurately and avoid unrealistic and incorrect predictions caused by misalignment in image generation. Based on extensive experiments, our proposed ILLUME stands out and competes with state-of-the-art unified MLLMs and specialized models across various benchmarks for multimodal understanding, generation, and editing.
zh

[CV-25] Knowledge Transfer and Domain Adaptation for Fine-Grained Remote Sensing Image Segmentation

【速读】：该论文试图解决细粒度遥感图像分割问题，特别是如何准确识别遥感图像中的详细对象。解决方案的关键在于提出了一种结合知识引导和领域细化的端到端学习范式，并通过两个核心模块实现：特征对齐模块 (Feature Alignment Module, FAM) 和特征调制模块 (Feature Modulation Module, FMM)。FAM 通过通道转换和空间插值将基于 CNN 的骨干网络特征与预训练的视觉变换器模型 (Vision Transformer Model, VTM) 编码器特征对齐，并通过 KL 散度和 L2 归一化约束进行知识传递。FMM 则进一步将知识适应到特定领域以解决领域偏移问题。实验结果表明，该方法在草地和云数据集上分别实现了 2.57 和 3.73 mIoU 的显著提升，展示了知识迁移和领域适应结合的潜力。

链接: https://arxiv.org/abs/2412.06664
作者: Shun Zhang,Xuechao Zou,Kai Li,Congyan Lang,Shiying Wang,Pin Tao,Tengfei Cao
关键词-EN: remote sensing image, Fine-grained remote sensing, accurately identifying detailed, remote sensing, sensing image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Fine-grained remote sensing image segmentation is essential for accurately identifying detailed objects in remote sensing images. Recently, vision transformer models (VTM) pretrained on large-scale datasets have shown strong zero-shot generalization, indicating that they have learned the general knowledge of object understanding. We introduce a novel end-to-end learning paradigm combining knowledge guidance with domain refinement to enhance performance. We present two key components: the Feature Alignment Module (FAM) and the Feature Modulation Module (FMM). FAM aligns features from a CNN-based backbone with those from the pretrained VTM’s encoder using channel transformation and spatial interpolation, and transfers knowledge via KL divergence and L2 normalization constraint. FMM further adapts the knowledge to the specific domain to address domain shift. We also introduce a fine-grained grass segmentation dataset and demonstrate, through experiments on two datasets, that our method achieves a significant improvement of 2.57 mIoU on the grass dataset and 3.73 mIoU on the cloud dataset. The results highlight the potential of combining knowledge transfer and domain adaptation to overcome domain-related challenges and data limitations. The project page is available at this https URL.
zh

[CV-26] Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

【速读】：该论文试图解决Stable Diffusion模型在推理过程中因重复去噪过程导致的计算密集性问题，特别是在需要低延迟和高可扩展性的实际应用场景中。解决方案的关键在于提出了一种高效的量化框架，通过串行到并行的校准流程（Serial-to-Parallel calibration pipeline）确保量化模型与浮点模型在生成结果上的一致性，并引入了混合精度量化策略、多时间步激活量化以及时间信息预计算技术，以保证生成图像的高保真度。实验结果表明，该方法在W4A8量化设置下，将分布相似性和视觉相似性提升了45%-60%，显著优于当前最先进的技术。

链接: https://arxiv.org/abs/2412.06661
作者: Shuaiting Li,Juncan Deng,Zeyu Wang,Hong Gu,Kedong Xu,Haibin Shen,Kejie Huang
关键词-EN: Stable Diffusion models, achieved notable success, notable success due, Diffusion models, Stable Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image generation of Stable Diffusion models has achieved notable success due to its remarkable generation ability. However, the repetitive denoising process is computationally intensive during inference, which renders Diffusion models less suitable for real-world applications that require low latency and scalability. Recent studies have employed post-training quantization (PTQ) and quantization-aware training (QAT) methods to compress Diffusion models. Nevertheless, prior research has often neglected to examine the consistency between results generated by quantized models and those from floating-point models. This consistency is crucial in fields such as content creation, design, and edge deployment, as it can significantly enhance both efficiency and system stability for practitioners. To ensure that quantized models generate high-quality and consistent images, we propose an efficient quantization framework for Stable Diffusion models. Our approach features a Serial-to-Parallel calibration pipeline that addresses the consistency of both the calibration and inference processes, as well as ensuring training stability. Based on this pipeline, we further introduce a mix-precision quantization strategy, multi-timestep activation quantization, and time information precalculation techniques to ensure high-fidelity generation in comparison to floating-point models. Through extensive experiments with Stable Diffusion v1-4, v2-1, and XL 1.0, we have demonstrated that our method outperforms the current state-of-the-art techniques when tested on prompts from the COCO validation dataset and the Stable-Diffusion-Prompts dataset. Under W4A8 quantization settings, our approach enhances both distribution similarity and visual similarity by 45%-60%.
zh

[CV-27] Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset

【速读】：该论文试图解决事件流中目标检测的问题，特别是在低光、运动模糊和快速运动场景下的性能挑战。解决方案的关键在于引入了一种基于MoE（Mixture of Experts）热传导的目标检测算法，该算法通过创新的MoE-HCO块来平衡准确性和计算效率。MoE-HCO块集成了多个专家模块，模拟事件流中的热传导过程，并通过IoU（Intersection over Union）查询选择模块进行高效的令牌提取，最终传递到检测头进行目标检测。此外，论文还引入了新的EvDET200K基准数据集，为未来的研究和比较提供了坚实的基础。

链接: https://arxiv.org/abs/2412.06647
作者: Xiao Wang,Yu Jin,Wentao Wu,Wei Zhang,Lin Zhu,Bo Jiang,Yonghong Tian
关键词-EN: demonstrating superior performance, cutting-edge research area, demonstrating superior, low-light conditions, scenarios with motion
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: In Peer Review

点击查看摘要

Abstract:Object detection in event streams has emerged as a cutting-edge research area, demonstrating superior performance in low-light conditions, scenarios with motion blur, and rapid movements. Current detectors leverage spiking neural networks, Transformers, or convolutional neural networks as their core architectures, each with its own set of limitations including restricted performance, high computational overhead, or limited local receptive fields. This paper introduces a novel MoE (Mixture of Experts) heat conduction-based object detection algorithm that strikingly balances accuracy and computational efficiency. Initially, we employ a stem network for event data embedding, followed by processing through our innovative MoE-HCO blocks. Each block integrates various expert modules to mimic heat conduction within event streams. Subsequently, an IoU-based query selection module is utilized for efficient token extraction, which is then channeled into a detection head for the final object detection process. Furthermore, we are pleased to introduce EvDET200K, a novel benchmark dataset for event-based object detection. Captured with a high-definition Prophesee EVK4-HD event camera, this dataset encompasses 10 distinct categories, 200,000 bounding boxes, and 10,054 samples, each spanning 2 to 5 seconds. We also provide comprehensive results from over 15 state-of-the-art detectors, offering a solid foundation for future research and comparison. The source code of this paper will be released on: this https URL
zh

[CV-28] he Narrow Gate: Localized Image-Text Communication in Vision-Language Models

【速读】：该论文试图解决视觉-语言模型 (VLMs) 在处理图像理解任务时，如何将视觉信息传递到文本域的问题。解决方案的关键在于研究多模态输出模型与仅输出文本模型在信息流上的差异。研究发现，多模态输出模型中图像和文本嵌入在残差流中更为分离，且视觉信息通过单一的“窄门”（single token）传递到文本域，而仅输出文本的模型则通过多个图像标记进行分布式信息交换。通过消融实验，论文证明了这一单一标记对图像理解任务的重要性，并展示了通过局部干预该标记可以有效控制模型全局行为的潜力。

链接: https://arxiv.org/abs/2412.06646
作者: Alessandro Serra,Francesco Ortu,Emanuele Panizon,Lucrezia Valeriani,Lorenzo Basile,Alessio Ansuini,Diego Doimo,Alberto Cazzaniga
关键词-EN: Recent advances, improved the integration, Recent, image, information
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, specifically focusing on how visual information is processed and transferred to the textual domain. We compare VLMs that generate both images and text with those that output only text, highlighting key differences in information flow. We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream. Additionally, models vary in how information is exchanged from visual to textual tokens. VLMs that only output text exhibit a distributed communication pattern, where information is exchanged through multiple image tokens. In contrast, models trained for image and text generation rely on a single token that acts as a narrow gate for the visual information. We demonstrate that ablating this single token significantly deteriorates performance on image understanding tasks. Furthermore, modifying this token enables effective steering of the image semantics, showing that targeted, local interventions can reliably control the model’s global behavior.
zh

[CV-29] Detecting Facial Image Manipulations with Multi-Layer CNN Models

【速读】：该论文试图解决数字图像篡改技术快速发展带来的内容验证挑战，特别是针对由生成式模型（如Stable Diffusion和MidJourney）生成的高度逼真但合成的图像。解决方案的关键在于设计和评估专门用于检测这些篡改图像的卷积神经网络（CNNs）。研究通过比较三种逐步复杂的CNN架构，系统地引入正则化和优化技术，以提高特征提取和分类性能。最终，所提出的模型在区分篡改图像与真实图像方面达到了高达76%的准确率，超越了传统方法。该研究不仅展示了CNN在增强数字媒体验证工具鲁棒性方面的潜力，还为低计算环境下的有效架构调整和训练策略提供了见解。

链接: https://arxiv.org/abs/2412.06643
作者: Alejandro Marco Montejano,Angela Sanchez Perez,Javier Barrachina,David Ortiz-Perez,Manuel Benavent-Lledo,Jose Garcia-Rodriguez
关键词-EN: producing highly realistic, deceive human perception, poses significant challenges, mid-journey producing highly, techniques poses significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution of digital image manipulation techniques poses significant challenges for content verification, with models such as stable diffusion and mid-journey producing highly realistic, yet synthetic, images that can deceive human perception. This research develops and evaluates convolutional neural networks (CNNs) specifically tailored for the detection of these manipulated images. The study implements a comparative analysis of three progressively complex CNN architectures, assessing their ability to classify and localize manipulations across various facial image modifications. Regularization and optimization techniques were systematically incorporated to improve feature extraction and performance. The results indicate that the proposed models achieve an accuracy of up to 76% in distinguishing manipulated images from genuine ones, surpassing traditional approaches. This research not only highlights the potential of CNNs in enhancing the robustness of digital media verification tools, but also provides insights into effective architectural adaptations and training strategies for low-computation environments. Future work will build on these findings by extending the architectures to handle more diverse manipulation techniques and integrating multi-modal data for improved detection capabilities.
zh

[CV-30] Class Balance Matters to Active Class-Incremental Learning ACM-MM2024

【速读】：该论文试图解决Few-Shot Class-Incremental Learning（少样本类增量学习）中由于启发式少样本标注可能未覆盖最具信息量的样本，从而限制增量学习器能力的问题。解决方案的关键在于引入Active Class-Incremental Learning (ACIL)，并通过Class-Balanced Selection (CBS)策略从大规模未标注数据池中选择最具信息量的样本，以确保类别的平衡性和样本的信息量。具体来说，CBS通过将未标注图像的特征聚类，并在每个聚类中采用贪心选择策略，使得采样特征的高斯分布与该聚类中所有未标注特征的高斯分布紧密匹配，从而实现高效的增量学习。

链接: https://arxiv.org/abs/2412.06642
作者: Zitong Huang,Ze Chen,Yuanze Li,Bowen Dong,Erjin Zhou,Yong Liu,Rick Siow Mong Goh,Chun-Mei Feng,Wangmeng Zuo
关键词-EN: shown remarkable efficacy, Few-Shot Class-Incremental Learning, Learning, shown remarkable, remarkable efficacy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM MM 2024

点击查看摘要

Abstract:Few-Shot Class-Incremental Learning has shown remarkable efficacy in efficient learning new concepts with limited annotations. Nevertheless, the heuristic few-shot annotations may not always cover the most informative samples, which largely restricts the capability of incremental learner. We aim to start from a pool of large-scale unlabeled data and then annotate the most informative samples for incremental learning. Based on this premise, this paper introduces the Active Class-Incremental Learning (ACIL). The objective of ACIL is to select the most informative samples from the unlabeled pool to effectively train an incremental learner, aiming to maximize the performance of the resulting model. Note that vanilla active learning algorithms suffer from class-imbalanced distribution among annotated samples, which restricts the ability of incremental learning. To achieve both class balance and informativeness in chosen samples, we propose Class-Balanced Selection (CBS) strategy. Specifically, we first cluster the features of all unlabeled images into multiple groups. Then for each cluster, we employ greedy selection strategy to ensure that the Gaussian distribution of the sampled features closely matches the Gaussian distribution of all unlabeled features within the cluster. Our CBS can be plugged and played into those CIL methods which are based on pretrained models with prompts tunning technique. Extensive experiments under ACIL protocol across five diverse datasets demonstrate that CBS outperforms both random selection and other SOTA active learning approaches. Code is publicly available at this https URL.
zh

[CV-31] Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformers

【速读】：该论文试图解决现有对齐分析方法在量化特征空间差异时仅依赖单一标量值的问题，这掩盖了不同表示之间共同特征和独特特征的区别。解决方案的关键在于结合对齐分析与概念发现，将特征空间中的对齐分解为单个概念，从而实现细粒度的比较。具体来说，论文定义了概念为特征空间中任意流形的最一般结构，并使用广义Rand指数来测量两个表示之间概念接近度的距离，从而揭示不同表示中的普遍概念和独特概念，以及每个表示内部概念的内部结构。这种方法在验证中显示出优于现有线性基线的性能，并揭示了增加监督与学习表示的语义结构减少之间的相关性。

链接: https://arxiv.org/abs/2412.06639
作者: Johanna Vielhaben,Dilyara Bareeva,Jim Berend,Wojciech Samek,Nils Strodthoff
关键词-EN: Vision transformers, learning paradigms, supervised to self-supervised, alignment, fully supervised
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 17 figures, code: this https URL

点击查看摘要

Abstract:Vision transformers (ViTs) can be trained using various learning paradigms, from fully supervised to self-supervised. Diverse training protocols often result in significantly different feature spaces, which are usually compared through alignment analysis. However, current alignment measures quantify this relationship in terms of a single scalar value, obscuring the distinctions between common and unique features in pairs of representations that share the same scalar alignment. We address this limitation by combining alignment analysis with concept discovery, which enables a breakdown of alignment into single concepts encoded in feature space. This fine-grained comparison reveals both universal and unique concepts across different representations, as well as the internal structure of concepts within each of them. Our methodological contributions address two key prerequisites for concept-based alignment: 1) For a description of the representation in terms of concepts that faithfully capture the geometry of the feature space, we define concepts as the most general structure they can possibly form - arbitrary manifolds, allowing hidden features to be described by their proximity to these manifolds. 2) To measure distances between concept proximity scores of two representations, we use a generalized Rand index and partition it for alignment between pairs of concepts. We confirm the superiority of our novel concept definition for alignment analysis over existing linear baselines in a sanity check. The concept-based alignment analysis of representations from four different ViTs reveals that increased supervision correlates with a reduction in the semantic structure of learned representations.
zh

[CV-32] MAVias: Mitigate any Visual Bias

【速读】：该论文试图解决计算机视觉模型中存在的多种潜在且未知的偏差问题，现有的偏差缓解方法通常仅针对预定义的一小部分偏差，限制了其在包含多种复杂偏差的视觉数据集中的适用性。解决方案的关键在于提出了MAVias，一种开放集偏差缓解方法。MAVias通过利用基础模型（foundation models）来发现视觉属性和目标类别之间的虚假关联。具体而言，MAVias首先通过基础图像标注模型捕捉广泛的视觉特征，并利用大型语言模型（large language model）筛选出定义目标类别的视觉特征，从而生成一组语言编码的潜在视觉偏差。随后，这些潜在偏差被转换为视觉-语言嵌入（vision-language embeddings），并通过一种内处理偏差缓解方法，防止模型编码与这些偏差相关的信息。实验结果表明，MAVias在多个数据集上有效检测并缓解了广泛的偏差，优于当前最先进的方法。

链接: https://arxiv.org/abs/2412.06632
作者: Ioannis Sarridis,Christos Koutlis,Symeon Papadopoulos,Christos Diou
关键词-EN: computer vision models, artificial intelligence models, Mitigating biases, computer vision, essential step
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mitigating biases in computer vision models is an essential step towards the trustworthiness of artificial intelligence models. Existing bias mitigation methods focus on a small set of predefined biases, limiting their applicability in visual datasets where multiple, possibly unknown biases exist. To address this limitation, we introduce MAVias, an open-set bias mitigation approach leveraging foundation models to discover spurious associations between visual attributes and target classes. MAVias first captures a wide variety of visual features in natural language via a foundation image tagging model, and then leverages a large language model to select those visual features defining the target class, resulting in a set of language-coded potential visual biases. We then translate this set of potential biases into vision-language embeddings and introduce an in-processing bias mitigation approach to prevent the model from encoding information related to them. Our experiments on diverse datasets, including CelebA, Waterbirds, ImageNet, and UrbanCars, show that MAVias effectively detects and mitigates a wide range of biases in visual recognition tasks outperforming current state-of-the-art.
zh

[CV-33] MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

【速读】：该论文试图解决3D内容生成领域中自动评估方法与人类偏好不一致的问题，特别是文本驱动和图像驱动方法在混合比较中导致的不公平评估。解决方案的关键在于提出一个综合框架，通过收集和过滤标准化的图像提示集，生成多视角资产，并利用系统化的排名流程获取专家对这些资产的成对比较数据，进而训练一个名为MVReward的奖励模型来有效编码人类偏好。此外，论文还提出了Multi-View Preference Learning (MVP)，一种即插即用的多视角扩散模型调优策略，以进一步提升模型与人类偏好的对齐。实验结果表明，MVReward作为可靠的评估指标，能够公平透明地比较不同方法，而MVP则显著增强了多视角扩散模型与人类偏好的对齐效果。

链接: https://arxiv.org/abs/2412.06614
作者: Weitao Wang,Haoran Xu,Yuxiao Yang,Zhifang Liu,Jun Meng,Haoqian Wang
关键词-EN: witnessed remarkable progress, Recent years, content generation, years have witnessed, witnessed remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of text- and image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALL \cdot E and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.
zh

[CV-34] 3D Spatial Understanding in MLLM s: Disambiguation and Evaluation

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在复杂3D环境中定位和区分目标对象时遇到的困难，特别是在目标对象被相似对象（干扰物）包围的情况下。解决方案的关键在于提出简单而有效的技术，以增强模型在上下文对象定位与消歧（contextual object localization and disambiguation）方面的能力。这些技术不仅在传统的句子相似性评估指标上达到了最先进水平，还通过3D视觉定位模型展示了更强的3D空间理解能力。

链接: https://arxiv.org/abs/2412.06613
作者: Chun-Peng Chang,Alain Pagani,Didier Stricker
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, made significant progress
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made significant progress in tasks such as image captioning and question answering. However, while these models can generate realistic captions, they often struggle with providing precise instructions, particularly when it comes to localizing and disambiguating objects in complex 3D environments. This capability is critical as MLLMs become more integrated with collaborative robotic systems. In scenarios where a target object is surrounded by similar objects (distractors), robots must deliver clear, spatially-aware instructions to guide humans effectively. We refer to this challenge as contextual object localization and disambiguation, which imposes stricter constraints than conventional 3D dense captioning, especially regarding ensuring target exclusivity. In response, we propose simple yet effective techniques to enhance the model’s ability to localize and disambiguate target objects. Our approach not only achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity, but also demonstrates improved 3D spatial understanding through 3D visual grounding model.
zh

[CV-35] PrEditor3D: Fast and Precise 3D Shape Editing ATC WWW

【速读】：该论文试图解决3D编辑中的快速、高质量编辑问题，特别是如何在几分钟内对单个3D形状进行编辑，同时确保未修改区域保持不变。解决方案的关键在于提出了一种无需训练的方法，通过将3D对象投影到4视图图像上，并结合用户引导的文本提示和粗略的掩码进行同步的多视图图像编辑。为了解决从3D到2D投影带来的编辑区域模糊问题，论文开发了一个3D分割流程，用于在3D空间中检测编辑区域，并通过合并算法将编辑后的3D区域与原始输入无缝集成。这种方法在实验中表现出优于先前方法的性能，实现了快速、高质量的编辑，同时保留了未修改区域的一致性。

链接: https://arxiv.org/abs/2412.06592
作者: Ziya Erkoç,Can Gümeli,Chaoyang Wang,Matthias Nießner,Angela Dai,Peter Wonka,Hsin-Ying Lee,Peiye Zhuang
关键词-EN: propose a training-free, training-free approach, single shape, editing, regions
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL Video: this https URL

点击查看摘要

Abstract:We propose a training-free approach to 3D editing that enables the editing of a single shape within a few minutes. The edited 3D mesh aligns well with the prompts, and remains identical for regions that are not intended to be altered. To this end, we first project the 3D object onto 4-view images and perform synchronized multi-view image editing along with user-guided text prompts and user-provided rough masks. However, the targeted regions to be edited are ambiguous due to projection from 3D to 2D. To ensure precise editing only in intended regions, we develop a 3D segmentation pipeline that detects edited areas in 3D space, followed by a merging algorithm to seamlessly integrate edited 3D regions with the original input. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling fast, high-quality editing while preserving unintended regions.
zh

[CV-36] Bridging the Divide: Reconsidering Softmax and Linear Attention NEURIPS2024

【速读】：该论文试图解决线性注意力（linear attention）在处理高分辨率图像时性能不如Softmax注意力（Softmax attention）的问题。解决方案的关键在于通过理论分析揭示了线性注意力性能不佳的核心原因，并提出了两个关键视角来理解和缓解这些限制：1）线性注意力的非单射性（injective property），即它容易为不同的查询向量分配相同的注意力权重，导致语义混淆；2）线性注意力在局部建模能力上的不足，而有效的局部建模是Softmax注意力成功的关键。通过赋予线性注意力这两个特性，论文证明了在保持较低计算复杂度的同时，线性注意力可以在各种任务中超越Softmax注意力。

链接: https://arxiv.org/abs/2412.06590
作者: Dongchen Han,Yifan Pu,Zhuofan Xia,Yizeng Han,Xuran Pan,Xiu Li,Jiwen Lu,Shiji Song,Gao Huang
关键词-EN: Vision Transformer designs, modern Vision Transformer, Vision Transformer, long-range visual information, effectively capture long-range
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2024

点击查看摘要

Abstract:Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors, thus adding to severe semantic confusion since different queries correspond to the same outputs. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short. The aforementioned two fundamental differences significantly contribute to the disparities between these two attention paradigms, which is demonstrated by our substantial empirical validation in the paper. In addition, more experiment results indicate that linear attention, as long as endowed with these two properties, can outperform Softmax attention across various tasks while maintaining lower computation complexity. Code is available at this https URL.
zh

[CV-37] MoViE: Mobile Diffusion for Video Editing

【速读】：该论文试图解决基于扩散的视频编辑方法在移动设备上部署的高成本和复杂性问题。解决方案的关键在于一系列优化措施：首先，通过优化现有图像编辑模型的架构并引入轻量级自编码器（autoencoder）来提升效率；其次，将无分类器引导蒸馏（classifier-free guidance distillation）扩展到多模态处理，从而实现三倍的设备端加速；最后，通过引入一种新的对抗性蒸馏方案，将采样步骤减少到一次，同时保持编辑过程的可控性。这些优化共同实现了在移动设备上以每秒12帧的速度进行高质量视频编辑。

链接: https://arxiv.org/abs/2412.06578
作者: Adil Karjauv,Noor Fathima,Ioannis Lelekas,Fatih Porikli,Amir Ghodrati,Amirhossein Habibian
关键词-EN: shown remarkable potential, Recent progress, practical applications, progress in diffusion-based, shown remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Recent progress in diffusion-based video editing has shown remarkable potential for practical applications. However, these methods remain prohibitively expensive and challenging to deploy on mobile devices. In this study, we introduce a series of optimizations that render mobile video editing feasible. Building upon the existing image editing model, we first optimize its architecture and incorporate a lightweight autoencoder. Subsequently, we extend classifier-free guidance distillation to multiple modalities, resulting in a threefold on-device speedup. Finally, we reduce the number of sampling steps to one by introducing a novel adversarial distillation scheme which preserves the controllability of the editing process. Collectively, these optimizations enable video editing at 12 frames per second on mobile devices, while maintaining high quality. Our results are available at this https URL
zh

[CV-38] Prediction of Occluded Pedestrians in Road Scenes using Human-like Reasoning: Insights from the OccluRoads Dataset

【速读】：该论文试图解决自动驾驶中行人检测的难题，特别是在存在遮挡（尤其是完全不可见的行人）情况下的检测问题。解决方案的关键在于提出了Occlusion-Rich Road Scenes with Pedestrians (OccluRoads)数据集，并通过结合知识图谱 (Knowledge Graph, KG)、知识图谱嵌入 (Knowledge Graph Embedding, KGE) 和贝叶斯推理过程，构建了一个预测遮挡行人存在的管道。该方法显著提升了检测性能，F1分数达到0.91，相比传统机器学习模型提高了42%。

链接: https://arxiv.org/abs/2412.06549
作者: Melo Castillo Angie Nataly,Martin Serrano Sergio,Salinas Carlota,Sotelo Miguel Angel
关键词-EN: autonomous driving, aimed at enhancing, critical task, task in autonomous, enhancing safety
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pedestrian detection is a critical task in autonomous driving, aimed at enhancing safety and reducing risks on the road. Over recent years, significant advancements have been made in improving detection performance. However, these achievements still fall short of human perception, particularly in cases involving occluded pedestrians, especially entirely invisible ones. In this work, we present the Occlusion-Rich Road Scenes with Pedestrians (OccluRoads) dataset, which features a diverse collection of road scenes with partially and fully occluded pedestrians in both real and virtual environments. All scenes are meticulously labeled and enriched with contextual information that encapsulates human perception in such scenarios. Using this dataset, we developed a pipeline to predict the presence of occluded pedestrians, leveraging Knowledge Graph (KG), Knowledge Graph Embedding (KGE), and a Bayesian inference process. Our approach achieves a F1 score of 0.91, representing an improvement of up to 42% compared to traditional machine learning models.
zh

[CV-39] Inverting Visual Representations with Detection Transformers

【速读】：该论文试图解决在基于Transformer的视觉模型中理解深度神经网络机制的问题。解决方案的关键在于应用逆向模型训练方法，从Detection Transformer的中间层重建输入图像。通过定性和定量评估重建图像，研究展示了Detection Transformer的关键特性，如上下文形状保持、层间相关性和对颜色扰动的鲁棒性，从而深入理解了Transformer架构中的这些特性如何产生。

链接: https://arxiv.org/abs/2412.06534
作者: Jan Rathjens,Shirin Reyhanian,David Kappel,Laurenz Wiskott
关键词-EN: deep neural networks, mechanisms underlying deep, computer vision remains, underlying deep neural, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding the mechanisms underlying deep neural networks in computer vision remains a fundamental challenge. While many prior approaches have focused on visualizing intermediate representations within deep neural networks, particularly convolutional neural networks, these techniques have yet to be thoroughly explored in transformer-based vision models. In this study, we apply the approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer, showing that this approach is efficient and feasible for transformer-based vision models. Through qualitative and quantitative evaluations of reconstructed images across model stages, we demonstrate critical properties of Detection Transformers, including contextual shape preservation, inter-layer correlation, and robustness to color perturbations, illustrating how these characteristics emerge within the model’s architecture. Our findings contribute to a deeper understanding of transformer-based vision models. The code for reproducing our experiments will be made available at this http URL.
zh

[CV-40] Fitting Spherical Gaussians to Dynamic HDRI Sequences WWW SIGGRAPH

【速读】：该论文试图解决高动态范围光照序列 (HDRI) 的拟合问题，特别是在使用各向异性球面高斯函数 (Anisotropic Spherical Gaussians, ASGs) 进行压缩时保持时间一致性。解决方案的关键在于引入一个优化网络，通过迭代最小化复合损失函数（包括重建损失和漫反射损失）来同时优化ASGs的方向、锐度和强度，从而以少量ASGs表示全频信号。此外，为了在时间域上扩展优化并确保整个HDRI序列的一致性，论文还引入了时间一致性损失。

链接: https://arxiv.org/abs/2412.06511
作者: Pascal Clausen,Li Ma,Mingming He,Ahmet Levent Tasel,Oliver Pilarski,Paul Debevec
关键词-EN: anisotropic spherical Gaussians, dynamic range illumination, fitting high dynamic, high dynamic range, compressed HDRI maps
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 3 pages, 4 figures, SIGGRAPH Asia 2024 poster, this https URL

点击查看摘要

Abstract:We present a technique for fitting high dynamic range illumination (HDRI) sequences using anisotropic spherical Gaussians (ASGs) while preserving temporal consistency in the compressed HDRI maps. Our approach begins with an optimization network that iteratively minimizes a composite loss function, which includes both reconstruction and diffuse losses. This allows us to represent all-frequency signals with a small number of ASGs, optimizing their directions, sharpness, and intensity simultaneously for an individual HDRI. To extend this optimization into the temporal domain, we introduce a temporal consistency loss, ensuring a consistent approximation across the entire HDRI sequence.
zh

[CV-41] AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

【速读】：该论文试图解决现有文本到图像异常合成方法在捕捉现实异常复杂特征（如细粒度视觉模式）方面的不足，导致生成异常样本的真实性和泛化能力受限的问题。解决方案的关键在于提出了一种名为AnomalyControl的新型异常合成框架，通过学习跨模态语义特征作为指导信号，从文本-图像参考提示中编码广义异常线索，从而提升合成异常样本的真实性。具体来说，AnomalyControl采用灵活且非匹配的提示对（即文本-图像参考提示和目标文本提示），并设计了跨模态语义建模（CSM）模块和异常语义增强注意力（ASEA）机制，以提取和聚焦于异常的特定视觉模式，增强生成特征的真实性和上下文相关性。此外，通过语义引导适配器（SGA）将跨模态语义特征作为先验，实现有效的可控合成过程。

链接: https://arxiv.org/abs/2412.06510
作者: Shidan He,Lei Liu,Shen Zhao
关键词-EN: advancing anomaly inspection, cross-modal semantic features, augment abnormal data, cross-modal semantic, Anomaly synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly synthesis is a crucial approach to augment abnormal data for advancing anomaly inspection. Based on the knowledge from the large-scale pre-training, existing text-to-image anomaly synthesis methods predominantly focus on textual information or coarse-aligned visual features to guide the entire generation process. However, these methods often lack sufficient descriptors to capture the complicated characteristics of realistic anomalies (e.g., the fine-grained visual pattern of anomalies), limiting the realism and generalization of the generation process. To this end, we propose a novel anomaly synthesis framework called AnomalyControl to learn cross-modal semantic features as guidance signals, which could encode the generalized anomaly cues from text-image reference prompts and improve the realism of synthesized abnormal samples. Specifically, AnomalyControl adopts a flexible and non-matching prompt pair (i.e., a text-image reference prompt and a targeted text prompt), where a Cross-modal Semantic Modeling (CSM) module is designed to extract cross-modal semantic features from the textual and visual descriptors. Then, an Anomaly-Semantic Enhanced Attention (ASEA) mechanism is formulated to allow CSM to focus on the specific visual patterns of the anomaly, thus enhancing the realism and contextual relevance of the generated anomaly features. Treating cross-modal semantic features as the prior, a Semantic Guided Adapter (SGA) is designed to encode effective guidance signals for the adequate and controllable synthesis process. Extensive experiments indicate that AnomalyControl can achieve state-of-the-art results in anomaly synthesis compared with existing methods while exhibiting superior performance for downstream tasks.
zh

[CV-42] Hybrid Attention Network: An efficient approach for anatomy-free landmark detection

【速读】：该论文试图解决医学图像中解剖标志点检测的准确性与计算效率之间的平衡问题，尤其是在处理高分辨率图像时。解决方案的关键在于引入了一种名为混合注意力网络 (Hybrid Attention Network, HAN) 的新型混合架构，该架构结合了卷积神经网络 (CNNs) 和 Transformer。其核心模块是 BiFormer，利用双层路由注意力 (Bi-Level Routing Attention, BRA) 机制来高效地关注相关图像区域，同时通过卷积注意力块 (Convolutional Attention Blocks, CAB) 结合通道注意力机制 (CBAM) 进行局部特征的精确细化。此外，特征融合校正模块 (Feature Fusion Correction Module, FFCM) 整合多尺度特征，减少分辨率损失。通过在多分辨率热图上使用均方误差 (MSE) 损失进行深度监督，进一步优化了模型性能。实验结果表明，HAN 在多个数据集上实现了最先进的性能，显著提升了检测的准确性、鲁棒性和效率。

链接: https://arxiv.org/abs/2412.06499
作者: Xiaoqian Zhou,Zhen Huang,Heqin Zhu,Qingsong Yao,S.Kevin Zhou
关键词-EN: clinical applications, crucial for clinical, Hybrid Attention Network, Convolutional Attention Blocks, Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate anatomical landmark detection in medical images is crucial for clinical applications. Existing methods often struggle to balance global context with computational efficiency, particularly with high-resolution images. This paper introduces the Hybrid Attention Network(HAN), a novel hybrid architecture integrating CNNs and Transformers. Its core is the BiFormer module, utilizing Bi-Level Routing Attention (BRA) for efficient attention to relevant image regions. This, combined with Convolutional Attention Blocks (CAB) enhanced by CBAM, enables precise local feature refinement guided by the global context. A Feature Fusion Correction Module (FFCM) integrates multi-scale features, mitigating resolution loss. Deep supervision with MSE loss on multi-resolution heatmaps optimizes the model. Experiments on five diverse datasets demonstrate state-of-the-art performance, surpassing existing methods in accuracy, robustness, and efficiency. The HAN provides a promising solution for accurate and efficient anatomical landmark detection in complex medical images. Our codes and data will be released soon at: \urlthis https URL.
zh

[CV-43] PPT: Pre-Training with Pseudo-Labeled Trajectories for Motion Forecasting

【速读】：该论文试图解决自动驾驶中复杂城市场景下的运动预测（Motion Forecasting, MF）问题，特别是如何在标注数据有限的情况下提高预测性能。解决方案的关键在于提出了一种混合训练策略，即先在伪标注数据上进行预训练（Pseudo-Labeling Pre-Training, PPT），然后在标注数据上进行微调。伪标注数据的生成依赖于现有的单帧3D目标检测器和非学习型跟踪器，这种预训练策略显著提升了模型在多样化测试集上的表现，尤其是在标注数据有限的情况下，并增强了跨数据集的泛化能力。

链接: https://arxiv.org/abs/2412.06491
作者: Yihong Xu,Yuan Yin,Tuan-Hung Vu,Alexandre Boulch,Éloi Zablocki,Matthieu Cord
关键词-EN: complex urban scenarios, urban scenarios, aims at anticipating, surrounding agents, agents in complex
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Motion forecasting (MF) for autonomous driving aims at anticipating trajectories of surrounding agents in complex urban scenarios. In this work, we investigate a mixed strategy in MF training that first pre-train motion forecasters on pseudo-labeled data, then fine-tune them on annotated data. To obtain pseudo-labeled trajectories, we propose a simple pipeline that leverages off-the-shelf single-frame 3D object detectors and non-learning trackers. The whole pre-training strategy including pseudo-labeling is coined as PPT. Our extensive experiments demonstrate that: (1) combining PPT with supervised fine-tuning on annotated data achieves superior performance on diverse testbeds, especially under annotation-efficient regimes, (2) scaling up to multiple datasets improves the previous state-of-the-art and (3) PPT helps enhance cross-dataset generalization. Our findings showcase PPT as a promising pre-training solution for robust motion forecasting in diverse autonomous driving contexts.
zh

[CV-44] An Efficient Scene Coordinate Encoding and Relocalization Method

【速读】：该论文试图解决场景坐标回归 (Scene Coordinate Regression, SCR) 技术在处理重复纹理和无意义区域时面临的挑战，这些问题源于现有方法对隐式三角测量的依赖。解决方案的关键在于设计了一种统一的架构，用于场景编码和显著关键点检测，使系统能够专注于编码信息丰富的区域，从而显著提高效率。此外，论文引入了一种利用序列信息的机制，在地图编码和重定位过程中增强了隐式三角测量，特别是在重复纹理环境中。实验结果表明，该系统在室内外数据集上均优于现有的最先进 (SOTA) SCR 方法，单帧重定位模式提高了召回率并提升了运行速度，而基于序列的模式进一步提高了召回率并保持了原有效率。

链接: https://arxiv.org/abs/2412.06488
作者: Kuan Xu,Zeyu Jiang,Haozhi Cao,Shenghai Yuan,Chen Wang,Lihua Xie
关键词-EN: Scene Coordinate Regression, deep neural networks, camera pose estimation, visual localization technique, utilizes deep neural
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Scene Coordinate Regression (SCR) is a visual localization technique that utilizes deep neural networks (DNN) to directly regress 2D-3D correspondences for camera pose estimation. However, current SCR methods often face challenges in handling repetitive textures and meaningless areas due to their reliance on implicit triangulation. In this paper, we propose an efficient scene coordinate encoding and relocalization method. Compared with the existing SCR methods, we design a unified architecture for both scene encoding and salient keypoint detection, enabling our system to focus on encoding informative regions, thereby significantly enhancing efficiency. Additionally, we introduce a mechanism that leverages sequential information during both map encoding and relocalization, which strengthens implicit triangulation, particularly in repetitive texture environments. Comprehensive experiments conducted across indoor and outdoor datasets demonstrate that the proposed system outperforms other state-of-the-art (SOTA) SCR methods. Our single-frame relocalization mode improves the recall rate of our baseline by 6.4% and increases the running speed from 56Hz to 90Hz. Furthermore, our sequence-based mode increases the recall rate by 11% while maintaining the original efficiency.
zh

[CV-45] From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding

【速读】：该论文试图解决大视觉-语言模型 (Large vision-language models, LVLMs) 在多模态任务中容易误解视觉输入，导致幻觉 (hallucinations) 和不可靠输出的问题。解决方案的关键是提出了一种名为“Dropout Decoding”的新型推理时方法，通过量化视觉标记的不确定性并选择性地屏蔽不确定的标记来改进解码过程。具体而言，该方法通过将视觉标记投影到文本空间，并将其分解为偶然不确定性 (aleatoric uncertainty) 和认知不确定性 (epistemic uncertainty)，重点处理能够更有效捕捉感知相关错误的认知不确定性。基于dropout正则化的思想，引入不确定性引导的标记dropout，在推理过程中对输入视觉标记而非模型参数应用dropout，并通过聚合来自多个掩码解码上下文的预测来增强模型的鲁棒性，从而显著减少对象幻觉 (object hallucinations, OH) 并提高输出的可靠性和质量。

链接: https://arxiv.org/abs/2412.06474
作者: Yixiong Fang,Ziran Yang,Zhaorun Chen,Zhuokai Zhao,Jiawei Zhou
关键词-EN: Large vision-language models, Large vision-language, demonstrate remarkable capabilities, remarkable capabilities, capabilities in multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is released at this https URL

点击查看摘要

Abstract:Large vision-language models (LVLMs) demonstrate remarkable capabilities in multimodal tasks but are prone to misinterpreting visual inputs, often resulting in hallucinations and unreliable outputs. To address these challenges, we propose Dropout Decoding, a novel inference-time approach that quantifies the uncertainty of visual tokens and selectively masks uncertain tokens to improve decoding. Our method measures the uncertainty of each visual token by projecting it onto the text space and decomposing it into aleatoric and epistemic components. Specifically, we focus on epistemic uncertainty, which captures perception-related errors more effectively. Inspired by dropout regularization, we introduce uncertainty-guided token dropout, which applies the dropout principle to input visual tokens instead of model parameters, and during inference rather than training. By aggregating predictions from an ensemble of masked decoding contexts, Dropout Decoding robustly mitigates errors arising from visual token misinterpretations. Evaluations on benchmarks including CHAIR, THRONE, and MMBench demonstrate that Dropout Decoding significantly reduces object hallucinations (OH) and enhances both reliability and quality of LVLM outputs across diverse visual contexts.
zh

[CV-46] Active Learning with Context Sampling and One-vs-Rest Entropy for Semantic Segmentation WACV2025

【速读】：该论文试图解决多类别语义分割（multi-class semantic segmentation）中数据集创建耗时且费力的问题，特别是在特定领域。解决方案的关键在于提出了一种名为OREAL的新型基于块的主动学习（Active Learning, AL）方法。OREAL通过最大聚合像素级不确定性分数来增强边界检测，并引入了一种新的不确定性评分函数——one-vs-rest熵（one-vs-rest entropy），该函数在计算类别不确定性的同时实现了数据集创建过程中的隐式类别平衡。

链接: https://arxiv.org/abs/2412.06470
作者: Fei Wu,Pablo Marquez-Neila,Hedyeh Rafi-Tarii,Raphael Sznitman
关键词-EN: computer vision, Multi-class semantic segmentation, Multi-class semantic, Abstract, cornerstone challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: WACV 2025, 8 pages

点击查看摘要

Abstract:Multi-class semantic segmentation remains a cornerstone challenge in computer vision. Yet, dataset creation remains excessively demanding in time and effort, especially for specialized domains. Active Learning (AL) mitigates this challenge by selecting data points for annotation strategically. However, existing patch-based AL methods often overlook boundary pixels critical information, essential for accurate segmentation. We present OREAL, a novel patch-based AL method designed for multi-class semantic segmentation. OREAL enhances boundary detection by employing maximum aggregation of pixel-wise uncertainty scores. Additionally, we introduce one-vs-rest entropy, a novel uncertainty score function that computes class-wise uncertainties while achieving implicit class balancing during dataset creation. Comprehensive experiments across diverse datasets and model architectures validate our hypothesis.
zh

[CV-47] Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation CVPR2025

【速读】：该论文试图解决在视觉与语言导航 (Vision-and-Language Navigation, VLN) 任务中，基于自然语言指令在未见环境中导航的困难。现有方法主要依赖于RGB图像进行环境表示，往往忽略了语义知识和空间线索。论文提出的解决方案关键在于引入了一个多功能的语义理解与空间感知 (Semantic Understanding and Spatial Awareness, SUSA) 架构。该架构包括两个核心模块：文本语义理解 (Textual Semantic Understanding, TSU) 模块，通过生成和关联环境地标的描述来缩小指令与环境之间的模态差距；以及基于深度的空间感知 (Depth-based Spatial Perception, DSP) 模块，通过逐步构建深度探索地图来增强对环境布局的细致理解。实验结果表明，SUSA的混合语义-空间表示显著提升了导航性能，在三个VLN基准测试中达到了新的最先进水平。

链接: https://arxiv.org/abs/2412.06465
作者: Xuesong Zhang,Yunbo Xu,Jia Li,Zhenzhen Hu,Richnag Hong
关键词-EN: Navigating unseen environments, Navigating unseen, unseen environments based, natural language instructions, language instructions remains
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: underreview in CVPR 2025

点击查看摘要

Abstract:Navigating unseen environments based on natural language instructions remains difficult for egocentric agents in Vision-and-Language Navigation (VLN). While recent advancements have yielded promising outcomes, they primarily rely on RGB images for environmental representation, often overlooking the underlying semantic knowledge and spatial cues. Intuitively, humans inherently ground textual semantics within the spatial layout during indoor navigation. Inspired by this, we propose a versatile Semantic Understanding and Spatial Awareness (SUSA) architecture to facilitate navigation. SUSA includes a Textual Semantic Understanding (TSU) module, which narrows the modality gap between instructions and environments by generating and associating the descriptions of environmental landmarks in the agent’s immediate surroundings. Additionally, a Depth-based Spatial Perception (DSP) module incrementally constructs a depth exploration map, enabling a more nuanced comprehension of environmental layouts. Experimental results demonstrate that SUSA hybrid semantic-spatial representations effectively enhance navigation performance, setting new state-of-the-art performance across three VLN benchmarks (REVERIE, R2R, and SOON). The source code will be publicly available.
zh

[CV-48] Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels

【速读】：该论文试图解决大规模多模态模型（LMMs）在实际应用中缺乏适应性和通用性评估方法的问题。传统评估方法依赖于固定的标注数据集和监督指标，资源消耗大且难以泛化到新场景。论文的关键解决方案是利用模型的不确定性信号（如softmax概率）进行无监督模型排序。通过分析这些不确定性指标在视觉问答基准上的表现，研究发现基于softmax分布的不确定性评分能够为模型在不同任务中的表现提供稳健且一致的排序依据，从而在不依赖手动标注的情况下，实现对真实世界无标签数据的模型选择。

链接: https://arxiv.org/abs/2412.06461
作者: Weijie Tu,Weijian Deng,Dylan Campbell,Yu Yao,Jiyang Zheng,Tom Gedeon,Tongliang Liu
关键词-EN: large multimodal models, large multimodal, increasingly deployed, diverse applications, ranking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As large multimodal models (LMMs) are increasingly deployed across diverse applications, the need for adaptable, real-world model ranking has become paramount. Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics, which are resource-intensive and may lack generalizability to novel scenarios, highlighting the importance of unsupervised ranking. In this work, we explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities. We evaluate state-of-the-art LMMs (e.g., LLaVA) across visual question answering benchmarks, analyzing how uncertainty-based metrics can reflect model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust, consistent basis for ranking models across varied tasks. This finding enables the ranking of LMMs on real-world, unlabeled data for visual question answering, providing a practical approach for selecting models across diverse domains without requiring manual annotation.
zh

[CV-49] Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

【速读】：该论文试图解决大规模视觉-语言模型 (Large Vision-Language Models, LVLMs) 在推理过程中高计算成本的问题。解决方案的关键在于提出了一种名为“Pruning All-Rounder (PAR)”的新框架，该框架通过引入元路由器 (meta-router) 来自适应地组织跨层和跨标记的剪枝流程。与传统依赖参数或标记的剪枝策略不同，PAR 采用自监督学习方式，能够在性能和效率之间实现更优的平衡，并提供多种剪枝版本以适应不同的剪枝场景。

链接: https://arxiv.org/abs/2412.06458
作者: Wei Suo,Ji Ma,Mengyang Sun,Lin Yuanbo Wu,Peng Wang,Yanning Zhang
关键词-EN: Large Vision-Language Models, achieved impressive results, high computational cost, computational cost poses, Vision-Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational cost poses a significant barrier to wider application. To enhance inference efficiency, most existing approaches depend on parameter-dependent or token-dependent strategies to reduce computational demands. However, these methods typically require complex training processes and struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios. The code for this work will be made publicly available.
zh

[CV-50] Adaptive Graph Learning from Spatial Information for Surgical Workflow Anticipation DATE

【速读】：该论文试图解决手术流程预测中的关键问题，特别是在机器人辅助手术（Robotic-Assisted Surgery, RAS）中，如何从实时视频数据中准确预测手术事件的时间。当前方法主要依赖于手术器械的静态交互和固定时间范围的预测，无法有效捕捉动态交互和适应不同时间范围的需求。论文提出的解决方案基于三个关键创新：首先，引入基于手术器械和目标的边界框及其检测置信度的新空间表示；其次，设计自适应图学习方法以捕捉动态交互；最后，开发多时间范围目标函数，实现不受限的预测。这些创新显著提升了短期到中期预测的准确性，减少了手术阶段预测误差约3%和剩余手术时间预测误差约9%，从而提高了手术安全性和手术室使用效率。

链接: https://arxiv.org/abs/2412.06454
作者: Francis Xiatian Zhang,Jingjing Deng,Robert Lieck,Hubert P. H. Shum
关键词-EN: live video data, Robotic-Assisted Surgery, Surgical, video data, relevant surgical events
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by IEEE Transactions on Medical Robotics and Bionics, the direct link to the IEEE page will be updated upon publication

点击查看摘要

Abstract:Surgical workflow anticipation is the task of predicting the timing of relevant surgical events from live video data, which is critical in Robotic-Assisted Surgery (RAS). Accurate predictions require the use of spatial information to model surgical interactions. However, current methods focus solely on surgical instruments, assume static interactions between instruments, and only anticipate surgical events within a fixed time horizon. To address these challenges, we propose an adaptive graph learning framework for surgical workflow anticipation based on a novel spatial representation, featuring three key innovations. First, we introduce a new representation of spatial information based on bounding boxes of surgical instruments and targets, including their detection confidence levels. These are trained on additional annotations we provide for two benchmark datasets. Second, we design an adaptive graph learning method to capture dynamic interactions. Third, we develop a multi-horizon objective that balances learning objectives for different time horizons, allowing for unconstrained predictions. Evaluations on two benchmarks reveal superior performance in short-to-mid-term anticipation, with an error reduction of approximately 3% for surgical phase anticipation and 9% for remaining surgical duration anticipation. These performance improvements demonstrate the effectiveness of our method and highlight its potential for enhancing preparation and coordination within the RAS team. This can improve surgical safety and the efficiency of operating room usage.
zh

[CV-51] Local Attention Transformers for High-Detail Optical Flow Upsampling

【速读】：该论文试图解决当前广泛采用的凸上采样（convex upsampling）方法在光流计算中存在的问题和局限性。解决方案的关键在于提出一系列改进措施：首先，通过解耦最终凸上采样器的权重，使其更容易找到正确的凸组合；其次，为凸上采样器提供额外的上下文特征；然后，通过基于注意力机制的替代凸上采样器（Transformers for Convex Upsampling）来增加凸掩码（convex mask）的大小，利用局部注意力掩码替代凸掩码以提高掩码尺寸，并提供经验证据表明更大的掩码尺寸增加了凸组合存在的可能性；最后，提出一种替代训练方案以消除模型输出中的双线性插值伪影。这些改进理论上可以应用于几乎所有当前最先进的光流架构，并在实验中显著降低了Sintel Clean训练集的端点误差。

链接: https://arxiv.org/abs/2412.06439
作者: Alexander Gielisse,Nergis Tömen,Jan van Gemert
关键词-EN: convex upsampling, convex, obtain high-resolution flow, convex upsampler, step to obtain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Note; this work is an extension of my Master’s thesis, available as “Optical Flow Upsamplers Ignore Details: Neighborhood Attention Transformers for Convex Upsampling”

点击查看摘要

Abstract:Most recent works on optical flow use convex upsampling as the last step to obtain high-resolution flow. In this work, we show and discuss several issues and limitations of this currently widely adopted convex upsampling approach. We propose a series of changes, in an attempt to resolve current issues. First, we propose to decouple the weights for the final convex upsampler, making it easier to find the correct convex combination. For the same reason, we also provide extra contextual features to the convex upsampler. Then, we increase the convex mask size by using an attention-based alternative convex upsampler; Transformers for Convex Upsampling. This upsampler is based on the observation that convex upsampling can be reformulated as attention, and we propose to use local attention masks as a drop-in replacement for convex masks to increase the mask size. We provide empirical evidence that a larger mask size increases the likelihood of the existence of the convex combination. Lastly, we propose an alternative training scheme to remove bilinear interpolation artifacts from the model output. Our proposed ideas could theoretically be applied to almost every current state-of-the-art optical flow architecture. On the FlyingChairs + FlyingThings3D training setting we reduce the Sintel Clean training end-point-error of RAFT from 1.42 to 1.26, GMA from 1.31 to 1.18, and that of FlowFormer from 0.94 to 0.90, by solely adapting the convex upsampler.
zh

[CV-52] Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Video

【速读】：该论文试图解决从模糊单目视频中重建高质量4D模型的问题。现有方法在处理因相机抖动和物体运动导致的视频模糊时，往往生成模糊的4D重建结果。尽管基于NeRF的方法尝试解决这一问题，但由于在曝光时间内估计连续动态表示的不准确性，效果不佳。论文的关键解决方案是提出首个4D高斯飞溅框架（Deblur4DGS），通过将曝光时间内的连续动态表示估计转化为曝光时间估计，并引入曝光正则化、多帧和多分辨率一致性约束来避免平凡解和减少伪影。此外，论文还提出了模糊感知可变标准高斯模型，以更好地表示具有大运动的物体。该方法不仅适用于新视角合成，还可用于模糊视频的多方面改进，如去模糊、帧插值和视频稳定。

链接: https://arxiv.org/abs/2412.06424
作者: Renlong Wu,Zhilu Zhang,Mingyang Chen,Xiaopeng Fan,Zifei Yan,Wangmeng Zuo
关键词-EN: yielded impressive results, yielded impressive, rely on sharp, Gaussian Splatting, impressive results
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Recent 4D reconstruction methods have yielded impressive results but rely on sharp videos as supervision. However, motion blur often occurs in videos due to camera shake and object movement, while existing methods render blurry results when using such videos for reconstructing 4D models. Although a few NeRF-based approaches attempted to address the problem, they struggled to produce high-quality results, due to the inaccuracy in estimating continuous dynamic representations within the exposure time. Encouraged by recent works in 3D motion trajectory modeling using 3D Gaussian Splatting (3DGS), we suggest taking 3DGS as the scene representation manner, and propose the first 4D Gaussian Splatting framework to reconstruct a high-quality 4D model from blurry monocular video, named Deblur4DGS. Specifically, we transform continuous dynamic representations estimation within an exposure time into the exposure time estimation. Moreover, we introduce exposure regularization to avoid trivial solutions, as well as multi-frame and multi-resolution consistency ones to alleviate artifacts. Furthermore, to better represent objects with large motion, we suggest blur-aware variable canonical Gaussians. Beyond novel-view synthesis, Deblur4DGS can be applied to improve blurry video from multiple perspectives, including deblurring, frame interpolation, and video stabilization. Extensive experiments on the above four tasks show that Deblur4DGS outperforms state-of-the-art 4D reconstruction methods. The codes are available at this https URL.
zh

[CV-53] Continual Learning for Segment Anything Model Adaptation

【速读】：该论文试图解决在动态场景下，数据以流式方式输入时，现有的单步适应（one-step adaptation）方法在持续分割（continual segmentation）任务中的局限性问题。解决方案的关键在于提出了一种新的简单而有效的域适配器混合算法（Mixture of Domain Adapters, MoDA），该算法通过利用全局特征标记（Global Feature Tokens, GFT）和全局辅助标记（Global Assistant Tokens, GAT）模块，帮助SAM编码器提取不同任务域的分离特征，并提供准确的特定任务信息，从而实现持续学习。实验结果表明，MoDA在持续分割任务中显著优于现有的经典持续学习方法以及基于提示和适配器的方法，并且在知识保留方面表现出卓越的能力。

链接: https://arxiv.org/abs/2412.06418
作者: Jinglong Yang,Yichen Wu,Jun Cen,Wenjian Huang,Hong Wang,Jianguo Zhang
关键词-EN: Continual SAM adaptation, one-step adaptation paradigm, SAM one-step adaptation, one-step adaptation methods, SAM adaptation methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at \url{ this https URL }

点击查看摘要

Abstract:Although the current different types of SAM adaptation methods have achieved promising performance for various downstream tasks, such as prompt-based ones and adapter-based ones, most of them belong to the one-step adaptation paradigm. In real-world scenarios, we are generally confronted with the dynamic scenario where the data comes in a streaming manner. Driven by the practical need, in this paper, we first propose a novel Continual SAM adaptation (CoSAM) benchmark with 8 different task domains and carefully analyze the limitations of the existing SAM one-step adaptation methods in the continual segmentation scenario. Then we propose a novel simple-yet-effective Mixture of Domain Adapters (MoDA) algorithm which utilizes the Global Feature Tokens (GFT) and Global Assistant Tokens (GAT) modules to help the SAM encoder extract well-separated features for different task domains, and then provide the accurate task-specific information for continual learning. Extensive experiments demonstrate that our proposed MoDA obviously surpasses the existing classic continual learning methods, as well as prompt-based and adapter-based approaches for continual segmentation. Moreover, after sequential learning on the CoSAM benchmark with diverse data distributions, our MoDA maintains highly competitive results in the natural image domain, approaching the zero-shot performance of the original SAM, demonstrating its superior capability in knowledge preservation. Notably, the proposed MoDA can be seamlessly integrated into various one-step adaptation methods of SAM, which can consistently bring obvious performance gains. Code is available at \urlthis https URL
zh

[CV-54] World-Consistent Data Generation for Vision-and-Language Navigation

【速读】：该论文试图解决视觉与语言导航 (Vision-and-Language Navigation, VLN) 任务中的数据稀缺问题，导致模型在未见环境中的泛化性能较差。解决方案的关键在于提出了世界一致性数据生成 (World-Consistent Data Generation, WCGEN) 框架，该框架通过两个阶段确保数据生成的多样性和世界一致性：首先是轨迹阶段，利用基于点云的技术确保视角间的空间连贯性；其次是视角阶段，采用新颖的角度合成方法保证整个观察空间的空间和环绕一致性。通过精确预测视角变化并结合3D知识，该方法在生成过程中保持了世界一致性，从而显著提升了模型在未见环境中的泛化能力。

链接: https://arxiv.org/abs/2412.06413
作者: Yu Zhong,Rui Zhang,Zihao Zhang,Shuo Wang,Chuan Fang,Xishan Zhang,Jiaming Guo,Shaohui Peng,Di Huang,Yanyang Yan,Xing Hu,Ping Tan,Qi Guo
关键词-EN: natural-language instructions, navigate through photorealistic, VLN, generate VLN data, photorealistic environments
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) is a challenging task that requires an agent to navigate through photorealistic environments following natural-language instructions. One main obstacle existing in VLN is data scarcity, leading to poor generalization performance over unseen environments. Tough data argumentation is a promising way for scaling up the dataset, how to generate VLN data both diverse and world-consistent remains problematic. To cope with this issue, we propose the world-consistent data generation (WCGEN), an efficacious data-augmentation framework satisfying both diversity and world-consistency, targeting at enhancing the generalizations of agents to novel environments. Roughly, our framework consists of two stages, the trajectory stage which leverages a point-cloud based technique to ensure spatial coherency among viewpoints, and the viewpoint stage which adopts a novel angle synthesis method to guarantee spatial and wraparound consistency within the entire observation. By accurately predicting viewpoint changes with 3D knowledge, our approach maintains the world-consistency during the generation procedure. Experiments on a wide range of datasets verify the effectiveness of our method, demonstrating that our data augmentation strategy enables agents to achieve new state-of-the-art results on all navigation tasks, and is capable of enhancing the VLN agents’ generalization ability to unseen environments.
zh

[CV-55] Generative Lines Matching Models

【速读】：该论文旨在解决去噪模型训练过程中出现的奇点问题，该奇点导致去噪器的预测结果趋向于源分布或目标分布的均值，从而产生虚假的吸引盆，扭曲去噪轨迹并增加采样步骤。解决方案的关键在于利用基于确定性常微分方程 (ODE) 的采样器，这些采样器由某些去噪扩散模型和得分匹配模型提供，能够在源分布和目标分布之间建立明确的变量变换关系。基于此，论文提出了一种新的概率流模型——直线匹配模型 (Lines Matching Model, LMM)，该模型通过全局直线插值来匹配两个分布。LMM 生成的流场表现出显著的时间一致性，从而产生具有优异直线度评分的轨迹。此外，LMM 通过集成领域特定的重建损失和对抗损失，并针对采样过程优化训练，进一步提高了生成样本的保真度。实验结果表明，LMM 在多个基准数据集上实现了最先进的 FID 分数，同时最小化了采样步骤数。

链接: https://arxiv.org/abs/2412.06403
作者: Ori Matityahu,Raanan Fattal
关键词-EN: key denoising models, Lines Matching Model, paper we identify, loss of key, denoiser predictions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper we identify the source of a singularity in the training loss of key denoising models, that causes the denoiser’s predictions to collapse towards the mean of the source or target distributions. This degeneracy creates false basins of attraction, distorting the denoising trajectories and ultimately increasing the number of steps required to sample these models. We circumvent this artifact by leveraging the deterministic ODE-based samplers, offered by certain denoising diffusion and score-matching models, which establish a well-defined change-of-variables between the source and target distributions. Given this correspondence, we propose a new probability flow model, the Lines Matching Model (LMM), which matches globally straight lines interpolating the two distributions. We demonstrate that the flow fields produced by the LMM exhibit notable temporal consistency, resulting in trajectories with excellent straightness scores. Beyond its sampling efficiency, the LMM formulation allows us to enhance the fidelity of the generated samples by integrating domain-specific reconstruction and adversarial losses, and by optimizing its training for the sampling procedure used. Overall, the LMM achieves state-of-the-art FID scores with minimal NFEs on established benchmark datasets: 1.57/1.39 (NFE=1/2) on CIFAR-10, 1.47/1.17 on ImageNet 64x64, and 2.68/1.54 on AFHQ 64x64. Finally, we provide a theoretical analysis showing that the use of optimal transport to relate the two distributions suffers from a curse of dimensionality, where the pairing set size (mini-batch) must scale exponentially with the signal dimension. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.06403 [cs.CV] (or arXiv:2412.06403v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.06403 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Raanan Fattal [view email] [v1] Mon, 9 Dec 2024 11:33:38 UTC (4,998 KB)
zh

[CV-56] Is Self-Supervision Enough? Benchmarking Foundation Models Against End-to-End Training for Mitotic Figure Classification

【速读】：该论文试图解决在组织病理学领域中，基于基础模型（Foundation Models, FMs）的迁移学习是否能够有效提升有丝分裂图像分类的性能和鲁棒性问题。解决方案的关键在于通过线性探测（linear probing）比较五种公开的基础模型与ImageNet预训练模型以及端到端训练的ResNet50基线模型在有丝分裂图像分类任务中的表现。研究结果表明，端到端训练的基线模型在所有数据量下均优于基础模型，且基础模型并未表现出更强的领域鲁棒性，从而否定了基础模型在减少标注数据需求和提升领域鲁棒性方面的假设。

链接: https://arxiv.org/abs/2412.06365
作者: Jonathan Ganz,Jonas Ammeling,Emely Rosbach,Ludwig Lausser,Christof A. Bertram,Katharina Breininger,Marc Aubreville
关键词-EN: typically unlabeled data, Foundation models, typically unlabeled, vast amount, amount of typically
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Foundation models (FMs), i.e., models trained on a vast amount of typically unlabeled data, have become popular and available recently for the domain of histopathology. The key idea is to extract semantically rich vectors from any input patch, allowing for the use of simple subsequent classification networks potentially reducing the required amounts of labeled data, and increasing domain robustness. In this work, we investigate to which degree this also holds for mitotic figure classification. Utilizing two popular public mitotic figure datasets, we compared linear probing of five publicly available FMs against models trained on ImageNet and a simple ResNet50 end-to-end-trained baseline. We found that the end-to-end-trained baseline outperformed all FM-based classifiers, regardless of the amount of data provided. Additionally, we did not observe the FM-based classifiers to be more robust against domain shifts, rendering both of the above assumptions incorrect.
zh

[CV-57] On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events

【速读】：该论文试图解决在资源受限的敏捷机器人（如小型飞行无人机）中，基于事件相机（event cameras）的视觉感知问题。解决方案的关键在于通过自监督学习（self-supervised learning）中的对比最大化（contrast maximization）方法，提升深度估计任务的计算效率和内存效率，从而实现实时在线学习。具体来说，论文改进了对比最大化学习流程的时间和内存效率，并通过基准测试和实际飞行实验验证了其在深度估计和障碍物规避中的有效性，同时展示了在飞行过程中进行少量预训练和微调的可行性。

链接: https://arxiv.org/abs/2412.06359
作者: Jesse Hagenaars,Yilun Wu,Federico Paredes-Vallés,Stein Stroobants,Guido de Croon
关键词-EN: cameras provide low-latency, Event cameras provide, provide low-latency perception, milliwatts of power, cameras provide
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras provide low-latency perception for only milliwatts of power. This makes them highly suitable for resource-restricted, agile robots such as small flying drones. Self-supervised learning based on contrast maximization holds great potential for event-based robot vision, as it foregoes the need to high-frequency ground truth and allows for online learning in the robot’s operational environment. However, online, onboard learning raises the major challenge of achieving sufficient computational efficiency for real-time learning, while maintaining competitive visual perception performance. In this work, we improve the time and memory efficiency of the contrast maximization learning pipeline. Benchmarking experiments show that the proposed pipeline achieves competitive results with the state of the art on the task of depth estimation from events. Furthermore, we demonstrate the usability of the learned depth for obstacle avoidance through real-world flight experiments. Finally, we compare the performance of different combinations of pre-training and fine-tuning of the depth estimation networks, showing that on-board domain adaptation is feasible given a few minutes of flight.
zh

[CV-58] SeFENet: Robust Deep Homography Estimation via Semantic-Driven Feature Enhancement

【速读】：该论文试图解决在恶劣环境下拍摄的图像因模糊、对比度降低和色彩失真等问题，导致特征检测和匹配困难，从而影响单应性估计（homography estimation）的准确性和鲁棒性的问题。解决方案的关键在于提出了一种语义驱动的特征增强网络（semantic-driven feature enhancement network），称为SeFENet。该网络通过引入层次化的尺度感知模块（hierarchical scale-aware module）来扩展感受野，聚合多尺度信息，从而在不同恶劣条件下有效提取图像特征。同时，结合语义引导的约束模块（semantic-guided constraint module）和高层次感知框架，实现了对图像退化的容忍性。通过元学习（meta-learning）训练策略，缓解了语义特征与结构特征之间的差异。最终，通过内外交替优化，网络实现了隐式的语义特征增强，增强了局部特征理解和上下文信息提取，从而提高了在恶劣环境下单应性估计的鲁棒性。

链接: https://arxiv.org/abs/2412.06352
作者: Zeru Shi,Zengxi Zhang,Zhiying Jiang,Ruizhe An,Jinyuan Liu
关键词-EN: exhibit blurred details, hinder feature detection, blurred details, color distortion, detection and matching
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Images captured in harsh environments often exhibit blurred details, reduced contrast, and color distortion, which hinder feature detection and matching, thereby affecting the accuracy and robustness of homography estimation. While visual enhancement can improve contrast and clarity, it may introduce visual-tolerant artifacts that obscure the structural integrity of images. Considering the resilience of semantic information against environmental interference, we propose a semantic-driven feature enhancement network for robust homography estimation, dubbed SeFENet. Concretely, we first introduce an innovative hierarchical scale-aware module to expand the receptive field by aggregating multi-scale information, thereby effectively extracting image features under diverse harsh conditions. Subsequently, we propose a semantic-guided constraint module combined with a high-level perceptual framework to achieve degradation-tolerant with semantic feature. A meta-learning-based training strategy is introduced to mitigate the disparity between semantic and structural features. By internal-external alternating optimization, the proposed network achieves implicit semantic-wise feature enhancement, thereby improving the robustness of homography estimation in adverse environments by strengthening the local feature comprehension and context information extraction. Experimental results under both normal and harsh conditions demonstrate that SeFENet significantly outperforms SOTA methods, reducing point match error by at least 41% on the large-scale datasets.
zh

[CV-59] Elastic-DETR: Making Image Resolution Learnable with Content-Specific Network Prediction

【速读】：该论文试图解决现代目标检测器（如 DETR）中多尺度图像分辨率选择的手动超参数调整问题，这种手动选择限制了分辨率的灵活性，并需要依赖先验知识。解决方案的关键在于提出了一个名为 Elastic-DETR 的新策略，通过引入可学习的分辨率机制，使网络能够根据图像内容自适应地调整分辨率。具体实现包括一个紧凑的尺度预测模块（2 GFLOPs），并通过两个损失函数——尺度损失（scale loss）和分布损失（distribution loss）来优化分辨率选择，从而在不依赖先验知识的情况下实现分辨率的灵活调整。实验结果表明，该方法在保持计算复杂度降低的同时，显著提升了模型在 MS COCO 数据集上的准确性。

链接: https://arxiv.org/abs/2412.06341
作者: Daeun Seo,Hoeseok Yang,Sihyeong Park,Hyungshin Kim
关键词-EN: modern object detectors, facto standard approach, Multi-scale image resolution, Multi-scale image, object detectors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-scale image resolution is a de facto standard approach in modern object detectors, such as DETR. This technique allows for the acquisition of various scale information from multiple image resolutions. However, manual hyperparameter selection of the resolution can restrict its flexibility, which is informed by prior knowledge, necessitating human intervention. This work introduces a novel strategy for learnable resolution, called Elastic-DETR, enabling elastic utilization of multiple image resolutions. Our network provides an adaptive scale factor based on the content of the image with a compact scale prediction module ( 2 GFLOPs). The key aspect of our method lies in how to determine the resolution without prior knowledge. We present two loss functions derived from identified key components for resolution optimization: scale loss, which increases adaptiveness according to the image, and distribution loss, which determines the overall degree of scaling based on network performance. By leveraging the resolution’s flexibility, we can demonstrate various models that exhibit varying trade-offs between accuracy and computational complexity. We empirically show that our scheme can unleash the potential of a wide spectrum of image resolutions without constraining flexibility. Our models on MS COCO establish a maximum accuracy gain of 3.5%p or 26% decrease in computation than MS-trained DN-DETR.
zh

[CV-60] UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts

【速读】：该论文试图解决视频修复（video inpainting）和视频插值（video interpolation）作为两个独立任务的传统方法问题，提出了一种统一的时空视频修复框架UniPaint。解决方案的关键在于引入了一个即插即用的时空视频修复适配器，并通过提出混合专家注意力机制（Mixture of Experts, MoE）来覆盖多种任务，使得这两个任务能够相互增强。此外，论文还设计了一种时空掩码策略，在训练阶段通过相互增强来提升性能。

链接: https://arxiv.org/abs/2412.06340
作者: Zhen Wan,Yue Ma,Chenyang Qi,Zhiheng Liu,Tao Gui
关键词-EN: space-time video inpainting, unified generative space-time, generative space-time video, unified inpainting framework, video inpainting framework
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present UniPaint, a unified generative space-time video inpainting framework that enables spatial-temporal inpainting and interpolation. Different from existing methods that treat video inpainting and video interpolation as two distinct tasks, we leverage a unified inpainting framework to tackle them and observe that these two tasks can mutually enhance synthesis performance. Specifically, we first introduce a plug-and-play space-time video inpainting adapter, which can be employed in various personalized models. The key insight is to propose a Mixture of Experts (MoE) attention to cover various tasks. Then, we design a spatial-temporal masking strategy during the training stage to mutually enhance each other and improve performance. UniPaint produces high-quality and aesthetically pleasing results, achieving the best quantitative results across various tasks and scale setups. The code and checkpoints will be available soon.
zh

[CV-61] riDi: Trilateral Diffusion of 3D Humans Objects and Interactions

【速读】：该论文试图解决三维人-物交互 (3D human-object interaction, HOI) 建模问题，特别是现有方法只能单向处理（如根据物体恢复人体交互或根据人体姿态恢复物体姿态）的局限性。解决方案的关键在于提出了一个统一的模型——TriDi，该模型通过三向扩散过程同时生成人体、物体和交互模态，能够在一个网络中建模七种分布。TriDi采用Transformer结构，通过处理不同模态的token来发现它们之间的条件关系，并允许用户通过文本描述或接触图来控制交互。该模型将文本描述和接触图嵌入共享的潜在空间，结合了文本描述的实用性和接触图的表现力，从而统一了先前工作的特殊情况并扩展到新的应用场景。

链接: https://arxiv.org/abs/2412.06334
作者: Ilya A. Petrov,Riccardo Marin,Julian Chibane,Gerard Pons-Moll
关键词-EN: mixed-reality applications, problem of great, great interest, interest for computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling 3D human-object interaction (HOI) is a problem of great interest for computer vision and a key enabler for virtual and mixed-reality applications. Existing methods work in a one-way direction: some recover plausible human interactions conditioned on a 3D object; others recover the object pose conditioned on a human pose. Instead, we provide the first unified model - TriDi which works in any direction. Concretely, we generate Human, Object, and Interaction modalities simultaneously with a new three-way diffusion process, allowing to model seven distributions with one network. We implement TriDi as a transformer attending to the various modalities’ tokens, thereby discovering conditional relations between them. The user can control the interaction either as a text description of HOI or a contact map. We embed these two representations into a shared latent space, combining the practicality of text descriptions with the expressiveness of contact maps. Using a single network, TriDi unifies all the special cases of prior work and extends to new ones, modeling a family of seven distributions. Remarkably, despite using a single model, TriDi generated samples surpass one-way specialized baselines on GRAB and BEHAVE in terms of both qualitative and quantitative metrics, and demonstrating better diversity. We show the applicability of TriDi to scene population, generating objects for human-contact datasets, and generalization to unseen object geometry. The project page is available at: this https URL.
zh

[CV-62] Normalizing Flows are Capable Generative Models

【速读】：该论文旨在解决归一化流 (Normalizing Flows, NFs) 在生成建模任务中的性能问题，特别是其在图像生成和密度估计方面的潜力。论文提出了一个名为 TarFlow 的新架构，该架构基于 Transformer 的掩码自回归流 (Masked Autoregressive Flows, MAFs) 变体，通过在图像块上堆叠自回归 Transformer 块，并在层之间交替自回归方向，实现了高效的端到端训练和像素直接建模与生成。关键解决方案包括三个提升样本质量的技术：训练期间的 Gaussian 噪声增强、训练后的去噪过程以及适用于类别条件和无条件设置的有效引导方法。这些技术的结合使得 TarFlow 在图像的似然估计上达到了新的最先进水平，并在样本质量和多样性上媲美扩散模型，首次实现了独立归一化流模型的这一性能。

链接: https://arxiv.org/abs/2412.06329
作者: Shuangfei Zhai,Ruixiang Zhang,Preetum Nakkiran,David Berthelot,Jiatao Gu,Huangjie Zheng,Tianrong Chen,Miguel Angel Bautista,Navdeep Jaitly,Josh Susskind
关键词-EN: Toggle, code, Toggle Hugging Face, Papers, Normalizing Flows
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present \textitTarFlow: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available at \hrefthis https URLthis https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2412.06329 [cs.CV] (or arXiv:2412.06329v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.06329 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuangfei Zhai [view email] [v1] Mon, 9 Dec 2024 09:28:06 UTC (18,989 KB) Full-text links: Access Paper: View a PDF of the paper titled Normalizing Flows are Capable Generative Models, by Shuangfei Zhai and 9 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-12 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-63] World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

【速读】：该论文试图解决多模态大语言模型（MLLMs）在自动驾驶中面对感知受限区域（如动态或静态遮挡区域）时，难以有效整合感知能力与世界知识进行推理的问题。解决方案的关键在于提出一个即插即用的指令引导交互模块，该模块能够弥合模态间的差距并显著减少输入序列长度，从而适应多视角视频输入。此外，通过收集和精炼一个大规模多模态数据集，包含200万自然语言问答对和170万定位任务数据，以及引入一个包含20万问答对的对象级风险评估数据集，进一步提升了模型在驾驶相关任务中整合世界知识的能力。

链接: https://arxiv.org/abs/2412.06324
作者: Mingliang Zhai,Cheng Li,Zengyuan Guo,Ningrui Yang,Xiameng Qin,Yuwei Wu,Sanyuan Zhao,Junyu Han,Ji Tao,Yunde Jia
关键词-EN: Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, world knowledge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages. Supplementary Material

点击查看摘要

Abstract:The Multi-modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perceptionlimited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug-and-play instruction-guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi-view video inputs. Furthermore, to better integrate world knowledge with driving-related tasks, we have collected and refined a large-scale multi-modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model’s utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.
zh

[CV-64] HAIFAI: Human-AI Collaboration for Mental Face Reconstruction

【速读】：该论文试图解决从用户心理图像中重建面部视觉表示的挑战性任务。解决方案的关键在于提出了一种新颖的人机协作系统HAIFAI，通过用户对AI系统呈现的图像进行迭代排序，系统能够提取相关图像特征并融合成统一的特征向量，进而利用生成模型重建心理图像。此外，HAIFAI-X扩展允许用户通过滑块界面手动精炼重建结果，以进一步提高重建质量。为避免繁琐的人类数据收集，论文还引入了计算用户模型来模拟人类排序行为，并通过在线众包研究收集了一个小规模面部排序数据集。实验结果表明，HAIFAI在重建质量、可用性、感知工作负荷和重建速度方面优于现有技术，而HAIFAI-X在牺牲部分可用性和增加重建时间的情况下，实现了更高的重建质量和新的最先进识别率。

链接: https://arxiv.org/abs/2412.06323
作者: Florian Strohm,Mihai Bâce,Andreas Bulling
关键词-EN: person mind, tackle the challenging, challenging task, visual representation, mental image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present HAIFAI - a novel collaborative human-AI system to tackle the challenging task of reconstructing a visual representation of a face that exists only in a person’s mind. Users iteratively rank images presented by the AI system based on their resemblance to a mental image. These rankings, in turn, allow the system to extract relevant image features, fuse them into a unified feature vector, and use a generative model to reconstruct the mental image. We also propose an extension called HAIFAI-X that allows users to manually refine and further improve the reconstruction using an easy-to-use slider interface. To avoid the need for tedious human data collection for model training, we introduce a computational user model of human ranking behaviour. For this, we collected a small face ranking dataset through an online crowd-sourcing study containing data from 275 participants. We evaluate HAIFAI and HAIFAI-X in a 12-participant user study and show that HAIFAI outperforms the previous state of the art regarding reconstruction quality, usability, perceived workload, and reconstruction speed. HAIFAI-X achieves even better reconstruction quality at the cost of reduced usability, perceived workload, and increased reconstruction time. We further validate the reconstructions in a subsequent face ranking study with 18 participants and show that HAIFAI-X achieves a new state-of-the-art identification rate of 60.6%. These findings represent a significant advancement towards developing new collaborative intelligent systems capable of reliably and effortlessly reconstructing a user’s mental image.
zh

[CV-65] LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations WACV2025

【速读】：该论文试图解决场景图生成 (Scene Graph Generation, SGG) 中现有模型忽视空间关系和在开放词汇环境下泛化能力不足的问题。解决方案的关键在于提出了LLaVA-SpaceSGG，一种多模态大语言模型 (Multimodal Large Language Model, MLLM)，专门用于开放词汇的SGG任务，并通过增强空间关系建模来提升性能。为了训练该模型，研究者构建了SpaceSGG数据集，结合了公开数据集和合成数据，涵盖了对象位置、对象关系和深度信息，并以空间SGG描述、问答和对话三种格式呈现。此外，论文还引入了一个两阶段的训练范式，以更好地将MLLM的固有能力迁移到SGG任务中，实验结果表明LLaVA-SpaceSGG在开放词汇SGG任务中显著优于其他方法。

链接: https://arxiv.org/abs/2412.06322
作者: Mingjie Xu,Mengyang Wu,Yuzhi Zhao,Jason Chun Lok Li,Weifeng Ou
关键词-EN: Scene Graph Generation, structured graph representations, converts visual scenes, providing deeper scene, deeper scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the WACV 2025, including supplementary material

点击查看摘要

Abstract:Scene Graph Generation (SGG) converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks. However, existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. To address these limitations, we propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling. To train it, we collect the SGG instruction-tuning dataset, named SpaceSGG. This dataset is constructed by combining publicly available datasets and synthesizing data using open-source models within our data construction pipeline. It combines object locations, object relations, and depth information, resulting in three data formats: spatial SGG description, question-answering, and conversation. To enhance the transfer of MLLMs’ inherent capabilities to the SGG task, we introduce a two-stage training paradigm. Experiments show that LLaVA-SpaceSGG outperforms other open-vocabulary SGG methods, boosting recall by 8.6% and mean recall by 28.4% compared to the baseline. Our codebase, dataset, and trained models are publicly accessible on GitHub at the following URL: this https URL.
zh

[CV-66] Vision-Based Deep Reinforcement Learning of UAV Autonomous Navigation Using Privileged Information

【速读】：该论文试图解决无人机在复杂和未知环境中高效自主导航及避障的问题，特别是在部分可观测环境下的高速度自主导航挑战。解决方案的关键在于提出了DPRL（Distributed Privileged Reinforcement Learning）导航算法，该算法结合了深度强化学习（Deep Reinforcement Learning）与特权学习（Privileged Learning），通过不对称的Actor-Critic架构在训练过程中为智能体提供特权信息，从而增强其感知能力。此外，采用多智能体探索策略在多样环境中加速经验收集，进一步加快模型收敛。这些方法共同提升了算法的飞行效率、鲁棒性和整体成功率。

链接: https://arxiv.org/abs/2412.06313
作者: Junqiao Wang,Zhongliang Yu,Dong Zhou,Jiaqi Shi,Runran Deng
关键词-EN: efficient autonomous navigation, autonomous UAV navigation, Distributed Privileged Reinforcement, high-speed autonomous UAV, Privileged Reinforcement Learning
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:The capability of UAVs for efficient autonomous navigation and obstacle avoidance in complex and unknown environments is critical for applications in agricultural irrigation, disaster relief and logistics. In this paper, we propose the DPRL (Distributed Privileged Reinforcement Learning) navigation algorithm, an end-to-end policy designed to address the challenge of high-speed autonomous UAV navigation under partially observable environmental conditions. Our approach combines deep reinforcement learning with privileged learning to overcome the impact of observation data corruption caused by partial observability. We leverage an asymmetric Actor-Critic architecture to provide the agent with privileged information during training, which enhances the model’s perceptual capabilities. Additionally, we present a multi-agent exploration strategy across diverse environments to accelerate experience collection, which in turn expedites model convergence. We conducted extensive simulations across various scenarios, benchmarking our DPRL algorithm against the state-of-the-art navigation algorithms. The results consistently demonstrate the superior performance of our algorithm in terms of flight efficiency, robustness and overall success rate.
zh

[CV-67] Self-Paced Learning Strategy with Easy Sample Prior Based on Confidence for the Flying Bird Object Detection Model Training

【速读】：该论文旨在解决飞行鸟类目标检测模型（FBOD模型）在训练过程中受到难样本（hard samples）影响的问题。解决方案的关键在于提出了一种基于自步学习策略（Self-Paced Learning, SPL）的新型训练策略，称为基于信心（Confidence）的简单样本优先自步学习策略（SPL-ESP-BC）。该策略首先改进了SPL中的基于损失的最小化函数，提出了基于信心的最小化函数，使其更适合于单类目标检测任务。其次，通过引入简单样本优先（ESP）策略，使模型在训练初期能够区分简单和难样本。最终，通过结合ESP策略和基于信心的最小化函数，提出了SPL-ESP-BC训练策略，使得FBOD模型能够从简单到复杂逐步学习飞行鸟类的特征，从而提高了检测性能。实验结果表明，与标准训练策略和其他基于损失的SPL策略相比，SPL-ESP-BC策略训练的FBOD模型在综合检测性能上表现最佳。

链接: https://arxiv.org/abs/2412.06306
作者: Zi-Wei Sun,Ze-Xi hua,Heng-Chao Li,Yan Li
关键词-EN: Easy Sample Prior, FBOD model, Flying Bird Object, Self-Paced Learning strategy, Sample Prior Based
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In order to avoid the impact of hard samples on the training process of the Flying Bird Object Detection model (FBOD model, in our previous work, we designed the FBOD model according to the characteristics of flying bird objects in surveillance video), the Self-Paced Learning strategy with Easy Sample Prior Based on Confidence (SPL-ESP-BC), a new model training strategy, is proposed. Firstly, the loss-based Minimizer Function in Self-Paced Learning (SPL) is improved, and the confidence-based Minimizer Function is proposed, which makes it more suitable for one-class object detection tasks. Secondly, to give the model the ability to judge easy and hard samples at the early stage of training by using the SPL strategy, an SPL strategy with Easy Sample Prior (ESP) is proposed. The FBOD model is trained using the standard training strategy with easy samples first, then the SPL strategy with all samples is used to train it. Combining the strategy of the ESP and the Minimizer Function based on confidence, the SPL-ESP-BC model training strategy is proposed. Using this strategy to train the FBOD model can make it to learn the characteristics of the flying bird object in the surveillance video better, from easy to hard. The experimental results show that compared with the standard training strategy that does not distinguish between easy and hard samples, the AP50 of the FBOD model trained by the SPL-ESP-BC is increased by 2.1%, and compared with other loss-based SPL strategies, the FBOD model trained with SPL-ESP-BC strategy has the best comprehensive detection performance.
zh

[CV-68] 4D Gaussian Splatting with Scale-aware Residual Field and Adaptive Optimization for Real-time Rendering of Temporally Complex Dynamic Scenes

【速读】：该论文试图解决动态场景重建中的两个关键问题：渲染速度慢和处理时间复杂性（temporal complexities）。解决方案的关键在于提出了一种新的动态场景表示方法SaRO-GS，该方法通过采用基于高斯基元（Gaussian primitive）的表示，并在4D空间中优化高斯基元，结合3D高斯光栅化（3D Gaussian Splatting）实现实时渲染。此外，引入了一种尺度感知残差场（Scale-aware Residual Field），该场考虑了每个高斯基元的大小信息，并编码其残差特征，与高斯基元的自分裂行为相一致，从而有效处理动态场景中的时间复杂性。最后，通过自适应优化调度（Adaptive Optimization Schedule），根据高斯基元的时间特性分配不同的优化策略，加速动态区域的重建。

链接: https://arxiv.org/abs/2412.06299
作者: Jinbo Yan,Rui Peng,Luyang Tang,Ronggang Wang
关键词-EN: highly promising task, Reconstructing dynamic scenes, Reconstructing dynamic, multimedia domain, video sequences
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Reconstructing dynamic scenes from video sequences is a highly promising task in the multimedia domain. While previous methods have made progress, they often struggle with slow rendering and managing temporal complexities such as significant motion and object appearance/disappearance. In this paper, we propose SaRO-GS as a novel dynamic scene representation capable of achieving real-time rendering while effectively handling temporal complexities in dynamic scenes. To address the issue of slow rendering speed, we adopt a Gaussian primitive-based representation and optimize the Gaussians in 4D space, which facilitates real-time rendering with the assistance of 3D Gaussian Splatting. Additionally, to handle temporally complex dynamic scenes, we introduce a Scale-aware Residual Field. This field considers the size information of each Gaussian primitive while encoding its residual feature and aligns with the self-splitting behavior of Gaussian primitives. Furthermore, we propose an Adaptive Optimization Schedule, which assigns different optimization strategies to Gaussian primitives based on their distinct temporal properties, thereby expediting the reconstruction of dynamic regions. Through evaluations on monocular and multi-view datasets, our method has demonstrated state-of-the-art performance. Please see our project page at this https URL.
zh

[CV-69] See Further When Clear: Curriculum Consistency Model

【速读】：该论文试图解决扩散模型和流匹配模型中由于时间步长不同导致的学复杂度不一致问题，导致学生模型性能不佳。解决方案的关键是提出了课程一致性模型 (Curriculum Consistency Model, CCM)，通过将每个时间步长的蒸馏过程视为一个课程，并引入基于峰值信噪比 (PSNR) 的度量来量化学习复杂度，确保在噪声强度较低时教师模型迭代更多步骤，从而稳定并平衡不同时间步长的学习复杂度。

链接: https://arxiv.org/abs/2412.06295
作者: Yunpeng Liu,Boxiao Liu,Yi Zhang,Xingzhong Hou,Guanglu Song,Yu Liu,Haihang You
关键词-EN: Curriculum Consistency Model, Significant advances, flow matching models, learning complexity, Stable Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Significant advances have been made in the sampling efficiency of diffusion models and flow matching models, driven by Consistency Distillation (CD), which trains a student model to mimic the output of a teacher model at a later timestep. However, we found that the learning complexity of the student model varies significantly across different timesteps, leading to suboptimal performance in this http URL address this issue, we propose the Curriculum Consistency Model (CCM), which stabilizes and balances the learning complexity across timesteps. Specifically, we regard the distillation process at each timestep as a curriculum and introduce a metric based on Peak Signal-to-Noise Ratio (PSNR) to quantify the learning complexity of this curriculum, then ensure that the curriculum maintains consistent learning complexity across different timesteps by having the teacher model iterate more steps when the noise intensity is low. Our method achieves competitive single-step sampling Fréchet Inception Distance (FID) scores of 1.64 on CIFAR-10 and 2.18 on ImageNet this http URL, we have extended our method to large-scale text-to-image models and confirmed that it generalizes well to both diffusion models (Stable Diffusion XL) and flow matching models (Stable Diffusion 3). The generated samples demonstrate improved image-text alignment and semantic structure, since CCM enlarges the distillation step at large timesteps and reduces the accumulated error.
zh

[CV-70] Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness Uniqueness and Representativeness

【速读】：该论文试图解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在指令微调过程中由于视觉指令数据集的快速扩展导致的计算成本过高和数据冗余问题。解决方案的关键在于提出了一个协作框架DataTailor，该框架通过信息性（informativeness）、唯一性（uniqueness）和代表性（representativeness）三个原则进行有效的数据选择。具体来说，一个有价值的样本应能反映任务信息、避免冗余，并且代表样本分布而非异常值。DataTailor通过为每个原则设计评分方法，自动适应给定数据集，无需繁琐的超参数调整，从而在显著减少计算成本的同时，保持甚至超越全数据微调的性能。

链接: https://arxiv.org/abs/2412.06293
作者: Qifan Yu,Zhebei Shen,Zhongqi Yue,Yang Wu,Wenqiao Zhang,Yunfei Li,Juncheng Li,Siliang Tang,Yueting Zhuang
关键词-EN: Large Language Models, Multi-modal Large Language, pre-trained Multi-modal Large, fine-tunes pre-trained Multi-modal, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles–informativeness, uniqueness, and representativeness–for effective data selection. We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier). We further propose practical ways to score against each principle, which automatically adapts to a given dataset without tedious hyperparameter tuning. Comprehensive experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data, significantly reducing computational costs while maintaining superior results. This exemplifies the “Less is More” philosophy in MLLM development.
zh

[CV-71] ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models

【速读】：该论文试图解决3D形状关键点检测中的标注依赖问题，特别是传统方法需要大量标注数据和监督训练的局限性。解决方案的关键在于利用多模态大语言模型 (Multi-Modal Large Language Models, MLLMs) 中嵌入的丰富知识，首次展示了无需任何3D关键点标注或监督的情况下，通过像素级标注训练的MLLMs可以用于提取和命名3D模型上的显著关键点。这种方法不仅在标准基准测试中表现出与监督方法相当的性能，还为跨模态学习和3D计算机视觉任务提供了新的研究方向。

链接: https://arxiv.org/abs/2412.06292
作者: Bingchen Gong,Diego Gomez,Abdullah Hamdi,Abdelrahman Eldesokey,Ahmed Abdelreheem,Peter Wonka,Maks Ovsjanikov
关键词-EN: DINO or CLIP, Large Language Models, keypoint detection, Language Models, models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website is accessible at this https URL

点击查看摘要

Abstract:We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to supervised methods, despite not requiring any 3D keypoint annotations during training. Our results highlight the potential of integrating language models for localized 3D shape understanding. This work opens new avenues for cross-modal learning and underscores the effectiveness of MLLMs in contributing to 3D computer vision challenges.
zh

[CV-72] No Annotations for Object Detection in Art through Stable Diffusion WACV2025

【速读】：该论文试图解决艺术图像中物体检测的挑战，特别是在标注这些图像时需要专业领域知识的难题。解决方案的关键是提出了NADA（no annotations for detection in art）管道，该管道利用扩散模型（diffusion models）的艺术相关知识，在不需要完整边界框监督的情况下进行绘画中的物体检测。NADA支持弱监督和零样本场景，并且不需要对其预训练组件进行微调，其核心组件包括基于大规模视觉-语言模型的类别提议器和基于Stable Diffusion的类别条件检测器。该方法在ArtDL 2.0和IconArt两个艺术数据集上进行了评估，表现优于先前的弱监督检测工作，并且是首个在艺术领域实现零样本物体检测的方法。

链接: https://arxiv.org/abs/2412.06286
作者: Patrick Ramos,Nicolas Gonthier,Selina Khan,Yuta Nakashima,Noa Garcia
关键词-EN: historical images compared, digital humanities, compared to humans, Object detection, valuable tool
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, to be published in WACV 2025

点击查看摘要

Abstract:Object detection in art is a valuable tool for the digital humanities, as it allows for faster identification of objects in artistic and historical images compared to humans. However, annotating such images poses significant challenges due to the need for specialized domain expertise. We present NADA (no annotations for detection in art), a pipeline that leverages diffusion models’ art-related knowledge for object detection in paintings without the need for full bounding box supervision. Our method, which supports both weakly-supervised and zero-shot scenarios and does not require any fine-tuning of its pretrained components, consists of a class proposer based on large vision-language models and a class-conditioned detector based on Stable Diffusion. NADA is evaluated on two artwork datasets, ArtDL 2.0 and IconArt, outperforming prior work in weakly-supervised detection, while being the first work for zero-shot object detection in art. Code is available at this https URL
zh

[CV-73] Neural Garment Dynamic Super-Resolution

【速读】：该论文试图解决高分辨率服装模拟在计算资源受限设备上的高效实现问题。解决方案的关键在于提出了一种轻量级的基于学习的方法，用于服装动态超分辨率（garment dynamic super-resolution）。该方法通过从低分辨率服装模拟和底层身体运动出发，利用网格图网络（mesh-graph-net）计算超分辨率特征，并结合服装与身体的交互信息。随后，通过超网络（hyper-net）构建每个粗网格三角形的隐式函数，用于预测细节皱纹残差。最终，通过将这些细节皱纹残差应用于修正后的粗服装形状，生成高分辨率的服装几何结构。该方法不仅在训练数据有限的情况下表现出强大的泛化能力，还能通过迭代预测实现帧间连续的精细皱纹细节生成，显著提升了低分辨率模拟的质量。

链接: https://arxiv.org/abs/2412.06285
作者: Meng Zhang,Jun Li
关键词-EN: Achieving efficient, garment, low-resolution garment simulation, garment simulation, coarse garment
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Achieving efficient, high-fidelity, high-resolution garment simulation is challenging due to its computational demands. Conversely, low-resolution garment simulation is more accessible and ideal for low-budget devices like smartphones. In this paper, we introduce a lightweight, learning-based method for garment dynamic super-resolution, designed to efficiently enhance high-resolution, high-frequency details in low-resolution garment simulations. Starting with low-resolution garment simulation and underlying body motion, we utilize a mesh-graph-net to compute super-resolution features based on coarse garment dynamics and garment-body interactions. These features are then used by a hyper-net to construct an implicit function of detailed wrinkle residuals for each coarse mesh triangle. Considering the influence of coarse garment shapes on detailed wrinkle performance, we correct the coarse garment shape and predict detailed wrinkle residuals using these implicit functions. Finally, we generate detailed high-resolution garment geometry by applying the detailed wrinkle residuals to the corrected coarse garment. Our method enables roll-out prediction by iteratively using its predictions as input for subsequent frames, producing fine-grained wrinkle details to enhance the low-resolution simulation. Despite training on a small dataset, our network robustly generalizes to different body shapes, motions, and garment types not present in the training data. We demonstrate significant improvements over state-of-the-art alternatives, particularly in enhancing the quality of high-frequency, fine-grained wrinkle details.
zh

[CV-74] Your Data Is Not Perfect: Towards Cross-Domain Out-of-Distribution Detection in Class-Imbalanced Data

【速读】：该论文试图解决在类不平衡和跨域场景下的异常检测 (Out-of-Distribution Detection, OOD) 问题，即类不平衡跨域异常检测 (Class-imbalanced Cross-Domain OOD Detection, CCOD)。解决方案的关键在于提出了一种新颖的不确定性感知的自适应语义对齐网络 (Uncertainty-aware Adaptive Semantic Alignment, UASA)。该方法通过在源域中构建标签驱动的原型 (prototype) 来缩小域间差距 (domain gap)，并利用这些原型进行目标分类。同时，UASA 采用自适应样本级阈值来处理语义差距 (semantic gap)，并通过不确定性感知的聚类方法来缓解类不平衡差距 (class-imbalance gap)。实验结果表明，UASA 在多个具有挑战性的基准测试中显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.06284
作者: Xiang Fang,Arvind Easwaran,Blaise Genest,Ponnuthurai Nagaratnam Suganthan
关键词-EN: Previous OOD detection, Previous OOD, OOD detection, OOD, semantic gap
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Expert Systems with Applications

点击查看摘要

Abstract:Previous OOD detection systems only focus on the semantic gap between ID and OOD samples. Besides the semantic gap, we are faced with two additional gaps: the domain gap between source and target domains, and the class-imbalance gap between different classes. In fact, similar objects from different domains should belong to the same class. In this paper, we introduce a realistic yet challenging setting: class-imbalanced cross-domain OOD detection (CCOD), which contains a well-labeled (but usually small) source set for training and conducts OOD detection on an unlabeled (but usually larger) target set for testing. We do not assume that the target domain contains only OOD classes or that it is class-balanced: the distribution among classes of the target dataset need not be the same as the source dataset. To tackle this challenging setting with an OOD detection system, we propose a novel uncertainty-aware adaptive semantic alignment (UASA) network based on a prototype-based alignment strategy. Specifically, we first build label-driven prototypes in the source domain and utilize these prototypes for target classification to close the domain gap. Rather than utilizing fixed thresholds for OOD detection, we generate adaptive sample-wise thresholds to handle the semantic gap. Finally, we conduct uncertainty-aware clustering to group semantically similar target samples to relieve the class-imbalance gap. Extensive experiments on three challenging benchmarks demonstrate that our proposed UASA outperforms state-of-the-art methods by a large margin.
zh

[CV-75] Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction

【速读】：该论文试图解决在自动驾驶场景中，基于像素的 Gaussian 表示方法在稀疏视角重建时面临的挑战，特别是由于视角重叠不足、物体遮挡和视锥截断等问题导致的深度估计不准确。解决方案的关键在于引入 Omni-Gaussian 表示，并结合定制的网络设计，以弥补传统像素表示的不足，从而在 ego-centric（自中心）重建中实现显著的性能提升，同时在 scene-centric（场景中心）重建中保持与现有方法相当的性能。此外，论文还通过结合扩散模型，开创了前馈式多模态生成 3D 驾驶场景的新方法。

链接: https://arxiv.org/abs/2412.06273
作者: Dongxu Wei,Zhiqi Li,Peidong Liu
关键词-EN: employing pixel-based Gaussian, pixel-based Gaussian representation, Gaussian representation, feed-forward sparse-view reconstruction, pixel-based Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Under Review

点击查看摘要

Abstract:Prior works employing pixel-based Gaussian representation have demonstrated efficacy in feed-forward sparse-view reconstruction. However, such representation necessitates cross-view overlap for accurate depth estimation, and is challenged by object occlusions and frustum truncations. As a result, these methods require scene-centric data acquisition to maintain cross-view overlap and complete scene visibility to circumvent occlusions and truncations, which limits their applicability to scene-centric reconstruction. In contrast, in autonomous driving scenarios, a more practical paradigm is ego-centric reconstruction, which is characterized by minimal cross-view overlap and frequent occlusions and truncations. The limitations of pixel-based representation thus hinder the utility of prior works in this task. In light of this, this paper conducts an in-depth analysis of different representations, and introduces Omni-Gaussian representation with tailored network design to complement their strengths and mitigate their drawbacks. Experiments show that our method significantly surpasses state-of-the-art methods, pixelSplat and MVSplat, in ego-centric reconstruction, and achieves comparable performance to prior works in scene-centric reconstruction. Furthermore, we extend our method with diffusion models, pioneering feed-forward multi-modal generation of 3D driving scenes.
zh

[CV-76] Open-Vocabulary High-Resolution 3D (OVHR3D) Data Segmentation and Annotation Framework

【速读】：该论文试图解决3D数据标注的高成本和低效率问题，特别是在军事模拟训练中的应用。解决方案的关键在于设计并开发一个高效的3D分割框架，该框架集成了Grounding DINO和Segment Anything Model，并通过3D网格增强的2D图像渲染来提升标注效率。此外，论文还开发了一个用户友好的界面，提供直观的渲染图像和3D点云可视化，以简化3D数据标注过程。

链接: https://arxiv.org/abs/2412.06268
作者: Jiuyi Xu,Meida Chen,Andrew Feng,Yangming Shi,Zifan Yu
关键词-EN: creating virtual environments, high quality annotated, Army modeling, modeling and simulation, availability of high
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the domain of the U.S. Army modeling and simulation, the availability of high quality annotated 3D data is pivotal to creating virtual environments for training and simulations. Traditional methodologies for 3D semantic and instance segmentation, such as KpConv, RandLA, Mask3D, etc., are designed to train on extensive labeled datasets to obtain satisfactory performance in practical tasks. This requirement presents a significant challenge, given the inherent scarcity of manually annotated 3D datasets, particularly for the military use cases. Recognizing this gap, our previous research leverages the One World Terrain data repository manually annotated databases, as showcased at IITSEC 2019 and 2021, to enrich the training dataset for deep learning models. However, collecting and annotating large scale 3D data for specific tasks remains costly and inefficient. To this end, the objective of this research is to design and develop a comprehensive and efficient framework for 3D segmentation tasks to assist in 3D data annotation. This framework integrates Grounding DINO and Segment anything Model, augmented by an enhancement in 2D image rendering via 3D mesh. Furthermore, the authors have also developed a user friendly interface that facilitates the 3D annotation process, offering intuitive visualization of rendered images and the 3D point cloud.
zh

[CV-77] LLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

【速读】：该论文试图解决在大规模视觉-语言模型（Large Vision-Language Models, LVLMs）中提高推理吞吐量的问题，同时尽量减少性能损失。解决方案的关键在于两个设计：首先，iLLaVA不仅加速了大规模语言模型（Large Language Models, LLMs）的前向传播，还加速了图像编码器的前向传播，这两者在推理过程中占据了大量时间；其次，iLLaVA通过精确且快速的算法逐步合并冗余的tokens，并将被修剪tokens中的有益信息回收至现有tokens中，避免了直接丢弃上下文tokens导致的性能损失。这种方法在几乎不损失模型性能的情况下，将吞吐量提升了近2倍，并将内存成本降低了一半。

链接: https://arxiv.org/abs/2412.06263
作者: Lianyu Hu,Fanhua Shang,Liang Wan,Wei Feng
关键词-EN: current Large Vision-Language, Large Vision-Language Models, Large Language Models, requirement to train, lossless model performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce iLLaVA, a simple method that can be seamlessly deployed upon current Large Vision-Language Models (LVLMs) to greatly increase the throughput with nearly lossless model performance, without a further requirement to train. iLLaVA achieves this by finding and gradually merging the redundant tokens with an accurate and fast algorithm, which can merge hundreds of tokens within only one step. While some previous methods have explored directly pruning or merging tokens in the inference stage to accelerate models, our method excels in both performance and throughput by two key designs. First, while most previous methods only try to save the computations of Large Language Models (LLMs), our method accelerates the forward pass of both image encoders and LLMs in LVLMs, which both occupy a significant part of time during inference. Second, our method recycles the beneficial information from the pruned tokens into existing tokens, which avoids directly dropping context tokens like previous methods to cause performance loss. iLLaVA can nearly 2 \times the throughput, and reduce the memory costs by half with only a 0.2% - 0.5% performance drop across models of different scales including 7B, 13B and 34B. On tasks across different domains including single-image, multi-images and videos, iLLaVA demonstrates strong generalizability with consistently promising efficiency. We finally offer abundant visualizations to show the merging processes of iLLaVA in each step, which show insights into the distribution of computing resources in LVLMs. Code is available at this https URL.
zh

[CV-78] A Lightweight U-like Network Utilizing Neural Memory Ordinary Differential Equations for Slimming the Decoder

【速读】：该论文试图解决现有U型网络在医学图像分割任务中存在的参数过多、计算复杂度高和推理速度慢的问题，特别是在计算资源有限的场景中。解决方案的关键在于提出了三种即插即用的解码器，通过采用神经记忆常微分方程（nmODEs）的不同离散化方法来实现。这些解码器通过处理来自跳跃连接的信息并执行数值运算，能够在保持性能的同时显著减少参数数量和浮点运算次数（FLOPs）。实验结果表明，这些解码器可以减少约20% ~ 50%的参数和高达74%的FLOPs，并且具有适应所有U型网络的潜力。

链接: https://arxiv.org/abs/2412.06262
作者: Quansong He,Xiaojun Yao,Jun Wu,Zhang Yi,Tao He
关键词-EN: image segmentation tasks, medical image segmentation, advanced U-like networks, demonstrated remarkable performance, U-like networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In recent years, advanced U-like networks have demonstrated remarkable performance in medical image segmentation tasks. However, their drawbacks, including excessive parameters, high computational complexity, and slow inference speed, pose challenges for practical implementation in scenarios with limited computational resources. Existing lightweight U-like networks have alleviated some of these problems, but they often have pre-designed structures and consist of inseparable modules, limiting their application scenarios. In this paper, we propose three plug-and-play decoders by employing different discretization methods of the neural memory Ordinary Differential Equations (nmODEs). These decoders integrate features at various levels of abstraction by processing information from skip connections and performing numerical operations on upward path. Through experiments on the PH2, ISIC2017, and ISIC2018 datasets, we embed these decoders into different U-like networks, demonstrating their effectiveness in significantly reducing the number of parameters and FLOPs while maintaining performance. In summary, the proposed discretized nmODEs decoders are capable of reducing the number of parameters by about 20% ~ 50% and FLOPs by up to 74%, while possessing the potential to adapt to all U-like networks. Our code is available at this https URL.
zh

[CV-79] Enhanced Multi-Object Tracking Using Pose-based Virtual Markers in 3x3 Basketball

【速读】：该论文试图解决团队运动（如篮球）中多目标跟踪 (Multi-object tracking, MOT) 的挑战，特别是由于球员不可预测的运动、频繁的近距离互动、视觉相似性导致的姿态标注困难以及由此产生的遮挡、频繁的ID切换和高昂的手动标注成本。解决方案的关键在于提出了一种基于姿态的虚拟标记 (Virtual Marker, VM) 方法，称为Sports-vmTracking。该方法通过构建篮球姿态数据集并应用主动学习来增强模型生成虚拟标记的能力，从而在视频中识别球员并提取其姿态信息，转换为边界框进行跟踪。实验结果表明，该方法在保持高准确性的同时，显著减少了手动校正和标注的需求，提高了时间和成本效率，并在HOTA评分上比其他无虚拟标记的先进方法高出10个百分点，且实现了零ID切换。

链接: https://arxiv.org/abs/2412.06258
作者: Li Yin,Calvin Yeung,Qingrui Hu,Jun Ichikawa,Hirotsugu Azechi,Susumu Takahashi,Keisuke Fujii
关键词-EN: team sports tactics, evaluating team sports, Multi-object tracking, team sports, multi-agent analyses
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) is crucial for various multi-agent analyses such as evaluating team sports tactics and player movements and performance. While pedestrian tracking has advanced with Tracking-by-Detection MOT, team sports like basketball pose unique challenges. These challenges include players’ unpredictable movements, frequent close interactions, and visual similarities that complicate pose labeling and lead to significant occlusions, frequent ID switches, and high manual annotation costs. To address these challenges, we propose a novel pose-based virtual marker (VM) MOT method for team sports, named Sports-vmTracking. This method builds on the vmTracking approach developed for multi-animal tracking with active learning. First, we constructed a 3x3 basketball pose dataset for VMs and applied active learning to enhance model performance in generating VMs. Then, we overlaid the VMs on video to identify players, extract their poses with unique IDs, and convert these into bounding boxes for comparison with automated MOT methods. Using our 3x3 basketball dataset, we demonstrated that our VM configuration has been highly effective, and reduced the need for manual corrections and labeling during pose model training while maintaining high accuracy. Our approach achieved an average HOTA score of 72.3%, over 10 points higher than other state-of-the-art methods without VM, and resulted in 0 ID switches. Beyond improving performance in handling occlusions and minimizing ID switches, our framework could substantially increase the time and cost efficiency compared to traditional manual annotation.
zh

[CV-80] Advancing Extended Reality with 3D Gaussian Splatting: Innovations and Prospects

【速读】：该论文试图解决3D Gaussian Splatting (3DGS) 在扩展现实 (Extended Reality, XR) 领域的应用研究不足的问题。解决方案的关键在于通过综合分析现有3DGS文献，特别是那些涉及XR相关概念的研究，提出一个分类体系来识别和强调3DGS在XR中的创新应用。基于这些分析，论文进一步提出了几个有前景的XR研究方向，这些方向目前尚未得到充分探索，但可以通过前沿的3DGS技术推动XR领域的发展。

链接: https://arxiv.org/abs/2412.06257
作者: Shi Qiu,Binzhu Xie,Qixuan Liu,Pheng-Ann Heng
关键词-EN: Gaussian Splatting, attracted significant attention, attracted significant, significant attention, Extended Reality
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: IEEE AIxVR 2025

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has attracted significant attention for its potential to revolutionize 3D representation, rendering, and interaction. Despite the rapid growth of 3DGS research, its direct application to Extended Reality (XR) remains underexplored. Although many studies recognize the potential of 3DGS for XR, few have explicitly focused on or demonstrated its effectiveness within XR environments. In this paper, we aim to synthesize innovations in 3DGS that show specific potential for advancing XR research and development. We conduct a comprehensive review of publicly available 3DGS papers, with a focus on those referencing XR-related concepts. Additionally, we perform an in-depth analysis of innovations explicitly relevant to XR and propose a taxonomy to highlight their significance. Building on these insights, we propose several prospective XR research areas where 3DGS can make promising contributions, yet remain rarely touched. By investigating the intersection of 3DGS and XR, this paper provides a roadmap to push the boundaries of XR using cutting-edge 3DGS techniques.
zh

[CV-81] Splatter-360: Generalizable 360circ Gaussian Splatting for Wide-baseline Panoramic Images

【速读】：该论文试图解决从宽基线全景图像中实时合成新视图的挑战，特别是在处理高分辨率和固有畸变的全景图像时。解决方案的关键在于提出了一个名为 Splatter-360 的端到端可泛化 3D Gaussian Splatting (3DGS) 框架。该框架通过在球面域中直接进行多视图匹配，利用球面扫描算法构建球面代价体积，从而增强网络的深度感知和几何估计能力。此外，引入了一种 3D 感知的双向投影编码器来缓解全景图像的固有畸变，并结合跨视图注意力机制以改善多视点间的特征交互，从而实现鲁棒的 3D 感知特征表示和实时渲染能力。

链接: https://arxiv.org/abs/2412.06250
作者: Zheng Chen,Chenming Wu,Zhelun Shen,Chen Zhao,Weicai Ye,Haocheng Feng,Errui Ding,Song-Hai Zhang
关键词-EN: Wide-baseline panoramic images, minimize capturing labor, panoramic images, Wide-baseline panoramic, capturing labor costs
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:Wide-baseline panoramic images are frequently used in applications like VR and simulations to minimize capturing labor costs and storage needs. However, synthesizing novel views from these panoramic images in real time remains a significant challenge, especially due to panoramic imagery’s high resolution and inherent distortions. Although existing 3D Gaussian splatting (3DGS) methods can produce photo-realistic views under narrow baselines, they often overfit the training views when dealing with wide-baseline panoramic images due to the difficulty in learning precise geometry from sparse 360 ^\circ views. This paper presents \textitSplatter-360, a novel end-to-end generalizable 3DGS framework designed to handle wide-baseline panoramic images. Unlike previous approaches, \textitSplatter-360 performs multi-view matching directly in the spherical domain by constructing a spherical cost volume through a spherical sweep algorithm, enhancing the network’s depth perception and geometry estimation. Additionally, we introduce a 3D-aware bi-projection encoder to mitigate the distortions inherent in panoramic images and integrate cross-view attention to improve feature interactions across multiple viewpoints. This enables robust 3D-aware feature representations and real-time rendering capabilities. Experimental results on the HM3D~\citehm3d and Replica~\citereplica demonstrate that \textitSplatter-360 significantly outperforms state-of-the-art NeRF and 3DGS methods (e.g., PanoGRF, MVSplat, DepthSplat, and HiSplat) in both synthesis quality and generalization performance for wide-baseline panoramic images. Code and trained models are available at \urlthis https URL.
zh

[CV-82] Rendering-Refined Stable Diffusion for Privacy Compliant Synthetic Data

【速读】：该论文试图解决图像数据集中隐私保护与数据实用性之间的平衡问题。传统方法如遮挡和模糊处理虽然保护了隐私，但降低了图像质量和关键信息的可用性，尤其是在以人为中心的图像中。论文提出的解决方案是引入Rendering-Refined Stable Diffusion (RefSD)，这是一个结合了3D渲染与Stable Diffusion的流程，能够在保留人体姿态的同时，通过提示控制人物属性，实现更真实且可定制的伪匿名化。与标准扩散模型和GAN相比，RefSD在姿态保留、真实性和属性控制方面实现了更好的平衡。此外，论文还提出了HumanGenAI框架，用于评估伪匿名化数据的人类感知和实用性，实验结果表明，基于RefSD训练的模型在检测任务中优于基于真实数据训练的模型，并且在分类任务中与真实数据结合使用时表现出持续的性能提升。

链接: https://arxiv.org/abs/2412.06248
作者: Kartik Patwari,David Schneider,Xiaoxiao Sun,Chen-Nee Chuah,Lingjuan Lyu,Vivek Sharma
关键词-EN: Growing privacy concerns, CCPA necessitate pseudonymization, GDPR and CCPA, necessitate pseudonymization techniques, Growing privacy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Growing privacy concerns and regulations like GDPR and CCPA necessitate pseudonymization techniques that protect identity in image datasets. However, retaining utility is also essential. Traditional methods like masking and blurring degrade quality and obscure critical context, especially in human-centric images. We introduce Rendering-Refined Stable Diffusion (RefSD), a pipeline that combines 3D-rendering with Stable Diffusion, enabling prompt-based control over human attributes while preserving posture. Unlike standard diffusion models that fail to retain posture or GANs that lack realism and flexible attribute control, RefSD balances posture preservation, realism, and customization. We also propose HumanGenAI, a framework for human perception and utility evaluation. Human perception assessments reveal attribute-specific strengths and weaknesses of RefSD. Our utility experiments show that models trained on RefSD pseudonymized data outperform those trained on real data in detection tasks, with further performance gains when combining RefSD with real data. For classification tasks, we consistently observe performance improvements when using RefSD data with real data, confirming the utility of our pseudonymized data.
zh

[CV-83] DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction

【速读】：该论文试图解决预训练视觉-语言模型（VLMs）在密集预测任务中表现不佳的问题，特别是由于“前景偏差”（foreground bias）导致的背景区域被错误识别为前景对象的现象。解决方案的关键在于提出了DenseVLM框架，通过利用预训练的VLM来检索未标记区域的类别，从而有效解耦前景和背景区域特征之间的干扰，确保每个区域与其对应类别准确对齐。这一方法能够无缝集成到开放词汇对象检测和图像分割任务中，显著提升性能，并在更广泛和多样化的数据集上展现出良好的零样本扩展能力。

链接: https://arxiv.org/abs/2412.06244
作者: Yunheng Li,Yuxuan Li,Quansheng Zeng,Wenhai Wang,Qibin Hou,Ming-Ming Cheng
关键词-EN: dense prediction tasks, zero-shot recognition capability, demonstrated impressive zero-shot, impressive zero-shot recognition, recognition capability
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias’, where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. By leveraging the pre-trained VLM to retrieve categories for unlabeled regions, DenseVLM effectively decouples the interference between foreground and background region features, ensuring that each region is accurately aligned with its corresponding category. We show that DenseVLM can be seamlessly integrated into open-vocabulary object detection and image segmentation tasks, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets.
zh

[CV-84] U-Know-DiffPAN: An Uncertainty-aware Knowledge Distillation Diffusion Framework with Details Enhancement for PAN-Sharpening

【速读】：该论文试图解决传统全色锐化方法在恢复精细细节方面的不足，特别是由于难以有效利用高频信息以及基于扩散的方法缺乏足够的条件来充分利用全色（PAN）图像和低分辨率多光谱（LRMS）输入的问题。解决方案的关键在于提出了一个不确定性感知的知识蒸馏扩散框架（U-Know-DiffPAN），通过频率选择性注意力机制捕获频率细节，并利用不确定性图引导轻量级学生模型关注难以锐化的图像区域。该框架通过在编码器中条件化PAN和LRMS的紧凑向量表示，并在解码器中使用小波变换，实现了丰富的频率利用，从而提升了全色锐化的性能。

链接: https://arxiv.org/abs/2412.06243
作者: Sungpyo Kim,Jeonghyeok Do,Jaehyup Lee,Munchurl Kim
关键词-EN: leveraging high-frequency information, fine details due, restore fine details, high-frequency information, Conventional methods
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Please visit our project page at this https URL

点击查看摘要

Abstract:Conventional methods for PAN-sharpening often struggle to restore fine details due to limitations in leveraging high-frequency information. Moreover, diffusion-based approaches lack sufficient conditioning to fully utilize Panchromatic (PAN) images and low-resolution multispectral (LRMS) inputs effectively. To address these challenges, we propose an uncertainty-aware knowledge distillation diffusion framework with details enhancement for PAN-sharpening, called U-Know-DiffPAN. The U-Know-DiffPAN incorporates uncertainty-aware knowledge distillation for effective transfer of feature details from our teacher model to a student one. The teacher model in our U-Know-DiffPAN captures frequency details through freqeuncy selective attention, facilitating accurate reverse process learning. By conditioning the encoder on compact vector representations of PAN and LRMS and the decoder on Wavelet transforms, we enable rich frequency utilization. So, the high-capacity teacher model distills frequency-rich features into a lightweight student model aided by an uncertainty map. From this, the teacher model can guide the student model to focus on difficult image regions for PAN-sharpening via the usage of the uncertainty map. Extensive experiments on diverse datasets demonstrate the robustness and superior performance of our U-Know-DiffPAN over very recent state-of-the-art PAN-sharpening methods.
zh

[CV-85] VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition

【速读】：该论文试图解决使用大规模网络爬取数据集训练人脸识别模型时所面临的隐私和偏见问题。解决方案的关键在于提出了一种名为VariFace的两阶段扩散模型管道，用于创建公平且多样化的人工合成人脸数据集。具体来说，VariFace引入了三种方法：人脸识别一致性（Face Recognition Consistency）用于优化人口统计标签，人脸Vendi评分引导（Face Vendi Score Guidance）以提高类间多样性，以及分歧评分条件（Divergence Score Conditioning）来平衡身份保持与类内多样性之间的权衡。这些方法使得VariFace在相同数据集规模下显著优于以往的合成数据集，并在不受限制的情况下首次在多个评估数据集上超越了真实数据集的性能，达到了新的最先进水平。

链接: https://arxiv.org/abs/2412.06235
作者: Michael Yeung,Toya Teramoto,Songtao Wu,Tatsuo Fujiwara,Kenji Suzuki,Tamaki Kojima
关键词-EN: raised significant privacy, face recognition, face recognition models, train face recognition, Real Gap
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The use of large-scale, web-scraped datasets to train face recognition models has raised significant privacy and bias concerns. Synthetic methods mitigate these concerns and provide scalable and controllable face generation to enable fair and accurate face recognition. However, existing synthetic datasets display limited intraclass and interclass diversity and do not match the face recognition performance obtained using real datasets. Here, we propose VariFace, a two-stage diffusion-based pipeline to create fair and diverse synthetic face datasets to train face recognition models. Specifically, we introduce three methods: Face Recognition Consistency to refine demographic labels, Face Vendi Score Guidance to improve interclass diversity, and Divergence Score Conditioning to balance the identity preservation-intraclass diversity trade-off. When constrained to the same dataset size, VariFace considerably outperforms previous synthetic datasets (0.9200 \rightarrow 0.9405) and achieves comparable performance to face recognition models trained with real data (Real Gap = -0.0065). In an unconstrained setting, VariFace not only consistently achieves better performance compared to previous synthetic methods across dataset sizes but also, for the first time, outperforms the real dataset (CASIA-WebFace) across six evaluation datasets. This sets a new state-of-the-art performance with an average face verification accuracy of 0.9567 (Real Gap = +0.0097) across LFW, CFP-FP, CPLFW, AgeDB, and CALFW datasets and 0.9366 (Real Gap = +0.0380) on the RFW dataset.
zh

[CV-86] Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

【速读】：该论文试图解决传统前馈高斯模型在稀疏视角三维重建中难以表示高频细节的问题。解决方案的关键在于提出了一种名为“生成式稠密化 (Generative Densification)”的方法，通过从前馈模型中上采样特征表示，并在单次前向传播中生成对应的精细高斯分布，从而利用嵌入的先验知识增强模型的泛化能力。与传统的3D高斯喷射 (3D-GS) 稠密化策略不同，该方法避免了迭代分裂和克隆原始高斯参数的过程，显著提升了在物体级和场景级重建任务中对细节的表示能力。

链接: https://arxiv.org/abs/2412.06234
作者: Seungtae Nam,Xiangyu Sun,Gyeongjin Kang,Younggeun Lee,Seungjun Oh,Eunbyung Park
关键词-EN: large multi-view datasets, achieved significant progress, progress in sparse-view, multi-view datasets, achieved significant
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Generalized feed-forward Gaussian models have achieved significant progress in sparse-view 3D reconstruction by leveraging prior knowledge from large multi-view datasets. However, these models often struggle to represent high-frequency details due to the limited number of Gaussians. While the densification strategy used in per-scene 3D Gaussian splatting (3D-GS) optimization can be adapted to the feed-forward models, it may not be ideally suited for generalized scenarios. In this paper, we propose Generative Densification, an efficient and generalizable method to densify Gaussians generated by feed-forward models. Unlike the 3D-GS densification strategy, which iteratively splits and clones raw Gaussian parameters, our method up-samples feature representations from the feed-forward models and generates their corresponding fine Gaussians in a single forward pass, leveraging the embedded prior knowledge for enhanced generalization. Experimental results on both object-level and scene-level reconstruction tasks demonstrate that our method outperforms state-of-the-art approaches with comparable or smaller model sizes, achieving notable improvements in representing fine details.
zh

[CV-87] Attention-Enhanced Lightweight Hourglass Network for Human Pose Estimation

【速读】：该论文试图解决现有姿态估计方法计算复杂度高和模型架构复杂的问题。解决方案的关键在于提出了一种轻量级的基于注意力机制的姿态估计网络，该网络采用了深度可分离卷积 (depthwise separable convolution) 和卷积块注意力模块 (Convolutional Block Attention Module)，并结合了沙漏网络 (hourglass backbone)。通过这些技术，模型显著降低了计算复杂度（浮点运算次数）和模型大小（参数量），仅包含原始八层沙漏网络约10%的参数量，同时在COCO和MPII数据集上实现了与六种其他轻量级姿态估计模型相比具有竞争力的平均精度（72.07），参数量仅为2.3M，浮点运算次数为3.7G FLOPs。

链接: https://arxiv.org/abs/2412.06227
作者: Marsha Mariya Kappan,Eduardo Benitez Sandoval,Erik Meijering,Francisco Cruz
关键词-EN: human-robot interaction, critical task, task in computer, computer vision, wide range
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pose estimation is a critical task in computer vision with a wide range of applications from activity monitoring to human-robot interaction. However,most of the existing methods are computationally expensive or have complex architecture. Here we propose a lightweight attention based pose estimation network that utilizes depthwise separable convolution and Convolutional Block Attention Module on an hourglass backbone. The network significantly reduces the computational complexity (floating point operations) and the model size (number of parameters) containing only about 10% of parameters of original eight stack Hourglass this http URL were conducted on COCO and MPII datasets using a two stack hourglass backbone. The results showed that our model performs well in comparison to six other lightweight pose estimation models with an average precision of 72.07. The model achieves this performance with only 2.3M parameters and 3.7G FLOPs.
zh

[CV-88] Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

【速读】：该论文试图解决现有具身导航模型在处理多样化任务时受限于特定任务配置或预定义地图的问题。解决方案的关键在于提出了Uni-NaVid，这是首个基于视频的视觉-语言-动作 (Vision-Language-Action, VLA) 模型，旨在统一多种具身导航任务，并在未见过的真实环境中实现无缝导航。Uni-NaVid通过协调所有常见具身导航任务的输入和输出数据配置，将这些任务整合到一个模型中，并通过从四个基本导航子任务中收集的360万条导航数据样本进行训练，从而实现了跨任务的学习协同效应。实验结果表明，Uni-NaVid在综合导航基准测试中展现了统一建模的优势，并达到了最先进的性能，同时在真实世界实验中验证了其有效性和泛化能力。

链接: https://arxiv.org/abs/2412.06224
作者: Jiazhao Zhang,Kunyu Wang,Shaoan Wang,Minghan Li,Haoran Liu,Songlin Wei,Zhongyuan Wang,Zhizheng Zhang,He Wang
关键词-EN: searching objects, answering questions, tracking people, interaction demands, practical navigation agent
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model’s effectiveness and efficiency, shedding light on its strong generalizability.
zh

[CV-89] Data Free Backdoor Attacks NEURIPS2024

【速读】：该论文试图解决现有后门攻击（backdoor attacks）在缺乏干净数据、模型规模较大时效率低下以及因架构修改导致隐蔽性差的问题。解决方案的关键在于提出了一种无需重新训练（retraining-free）且无需数据（data-free）的后门攻击方法DFBA，通过修改分类器的少量参数来注入后门，而不改变模型架构。该方法在理论分析和实验评估中均证明了其不可检测性和不可移除性，并且在多个数据集上实现了接近100%的攻击成功率，同时绕过了六种现有的最先进防御机制，且分类准确率损失极小。

链接: https://arxiv.org/abs/2412.06219
作者: Bochuan Cao,Jinyuan Jia,Chuxuan Hu,Wenbo Guo,Zhen Xiang,Jinghui Chen,Bo Li,Dawn Song
关键词-EN: attacker-chosen target class, attacker-chosen backdoor trigger, attacker-chosen target, target class, Backdoor
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 8 figures, accepted by NeurIPS 2024

点击查看摘要

Abstract:Backdoor attacks aim to inject a backdoor into a classifier such that it predicts any input with an attacker-chosen backdoor trigger as an attacker-chosen target class. Existing backdoor attacks require either retraining the classifier with some clean data or modifying the model’s architecture. As a result, they are 1) not applicable when clean data is unavailable, 2) less efficient when the model is large, and 3) less stealthy due to architecture changes. In this work, we propose DFBA, a novel retraining-free and data-free backdoor attack without changing the model architecture. Technically, our proposed method modifies a few parameters of a classifier to inject a backdoor. Through theoretical analysis, we verify that our injected backdoor is provably undetectable and unremovable by various state-of-the-art defenses under mild assumptions. Our evaluation on multiple datasets further demonstrates that our injected backdoor: 1) incurs negligible classification loss, 2) achieves 100% attack success rates, and 3) bypasses six existing state-of-the-art defenses. Moreover, our comparison with a state-of-the-art non-data-free backdoor attack shows our attack is more stealthy and effective against various defenses while achieving less classification accuracy loss.
zh

[CV-90] A Real-Time Defense Against Object Vanishing Adversarial Patch Attacks for Object Detection in Autonomous Vehicles

【速读】：该论文试图解决自动驾驶车辆（Autonomous Vehicles, AVs）中基于深度神经网络（DNN）的目标检测模型在面对对抗性补丁（Adversarial Patches）攻击时，可能导致物体消失（Object Vanishing）的问题。解决方案的关键是提出了一种名为ADAV（Adversarial Defense for Autonomous Vehicles）的新型防御方法，该方法能够在实时环境中运行，并利用自动驾驶车辆视频流中的上下文信息。ADAV通过检查目标帧与参考帧之间的输出是否具有时间一致性来检测对抗性补丁的存在，并使用基于梯度的归因（Gradient-based Attribution）来定位破坏时间一致性的对抗性像素。这种两阶段的处理流程不仅能够高效处理干净输入，还通过优化实现了低延迟。

链接: https://arxiv.org/abs/2412.06215
作者: Jaden Mu
关键词-EN: Autonomous vehicles, DNN-based object detection, increasingly use DNN-based, vision-based perception, object detection models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous vehicles (AVs) increasingly use DNN-based object detection models in vision-based perception. Correct detection and classification of obstacles is critical to ensure safe, trustworthy driving decisions. Adversarial patches aim to fool a DNN with intentionally generated patterns concentrated in a localized region of an image. In particular, object vanishing patch attacks can cause object detection models to fail to detect most or all objects in a scene, posing a significant practical threat to AVs. This work proposes ADAV (Adversarial Defense for Autonomous Vehicles), a novel defense methodology against object vanishing patch attacks specifically designed for autonomous vehicles. Unlike existing defense methods which have high latency or are designed for static images, ADAV runs in real-time and leverages contextual information from prior frames in an AV’s video feed. ADAV checks if the object detector’s output for the target frame is temporally consistent with the output from a previous reference frame to detect the presence of a patch. If the presence of a patch is detected, ADAV uses gradient-based attribution to localize adversarial pixels that break temporal consistency. This two stage procedure allows ADAV to efficiently process clean inputs, and both stages are optimized to be low latency. ADAV is evaluated using real-world driving data from the Berkeley Deep Drive BDD100K dataset, and demonstrates high adversarial and clean performance. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.06215 [cs.CV] (or arXiv:2412.06215v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.06215 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-91] MSCrackMamba: Leveraging Vision Mamba for Crack Detection in Fused Multispectral Imagery

【速读】：该论文试图解决在基于视觉的裂缝检测中，红外（IR）和红绿蓝（RGB）通道分辨率不一致导致的细节丢失问题，以及传统图像分割网络感受野有限和高计算复杂度的问题。解决方案的关键在于提出了一种两阶段范式，称为MSCrackMamba。首先，通过超分辨率网络将IR通道的分辨率提升至与RGB通道匹配，以实现数据融合；其次，采用Vision Mamba作为骨干网络，并结合UperNet作为解码器进行裂缝检测。该方法在Crack900数据集上验证，相较于最佳基线方法，mIoU提升了3.55%。

链接: https://arxiv.org/abs/2412.06211
作者: Qinfeng Zhu,Yuan Fang,Lei Fan
关键词-EN: structural health monitoring, prevent potential failures, Crack detection, RGB channels, structural health
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Crack detection is a critical task in structural health monitoring, aimed at assessing the structural integrity of bridges, buildings, and roads to prevent potential failures. Vision-based crack detection has become the mainstream approach due to its ease of implementation and effectiveness. Fusing infrared (IR) channels with red, green and blue (RGB) channels can enhance feature representation and thus improve crack detection. However, IR and RGB channels often differ in resolution. To align them, higher-resolution RGB images typically need to be downsampled to match the IR image resolution, which leads to the loss of fine details. Moreover, crack detection performance is restricted by the limited receptive fields and high computational complexity of traditional image segmentation networks. Inspired by the recently proposed Mamba neural architecture, this study introduces a two-stage paradigm called MSCrackMamba, which leverages Vision Mamba along with a super-resolution network to address these challenges. Specifically, to align IR and RGB channels, we first apply super-resolution to IR channels to match the resolution of RGB channels for data fusion. Vision Mamba is then adopted as the backbone network, while UperNet is employed as the decoder for crack detection. Our approach is validated on the large-scale Crack Detection dataset Crack900, demonstrating an improvement of 3.55% in mIoU compared to the best-performing baseline methods.
zh

[CV-92] Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

【速读】：该论文试图解决从多样化自然声音中生成视觉场景图像的问题，这是一个由于听觉和视觉信号之间显著信息差异而具有挑战性的跨模态生成任务。解决方案的关键在于设计一个模型，通过丰富音频特征并将其转换到视觉潜在空间来对齐音频-视觉模态。具体来说，该模型利用声音源定位来选择具有强跨模态关联的音频-视觉对，并将这些特征输入预训练的图像生成器以生成图像。此外，通过分析学习到的嵌入空间的几何特性，证明了该方法能够有效对齐音频-视觉信号，并且具有通用性，能够集成多种模型架构和不同类型的音频-视觉数据。

链接: https://arxiv.org/abs/2412.06209
作者: Kim Sung-Bin,Arda Senocak,Hyunwoo Ha,Tae-Hyun Oh
关键词-EN: describe the world, audio describe, visual, audio-visual, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Under-review

点击查看摘要

Abstract:How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual information and translating them into the visual latent space. These features are then fed into the pre-trained image generator to produce images. To enhance image quality, we use sound source localization to select audio-visual pairs with strong cross-modal correlations. Our method achieves substantially better results on the VEGAS and VGGSound datasets compared to previous work and demonstrates control over the generation process through simple manipulations to the input waveform or latent space. Furthermore, we analyze the geometric properties of the learned embedding space and demonstrate that our learning approach effectively aligns audio-visual signals for cross-modal generation. Based on this analysis, we show that our method is agnostic to specific design choices, showing its generalizability by integrating various model architectures and different types of audio-visual data.
zh

[CV-93] Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization

【速读】：该论文试图解决多模态语义通信中的关键问题，特别是如何有效处理动态物理信道和噪声，以及如何应对多模态数据流（如视频和音频）的语义增强和识别任务。现有方法主要依赖模拟信道和假设恒定的信道状态（完美CSI），无法应对现实场景中的动态信道和噪声。此外，现有方法通常只处理单一模态任务，忽略了多模态语义增强的需求。论文提出的解决方案是一个引导框架，专门针对音频-视觉事件定位任务。该框架利用数字引导码和信道模块来指导现实场景中模拟信道的状态，并设计了基于欧拉的多模态语义编码和解码，考虑了动态信道状态下的时频特性。这一方法有效处理了多模态数据流，特别是在音频-视觉事件定位任务中表现出色，实验结果表明其在信道变化中的鲁棒性以及在信号噪声比（SNR）方面的优势。

链接: https://arxiv.org/abs/2412.06208
作者: Fei Yu,Zhe Xiang,Nan Che,Zhuoran Zhang,Yuandi Li,Junxiao Xue,Zhiguo Wan
关键词-EN: significantly enhances communication, enhances communication efficiency, Multimodal semantic, Multimodal semantic communication, significantly enhances
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Multimodal semantic communication, which integrates various data modalities such as text, images, and audio, significantly enhances communication efficiency and reliability. It has broad application prospects in fields such as artificial intelligence, autonomous driving, and smart homes. However, current research primarily relies on analog channels and assumes constant channel states (perfect CSI), which is inadequate for addressing dynamic physical channels and noise in real-world scenarios. Existing methods often focus on single modality tasks and fail to handle multimodal stream data, such as video and audio, and their corresponding tasks. Furthermore, current semantic encoding and decoding modules mainly transmit single modality features, neglecting the need for multimodal semantic enhancement and recognition tasks. To address these challenges, this paper proposes a pilot-guided framework for multimodal semantic communication specifically tailored for audio-visual event localization tasks. This framework utilizes digital pilot codes and channel modules to guide the state of analog channels in real-wold scenarios and designs Euler-based multimodal semantic encoding and decoding that consider time-frequency characteristics based on dynamic channel state. This approach effectively handles multimodal stream source data, especially for audio-visual event localization tasks. Extensive numerical experiments demonstrate the robustness of the proposed framework in channel changes and its support for various communication scenarios. The experimental results show that the framework outperforms existing benchmark methods in terms of Signal-to-Noise Ratio (SNR), highlighting its advantage in semantic communication quality. Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS) Cite as: arXiv:2412.06208 [cs.SD] (or arXiv:2412.06208v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2412.06208 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-94] You KAN Do It in a Single Shot: Plug-and-Play Methods with Single-Instance Priors

【速读】：该论文试图解决在逆问题求解中，传统去噪方法通常需要大量数据集的问题。解决方案的关键在于引入KAN-PnP优化框架，该框架利用Kolmogorov-Arnold Networks (KANs)作为去噪器，能够在仅有一个噪声观测的情况下有效工作。KANs基于Kolmogorov-Arnold表示定理，提供了鲁棒的去噪方法，并且证明了KAN去噪器的Lipschitz连续性，确保了在PnP-ADMM等优化算法中的稳定性和收敛性。此外，论文提供了KAN-PnP的理论保证，证明了在数据保真项的凸性、去噪器的Lipschitz连续性和正则化泛函的有界性等关键条件下，KAN-PnP能够稳定且可靠地优化。实验结果表明，KAN-PnP在超分辨率和联合优化任务中优于现有方法，在单次学习中表现出优异的性能，且具有较强的收敛特性。

链接: https://arxiv.org/abs/2412.06204
作者: Yanqi Cheng,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
关键词-EN: solving inverse problems, clean solution, serving as regularising, incorporates Kolmogorov-Arnold Networks, inverse problems
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The use of Plug-and-Play (PnP) methods has become a central approach for solving inverse problems, with denoisers serving as regularising priors that guide optimisation towards a clean solution. In this work, we introduce KAN-PnP, an optimisation framework that incorporates Kolmogorov-Arnold Networks (KANs) as denoisers within the Plug-and-Play (PnP) paradigm. KAN-PnP is specifically designed to solve inverse problems with single-instance priors, where only a single noisy observation is available, eliminating the need for large datasets typically required by traditional denoising methods. We show that KANs, based on the Kolmogorov-Arnold representation theorem, serve effectively as priors in such settings, providing a robust approach to denoising. We prove that the KAN denoiser is Lipschitz continuous, ensuring stability and convergence in optimisation algorithms like PnP-ADMM, even in the context of single-shot learning. Additionally, we provide theoretical guarantees for KAN-PnP, demonstrating its convergence under key conditions: the convexity of the data fidelity term, Lipschitz continuity of the denoiser, and boundedness of the regularisation functional. These conditions are crucial for stable and reliable optimisation. Our experimental results show, on super-resolution and joint optimisation, that KAN-PnP outperforms exiting methods, delivering superior performance in single-shot learning with minimal data. The method exhibits strong convergence properties, achieving high accuracy with fewer iterations.
zh

[CV-95] Size-Variable Virtual Try-On with Physical Clothes Size

【速读】：该论文试图解决在图像领域中将任意尺寸的衣物适配到参考人物上的虚拟试衣问题。传统基于图像的虚拟试衣方法虽然能生成自然的试衣效果，但未考虑衣物与人物之间的物理尺寸关系。论文提出的解决方案关键在于实现尺寸可变的虚拟试衣，即根据衣物与人物的物理尺寸关系动态调整试衣图像中衣物的尺寸。为此，研究重点放在参考图像与试衣图像中衣物轮廓的残差上，并通过构建包含1,524张图像的尺寸可变虚拟试衣数据集以及提出相应的评估指标，验证了该方法在尺寸可变虚拟试衣上的优越性。

链接: https://arxiv.org/abs/2412.06201
作者: Yohei Yamashita,Chihiro Nakatani,Norimichi Ukita
关键词-EN: size-variable virtual try-on, virtual try-on, virtual try-on methods, virtual try-on problem, try-on
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:This paper addresses a new virtual try-on problem of fitting any size of clothes to a reference person in the image domain. While previous image-based virtual try-on methods can produce highly natural try-on images, these methods fit the clothes on the person without considering the relative relationship between the physical sizes of the clothes and the person. Different from these methods, our method achieves size-variable virtual try-on in which the image size of the try-on clothes is changed depending on this relative relationship of the physical sizes. To relieve the difficulty in maintaining the physical size of the closes while synthesizing the high-fidelity image of the whole clothes, our proposed method focuses on the residual between the silhouettes of the clothes in the reference and try-on images. We also develop a size-variable virtual try-on dataset consisting of 1,524 images provided by 26 subjects. Furthermore, we propose an evaluation metric for size-variable virtual-try-on. Quantitative and qualitative experimental results show that our method can achieve size-variable virtual try-on better than general virtual try-on methods.
zh

[CV-96] Adaptive Resolution Residual Networks – Generalizing Across Resolutions Easily and Efficiently

【速读】：该论文试图解决现有深度学习架构在处理多分辨率信号数据时的局限性，即固定分辨率（fixed-resolution）方法无法充分利用多样化的信号数据，而自适应分辨率（adaptive-resolution）方法虽然提升了鲁棒性和计算效率，但设计复杂且难以广泛应用。论文提出的解决方案是引入自适应分辨率残差网络（Adaptive Resolution Residual Networks, ARRNs），其关键在于利用拉普拉斯残差（Laplacian residuals）作为通用自适应分辨率适配器，能够在推理时通过省略高分辨率拉普拉斯残差来降低低分辨率信号的计算成本，同时保持性能。此外，论文还引入了拉普拉斯丢弃（Laplacian dropout）来增强对低分辨率分布的鲁棒性，并通过神经算子（neural operators）的理论分析为ARRNs的优势提供了坚实的基础。

链接: https://arxiv.org/abs/2412.06195
作者: Léa Demeule,Mahtab Sandhu,Glen Berseth
关键词-EN: signal data captured, real world, world uses numerous, numerous sensors, Laplacian residuals
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The majority of signal data captured in the real world uses numerous sensors with different resolutions. In practice, however, most deep learning architectures are fixed-resolution; they consider a single resolution at training time and inference time. This is convenient to implement but fails to fully take advantage of the diverse signal data that exists. In contrast, other deep learning architectures are adaptive-resolution; they directly allow various resolutions to be processed at training time and inference time. This benefits robustness and computational efficiency but introduces difficult design constraints that hinder mainstream use. In this work, we address the shortcomings of both fixed-resolution and adaptive-resolution methods by introducing Adaptive Resolution Residual Networks (ARRNs), which inherit the advantages of adaptive-resolution methods and the ease of use of fixed-resolution methods. We construct ARRNs from Laplacian residuals, which serve as generic adaptive-resolution adapters for fixed-resolution layers, and which allow casting high-resolution ARRNs into low-resolution ARRNs at inference time by simply omitting high-resolution Laplacian residuals, thus reducing computational cost on low-resolution signals without compromising performance. We complement this novel component with Laplacian dropout, which regularizes for robustness to a distribution of lower resolutions, and which also regularizes for errors that may be induced by approximate smoothing kernels in Laplacian residuals. We provide a solid grounding for the advantageous properties of ARRNs through a theoretical analysis based on neural operators, and empirically show that ARRNs embrace the challenge posed by diverse resolutions with greater flexibility, robustness, and computational efficiency.
zh

[CV-97] Event fields: Capturing light fields at high speed resolution and dynamic range

【速读】：该论文试图解决在高速度和高动态范围场景下，传统基于帧的相机在延迟、带宽需求和动态范围方面的局限性问题。解决方案的关键在于提出了一种名为“事件场 (Event Fields)”的新方法，通过创新的光学设计，利用事件相机 (event cameras) 捕捉高速光场 (light fields)。具体实现包括两个基础框架：空间复用 (spatial multiplexing) 用于捕捉时间导数，时间复用 (temporal multiplexing) 用于捕捉角度导数。论文设计了两种互补的光学装置：一种使用万花筒 (kaleidoscope) 进行空间复用，另一种使用检流计 (galvanometer) 进行时间复用。通过模拟器和硬件原型评估，展示了这两种设计的独特优势，从而在高速度和高动态范围场景下实现了光场的完整优势，如后期重聚焦和深度估计。

链接: https://arxiv.org/abs/2412.06191
作者: Ziyuan Qu,Zihao Zou,Vivek Boominathan,Praneeth Chakravarthula,Adithya Pediredla
关键词-EN: reduced bandwidth requirements, traditional frame-based cameras, dynamic range compared, Event cameras, enhanced dynamic range
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras, which feature pixels that independently respond to changes in brightness, are becoming increasingly popular in high-speed applications due to their lower latency, reduced bandwidth requirements, and enhanced dynamic range compared to traditional frame-based cameras. Numerous imaging and vision techniques have leveraged event cameras for high-speed scene understanding by capturing high-framerate, high-dynamic range videos, primarily utilizing the temporal advantages inherent to event cameras. Additionally, imaging and vision techniques have utilized the light field-a complementary dimension to temporal information-for enhanced scene understanding. In this work, we propose “Event Fields”, a new approach that utilizes innovative optical designs for event cameras to capture light fields at high speed. We develop the underlying mathematical framework for Event Fields and introduce two foundational frameworks to capture them practically: spatial multiplexing to capture temporal derivatives and temporal multiplexing to capture angular derivatives. To realize these, we design two complementary optical setups one using a kaleidoscope for spatial multiplexing and another using a galvanometer for temporal multiplexing. We evaluate the performance of both designs using a custom-built simulator and real hardware prototypes, showcasing their distinct benefits. Our event fields unlock the full advantages of typical light fields-like post-capture refocusing and depth estimation-now supercharged for high-speed and high-dynamic range scenes. This novel light-sensing paradigm opens doors to new applications in photography, robotics, and AR/VR, and presents fresh challenges in rendering and machine learning.
zh

[CV-98] Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition

【速读】：该论文试图解决开放词汇多标签识别（OV-MLR）任务中，由于未见类别和不同类别间判别区域数量差异导致的语义关联捕捉不足的问题。解决方案的关键在于提出了一种新颖的类别自适应跨模态语义精炼与迁移（C²SRT）框架，该框架包含两个互补模块：类别内语义精炼（ISR）模块和类别间语义迁移（IST）模块。ISR模块通过利用视觉语言预训练（VLP）模型的跨模态知识，自适应地找到最能代表目标类别语义的局部判别区域；IST模块则通过利用大语言模型（LLMs）的常识能力，构建类别自适应的相关图，并将语义知识从已见类别迁移到未见类别。实验结果表明，该框架在OV-MLR基准测试中显著优于当前最先进的算法。

链接: https://arxiv.org/abs/2412.06190
作者: Haijing Liu,Tao Pu,Hefeng Wu,Keze Wang,Liang Lin
关键词-EN: recent vision language, vision language pre-training, capability of CLIP, recent vision, language pre-training
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Benefiting from the generalization capability of CLIP, recent vision language pre-training (VLP) models have demonstrated an impressive ability to capture virtually any visual concept in daily images. However, due to the presence of unseen categories in open-vocabulary settings, existing algorithms struggle to effectively capture strong semantic correlations between categories, resulting in sub-optimal performance on the open-vocabulary multi-label recognition (OV-MLR). Furthermore, the substantial variation in the number of discriminative areas across diverse object categories is misaligned with the fixed-number patch matching used in current methods, introducing noisy visual cues that hinder the accurate capture of target semantics. To tackle these challenges, we propose a novel category-adaptive cross-modal semantic refinement and transfer (C ^2 SRT) framework to explore the semantic correlation both within each category and across different categories, in a category-adaptive manner. The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module. Specifically, the ISR module leverages the cross-modal knowledge of the VLP model to adaptively find a set of local discriminative regions that best represent the semantics of the target category. The IST module adaptively discovers a set of most correlated categories for a target category by utilizing the commonsense capabilities of LLMs to construct a category-adaptive correlation graph and transfers semantic knowledge from the correlated seen categories to unseen ones. Extensive experiments on OV-MLR benchmarks clearly demonstrate that the proposed C ^2 SRT framework outperforms current state-of-the-art algorithms.
zh

[CV-99] Evaluating Model Perception of Color Illusions in Photorealistic Scenes

【速读】：该论文试图解决的问题是探究视觉-语言模型 (Vision-Language Models, VLMs) 在面对颜色错觉 (color illusions) 时是否表现出与人类视觉相似的感知偏差。解决方案的关键在于提出了一个自动化的框架来生成颜色错觉图像，并构建了RCID (Realistic Color Illusion Dataset) 数据集，包含19,000张逼真的错觉图像。通过实验验证，所有研究的VLMs均表现出与人类视觉相似的感知偏差，并进一步训练了一个模型来区分人类感知与实际像素差异。

链接: https://arxiv.org/abs/2412.06184
作者: Lingjun Mao,Zineng Tang,Alane Suhr
关键词-EN: Color illusion, color, Realistic Color Illusion, Color Illusion Dataset, color illusion images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study the perception of color illusions by vision-language models. Color illusion, where a person’s visual system perceives color differently from actual color, is well-studied in human vision. However, it remains underexplored whether vision-language models (VLMs), trained on large-scale human data, exhibit similar perceptual biases when confronted with such color illusions. We propose an automated framework for generating color illusion images, resulting in RCID (Realistic Color Illusion Dataset), a dataset of 19,000 realistic illusion images. Our experiments show that all studied VLMs exhibit perceptual biases similar human vision. Finally, we train a model to distinguish both human perception and actual pixel differences.
zh

[CV-100] owards Long Video Understanding via Fine-detailed Video Story Generation

【速读】：该论文试图解决长视频理解中的两个关键问题：复杂的长时间上下文关系建模和冗余信息的干扰。解决方案的关键在于提出了细粒度视频故事生成 (Fine-Detailed Video Story generation, FDVS) 方法，通过两种机制实现：一是自底向上的视频解释机制 (Bottom-up Video Interpretation Mechanism)，逐步从视频片段到整体视频进行细粒度建模；二是语义冗余减少机制 (Semantic Redundancy Reduction mechanism)，在视觉和文本层面去除冗余信息。最终，FDVS 将长视频转化为包含多粒度信息的分层文本表示，适用于多种任务且无需微调，展示了其有效性和通用性。

链接: https://arxiv.org/abs/2412.06182
作者: Zeng You,Zhiquan Wen,Yaofo Chen,Xin Li,Runhao Zeng,Yaowei Wang,Mingkui Tan
关键词-EN: Long video understanding, video understanding, video, computer vision, driving advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long videos into hierarchical textual representations that contain multi-granularity information of the video. With these representations, FDVS is applicable to various tasks without any fine-tuning. We evaluate the proposed method across eight datasets spanning three tasks. The performance demonstrates the effectiveness and versatility of our method.
zh

[CV-101] One-shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing

【速读】：该论文试图解决在单次人体运动迁移中，由于运动和关节复杂性的大幅度变化，导致基于2D身体标志点、骨架和语义掩码的方法难以准确捕捉源图像与驱动姿态之间对应关系的问题。此外，DensePose的准确性和精度下降也影响了基于神经渲染方法的图像质量。解决方案的关键在于提出了一种统一的框架，结合多尺度特征扭曲（multi-scale feature warping）和神经纹理映射（neural texture mapping），以恢复更好的2D外观和2.5D几何结构。该框架通过联合训练和融合多种模态，利用DensePose的信息并适应其固有的有限准确性，从而生成鲁棒的神经纹理特征和多尺度密集运动流，有效处理几何误差并更好地保留外观。实验结果表明，该模型在处理具有显著自遮挡等挑战性案例时表现尤为出色。

链接: https://arxiv.org/abs/2412.06174
作者: Yuzhu Ji,Chuanxia Zheng,Tat-Jen Cham
关键词-EN: Human motion transfer, motion transfer aims, static source image, one-shot human motion, Human motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This article has been accepted for publication in IEEE Transactions on Multimedia

点击查看摘要

Abstract:Human motion transfer aims at animating a static source image with a driving video. While recent advances in one-shot human motion transfer have led to significant improvement in results, it remains challenging for methods with 2D body landmarks, skeleton and semantic mask to accurately capture correspondences between source and driving poses due to the large variation in motion and articulation complexity. In addition, the accuracy and precision of DensePose degrade the image quality for neural-rendering-based methods. To address the limitations and by both considering the importance of appearance and geometry for motion transfer, in this work, we proposed a unified framework that combines multi-scale feature warping and neural texture mapping to recover better 2D appearance and 2.5D geometry, partly by exploiting the information from DensePose, yet adapting to its inherent limited accuracy. Our model takes advantage of multiple modalities by jointly training and fusing them, which allows it to robust neural texture features that cope with geometric errors as well as multi-scale dense motion flow that better preserves appearance. Experimental results with full and half-view body video datasets demonstrate that our model can generalize well and achieve competitive results, and that it is particularly effective in handling challenging cases such as those with substantial self-occlusions.
zh

[CV-102] Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight

【速读】：该论文试图解决跨模态匹配任务中由于众包或网络爬取数据引入的噪声对应问题。解决方案的关键在于提出了一种新颖的“自丢弃与双重权重”（self-drop and dual-weight）方法，通过精细的数据划分（qua-partitioning）将数据分为四类：干净且重要、干净但不重要、模糊和噪声。该方法通过自丢弃策略丢弃噪声样本，以有效减少噪声的影响，同时采用双重权重策略，确保模型更多关注重要样本，并适当利用模糊样本。相比现有方法，该方案在噪声数据集上表现出更高的鲁棒性和更稳定的性能。

链接: https://arxiv.org/abs/2412.06172
作者: Fan Liu,Chenwei Dong,Chuanyi Zhang,Hualiang Zhou,Jun Zhou
关键词-EN: researchers collect data, cross-modal matching, researchers collect, internet through crowd-sourcing, crowd-sourcing or web
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many researchers collect data from the internet through crowd-sourcing or web crawling to alleviate the data-hungry challenge associated with cross-modal matching. Although such practice does not require expensive annotations, it inevitably introduces mismatched pairs and results in a noisy correspondence problem. Current approaches leverage the memorization effect of deep neural networks to distinguish noise and perform re-weighting. However, briefly lowering the weight of noisy pairs cannot eliminate the negative impact of noisy correspondence in the training process. In this paper, we propose a novel self-drop and dual-weight approach, which achieves elaborate data processing by qua-partitioning the data. Specifically, our approach partitions all data into four types: clean and significant, clean yet insignificant, vague, and noisy. We analyze the effect of noisy and clean data pairs and find that for vision-language pre-training models, a small number of clean samples is more valuable than a majority of noisy ones. Based on this observation, we employ self-drop to discard noisy samples to effectively mitigate the impact of noise. In addition, we adopt a dual-weight strategy to ensure that the model focuses more on significant samples while appropriately leveraging vague samples. Compared to the prior works, our approach is more robust and demonstrates relatively more stable performance on noisy datasets, especially under a high noise ratio. Extensive experiments on three widely used datasets, including Flickr30K, MS-COCO, and Conceptual Captions, validate the effectiveness of our approach. The source code is available at this https URL.
zh

[CV-103] Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

【速读】：该论文试图解决视频异常理解 (Video Anomaly Understanding, VAU) 中多尺度时间跨度和复杂上下文的问题，传统方法主要关注帧级异常预测，缺乏对复杂和多样化现实世界异常的可解释性。解决方案的关键在于引入了一个大规模的分层视频异常理解基准 (Hierarchical Video Anomaly Understanding, HIVAU-70k)，并通过半自动化的标注引擎结合手动视频分割和递归自由文本标注（使用大语言模型 (LLMs)）生成超过70,000个多粒度标注。此外，提出了异常聚焦的时间采样器 (Anomaly-focused Temporal Sampler, ATS)，通过集成异常评分器和密度感知采样器，自适应选择异常丰富的帧，从而提高长视频中异常检测的效率和准确性。

链接: https://arxiv.org/abs/2412.06171
作者: Huaxin Zhang,Xiaohao Xu,Xiang Wang,Jialong Zuo,Xiaonan Huang,Changxin Gao,Shanjun Zhang,Li Yu,Nong Sang
关键词-EN: occurring over varying, comprehend video anomalies, video anomalies occurring, Video Anomaly Understanding, varying temporal scales
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages

点击查看摘要

Abstract:How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at this https URL.
zh

[CV-104] ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance

【速读】：该论文试图解决高分辨率（HR）图像生成中的两个主要问题：一是现有方法在生成过程中容易出现重复图案的问题，二是使用扩散模型进行HR生成时的高计算成本。解决方案的关键在于提出了一种名为ASGDiffusion的新方法，该方法通过异步结构引导（Asynchronous Structure Guidance, ASG）和预训练的扩散模型实现并行HR生成。具体来说，ASGDiffusion利用低分辨率（LR）噪声并结合注意力掩码作为去噪步骤的结构引导，以确保语义一致性，从而有效缓解图案重复问题。此外，该方法通过异步计算补丁噪声和结构引导的并行策略，结合多GPU加速，显著提高了生成速度并减少了每GPU的内存使用。

链接: https://arxiv.org/abs/2412.06163
作者: Yuming Li,Peidong Jia,Daiwei Hong,Yueru Jia,Qi She,Rui Zhao,Ming Lu,Shanghang Zhang
关键词-EN: Training-free high-resolution, training large diffusion, generation, Structure Guidance, training large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training-free high-resolution (HR) image generation has garnered significant attention due to the high costs of training large diffusion models. Most existing methods begin by reconstructing the overall structure and then proceed to refine the local details. Despite their advancements, they still face issues with repetitive patterns in HR image generation. Besides, HR generation with diffusion models incurs significant computational costs. Thus, parallel generation is essential for interactive applications. To solve the above limitations, we introduce a novel method named ASGDiffusion for parallel HR generation with Asynchronous Structure Guidance (ASG) using pre-trained diffusion models. To solve the pattern repetition problem of HR image generation, ASGDiffusion leverages the low-resolution (LR) noise weighted by the attention mask as the structure guidance for the denoising step to ensure semantic consistency. The proposed structure guidance can significantly alleviate the pattern repetition problem. To enable parallel generation, we further propose a parallelism strategy, which calculates the patch noises and structure guidance asynchronously. By leveraging multi-GPU parallel acceleration, we significantly accelerate generation speed and reduce memory usage per GPU. Extensive experiments demonstrate that our method effectively and efficiently addresses common issues like pattern repetition and achieves state-of-the-art HR generation.
zh

[CV-105] A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition

【速读】：该论文试图解决视觉位置识别 (Visual Place Recognition, VPR) 中由于环境外观变化导致的性能下降问题，同时提升计算效率和可扩展性。解决方案的关键在于提出超维度单一场景签名 (Hyperdimensional One Place Signatures, HOPS)，通过融合在不同条件下捕获的多个参考集的描述符，利用超维度计算框架 (Hyperdimensional Computing) 实现对任意数量环境条件的扩展。HOPS 不仅提高了性能，还显著提升了计算效率和可扩展性，并通过广泛的评估证明了其在各种 VPR 方法和数据集上的高度通用性和显著的召回率提升。

链接: https://arxiv.org/abs/2412.06153
作者: Connor Malone,Somayeh Hussaini,Tobias Fischer,Michael Milford
关键词-EN: Visual Place Recognition, Visual Place, Place Recognition, comparing query images, comparing query
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Visual Place Recognition (VPR) enables coarse localization by comparing query images to a reference database of geo-tagged images. Recent breakthroughs in deep learning architectures and training regimes have led to methods with improved robustness to factors like environment appearance change, but with the downside that the required training and/or matching compute scales with the number of distinct environmental conditions encountered. Here, we propose Hyperdimensional One Place Signatures (HOPS) to simultaneously improve the performance, compute and scalability of these state-of-the-art approaches by fusing the descriptors from multiple reference sets captured under different conditions. HOPS scales to any number of environmental conditions by leveraging the Hyperdimensional Computing framework. Extensive evaluations demonstrate that our approach is highly generalizable and consistently improves recall performance across all evaluated VPR methods and datasets by large margins. Arbitrarily fusing reference images without compute penalty enables numerous other useful possibilities, three of which we demonstrate here: descriptor dimensionality reduction with no performance penalty, stacking synthetic images, and coarse localization to an entire traverse or environmental section.
zh

[CV-106] An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers

【速读】：该论文试图解决深度神经网络 (DNN) 模型在面对后门攻击时的脆弱性问题，特别是现有攻击方法中触发器形状和位置的任意设置或随机选择导致的攻击效果和鲁棒性不足的问题。解决方案的关键在于提出了一种基于注意力机制的掩码生成方法，用于搜索最优的触发器形状和位置，并通过引入体验质量 (QoE) 项到损失函数中，调整触发器的透明度以使后门样本更加自然。此外，论文还提出了交替重训练算法，在注入后门的过程中交替使用混合毒化数据集和仅良性样本进行重训练，以提高受害模型的预测准确性。最后，通过在协同优化攻击框架下交替优化触发器和后门模型，进一步提升了攻击性能，并展示了该方法在视觉变换器上的扩展应用及其对现有防御措施的鲁棒性。

链接: https://arxiv.org/abs/2412.06149
作者: Xueluan Gong,Bowei Tian,Meng Xue,Yuan Wu,Yanjiao Chen,Qian Wang
关键词-EN: Deep Neural Network, Neural Network, Deep Neural, Recent studies, vulnerability of Deep
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Recent studies have revealed the vulnerability of Deep Neural Network (DNN) models to backdoor attacks. However, existing backdoor attacks arbitrarily set the trigger mask or use a randomly selected trigger, which restricts the effectiveness and robustness of the generated backdoor triggers. In this paper, we propose a novel attention-based mask generation methodology that searches for the optimal trigger shape and location. We also introduce a Quality-of-Experience (QoE) term into the loss function and carefully adjust the transparency value of the trigger in order to make the backdoored samples to be more natural. To further improve the prediction accuracy of the victim model, we propose an alternating retraining algorithm in the backdoor injection process. The victim model is retrained with mixed poisoned datasets in even iterations and with only benign samples in odd iterations. Besides, we launch the backdoor attack under a co-optimized attack framework that alternately optimizes the backdoor trigger and backdoored model to further improve the attack performance. Apart from DNN models, we also extend our proposed attack method against vision transformers. We evaluate our proposed method with extensive experiments on VGG-Flower, CIFAR-10, GTSRB, CIFAR-100, and ImageNette datasets. It is shown that we can increase the attack success rate by as much as 82% over baselines when the poison ratio is low and achieve a high QoE of the backdoored samples. Our proposed backdoor attack framework also showcases robustness against state-of-the-art backdoor defenses.
zh

[CV-107] Homogeneous Dynamics Space for Heterogeneous Humans

【速读】：该论文试图解决人类运动动力学（human dynamics）研究中存在的异质性问题，即不同领域（如生物力学和强化学习）在运动学表示和层次动力学表示上的多样性。解决方案的关键在于提出同质动力学空间（Homogeneous Dynamics Space, HDyS），通过聚合异质数据并借鉴逆向-正向动力学过程，训练一个同质的潜在空间，从而实现对人类运动学和动力学之间的有效映射。

链接: https://arxiv.org/abs/2412.06146
作者: Xinpeng Liu,Junxuan Liang,Chenshuo Zhang,Zixuan Cai,Cewu Lu,Yong-Lu Li
关键词-EN: achieved tremendous advances, tremendous advances, achieved tremendous, human, dynamics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Cewu Lu and Yong-Lu Li are the corresponding authors

点击查看摘要

Abstract:Analyses of human motion kinematics have achieved tremendous advances. However, the production mechanism, known as human dynamics, is still undercovered. In this paper, we aim to push data-driven human dynamics understanding forward. We identify a major obstacle to this as the heterogeneity of existing human motion understanding efforts. Specifically, heterogeneity exists in not only the diverse kinematics representations and hierarchical dynamics representations but also in the data from different domains, namely biomechanics and reinforcement learning. With an in-depth analysis of the existing heterogeneity, we propose to emphasize the beneath homogeneity: all of them represent the homogeneous fact of human motion, though from different perspectives. Given this, we propose Homogeneous Dynamics Space (HDyS) as a fundamental space for human dynamics by aggregating heterogeneous data and training a homogeneous latent space with inspiration from the inverse-forward dynamics procedure. Leveraging the heterogeneous representations and datasets, HDyS achieves decent mapping between human kinematics and dynamics. We demonstrate the feasibility of HDyS with extensive experiments and applications. The project page is this https URL.
zh

[CV-108] Precise Fast and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters

【速读】：该论文旨在解决文本到图像生成模型中，如何精确、及时且低成本地从预训练模型中移除不需要的概念（如版权受保护、冒犯性或不安全的内容）的问题。解决方案的关键在于提出了一种名为自适应值分解器 (Adaptive Value Decomposer, AdaVD) 的无训练概念擦除方法。该方法基于经典的线性代数正交补操作，在扩散模型的 UNet 结构中的每个交叉注意力层的值空间中实现，并通过设计一个有效的偏移因子来自适应调整擦除强度，从而在保持非目标内容生成的同时，增强擦除效果。实验结果表明，AdaVD 在单概念和多概念擦除任务中均表现出色，相较于现有方法，在保持生成内容质量的同时，显著提升了擦除效率。

链接: https://arxiv.org/abs/2412.06143
作者: Yuan Wang,Ouxiang Li,Tingting Mu,Yanbin Hao,Kuien Liu,Xiang Wang,Xiangnan He
关键词-EN: erase unwanted concepts, erasure efficacy, prior preservation, enabled by diffuion, imposed an urgent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The success of text-to-image generation enabled by diffuion models has imposed an urgent need to erase unwanted concepts, e.g., copyrighted, offensive, and unsafe ones, from the pre-trained models in a precise, timely, and low-cost manner. The twofold demand of concept erasure requires a precise removal of the target concept during generation (i.e., erasure efficacy), while a minimal impact on non-target content generation (i.e., prior preservation). Existing methods are either computationally costly or face challenges in maintaining an effective balance between erasure efficacy and prior preservation. To improve, we propose a precise, fast, and low-cost concept erasure method, called Adaptive Vaule Decomposer (AdaVD), which is training-free. This method is grounded in a classical linear algebraic orthogonal complement operation, implemented in the value space of each cross-attention layer within the UNet of diffusion models. An effective shift factor is designed to adaptively navigate the erasure strength, enhancing prior preservation without sacrificing erasure efficacy. Extensive experimental results show that the proposed AdaVD is effective at both single and multiple concept erasure, showing a 2- to 10-fold improvement in prior preservation as compared to the second best, meanwhile achieving the best or near best erasure efficacy, when comparing with both training-based and training-free state of the arts. AdaVD supports a series of diffusion models and downstream image generation tasks, the code is available on the project page: this https URL
zh

[CV-109] Agent Align: Misalignment-Adapted Multi-Agent Agent Perception for Resilient Inter-Agent Sensor Correlations

【速读】：该论文试图解决多智能体（multi-agent）环境下多模态感知中的传感器对齐问题，特别是在异构智能体（heterogeneous agent）之间由于环境因素导致的传感器测量不一致性。解决方案的关键在于提出了AgentAlign框架，该框架通过引入跨模态特征对齐空间（Cross-Modality Feature Alignment Space, CFAS）和异构智能体特征对齐（Heterogeneous Agent Feature Alignment, HAFA）机制，动态地协调不同智能体之间的多模态特征。此外，论文还提出了V2XSet-Noise数据集，用于模拟真实环境中的传感器噪声，以系统评估该框架的鲁棒性。实验结果表明，AgentAlign在V2X-Real和V2XSet-Noise基准上达到了最先进的性能，展示了其在实际合作自动驾驶应用中的潜力。

链接: https://arxiv.org/abs/2412.06142
作者: Zonglin Meng,Yun Zhang,Zhaoliang Zheng,Zhihao Zhao,Jiaqi Ma
关键词-EN: connected automated vehicles, attracted wide attention, leverage shared information, range limitation issues, address sensing occlusion
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cooperative perception has attracted wide attention given its capability to leverage shared information across connected automated vehicles (CAVs) and smart infrastructures to address sensing occlusion and range limitation issues. However, existing research overlooks the fragile multi-sensor correlations in multi-agent settings, as the heterogeneous agent sensor measurements are highly susceptible to environmental factors, leading to weakened inter-agent sensor interactions. The varying operational conditions and other real-world factors inevitably introduce multifactorial noise and consequentially lead to multi-sensor misalignment, making the deployment of multi-agent multi-modality perception particularly challenging in the real world. In this paper, we propose AgentAlign, a real-world heterogeneous agent cross-modality feature alignment framework, to effectively address these multi-modality misalignment issues. Our method introduces a cross-modality feature alignment space (CFAS) and heterogeneous agent feature alignment (HAFA) mechanism to harmonize multi-modality features across various agents dynamically. Additionally, we present a novel V2XSet-noise dataset that simulates realistic sensor imperfections under diverse environmental conditions, facilitating a systematic evaluation of our approach’s robustness. Extensive experiments on the V2X-Real and V2XSet-Noise benchmarks demonstrate that our framework achieves state-of-the-art performance, underscoring its potential for real-world applications in cooperative autonomous driving. The controllable V2XSet-Noise dataset and generation pipeline will be released in the future.
zh

[CV-110] SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation

【速读】：该论文试图解决细粒度视觉分类 (Fine-Grained Visual Classification, FGVC) 中由于子类别高度相似而导致的分类难题，特别是在数据集获取和标注成本高昂且需要专业知识的情况下。解决方案的关键在于提出了一种基于序列潜在扩散模型 (Sequence Latent Diffusion Model, SLDM) 的新方法，称为序列生成图像增强 (Sequence Generative Image Augmentation, SGIA)，并结合了桥接迁移学习 (Bridging Transfer Learning, BTL) 过程，以缩小真实数据与合成数据之间的领域差距。该方法不仅在生成更逼真的图像样本方面超越了现有技术，还提供了超越传统刚性变换和风格变化的多样化姿态变换，显著提升了在少样本学习场景中的分类性能。

链接: https://arxiv.org/abs/2412.06138
作者: Qiyu Liao,Xin Yuan,Min Xu,Dadong Wang
关键词-EN: Fine-Grained Visual Classification, distinguishing highly similar, highly similar subcategories, similar subcategories remains, Fine-Grained Visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:In Fine-Grained Visual Classification (FGVC), distinguishing highly similar subcategories remains a formidable challenge, often necessitating datasets with extensive variability. The acquisition and annotation of such FGVC datasets are notably difficult and costly, demanding specialized knowledge to identify subtle distinctions among closely related categories. Our study introduces a novel approach employing the Sequence Latent Diffusion Model (SLDM) for augmenting FGVC datasets, called Sequence Generative Image Augmentation (SGIA). Our method features a unique Bridging Transfer Learning (BTL) process, designed to minimize the domain gap between real and synthetically augmented data. This approach notably surpasses existing methods in generating more realistic image samples, providing a diverse range of pose transformations that extend beyond the traditional rigid transformations and style changes in generative augmentation. We demonstrate the effectiveness of our augmented dataset with substantial improvements in FGVC tasks on various datasets, models, and training strategies, especially in few-shot learning scenarios. Our method outperforms conventional image augmentation techniques in benchmark tests on three FGVC datasets, showcasing superior realism, variability, and representational quality. Our work sets a new benchmark and outperforms the previous state-of-the-art models in classification accuracy by 0.5% for the CUB-200-2011 dataset and advances the application of generative models in FGVC data augmentation.
zh

[CV-111] GCUNet: A GNN-Based Contextual Learning Network for Tertiary Lymphoid Structure Semantic Segmentation in Whole Slide Image

【速读】：该论文试图解决在全切片图像 (WSI) 中三级淋巴结构 (TLS) 的语义分割问题，特别是如何有效整合上下文信息以识别 TLS 的边界和成熟度。解决方案的关键在于提出了基于图神经网络 (GNN) 的上下文学习网络 GCUNet。GCUNet 通过逐步聚合目标图像块外部的长距离和细粒度上下文信息，并利用细节与上下文融合块 (DCFusion) 将这些上下文信息与目标图像块的细节进行整合，从而实现更精确的 TLS 语义分割。实验结果表明，GCUNet 在多个数据集上相较于现有最先进 (SOTA) 方法，在 mF1 指标上至少提升了 7.41%。

链接: https://arxiv.org/abs/2412.06129
作者: Lei Su,Yang Du
关键词-EN: TLS semantic segmentation, tertiary lymphoid structure, TLS semantic, semantic segmentation, TLS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We focus on tertiary lymphoid structure (TLS) semantic segmentation in whole slide image (WSI). Unlike TLS binary segmentation, TLS semantic segmentation identifies boundaries and maturity, which requires integrating contextual information to discover discriminative features. Due to the extensive scale of WSI (e.g., 100,000 \times 100,000 pixels), the segmentation of TLS is usually carried out through a patch-based strategy. However, this prevents the model from accessing information outside of the patches, limiting the performance. To address this issue, we propose GCUNet, a GNN-based contextual learning network for TLS semantic segmentation. Given an image patch (target) to be segmented, GCUNet first progressively aggregates long-range and fine-grained context outside the target. Then, a Detail and Context Fusion block (DCFusion) is designed to integrate the context and detail of the target to predict the segmentation mask. We build four TLS semantic segmentation datasets, called TCGA-COAD, TCGA-LUSC, TCGA-BLCA and INHOUSE-PAAD, and make the former three datasets (comprising 826 WSIs and 15,276 TLSs) publicly available to promote the TLS semantic segmentation. Experiments on these datasets demonstrate the superiority of GCUNet, achieving at least 7.41% improvement in mF1 compared with SOTA.
zh

[CV-112] HSDA: High-frequency Shuffle Data Augmentation for Birds-Eye-View Map Segmentation WACV

【速读】：该论文试图解决在基于相机的鸟瞰图（Bird’s-Eye-View, BEV）地图分割中，如何通过数据增强技术提升网络对高频信息的处理能力，从而提高分割精度和细节感知能力的问题。解决方案的关键在于提出了高频信息打乱数据增强（High-frequency Shuffle Data Augmentation, HSDA）策略，该策略通过增强网络对高频图像内容的解读能力，使其能够更好地区分高频信息与噪声，从而提升对小区域和复杂细节的分割效果，并改善边缘和细节的感知。实验结果表明，该方法在nuScenes数据集上显著提升了相机系统的平均交并比（mIoU），达到了61.3%的新纪录。

链接: https://arxiv.org/abs/2412.06127
作者: Calvin Glisson,Qiuxiao Chen
关键词-EN: BEV map segmentation, BEV map, data augmentation, map segmentation, map segmentation plays
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 8 pages excluding references, 5 figures

点击查看摘要

Abstract:Autonomous driving has garnered significant attention in recent research, and Bird’s-Eye-View (BEV) map segmentation plays a vital role in the field, providing the basis for safe and reliable operation. While data augmentation is a commonly used technique for improving BEV map segmentation networks, existing approaches predominantly focus on manipulating spatial domain representations. In this work, we investigate the potential of frequency domain data augmentation for camera-based BEV map segmentation. We observe that high-frequency information in camera images is particularly crucial for accurate segmentation. Based on this insight, we propose High-frequency Shuffle Data Augmentation (HSDA), a novel data augmentation strategy that enhances a network’s ability to interpret high-frequency image content. This approach encourages the network to distinguish relevant high-frequency information from noise, leading to improved segmentation results for small and intricate image regions, as well as sharper edge and detail perception. Evaluated on the nuScenes dataset, our method demonstrates broad applicability across various BEV map segmentation networks, achieving a new state-of-the-art mean Intersection over Union (mIoU) of 61.3% for camera-only systems. This significant improvement underscores the potential of frequency domain data augmentation for advancing the field of autonomous driving perception. Code has been released: this https URL
zh

[CV-113] Self-supervised cost of transport estimation for multimodal path planning

【速读】：该论文试图解决自主机器人在真实环境中如何根据高层次目标和周围环境信息，选择能量最优路径的问题。解决方案的关键在于开发了一种自监督学习方法，使机器人能够仅通过视觉输入估计其周围环境的运输成本（cost of transport）。该方法应用于多模态移动变形机器人（M4），展示了其在不同环境（如草地与平滑道路）中准确分配不同运输成本的能力，并强调了该方法在计算资源有限的Nvidia Jetson Orin Nano机器人计算单元上的低计算成本特性。

链接: https://arxiv.org/abs/2412.06101
作者: Vincent Gherold,Ioannis Mandralis,Eric Sihite,Adarsh Salagame,Alireza Ramezani,Morteza Gharib
关键词-EN: Autonomous robots operating, Autonomous robots, faced with decisions, Autonomous, robots operating
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autonomous robots operating in real environments are often faced with decisions on how best to navigate their surroundings. In this work, we address a particular instance of this problem: how can a robot autonomously decide on the energetically optimal path to follow given a high-level objective and information about the surroundings? To tackle this problem we developed a self-supervised learning method that allows the robot to estimate the cost of transport of its surroundings using only vision inputs. We apply our method to the multi-modal mobility morphobot (M4), a robot that can drive, fly, segway, and crawl through its environment. By deploying our system in the real world, we show that our method accurately assigns different cost of transports to various types of environments e.g. grass vs smooth road. We also highlight the low computational cost of our method, which is deployed on an Nvidia Jetson Orin Nano robotic compute unit. We believe that this work will allow multi-modal robotic platforms to unlock their full potential for navigation and exploration tasks.
zh

[CV-114] Order Theory in the Context of Machine Learning: an application

【速读】：该论文试图解决的问题是将整数值神经网络（IVNN）与热带几何中的热带有理函数建立等价关系，并探讨这种等价关系如何应用于神经网络的优化和结构化。解决方案的关键在于将IVNN与热带有理函数关联，并通过映射到多面体（polytopes）来实现。具体来说，论文展示了如何将具有特定激活函数（ReLU_t）的IVNN与热带有理函数对应，并进一步将这些函数映射到有序多面体（order polytopes）。通过这种方式，论文不仅揭示了神经网络与热带几何之间的深刻联系，还提出了一种新的卷积滤波器（poset filters），这些滤波器可以在反向传播过程中更新神经网络的权重，提供比传统池化方法（如平均池化、最大池化和混合池化）更高的精度，且无需额外训练参数。此外，论文还证明了从偏序集（poset）到有序多面体的映射是一对一的，并定义了热带多项式上的代数结构。

链接: https://arxiv.org/abs/2412.06097
作者: Eric Dolores-Cuenca,Aldo Guzman-Saenz,Sangil Kim,Susana Lopez-Moreno,Jose Mendoza-Cortes
关键词-EN: Geometry of Deep, Deep Neural Networks, tropical rational functions, Deep Neural, Tropical Geometry
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Category Theory (math.CT)
备注: Poster presentation in NeuroIPS WIML 2024

点击查看摘要

Abstract:The paper ``Tropical Geometry of Deep Neural Networks’’ by L. Zhang et al. introduces an equivalence between integer-valued neural networks (IVNN) with activation \textReLU_t and tropical rational functions, which come with a map to polytopes. Here, IVNN refers to a network with integer weights but real biases, and \textReLU_t is defined as \textReLU_t(x)=\max(x,t) for t\in\mathbbR\cup-\infty\ . For every poset with n points, there exists a corresponding order polytope, i.e., a convex polytope in the unit cube [0,1]^n whose coordinates obey the inequalities of the poset. We study neural networks whose associated polytope is an order polytope. We then explain how posets with four points induce neural networks that can be interpreted as 2\times 2 convolutional filters. These poset filters can be added to any neural network, not only IVNN. Similarly to maxout, poset convolutional filters update the weights of the neural network during backpropagation with more precision than average pooling, max pooling, or mixed pooling, without the need to train extra parameters. We report experiments that support our statements. We also prove that the assignment from a poset to an order polytope (and to certain tropical polynomials) is one to one, and we define the structure of algebra over the operad of posets on tropical polynomials. Comments: Poster presentation in NeuroIPS WIML 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Category Theory (math.CT) MSC classes: 68T07, 06A99, 68T05, 18M60, 52B11, 68Q55, 14T10, 06F99 ACMclasses: I.2.6; I.5.1 Cite as: arXiv:2412.06097 [cs.CV] (or arXiv:2412.06097v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.06097 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Eric Dolores Cuenca [view email] [v1] Sun, 8 Dec 2024 22:57:41 UTC (10,724 KB)
zh

[CV-115] GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

【速读】：该论文试图解决文本到图像生成（Text-to-Image, T2I）任务中，现有方法在处理复杂文本提示时难以准确建模对象属性和关系的问题。解决方案的关键在于提出了一种模块化的三步生成流程：首先使用现有的扩散模型生成图像（Generate），然后利用多模态大语言模型（Multi-Modal LLMs, MLLMs）识别生成图像中的错误并生成修正步骤的编辑计划（Plan），最后通过文本引导的图像编辑模型按计划对图像进行逐步修正（Edit）。该方法的优势在于其模块化设计、无需额外训练，并且可以灵活应用于各种图像生成和编辑模型组合，从而显著提升了现有SOTA模型在复杂文本提示下的表现，并缩小了不同模型之间的性能差距。

链接: https://arxiv.org/abs/2412.06089
作者: Ashish Goswami,Satyam Kumar Modi,Santhosh Rishi Deshineni,Harman Singh,Prathosh A. P,Parag Singla
关键词-EN: text prompts, complex text prompts, compositional text prompts, models, significant progress
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo-realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi-step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them. In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi-Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit-plan. © Edit: we make use of an existing text-guided image editing models to sequentially execute our edit-plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE-3 and the latest – SD-3.5-Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models. \hrefthis https URLthis https URL
zh

[CV-116] A4-Unet: Deformable Multi-Scale Attention Network for Brain Tumor Segmentation

【速读】：该论文试图解决脑肿瘤分割模型在面对MRI图像复杂性和变异性时遇到的挑战，包括不规则形状、边界模糊导致的噪声、误分类和分割不完整等问题，从而限制了模型的准确性。解决方案的关键在于提出了一种名为A4-Unet的新型网络架构，其中引入了多个创新模块以提升分割性能。具体来说，编码器中集成了可变形大核注意力机制（Deformable Large Kernel Attention, DLKA）以更好地捕捉多尺度肿瘤特征；瓶颈层采用了带有跨通道注意力的Swin空间金字塔池化（Swin Spatial Pyramid Pooling, SSPP）来研究图像中的长距离依赖和通道关系；解码器中引入了结合离散余弦变换（Discrete Cosine Transform, DCT）正交性的组合注意力模块（Combined Attention Module, CAM），用于通道和空间权重的加权；跳跃连接中加入了注意力门（Attention Gates, AG）以突出前景并抑制无关背景信息。这些创新设计使得A4-Unet在多个权威MRI脑肿瘤数据集上实现了新的最先进性能，尤其是在BraTS 2020数据集上达到了94.4%的Dice分数。

链接: https://arxiv.org/abs/2412.06088
作者: Ruoxin Wang,Tianyi Tang,Haiming Du,Yuxuan Cheng,Yu Wang,Lingjie Yang,Xiaohui Duan,Yunfang Yu,Yu Zhou,Donglong Chen
关键词-EN: tumor segmentation models, recent years, models have aided, aided diagnosis, diagnosis in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 14 figures, IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2024

点击查看摘要

Abstract:Brain tumor segmentation models have aided diagnosis in recent years. However, they face MRI complexity and variability challenges, including irregular shapes and unclear boundaries, leading to noise, misclassification, and incomplete segmentation, thereby limiting accuracy. To address these issues, we adhere to an outstanding Convolutional Neural Networks (CNNs) design paradigm and propose a novel network named A4-Unet. In A4-Unet, Deformable Large Kernel Attention (DLKA) is incorporated in the encoder, allowing for improved capture of multi-scale tumors. Swin Spatial Pyramid Pooling (SSPP) with cross-channel attention is employed in a bottleneck further to study long-distance dependencies within images and channel relationships. To enhance accuracy, a Combined Attention Module (CAM) with Discrete Cosine Transform (DCT) orthogonality for channel weighting and convolutional element-wise multiplication is introduced for spatial weighting in the decoder. Attention gates (AG) are added in the skip connection to highlight the foreground while suppressing irrelevant background information. The proposed network is evaluated on three authoritative MRI brain tumor benchmarks and a proprietary dataset, and it achieves a 94.4% Dice score on the BraTS 2020 dataset, thereby establishing multiple new state-of-the-art benchmarks. The code is available here: this https URL.
zh

[CV-117] Are foundation models for computer vision good conformal predictors?

【速读】：该论文试图解决在风险敏感和高风险应用场景中，基础模型（foundation models）的不确定性建模能力问题。解决方案的关键在于利用保序预测（Conformal Prediction, CP）这一统计框架，为模型的预测提供边缘覆盖率的理论保证。研究表明，基础模型，尤其是结合了视觉Transformer（Vision Transformers）的模型，非常适合进行保序预测过程。此外，论文发现，在适应性保序预测方法中，校准模型的置信度预测会导致保序集的效率下降，而少样本适应下游任务通常会提高保序得分，其中Adapters被认为是比提示学习（Prompt Learning）策略更好的可保序替代方案。特别地，APS在视觉基础模型中表现出特别的前景，因为它在多个具有挑战性但现实的场景中不违反边缘覆盖率属性。

链接: https://arxiv.org/abs/2412.06082
作者: Leo Fillioux,Julio Silva-Rodríguez,Ismail Ben Ayed,Paul-Henry Cournède,Maria Vakalopoulou,Stergios Christodoulidis,Jose Dolz
关键词-EN: Recent advances, advances in self-supervision, self-supervision and constrastive, brought the performance, unprecedented levels
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in self-supervision and constrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has been barely explored. In this work, we delve into the behavior of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. Furthermore, we show that calibrating the confidence predictions of these models leads to efficiency degradation of the conformal set on adaptive CP methods. In contrast, few-shot adaptation to downstream tasks generally enhances conformal scores, where we identify Adapters as a better conformable alternative compared to Prompt Learning strategies. Our empirical study identifies APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage property across multiple challenging, yet realistic scenarios.
zh

[CV-118] GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion

【速读】：该论文试图解决单目深度估计（metric monocular depth estimation）在多数据集训练和零样本精度（zero-shot accuracy）中的泛化难题，尤其是由于相机参数与深度之间的纠缠（entanglement）导致的复杂性。解决方案的关键在于提出了一种新的规范表示（canonical representation），该表示能够在不同相机设置下保持一致性，从而有效解耦深度与特定参数的关系，增强跨数据集的泛化能力。此外，论文还提出了一种新颖的架构，能够自适应且概率性地融合通过物体尺寸和垂直图像位置线索估计的深度，进一步提升了深度估计的准确性。

链接: https://arxiv.org/abs/2412.06080
作者: Karlo Koledic,Luka Petrovic,Ivan Markovic,Ivan Petrovic
关键词-EN: Generalizing metric monocular, significant challenge due, hindering multi-dataset training, depth amplifies issues, ill-posed nature
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project website: this https URL

点击查看摘要

Abstract:Generalizing metric monocular depth estimation presents a significant challenge due to its ill-posed nature, while the entanglement between camera parameters and depth amplifies issues further, hindering multi-dataset training and zero-shot accuracy. This challenge is particularly evident in autonomous vehicles and mobile robotics, where data is collected with fixed camera setups, limiting the geometric diversity. Yet, this context also presents an opportunity: the fixed relationship between the camera and the ground plane imposes additional perspective geometry constraints, enabling depth regression via vertical image positions of objects. However, this cue is highly susceptible to overfitting, thus we propose a novel canonical representation that maintains consistency across varied camera setups, effectively disentangling depth from specific parameters and enhancing generalization across datasets. We also propose a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues. A comprehensive evaluation demonstrates the effectiveness of the proposed approach on five autonomous driving datasets, achieving accurate metric depth estimation for varying resolutions, aspect ratios and camera setups. Notably, we achieve comparable accuracy to existing zero-shot methods, despite training on a single dataset with a single-camera setup.
zh

[CV-119] Hyperspectral Image Spectral-Spatial Feature Extraction via Tensor Principal Component Analysis

【速读】：该论文试图解决高光谱图像分类中的光谱-空间特征提取问题，关键解决方案是引入了一种基于张量（tensor）的新框架。该框架通过将循环卷积（circular convolution）融入张量结构，有效捕捉并整合光谱和空间信息。在此基础上，传统的PCA（Principal Component Analysis）技术被扩展为张量主成分分析（Tensor Principal Component Analysis, TPCA），利用高光谱数据的多维结构实现更有效的特征表示。实验结果表明，使用TPCA特征的分类模型在基准高光谱数据集上持续优于传统PCA和其他先进技术，突显了该张量框架在高光谱图像分析中的潜力。

链接: https://arxiv.org/abs/2412.06075
作者: Yuemei Ren,Liang Liao,Stephen John Maybank,Yanning Zhang,Xin Liu
关键词-EN: Principal Component Analysis, Tensor Principal Component, spectral-spatial feature extraction, Principal Component, Component Analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of spectral-spatial feature extraction for hyperspectral image classification by introducing a novel tensor-based framework. The proposed approach incorporates circular convolution into a tensor structure to effectively capture and integrate both spectral and spatial information. Building upon this framework, the traditional Principal Component Analysis (PCA) technique is extended to its tensor-based counterpart, referred to as Tensor Principal Component Analysis (TPCA). The proposed TPCA method leverages the inherent multi-dimensional structure of hyperspectral data, thereby enabling more effective feature representation. Experimental results on benchmark hyperspectral datasets demonstrate that classification models using TPCA features consistently outperform those using traditional PCA and other state-of-the-art techniques. These findings highlight the potential of the tensor-based framework in advancing hyperspectral image analysis.
zh

[CV-120] Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training

【速读】：该论文试图解决在视频生成式扩散模型中精确控制相机姿态的问题。现有方法需要使用包含视频和相机姿态标注的额外数据集进行微调，这不仅数据密集且计算成本高，还可能破坏预训练模型的分布。论文提出的解决方案是 Latent-Reframe，其关键在于在预训练视频扩散模型的采样阶段进行相机控制，而无需微调。具体来说，Latent-Reframe 通过时间感知的点云将视频帧的潜在代码重新调整以匹配输入的相机轨迹，并通过潜在代码修复和协调来优化模型潜在空间，从而在保持模型原始分布的同时实现高效的相机控制和高质量视频生成。

链接: https://arxiv.org/abs/2412.06029
作者: Zhenghong Zhou,Jie An,Jiebo Luo
关键词-EN: Precise camera pose, Precise camera, Precise, camera pose, camera
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Precise camera pose control is crucial for video generation with diffusion models. Existing methods require fine-tuning with additional datasets containing paired videos and camera pose annotations, which are both data-intensive and computationally costly, and can disrupt the pre-trained model distribution. We introduce Latent-Reframe, which enables camera control in a pre-trained video diffusion model without fine-tuning. Unlike existing methods, Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the original model distribution. Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds. Latent code inpainting and harmonization then refine the model latent space, ensuring high-quality video generation. Experimental results demonstrate that Latent-Reframe achieves comparable or superior camera control precision and video quality to training-based methods, without the need for fine-tuning on additional datasets.
zh

[CV-121] FlexDiT: Dynamic Token Density Control for Diffusion Transformer

【速读】：该论文试图解决扩散变换器 (Diffusion Transformers, DiT) 在生成任务中由于基于标记的自注意力机制的二次复杂度和大量采样步骤导致的计算需求过高的问题。解决方案的关键在于提出了一种名为 FlexDiT 的框架，该框架通过在空间和时间维度上动态调整标记密度来实现计算效率的提升，同时不牺牲生成质量。在空间维度上，FlexDiT 采用三段式架构，根据每一层的特征需求分配标记密度：底层使用 Poolingformer 进行高效的全局特征提取，中层使用稀疏-密集标记模块 (Sparse-Dense Token Modules, SDTM) 平衡全局上下文与局部细节，顶层使用密集标记来细化高频细节。在时间维度上，FlexDiT 在去噪阶段动态调节标记密度，随着生成过程的推进逐步增加标记数量。这种空间和时间上的协同优化使得 FlexDiT 在保持生成质量的同时显著提升了计算效率和推理速度。

链接: https://arxiv.org/abs/2412.06028
作者: Shuning Chang,Pichao Wang,Jiasheng Tang,Yi Yang
关键词-EN: Diffusion Transformers, deliver impressive generative, impressive generative performance, extensive sampling steps, face prohibitive computational
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiT) deliver impressive generative performance but face prohibitive computational demands due to both the quadratic complexity of token-based self-attention and the need for extensive sampling steps. While recent research has focused on accelerating sampling, the structural inefficiencies of DiT remain underexplored. We propose FlexDiT, a framework that dynamically adapts token density across both spatial and temporal dimensions to achieve computational efficiency without compromising generation quality. Spatially, FlexDiT employs a three-segment architecture that allocates token density based on feature requirements at each layer: Poolingformer in the bottom layers for efficient global feature extraction, Sparse-Dense Token Modules (SDTM) in the middle layers to balance global context with local detail, and dense tokens in the top layers to refine high-frequency details. Temporally, FlexDiT dynamically modulates token density across denoising stages, progressively increasing token count as finer details emerge in later timesteps. This synergy between FlexDiT’s spatially adaptive architecture and its temporal pruning strategy enables a unified framework that balances efficiency and fidelity throughout the generation process. Our experiments demonstrate FlexDiT’s effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed on DiT-XL with only a 0.09 increase in FID score on 512 \times 512 ImageNet images, a 56% reduction in FLOPs across video generation datasets including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, and a 69% improvement in inference speed on PixArt- \alpha on text-to-image generation task with a 0.24 FID score decrease. FlexDiT provides a scalable solution for high-quality diffusion-based generation compatible with further sampling optimization techniques.
zh

[CV-122] rack4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

【速读】：该论文试图解决视频生成过程中出现的外观漂移（appearance drift）问题，即物体在跨帧过程中逐渐退化或不一致变化，导致视觉连贯性被破坏。解决方案的关键在于提出了Track4Gen，这是一种空间感知视频生成器，通过结合视频扩散损失（video diffusion loss）与跨帧点跟踪（point tracking across frames），在扩散特征层面上提供增强的空间监督。Track4Gen通过最小化修改现有视频生成架构，将视频生成和点跟踪任务统一到一个网络中，从而有效减少了外观漂移，实现了时间上稳定且视觉连贯的视频生成。

链接: https://arxiv.org/abs/2412.06016
作者: Hyeonho Jeong,Chun-Hao Paul Huang,Jong Chul Ye,Niloy Mitra,Duygu Ceylan
关键词-EN: breaking visual coherence, objects gradually degrade, recent foundational video, visually rich output, produce visually rich
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this http URL

点击查看摘要

Abstract:While recent foundational video generators produce visually rich output, they still struggle with appearance drift, where objects gradually degrade or change inconsistently across frames, breaking visual coherence. We hypothesize that this is because there is no explicit supervision in terms of spatial tracking at the feature level. We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. Track4Gen merges the video generation and point tracking tasks into a single network by making minimal changes to existing video generation architectures. Using Stable Video Diffusion as a backbone, Track4Gen demonstrates that it is possible to unify video generation and point tracking, which are typically handled as separate tasks. Our extensive evaluations show that Track4Gen effectively reduces appearance drift, resulting in temporally stable and visually coherent video generation. Project page: this http URL
zh

[CV-123] Post-hoc Probabilistic Vision-Language Models ALT

【速读】：该论文试图解决视觉-语言模型（Vision-language models, VLMs）在面对领域偏移（domain shifts）时无法捕捉概念不确定性（uncertainties over concepts）的问题。解决方案的关键在于提出了一种无需额外训练的后验不确定性估计方法，通过在VLMs的最后一层引入贝叶斯后验近似（Bayesian posterior approximation），并解析量化余弦相似度（cosine similarities）的不确定性。该方法在不确定性量化、支持集选择和主动学习中表现出更好的校准预测不确定性和样本效率。

链接: https://arxiv.org/abs/2412.06014
作者: Anton Baumann,Rui Li,Marcus Klasson,Santeri Mentu,Shyamgopal Karthik,Zeynep Akata,Arno Solin,Martin Trapp
关键词-EN: found remarkable success, CLIP and SigLIP, Vision-language models, success in classification, found remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.
zh

[CV-124] Enhancing Content Representation for AR Image Quality Assessment Using Knowledge Distillation

【速读】：该论文试图解决增强现实（Augmented Reality, AR）场景中图像质量评估的问题，特别是由于数据稀缺和AR技术的独特性导致的有效质量评估指标开发困难。解决方案的关键在于提出了一种基于深度学习的客观评估指标，通过四个主要步骤实现：（1）微调自监督预训练的视觉变换器（Vision Transformer）以提取参考图像的显著特征，并将这些知识迁移到失真图像的表示中；（2）通过计算位移表示来量化失真；（3）使用基于交叉注意力的解码器捕捉感知质量特征；（4）结合正则化技术和标签平滑来解决过拟合问题。该方法在ARIQA数据集上的实验结果表明，其性能优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.06003
作者: Aymen Sekhri,Seyed Ali Amirshahi,Mohamed-Chaker Larabi
关键词-EN: overlaying digital content, Augmented Reality, major immersive media, immersive media technology, digital content
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to the IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Augmented Reality (AR) is a major immersive media technology that enriches our perception of reality by overlaying digital content (the foreground) onto physical environments (the background). It has far-reaching applications, from entertainment and gaming to education, healthcare, and industrial training. Nevertheless, challenges such as visual confusion and classical distortions can result in user discomfort when using the technology. Evaluating AR quality of experience becomes essential to measure user satisfaction and engagement, facilitating the refinement necessary for creating immersive and robust experiences. Though, the scarcity of data and the distinctive characteristics of AR technology render the development of effective quality assessment metrics challenging. This paper presents a deep learning-based objective metric designed specifically for assessing image quality for AR scenarios. The approach entails four key steps, (1) fine-tuning a self-supervised pre-trained vision transformer to extract prominent features from reference images and distilling this knowledge to improve representations of distorted images, (2) quantifying distortions by computing shift representations, (3) employing cross-attention-based decoders to capture perceptual quality features, and (4) integrating regularization techniques and label smoothing to address the overfitting problem. To validate the proposed approach, we conduct extensive experiments on the ARIQA dataset. The results showcase the superior performance of our proposed approach across all model variants, namely TransformAR, TransformAR-KD, and TransformAR-KD+ in comparison to existing state-of-the-art methods.
zh

[CV-125] Paddy Disease Detection and Classification Using Computer Vision Techniques: A Mobile Application to Detect Paddy Disease

【速读】：该论文试图解决植物病害对粮食供应的重大影响，特别是准确和及时的植物病害诊断问题。解决方案的关键在于利用深度学习技术，特别是计算机视觉模型，来实现无需植物病理学家参与的精确病害检测。论文评估了多种计算机视觉模型在检测水稻病害中的效果，并提出了基于YOLOv8模型和Vision Transformer的最佳深度学习病害检测系统。通过使用包含超过20,000张标注图像的Paddy Doctor数据集，研究实现了69%的平均mAP50检测准确率和99.38%的分类准确率。此外，开发了一款移动应用程序，使农民能够即时识别水稻病害并获得治疗指导，从而在实际应用中验证了模型的有效性。

链接: https://arxiv.org/abs/2412.05996
作者: Bimarsha Khanal,Paras Poudel,Anish Chapagai,Bijan Regmi,Sitaram Pokhrel,Salik Ram Khanal
关键词-EN: global food security, diseases significantly impact, Plant diseases significantly, food supply, food security
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages,12 figures and 2 tables

点击查看摘要

Abstract:Plant diseases significantly impact our food supply, causing problems for farmers, economies reliant on agriculture, and global food security. Accurate and timely plant disease diagnosis is crucial for effective treatment and minimizing yield losses. Despite advancements in agricultural technology, a precise and early diagnosis remains a challenge, especially in underdeveloped regions where agriculture is crucial and agricultural experts are scarce. However, adopting Deep Learning applications can assist in accurately identifying diseases without needing plant pathologists. In this study, the effectiveness of various computer vision models for detecting paddy diseases is evaluated and proposed the best deep learning-based disease detection system. Both classification and detection using the Paddy Doctor dataset, which contains over 20,000 annotated images of paddy leaves for disease diagnosis are tested and evaluated. For detection, we utilized the YOLOv8 model-based model were used for paddy disease detection and CNN models and the Vision Transformer were used for disease classification. The average mAP50 of 69% for detection tasks was achieved and the Vision Transformer classification accuracy was 99.38%. It was found that detection models are effective at identifying multiple diseases simultaneously with less computing power, whereas classification models, though computationally expensive, exhibit better performance for classifying single diseases. Additionally, a mobile application was developed to enable farmers to identify paddy diseases instantly. Experiments with the app showed encouraging results in utilizing the trained models for both disease classification and treatment guidance.
zh

[CV-126] Nested Diffusion Models Using Hierarchical Latent Priors

【速读】：该论文试图解决复杂场景图像生成中扩散模型生成质量不足的问题。解决方案的关键在于引入嵌套扩散模型（nested diffusion models），这是一种分层生成框架，通过一系列扩散模型逐步生成不同语义层次的潜在变量（latent variables）。每个模型都以前一个更高层次模型的输出为条件，最终实现图像生成。通过预训练的视觉编码器（visual encoder）学习强语义视觉表示，并通过降维和噪声注入调节其能力，该方法能够捕捉复杂的结构细节并显著提升图像质量。该框架在多个数据集上展示了无条件和条件生成任务中图像质量的显著提升，且计算开销最小化。

链接: https://arxiv.org/abs/2412.05984
作者: Xiao Zhang,Ruoxi Jiang,Rebecca Willett,Michael Maire
关键词-EN: introduce nested diffusion, nested diffusion models, powerful hierarchical generative, hierarchical generative framework, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce nested diffusion models, an efficient and powerful hierarchical generative framework that substantially enhances the generation quality of diffusion models, particularly for images of complex scenes. Our approach employs a series of diffusion models to progressively generate latent variables at different semantic levels. Each model in this series is conditioned on the output of the preceding higher-level models, culminating in image generation. Hierarchical latent variables guide the generation process along predefined semantic pathways, allowing our approach to capture intricate structural details while significantly improving image quality. To construct these latent variables, we leverage a pre-trained visual encoder, which learns strong semantic visual representations, and modulate its capacity via dimensionality reduction and noise injection. Across multiple datasets, our system demonstrates significant enhancements in image quality for both unconditional and class/text conditional generation. Moreover, our unconditional generation system substantially outperforms the baseline conditional system. These advancements incur minimal computational overhead as the more abstract levels of our hierarchy work with lower-dimensional representations.
zh

[CV-127] Chimera: Improving Generalist Model with Domain-Specific Experts

【速读】：该论文试图解决现有大型多模态模型（Large Multi-modal Models, LMMs）在处理特定领域任务时表现不足的问题，这些任务通常需要大量的领域先验知识。解决方案的关键在于引入了一个名为Chimera的可扩展且低成本的多模态管道，通过渐进式训练策略将领域专家模型的特征整合到通用LMM的输入中。此外，为了解决通用视觉编码器与专家模型之间优化不平衡的问题，论文提出了一种新的通用专家协作掩码机制（Generalist-Specialist Collaboration Masking, GSCM）。这一方法使得模型在图表、表格、数学和文档等特定领域中表现出色，并在多模态推理和视觉内容提取任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.05983
作者: Tianshuo Peng,Mingsheng Li,Hongbin Zhou,Renqiu Xia,Renrui Zhang,Lei Bai,Song Mao,Bin Wang,Conghui He,Aojun Zhou,Botian Shi,Tao Chen,Bo Zhang,Xiangyu Yue
关键词-EN: image-text paired data, increasing image-text paired, Large Multi-modal Models, Recent advancements, advancements in Large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Chimera Homepage: this https URL

点击查看摘要

Abstract:Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, generalist models are primarily trained on web-scale datasets dominated by natural images, resulting in the sacrifice of specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. Moreover, directly integrating expert models tailored for specific domains is challenging due to the representational gap and imbalanced optimization between the generalist model and experts. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs.
zh

[CV-128] Anti-Reference: Universal and Immediate Defense Against Reference-Based Generation

【速读】：该论文试图解决生成式模型（如扩散模型）被滥用于制作虚假新闻或针对个人的有害内容的问题。解决方案的关键在于提出了一种名为Anti-Reference的新方法，通过向图像中添加不可察觉的对抗性噪声（adversarial noise）来保护图像免受基于参考的生成技术的威胁。论文设计了一个统一的损失函数，能够同时对抗基于微调的定制方法、非微调定制方法以及以人为中心的驱动方法。基于此损失函数，研究者训练了一个对抗性噪声编码器（Adversarial Noise Encoder），并使用PGD方法直接优化噪声。该方法展示了一定的迁移攻击能力，能够有效挑战灰盒模型和部分商业API，实验结果验证了Anti-Reference在图像安全领域的性能，并设立了新的基准。

链接: https://arxiv.org/abs/2412.05980
作者: Yiren Song,Shengtao Lou,Xiaokang Liu,Hai Ci,Pei Yang,Jiaming Liu,Mike Zheng Shou
关键词-EN: revolutionized generative modeling, produce high-fidelity images, Diffusion models, revolutionized generative, generative modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have revolutionized generative modeling with their exceptional ability to produce high-fidelity images. However, misuse of such potent tools can lead to the creation of fake news or disturbing content targeting individuals, resulting in significant social harm. In this paper, we introduce Anti-Reference, a novel method that protects images from the threats posed by reference-based generation techniques by adding imperceptible adversarial noise to the images. We propose a unified loss function that enables joint attacks on fine-tuning-based customization methods, non-fine-tuning customization methods, and human-centric driving methods. Based on this loss, we train a Adversarial Noise Encoder to predict the noise or directly optimize the noise using the PGD method. Our method shows certain transfer attack capabilities, effectively challenging both gray-box models and some commercial APIs. Extensive experiments validate the performance of Anti-Reference, establishing a new benchmark in image security.
zh

[CV-129] Lightweight Spatial Embedding for Vision-based 3D Occupancy Prediction

【速读】：该论文试图解决3D占用预测（Occupancy Prediction）在实时自动驾驶系统中因繁重的体素特征（voxel features）和3D卷积操作导致的内存和计算开销问题。解决方案的关键在于提出了LightOcc框架，通过轻量级空间嵌入（Lightweight Spatial Embedding）来补充基于鸟瞰图（Bird’s-Eye-View, BEV）特征的高度信息，同时保持其可部署性。具体来说，LightOcc首先利用全局空间采样（Global Spatial Sampling）从多视角深度分布中获取单通道占用（Single-Channel Occupancy），然后通过空间到通道机制（Spatial-to-Channel mechanism）和2D卷积提取三视角嵌入（Tri-Perspective Views, TPV Embeddings），最后通过轻量级TPV交互模块（Lightweight TPV Interaction module）使TPV嵌入相互作用，生成最优的空间嵌入以补充BEV特征。实验结果表明，LightOcc显著提升了预测精度，并在Occ3D-nuScenes基准上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.05976
作者: Jinqing Zhang,Yanan Zhang,Qingjie Liu,Yunhong Wang
关键词-EN: garnered increasing attention, comprehensive fine-grained environmental, fine-grained environmental representation, Occupancy prediction, Lightweight Spatial Embedding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Occupancy prediction has garnered increasing attention in recent years for its comprehensive fine-grained environmental representation and strong generalization to open-set objects. However, cumbersome voxel features and 3D convolution operations inevitably introduce large overheads in both memory and computation, obstructing the deployment of occupancy prediction approaches in real-time autonomous driving systems. Although some methods attempt to efficiently predict 3D occupancy from 2D Bird’s-Eye-View (BEV) features through the Channel-to-Height mechanism, BEV features are insufficient to store all the height information of the scene, which limits performance. This paper proposes LightOcc, an innovative 3D occupancy prediction framework that leverages Lightweight Spatial Embedding to effectively supplement the height clues for the BEV-based representation while maintaining its deployability. Firstly, Global Spatial Sampling is used to obtain the Single-Channel Occupancy from multi-view depth distribution. Spatial-to-Channel mechanism then takes the arbitrary spatial dimension of Single-Channel Occupancy as the feature dimension and extracts Tri-Perspective Views (TPV) Embeddings by 2D convolution. Finally, TPV Embeddings will interact with each other by Lightweight TPV Interaction module to obtain the Spatial Embedding that is optimal supplementary to BEV features. Sufficient experimental results show that LightOcc significantly increases the prediction accuracy of the baseline and achieves state-of-the-art performance on the Occ3D-nuScenes benchmark.
zh

[CV-130] Efficient Semantic Splatting for Remote Sensing Multi-view Segmentation

【速读】：该论文试图解决点云数据在图像平面上的高效渲染和语义分割问题，关键在于提出了一种基于高斯溅射（Gaussian Splatting）的新型语义溅射方法。该方法通过将点云的RGB属性和语义特征同时投影到图像平面，实现RGB图像和语义分割结果的同步渲染。其核心创新包括利用点云的显式结构和一次性渲染策略来提升优化和渲染效率，并通过SAM2生成边界区域的伪标签以增强监督，同时引入二维特征图和三维空间层面的两级聚合损失，以提高视图一致性和空间连续性。

链接: https://arxiv.org/abs/2412.05969
作者: Zipeng Qi,Hao Chen,Haotian Zhang,Zhengxia Zou,Zhenwei Shi
关键词-EN: splatting approach based, Gaussian Splatting, semantic splatting approach, based on Gaussian, semantic splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel semantic splatting approach based on Gaussian Splatting to achieve efficient and low-latency. Our method projects the RGB attributes and semantic features of point clouds onto the image plane, simultaneously rendering RGB images and semantic segmentation results. Leveraging the explicit structure of point clouds and a one-time rendering strategy, our approach significantly enhances efficiency during optimization and rendering. Additionally, we employ SAM2 to generate pseudo-labels for boundary regions, which often lack sufficient supervision, and introduce two-level aggregation losses at the 2D feature map and 3D spatial levels to improve the view-consistent and spatial continuity.
zh

[CV-131] FOF-X: Towards Real-time Detailed Human Reconstruction from a Single Image

【速读】：该论文试图解决从单张图像实时重建高质量人体几何（human geometry）的问题，其关键在于提出了一种高效的3D表示方法——傅里叶占用场（Fourier Occupancy Field, FOF）。FOF通过将3D占用场分解为2D向量场，保留了3D空间中的拓扑和空间关系，同时兼容2D卷积神经网络（CNN），从而在3D和2D领域之间架起桥梁。此外，论文设计了基于FOF的新重建框架FOF-X，通过引入拉普拉斯约束和基于自动机的间断匹配器，增强了FOF与网格表示之间的相互转换算法，提升了重建的鲁棒性和质量。该方法有效解决了现有3D表示方法计算需求高的问题，并在多个数据集和真实捕捉数据上实现了最先进的性能。

链接: https://arxiv.org/abs/2412.05961
作者: Qiao Feng,Yebin Liu,Yu-Kun Lai,Jingyu Yang,Kun Li
关键词-EN: detailed human geometry, Fourier Occupancy Field, Occupancy Field, propose Fourier Occupancy, Balancing real-time speed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2206.02194

点击查看摘要

Abstract:We introduce FOF-X for real-time reconstruction of detailed human geometry from a single image. Balancing real-time speed against high-quality results is a persistent challenge, mainly due to the high computational demands of existing 3D representations. To address this, we propose Fourier Occupancy Field (FOF), an efficient 3D representation by learning the Fourier series. The core of FOF is to factorize a 3D occupancy field into a 2D vector field, retaining topology and spatial relationships within the 3D domain while facilitating compatibility with 2D convolutional neural networks. Such a representation bridges the gap between 3D and 2D domains, enabling the integration of human parametric models as priors and enhancing the reconstruction robustness. Based on FOF, we design a new reconstruction framework, FOF-X, to avoid the performance degradation caused by texture and lighting. This enables our real-time reconstruction system to better handle the domain gap between training images and real images. Additionally, in FOF-X, we enhance the inter-conversion algorithms between FOF and mesh representations with a Laplacian constraint and an automaton-based discontinuity matcher, improving both quality and robustness. We validate the strengths of our approach on different datasets and real-captured data, where FOF-X achieves new state-of-the-art results. The code will be released for research purposes.
zh

[CV-132] When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining

【速读】：该论文试图解决在音频下游任务中利用预训练视觉模型时，通常需要大规模音频数据和复杂目标函数进行额外预训练的问题。解决方案的关键在于提出了一种名为Look Aside Adapter (LoAA)的适配器，通过直接微调视觉模型来实现高效的音频理解，而无需额外的预训练阶段。LoAA通过优化适配器以促进时间维度和频率维度之间的交互，使得视觉模型在各种音频和语音任务中的表现能够达到或超越预训练音频模型的性能，从而提供了一种资源高效且有效的解决方案。

链接: https://arxiv.org/abs/2412.05951
作者: Juan Yeo,Jinkwan Jang,Kyubyung Chae,Seongkyu Mun,Taesup Kim
关键词-EN: Recent studies show, Recent studies, vision models, studies show, audio
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Recent studies show that pretrained vision models can boost performance in audio downstream tasks. To enhance the performance further, an additional pretraining stage with large scale audio data is typically required to infuse audio specific knowledge into the vision model. However, such approaches require extensive audio data and a carefully designed objective function. In this work, we propose bypassing the pretraining stage by directly fine-tuning the vision model with our Look Aside Adapter (LoAA) designed for efficient audio understanding. Audio spectrum data is represented across two heterogeneous dimensions time and frequency and we refine adapters to facilitate interactions between tokens across these dimensions. Our experiments demonstrate that our adapters allow vision models to reach or surpass the performance of pretrained audio models in various audio and speech tasks, offering a resource efficient and effective solution for leveraging vision models in audio applications.
zh

[CV-133] Adversarial Transferability in Deep Denoising Models: Theoretical Insights and Robustness Enhancement via Out-of-Distribution Typical Set Sampling

【速读】：该论文试图解决基于深度学习的图像去噪模型在对抗攻击下的鲁棒性问题。论文指出，这些模型容易受到对抗攻击，即输入数据中的微小扰动可能导致模型失效，并且这种对抗样本具有高度的跨模型迁移性，这在分类模型中并不常见。解决方案的关键在于通过分析高斯噪声和对抗扰动的典型集（typical set）及其渐近等分性质（asymptotic equipartition property），证明对抗样本偏离了原始输入分布的典型集，从而导致模型失效。基于这一发现，论文提出了一种新的对抗防御方法——分布外典型集采样训练策略（Out-of-Distribution Typical Set Sampling Training strategy, TS），该策略不仅显著提升了模型的鲁棒性，还在一定程度上改善了去噪性能。

链接: https://arxiv.org/abs/2412.05943
作者: Jie Ning,Jiebao Sun,Shengzhu Shi,Zhichang Guo,Yao Li,Hongwei Li,Boying Wu
关键词-EN: Deep learning-based image, Deep learning-based, models demonstrate remarkable, learning-based image denoising, demonstrate remarkable performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based image denoising models demonstrate remarkable performance, but their lack of robustness analysis remains a significant concern. A major issue is that these models are susceptible to adversarial attacks, where small, carefully crafted perturbations to input data can cause them to fail. Surprisingly, perturbations specifically crafted for one model can easily transfer across various models, including CNNs, Transformers, unfolding models, and plug-and-play models, leading to failures in those models as well. Such high adversarial transferability is not observed in classification models. We analyze the possible underlying reasons behind the high adversarial transferability through a series of hypotheses and validation experiments. By characterizing the manifolds of Gaussian noise and adversarial perturbations using the concept of typical set and the asymptotic equipartition property, we prove that adversarial samples deviate slightly from the typical set of the original input distribution, causing the models to fail. Based on these insights, we propose a novel adversarial defense method: the Out-of-Distribution Typical Set Sampling Training strategy (TS). TS not only significantly enhances the model’s robustness but also marginally improves denoising performance compared to the original model.
zh

[CV-134] Enhanced 3D Generation by 2D Editing

【速读】：该论文试图解决从预训练的2D扩散模型中提取3D表示时，基于扩散模型蒸馏（SDS）方法存在的信息提取效率低、导致3D内容生成不真实的问题。解决方案的关键在于提出了GE3D（3D Generation by Editing）方法，通过结合噪声轨迹和文本引导的去噪轨迹，优化潜在空间的对齐，从而在多步去噪过程中提取多粒度信息，生成高质量的3D内容。这一方法不仅提升了3D生成的真实感，还建立了3D生成与2D编辑之间的联系，为该领域提供了新的研究方向。

链接: https://arxiv.org/abs/2412.05929
作者: Haoran Li,Yuli Tian,Yong Liao,Lin Wang,Yuyang Wang,Peng Yuan Zhou
关键词-EN: diffusion models, creative applications, applications across gaming, interior design, pretrained diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Distilling 3D representations from pretrained 2D diffusion models is essential for 3D creative applications across gaming, film, and interior design. Current SDS-based methods are hindered by inefficient information distillation from diffusion models, which prevents the creation of photorealistic 3D contents. Our research reevaluates the SDS approach by analyzing its fundamental nature as a basic image editing process that commonly results in over-saturation, over-smoothing and lack of rich content due to the poor-quality single-step denoising. To address these limitations, we propose GE3D (3D Generation by Editing). Each iteration of GE3D utilizes a 2D editing framework that combines a noising trajectory to preserve the information of the input image, alongside a text-guided denoising trajectory. We optimize the process by aligning the latents across both trajectories. This approach fully exploits pretrained diffusion models to distill multi-granularity information through multiple denoising steps, resulting in photorealistic 3D outputs. Both theoretical and experimental results confirm the effectiveness of our approach, which not only advances 3D generation technology but also establishes a novel connection between 3D generation and 2D editing. This could potentially inspire further research in the field. Code and demos are released at this https URL.
zh

[CV-135] BiDM: Pushing the Limit of Quantization for Diffusion Models NEURIPS2024

【速读】：该论文试图解决扩散模型（Diffusion Models, DMs）在资源受限场景下的实际应用问题，特别是由于计算成本高和参数规模大导致的限制。解决方案的关键在于提出了一种名为BiDM的新方法，通过完全二值化权重和激活（W1A1）来实现极致的量化压缩。具体来说，从时间维度上引入了时间步友好二值结构（Timestep-friendly Binary Structure, TBS），通过可学习的激活二值化器和跨时间步的特征连接来处理DMs中高度时间步相关的激活特征；从空间维度上提出了空间补丁蒸馏（Space Patched Distillation, SPD），通过关注图像生成任务和噪声估计网络的空间局部性来解决二值特征匹配的难题。这些创新使得BiDM在LSUN-Bedrooms 256×256数据集上实现了显著的FID（22.74），远超当前最先进的通用二值化方法，并实现了高达28.0倍的存储节省和52.7倍的计算操作（OPs）节省。

链接: https://arxiv.org/abs/2412.05926
作者: Xingyu Zheng,Xianglong Liu,Yichen Bian,Xudong Ma,Yulun Zhang,Jiakai Wang,Jinyang Guo,Haotong Qin
关键词-EN: Diffusion models, developed and widely, applications due, DMs, excellent generative qualities
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2024

点击查看摘要

Abstract:Diffusion models (DMs) have been significantly developed and widely used in various applications due to their excellent generative qualities. However, the expensive computation and massive parameters of DMs hinder their practical use in resource-constrained scenarios. As one of the effective compression approaches, quantization allows DMs to achieve storage saving and inference acceleration by reducing bit-width while maintaining generation performance. However, as the most extreme quantization form, 1-bit binarization causes the generation performance of DMs to face severe degradation or even collapse. This paper proposes a novel method, namely BiDM, for fully binarizing weights and activations of DMs, pushing quantization to the 1-bit limit. From a temporal perspective, we introduce the Timestep-friendly Binary Structure (TBS), which uses learnable activation binarizers and cross-timestep feature connections to address the highly timestep-correlated activation features of DMs. From a spatial perspective, we propose Space Patched Distillation (SPD) to address the difficulty of matching binary features during distillation, focusing on the spatial locality of image generation tasks and noise estimation networks. As the first work to fully binarize DMs, the W1A1 BiDM on the LDM-4 model for LSUN-Bedrooms 256 \times 256 achieves a remarkable FID of 22.74, significantly outperforming the current state-of-the-art general binarization methods with an FID of 59.44 and invalid generative samples, and achieves up to excellent 28.0 times storage and 52.7 times OPs savings. The code is available at this https URL .
zh

[CV-136] GBR: Generative Bundle Refinement for High-fidelity Gaussian Splatting and Meshing

【速读】：该论文试图解决在稀疏视角输入下，高斯斑点（Gaussian splatting）在表示和渲染3D场景时由于几何和光度信息不足而导致的深度、形状和纹理模糊问题。解决方案的关键在于提出了生成式束调整（Generative Bundle Refinement, GBR）方法，该方法通过结合神经束调整模块和生成式深度优化模块来提升几何精度和细节。具体来说，神经束调整模块利用基础网络生成初始3D点图和点匹配，并通过束调整优化提高多视角一致性和点云精度；生成式深度优化模块采用基于扩散的策略增强几何细节和保真度。此外，GBR还引入了一种多模态损失函数，结合深度和法线一致性、几何正则化以及伪视图监督，以在稀疏视角条件下提供稳健的优化指导。实验结果表明，GBR在稀疏视角输入下显著优于现有方法，并能重建和渲染具有显著细节的大规模真实场景。

链接: https://arxiv.org/abs/2412.05908
作者: Jianing Zhang,Yuchao Zheng,Ziwei Li,Qionghai Dai,Xiaoyun Yuan
关键词-EN: continuous Gaussian primitives, Gaussian splatting, generative depth refinement, bundle adjustment module, depth refinement module
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaussian splatting has gained attention for its efficient representation and rendering of 3D scenes using continuous Gaussian primitives. However, it struggles with sparse-view inputs due to limited geometric and photometric information, causing ambiguities in depth, shape, and texture. we propose GBR: Generative Bundle Refinement, a method for high-fidelity Gaussian splatting and meshing using only 4-6 input views. GBR integrates a neural bundle adjustment module to enhance geometry accuracy and a generative depth refinement module to improve geometry fidelity. More specifically, the neural bundle adjustment module integrates a foundation network to produce initial 3D point maps and point matches from unposed images, followed by bundle adjustment optimization to improve multiview consistency and point cloud accuracy. The generative depth refinement module employs a diffusion-based strategy to enhance geometric details and fidelity while preserving the scale. Finally, for Gaussian splatting optimization, we propose a multimodal loss function incorporating depth and normal consistency, geometric regularization, and pseudo-view supervision, providing robust guidance under sparse-view conditions. Experiments on widely used datasets show that GBR significantly outperforms existing methods under sparse-view inputs. Additionally, GBR demonstrates the ability to reconstruct and render large-scale real-world scenes, such as the Pavilion of Prince Teng and the Great Wall, with remarkable details using only 6 views. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.05908 [cs.CV] (or arXiv:2412.05908v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.05908 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-137] hermal Image-based Fault Diagnosis in Induction Machines via Self-Organized Operational Neural Networks

【速读】：该论文试图解决感应电机的机械故障（如不对中和转子故障）的监测与诊断问题。解决方案的关键在于使用二维自组织运算神经网络（2-dimensional Self-Organized Operational Neural Networks, Self-ONNs）从热成像图像中诊断这些故障。Self-ONNs通过其非线性神经元和自组织能力，能够在较浅的架构下实现与复杂卷积神经网络（CNNs）相当的诊断性能，从而确保高效性能并适合在边缘设备上部署，适用于多设备和多功能的复杂监测系统。

链接: https://arxiv.org/abs/2412.05901
作者: Sertac Kilickaya,Cansu Celebioglu,Levent Eren,Murat Askar
关键词-EN: prevent costly interruptions, equipment failure, Condition monitoring, crucial to prevent, prevent costly
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: To be published in 2025 IEEE Symposium Series on Computational Intelligence

点击查看摘要

Abstract:Condition monitoring of induction machines is crucial to prevent costly interruptions and equipment failure. Mechanical faults such as misalignment and rotor issues are among the most common problems encountered in industrial environments. To effectively monitor and detect these faults, a variety of sensors, including accelerometers, current sensors, temperature sensors, and microphones, are employed in the field. As a non-contact alternative, thermal imaging offers a powerful monitoring solution by capturing temperature variations in machines with thermal cameras. In this study, we propose using 2-dimensional Self-Organized Operational Neural Networks (Self-ONNs) to diagnose misalignment and broken rotor faults from thermal images of squirrel-cage induction motors. We evaluate our approach by benchmarking its performance against widely used Convolutional Neural Networks (CNNs), including ResNet, EfficientNet, PP-LCNet, SEMNASNet, and MixNet, using a Workswell InfraRed Camera (WIC). Our results demonstrate that Self-ONNs, with their non-linear neurons and self-organizing capability, achieve diagnostic performance comparable to more complex CNN models while utilizing a shallower architecture with just three operational layers. Its streamlined architecture ensures high performance and is well-suited for deployment on edge devices, enabling its use also in more complex multi-function and/or multi-device monitoring systems.
zh

[CV-138] Accelerating Video Diffusion Models via Distribution Matching

【速读】：该论文试图解决当前扩散模型在视频生成中计算密集、采样步骤多、实用性受限的问题。解决方案的关键在于提出了一种新颖的扩散蒸馏与分布匹配框架，通过将预训练的扩散模型蒸馏为更高效的少步生成器，显著减少了推理步骤，同时保持或提升了生成质量。具体方法包括利用视频GAN损失和一种新的2D分数分布匹配损失，结合去噪GAN判别器和预训练的图像扩散模型，以增强帧质量和提示跟随能力。实验结果表明，使用AnimateDiff作为教师模型时，该方法在仅四步采样的情况下，性能优于现有技术。

链接: https://arxiv.org/abs/2412.05899
作者: Yuanzhi Zhu,Hanshu Yan,Huan Yang,Kai Zhang,Junnan Li
关键词-EN: made significant success, Generative models, made significant, significant success, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models, particularly diffusion models, have made significant success in data synthesis across various modalities, including images, videos, and 3D assets. However, current diffusion models are computationally intensive, often requiring numerous sampling steps that limit their practical application, especially in video generation. This work introduces a novel framework for diffusion distillation and distribution matching that dramatically reduces the number of inference steps while maintaining-and potentially improving-generation quality. Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator, specifically targeting video generation. By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames with substantially fewer sampling steps. To be specific, the proposed method incorporates a denoising GAN discriminator to distil from the real data and a pre-trained image diffusion model to enhance the frame quality and the prompt-following capabilities. Experimental results using AnimateDiff as the teacher model showcase the method’s effectiveness, achieving superior performance in just four sampling steps compared to existing techniques.
zh

[CV-139] Detecting Discrepancies Between AI-Generated and Natural Images Using Uncertainty

【速读】：该论文试图解决通过利用预测不确定性来检测生成式 AI (AI-generated) 图像的问题，以减轻其误用和相关风险。解决方案的关键在于利用预测不确定性捕捉自然图像与生成式 AI 图像之间的分布差异。具体来说，随着训练数据与测试数据之间的分布差异增加，模型性能通常会下降，伴随预测不确定性的增加。因此，通过计算大规模预训练模型在图像上的预测不确定性，将高不确定性图像识别为生成式 AI 图像，从而实现简单而有效的检测方法。

链接: https://arxiv.org/abs/2412.05897
作者: Jun Nie,Yonggang Zhang,Tongliang Liu,Yiu-ming Cheung,Bo Han,Xinmei Tian
关键词-EN: detecting AI-generated images, AI-generated images, images, detecting AI-generated, AI-generated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we propose a novel approach for detecting AI-generated images by leveraging predictive uncertainty to mitigate misuse and associated risks. The motivation arises from the fundamental assumption regarding the distributional discrepancy between natural and AI-generated images. The feasibility of distinguishing natural images from AI-generated ones is grounded in the distribution discrepancy between them. Predictive uncertainty offers an effective approach for capturing distribution shifts, thereby providing insights into detecting AI-generated images. Namely, as the distribution shift between training and testing data increases, model performance typically degrades, often accompanied by increased predictive uncertainty. Therefore, we propose to employ predictive uncertainty to reflect the discrepancies between AI-generated and natural images. In this context, the challenge lies in ensuring that the model has been trained over sufficient natural images to avoid the risk of determining the distribution of natural images as that of generated images. We propose to leverage large-scale pre-trained models to calculate the uncertainty as the score for detecting AI-generated images. This leads to a simple yet effective method for detecting AI-generated images using large-scale vision models: images that induce high uncertainty are identified as AI-generated. Comprehensive experiments across multiple benchmarks demonstrate the effectiveness of our method.
zh

[CV-140] doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation

【速读】：该论文试图解决人机交互系统，特别是自动驾驶车辆（AVs）中如何有效整合人类指令以进行运动规划的问题。解决方案的关键在于引入了一个名为doScenes的新型数据集，该数据集通过注释多模态传感器数据与自然语言指令及指代标签，实现了指令与驾驶响应之间的桥梁作用。doScenes强调与静态和动态场景对象相关的可操作指令，克服了现有数据集在依赖模拟数据或预定义动作集方面的局限性，支持在真实世界场景中进行细致且灵活的响应。这一框架为开发能够无缝整合人类指令到自主系统中的学习策略奠定了基础，从而推动了视觉语言导航中安全有效的人车协作。

链接: https://arxiv.org/abs/2412.05893
作者: Parthib Roy,Srinivasa Perisetla,Shashank Shriram,Harsha Krishnaswamy,Aryan Keskar,Ross Greer
关键词-EN: Human-interactive robotic systems, Human-interactive robotic, effectively integrate human, integrate human instructions, influence vehicle motion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human-interactive robotic systems, particularly autonomous vehicles (AVs), must effectively integrate human instructions into their motion planning. This paper introduces doScenes, a novel dataset designed to facilitate research on human-vehicle instruction interactions, focusing on short-term directives that directly influence vehicle motion. By annotating multimodal sensor data with natural language instructions and referentiality tags, doScenes bridges the gap between instruction and driving response, enabling context-aware and adaptive planning. Unlike existing datasets that focus on ranking or scene-level reasoning, doScenes emphasizes actionable directives tied to static and dynamic scene objects. This framework addresses limitations in prior research, such as reliance on simulated data or predefined action sets, by supporting nuanced and flexible responses in real-world scenarios. This work lays the foundation for developing learning strategies that seamlessly integrate human instructions into autonomous systems, advancing safe and effective human-vehicle collaboration for vision-language navigation. We make our data publicly available at this https URL
zh

[CV-141] MCP-MedSAM: A Powerful Lightweight Medical Segment Anything Model Trained with a Single GPU in Just One Day

【速读】：该论文试图解决医学图像分割领域中，Segmentation Anything Model (SAM) 模型由于其大规模参数和高GPU需求导致的可扩展性和开发难度问题。解决方案的关键在于提出了MCP-MedSAM，一个轻量化的医学SAM模型，能够在单个GPU上一天内完成训练，同时保持竞争力的分割性能。通过减少参数数量和优化训练过程，MCP-MedSAM在大型挑战数据集上与顶级方法相比，实现了更优的性能，且显著降低了训练资源需求。

链接: https://arxiv.org/abs/2412.05888
作者: Donghang Lyu,Ruochen Gao,Marius Staring
关键词-EN: identifying anatomical structures, partitioning medical images, image segmentation involves, involves partitioning medical, segmentation involves partitioning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation involves partitioning medical images into meaningful regions, with a focus on identifying anatomical structures or abnormalities. It has broad applications in healthcare, and deep learning methods have enabled significant advancements in automating this process. Recently, the introduction of the Segmentation Anything Model (SAM), the first foundation model for segmentation task, has prompted researchers to adapt it for the medical domain to improve performance across various tasks. However, SAM’s large model size and high GPU requirements hinder its scalability and development in the medical domain. To address these challenges, research has increasingly focused on lightweight adaptations of SAM to reduce its parameter count, enabling training with limited GPU resources while maintaining competitive segmentation performance. In this work, we propose MCP-MedSAM, a powerful and lightweight medical SAM model designed to be trainable on a single GPU within one day while delivering superior segmentation performance. Our method was trained and evaluated using a large-scale challenge dataset\footnote\urlthis https URL\labelcomp, compared to top-ranking methods on the challenge leaderboard, MCP-MedSAM achieved superior performance while requiring only one day of training on a single GPU. The code is publicly available at \urlthis https URL.
zh

[CV-142] 3D-Consistent Image Inpainting with Diffusion Models

【速读】：该论文试图解决基于扩散模型的图像修复中存在的3D不一致性问题。解决方案的关键在于通过使用同一场景的图像对，在去噪过程中引入场景的替代视角，从而在无需显式3D监督的情况下，通过训练生成式扩散模型来恢复3D先验信息。这种方法通过在上下文指导中加入额外的图像，实现了掩码区域与非掩码区域之间的协调，确保了修复结果的语义一致性和3D一致性，并在多个数据集上展示了其优于现有最先进方法的效果。

链接: https://arxiv.org/abs/2412.05881
作者: Leonid Antsfeld,Boris Chidlovskii
关键词-EN: image inpainting based, address the problem, generative diffusion model, diffusion models, inpainting based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 9 figures, 4 tables

点击查看摘要

Abstract:We address the problem of 3D inconsistency of image inpainting based on diffusion models. We propose a generative model using image pairs that belong to the same scene. To achieve the 3D-consistent and semantically coherent inpainting, we modify the generative diffusion model by incorporating an alternative point of view of the scene into the denoising process. This creates an inductive bias that allows to recover 3D priors while training to denoise in 2D, without explicit 3D supervision. Training unconditional diffusion models with additional images as in-context guidance allows to harmonize the masked and non-masked regions while repainting and ensures the 3D consistency. We evaluate our method on one synthetic and three real-world datasets and show that it generates semantically coherent and 3D-consistent inpaintings and outperforms the state-of-art methods.
zh

[CV-143] MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training

【速读】：该论文试图解决3D医学图像分析中标注数据稀缺和模型泛化能力有限的问题。解决方案的关键在于提出了一种多任务视觉-语言预训练方法（MG-3D），通过以下两个方面来解决这些问题：1）通过跨模态全局对齐和互补模态引导的局部重建，建立患者体素语义与多粒度医学知识之间的对应关系，确保不同模态的特征在患者内部一致地表示相同的语义内容；2）基于患者间细粒度报告相关性，通过对比学习来关联患者间的视觉语义，同时保持对全局个体差异的敏感性，从而增强特征的判别性表示。该方法在大规模数据（47.1K）上进行预训练，并通过全面的临床任务评估展示了其优越的迁移性、可扩展性和泛化能力。

链接: https://arxiv.org/abs/2412.05876
作者: Xuefeng Ni,Linshan Wu,Jiaxin Zhuang,Qiong Wang,Mingxiang Wu,Varut Vardhanabhuti,Lihai Zhang,Hanyu Gao,Hao Chen
关键词-EN: numerous clinical applications, medical image analysis, pivotal in numerous, image analysis, medical image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 Pages

点击查看摘要

Abstract:3D medical image analysis is pivotal in numerous clinical applications. However, the scarcity of labeled data and limited generalization capabilities hinder the advancement of AI-empowered models. Radiology reports are easily accessible and can serve as weakly-supervised signals. However, large-scale vision-language pre-training (VLP) remains underexplored in 3D medical image analysis. Specifically, the insufficient investigation into multi-grained radiology semantics and their correlations across patients leads to underutilization of large-scale volume-report data. Considering intra-patient cross-modal semantic consistency and inter-patient semantic correlations, we propose a multi-task VLP method, MG-3D, pre-trained on large-scale data (47.1K), addressing the challenges by the following two aspects: 1) Establishing the correspondence between volume semantics and multi-grained medical knowledge of each patient with cross-modal global alignment and complementary modality-guided local reconstruction, ensuring intra-patient features of different modalities cohesively represent the same semantic content; 2) Correlating inter-patient visual semantics based on fine-grained report correlations across patients, and keeping sensitivity to global individual differences via contrastive learning, enhancing the discriminative feature representation. Furthermore, we delve into the scaling law to explore potential performance improvements. Comprehensive evaluations across nine uni- and cross-modal clinical tasks are carried out to assess model efficacy. Extensive experiments on both internal and external datasets demonstrate the superior transferability, scalability, and generalization of MG-3D, showcasing its potential in advancing feature representation for 3D medical image analysis. Code will be available: this https URL. Comments: 10 Pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.05876 [cs.CV] (or arXiv:2412.05876v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.05876 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-144] MID: A Comprehensive Shore-Based Dataset for Multi-Scale Dense Ship Occlusion and Interaction Scenarios

【速读】：该论文试图解决复杂海上环境中船舶检测的挑战，特别是在处理多样的海上场景（如不同天气条件、靠泊操作、小目标聚集和部分遮挡）时。解决方案的关键在于引入了一个名为Maritime Ship Navigation Behavior Dataset (MID) 的数据集，该数据集包含5,673张图像和135,884个精细标注的目标实例，支持监督和半监督学习。MID通过使用Oriented Bounding Boxes (OBB) 来提高检测精度，并填补了现有数据集（如HRSID、SSDD和NWPU-10）的空白。数据集的多样性和高质量标注使其能够更好地应对实际海上交通监控和自主导航系统的需求，推动智能海上交通监控和自主导航系统的创新。

链接: https://arxiv.org/abs/2412.05871
作者: Yugang Chang,Hongyu Chen,Fei Wang,Chengcheng Chen,Weiming Zeng
关键词-EN: Oriented Bounding Boxes, Bounding Boxes, Oriented Bounding, Ship Navigation Behavior, Navigation Behavior Dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces the Maritime Ship Navigation Behavior Dataset (MID), designed to address challenges in ship detection within complex maritime environments using Oriented Bounding Boxes (OBB). MID contains 5,673 images with 135,884 finely annotated target instances, supporting both supervised and semi-supervised learning. It features diverse maritime scenarios such as ship encounters under varying weather, docking maneuvers, small target clustering, and partial occlusions, filling critical gaps in datasets like HRSID, SSDD, and NWPU-10. MID’s images are sourced from high-definition video clips of real-world navigation across 43 water areas, with varied weather and lighting conditions (e.g., rain, fog). Manually curated annotations enhance the dataset’s variety, ensuring its applicability to real-world demands in busy ports and dense maritime regions. This diversity equips models trained on MID to better handle complex, dynamic environments, supporting advancements in maritime situational awareness. To validate MID’s utility, we evaluated 10 detection algorithms, providing an in-depth analysis of the dataset, detection results from various models, and a comparative study of baseline algorithms, with a focus on handling occlusions and dense target clusters. The results highlight MID’s potential to drive innovation in intelligent maritime traffic monitoring and autonomous navigation systems. The dataset will be made publicly available at this https URL_DataSet.
zh

[CV-145] MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

【速读】：该论文试图解决在图像到视频生成（I2V）任务中，缺乏可靠的运动强度估计器的问题。传统方法如结构相似性（SSIM）或光流（optical flow）难以泛化到任意视频，而人工标注运动强度也极为困难。论文提出了一种新的运动估计器，能够分别测量视频中物体和相机的运动强度（decoupled motion intensities），并通过对比学习（contrastive learning）在随机配对的视频中区分运动强度较大的视频。这一解决方案的关键在于其易于标注和扩展，能够在大规模视频数据集上实现稳定的运动估计。基于此，论文还提出了一种新的I2V模型——MotionStone，实验结果表明该模型在I2V生成任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.05848
作者: Shuwei Shi,Biao Gong,Xi Chen,Dandan Zheng,Shuai Tan,Zizheng Yang,Yuyuan Li,Jingwen He,Kecheng Zheng,Jingdong Chen,Ming Yang,Yinqiang Zheng
关键词-EN: additional control signal, motion, motion intensity, motion estimator, static image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are hard to generalize to arbitrary videos, while, it is very tough for human annotators to label the abstract motion intensity neither. Furthermore, the motion intensity shall reveal both local object motion and global camera movement, which has not been studied before. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video. We leverage the contrastive learning on randomly paired videos and distinguish the video with greater motion intensity. Such a paradigm is friendly for annotation and easy to scale up to achieve stable performance on motion estimation. We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator. Experimental results demonstrate the stability of the proposed motion estimator and the state-of-the-art performance of MotionStone on I2V generation. These advantages warrant the decoupled motion estimator to serve as a general plug-in enhancer for both data processing and video generation training.
zh

[CV-146] LVP-CLIP:Revisiting CLIP for Continual Learning with Label Vector Pool CVPR2025

【速读】：该论文试图解决传统基于CLIP的持续学习方法在处理缺乏有意义文本标签的类别时效果不佳的问题。解决方案的关键在于引入标签向量池（Label Vector Pool, LVP），通过使用训练图像作为相似性参考，替代文本标签，从而消除了对理想文本描述的依赖。LVP利用CLIP的高维特征空间，实现了任务顺序不变性，确保新知识的学习不会修改旧知识，从而最小化遗忘。实验结果表明，LVP方法在类和域增量学习任务中显著优于现有最先进基线，提升了40.7%。

链接: https://arxiv.org/abs/2412.05840
作者: Yue Ma,Huantao Ren,Boyu Wang,Jingang Jin,Senem Velipasalar,Qinru Qiu
关键词-EN: Continual learning aims, previously acquired knowledge, forgetting previously acquired, Continual learning, aims to update
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to CVPR2025

点击查看摘要

Abstract:Continual learning aims to update a model so that it can sequentially learn new tasks without forgetting previously acquired knowledge. Recent continual learning approaches often leverage the vision-language model CLIP for its high-dimensional feature space and cross-modality feature matching. Traditional CLIP-based classification methods identify the most similar text label for a test image by comparing their embeddings. However, these methods are sensitive to the quality of text phrases and less effective for classes lacking meaningful text labels. In this work, we rethink CLIP-based continual learning and introduce the concept of Label Vector Pool (LVP). LVP replaces text labels with training images as similarity references, eliminating the need for ideal text descriptions. We present three variations of LVP and evaluate their performance on class and domain incremental learning tasks. Leveraging CLIP’s high dimensional feature space, LVP learning algorithms are task-order invariant. The new knowledge does not modify the old knowledge, hence, there is minimum forgetting. Different tasks can be learned independently and in parallel with low computational and memory demands. Experimental results show that proposed LVP-based methods outperform the current state-of-the-art baseline by a significant margin of 40.7%.
zh

[CV-147] ny Object Detection with Single Point Supervision

【速读】：该论文试图解决小目标检测中的点监督问题，特别是在航空图像中，由于小目标的空间分辨率有限且特征不明显，点标注容易受到噪声影响，导致模型鲁棒性不足。解决方案的关键在于提出了Point Teacher，这是一种端到端的点监督方法，通过教师-学生架构将学习过程分解为两阶段的去噪过程。首先，教师网络通过随机掩码策略将噪声点标注转换为粗略的伪框；其次，利用动态多实例学习对这些粗略伪框进行细化，从而逐步提高伪框的可靠性，最终指导学生网络的学习。该方法在多个小目标数据集上验证了其有效性和对点标注位置偏移的鲁棒性。

链接: https://arxiv.org/abs/2412.05837
作者: Haoran Zhu,Chang Xu,Ruixiang Zhang,Fang Xu,Wen Yang,Haijian Zhang,Gui-Song Xia
关键词-EN: limited spatial resolution, resemble point-like distributions, Point Teacher, point, point annotations
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tiny objects, with their limited spatial resolution, often resemble point-like distributions. As a result, bounding box prediction using point-level supervision emerges as a natural and cost-effective alternative to traditional box-level supervision. However, the small scale and lack of distinctive features of tiny objects make point annotations prone to noise, posing significant hurdles for model robustness. To tackle these challenges, we propose Point Teacher–the first end-to-end point-supervised method for robust tiny object detection in aerial images. To handle label noise from scale ambiguity and location shifts in point annotations, Point Teacher employs the teacher-student architecture and decouples the learning into a two-phase denoising process. In this framework, the teacher network progressively denoises the pseudo boxes derived from noisy point annotations, guiding the student network’s learning. Specifically, in the first phase, random masking of image regions facilitates regression learning, enabling the teacher to transform noisy point annotations into coarse pseudo boxes. In the second phase, these coarse pseudo boxes are refined using dynamic multiple instance learning, which adaptively selects the most reliable instance from dynamically constructed proposal bags around the coarse pseudo boxes. Extensive experiments on three tiny object datasets (i.e., AI-TOD-v2, SODA-A, and TinyPerson) validate the proposed method’s effectiveness and robustness against point location shifts. Notably, relying solely on point supervision, our Point Teacher already shows comparable performance with box-supervised learning methods. Codes and models will be made publicly available.
zh

[CV-148] CSG: A Context-Semantic Guided Diffusion Approach in De Novo Musculoskeletal Ultrasound Image Generation

【速读】：该论文试图解决医学影像AI解决方案中，由于数据多样性、代表性和无偏性不足而导致的合成图像生成效果受限的问题。解决方案的关键在于引入了一种可扩展的语义和上下文条件生成模型，称为CSG (Context-Semantic Guidance)。该模型通过双条件控制结构和外观，实现了对超声图像的语义变异性和上下文细节的全面控制，从而生成更加真实和多样化的医学图像。此外，CSG还能够生成肌肉骨骼超声图像中的病理异常，并通过三重验证协议证明了其生成的合成图像在语义分割模型性能、与真实图像的相似度以及通过图灵测试等方面均优于基线方法。

链接: https://arxiv.org/abs/2412.05833
作者: Elay Dahan,Hedda Cohen Indelman,Angeles M. Perez-Agosto,Carmit Shiran,Gopal Avinash,Doron Shaked,Nati Daniel
关键词-EN: imaging Artificial Intelligence, medical imaging Artificial, Artificial Intelligence, imaging Artificial, representative medical image
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The use of synthetic images in medical imaging Artificial Intelligence (AI) solutions has been shown to be beneficial in addressing the limited availability of diverse, unbiased, and representative data. Despite the extensive use of synthetic image generation methods, controlling the semantics variability and context details remains challenging, limiting their effectiveness in producing diverse and representative medical image datasets. In this work, we introduce a scalable semantic and context-conditioned generative model, coined CSG (Context-Semantic Guidance). This dual conditioning approach allows for comprehensive control over both structure and appearance, advancing the synthesis of realistic and diverse ultrasound images. We demonstrate the ability of CSG to generate findings (pathological anomalies) in musculoskeletal (MSK) ultrasound images. Moreover, we test the quality of the synthetic images using a three-fold validation protocol. The results show that the synthetic images generated by CSG improve the performance of semantic segmentation models, exhibit enhanced similarity to real images compared to the baseline methods, and are undistinguishable from real images according to a Turing test. Furthermore, we demonstrate an extension of the CSG that allows enhancing the variability space of images by synthetically generating augmentations of anatomical geometries and textures.
zh

[CV-149] Self-Guidance: Boosting Flow and Diffusion Generation on Their Own

【速读】：该论文试图解决现有扩散和流式文本到图像模型在生成过程中需要特定训练或强归纳偏置（inductive biases）的问题。解决方案的关键是提出了自引导（Self-Guidance, SG）方法，该方法通过测量两个连续扩散时间步的速度差异来计算引导向量，无需特定训练或特定神经网络架构，能够灵活应用于条件和非条件模型，并显著提升生成性能，尤其是在高质量人体部位（如手、脸、手臂）的生成上表现出显著优势。

链接: https://arxiv.org/abs/2412.05827
作者: Tiancheng Li,Weijian Luo,Zhiyang Chen,Liyuan Ma,Guo-Jun Qi
关键词-EN: Proper guidance strategies, Proper guidance, optimal generation results, neural network architectures, strategies are essential
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Proper guidance strategies are essential to get optimal generation results without re-training diffusion and flow-based text-to-image models. However, existing guidances either require specific training or strong inductive biases of neural network architectures, potentially limiting their applications. To address these issues, in this paper, we introduce Self-Guidance (SG), a strong diffusion guidance that neither needs specific training nor requires certain forms of neural network architectures. Different from previous approaches, the Self-Guidance calculates the guidance vectors by measuring the difference between the velocities of two successive diffusion timesteps. Therefore, SG can be readily applied for both conditional and unconditional models with flexible network architectures. We conduct intensive experiments on both text-to-image generation and text-to-video generations across flexible architectures including UNet-based models and diffusion transformer-based models. On current state-of-the-art diffusion models such as Stable Diffusion 3.5 and FLUX, SG significantly boosts the image generation performance in terms of FID, and Human Preference Scores. Moreover, we find that SG has a surprisingly positive effect on the generation of high-quality human bodies such as hands, faces, and arms, showing strong potential to overcome traditional challenges on human body generations with minimal effort. We will release our implementation of SG on SD 3.5 and FLUX models along with this paper.
zh

[CV-150] Doppelgangers: Improved Visual Disambiguation with Geometric 3D Features

【速读】：该论文试图解决三维重建中由于视觉混淆（visual aliasing）导致的错误匹配问题，即视觉上相似但实际不同的表面（doppelgangers）被错误匹配，从而影响结构光运动（SfM）过程的准确性。解决方案的关键在于提出了一种名为Doppelgangers++的方法，该方法通过以下创新提升了doppelganger检测和三维重建的精度：1）引入了一个多样化的训练数据集，包含日常场景的地理标记图像，增强了模型在不同场景中的泛化能力；2）采用基于Transformer的分类器，利用MASt3R模型的三维感知特征，提高了分类的精确度和召回率；3）提供了一种自动化、基于地理标记的验证方法，用于评估重建模型的准确性，减少了人工检查的需求。这些改进使得Doppelgangers++能够无缝集成到标准的SfM和MASt3R-SfM流程中，显著提升了复杂和多样化场景下的三维重建质量。

链接: https://arxiv.org/abs/2412.05826
作者: Yuanbo Xiangli,Ruojin Cai,Hanyu Chen,Jeffrey Byrne,Noah Snavely
关键词-EN: distinct surfaces, incorrectly matched, frequently hindered, visually similar, similar but distinct
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page can be found in this https URL

点击查看摘要

Abstract:Accurate 3D reconstruction is frequently hindered by visual aliasing, where visually similar but distinct surfaces (aka, doppelgangers), are incorrectly matched. These spurious matches distort the structure-from-motion (SfM) process, leading to misplaced model elements and reduced accuracy. Prior efforts addressed this with CNN classifiers trained on curated datasets, but these approaches struggle to generalize across diverse real-world scenes and can require extensive parameter tuning. In this work, we present Doppelgangers++, a method to enhance doppelganger detection and improve 3D reconstruction accuracy. Our contributions include a diversified training dataset that incorporates geo-tagged images from everyday scenes to expand robustness beyond landmark-based datasets. We further propose a Transformer-based classifier that leverages 3D-aware features from the MASt3R model, achieving superior precision and recall across both in-domain and out-of-domain tests. Doppelgangers++ integrates seamlessly into standard SfM and MASt3R-SfM pipelines, offering efficiency and adaptability across varied scenes. To evaluate SfM accuracy, we introduce an automated, geotag-based method for validating reconstructed models, eliminating the need for manual inspection. Through extensive experiments, we demonstrate that Doppelgangers++ significantly enhances pairwise visual disambiguation and improves 3D reconstruction quality in complex and diverse scenarios.
zh

[CV-151] Self-Supervised Learning with Probabilistic Density Labeling for Rainfall Probability Estimation WACV2025

【速读】：该论文试图解决数值天气预报 (Numerical Weather Prediction, NWP) 模型在极端天气现象（如强降雨）预测中的非线性和不可预测性问题，尤其是提高降水预报的准确性和延长预报提前时间。解决方案的关键在于提出了一种自监督学习与概率密度标注 (Self-Supervised Learning with Probabilistic Density Labeling, SSLPDL) 的后处理方法。该方法通过自监督学习 (Self-Supervised Learning, SSL) 和掩码建模来重构大气物理变量，从而捕捉变量间的依赖关系，并利用预训练的编码器进行迁移学习以处理降水分割任务。此外，论文引入了一种基于概率密度的标注方法，以解决极端天气事件中的类别不平衡问题。实验结果表明，SSLPDL 在区域降水后处理和延长预报提前时间方面表现优异。

链接: https://arxiv.org/abs/2412.05825
作者: Junha Lee,Sojung An,Sujeong You,Namik Cho
关键词-EN: Numerical weather prediction, Numerical weather, textbf, fundamental in meteorology, meteorology for simulating
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2025

点击查看摘要

Abstract:Numerical weather prediction (NWP) models are fundamental in meteorology for simulating and forecasting the behavior of various atmospheric variables. The accuracy of precipitation forecasts and the acquisition of sufficient lead time are crucial for preventing hazardous weather events. However, the performance of NWP models is limited by the nonlinear and unpredictable patterns of extreme weather phenomena driven by temporal dynamics. In this regard, we propose a \textbfSelf-\textbfSupervised \textbfLearning with \textbfProbabilistic \textbfDensity \textbfLabeling (SSLPDL) for estimating rainfall probability by post-processing NWP forecasts. Our post-processing method uses self-supervised learning (SSL) with masked modeling for reconstructing atmospheric physics variables, enabling the model to learn the dependency between variables. The pre-trained encoder is then utilized in transfer learning to a precipitation segmentation task. Furthermore, we introduce a straightforward labeling approach based on probability density to address the class imbalance in extreme weather phenomena like heavy rain events. Experimental results show that SSLPDL surpasses other precipitation forecasting models in regional precipitation post-processing and demonstrates competitive performance in extending forecast lead times. Our code is available at this https URL
zh

[CV-152] [CLS] Token Tells Everything Needed for Training-free Efficient MLLM s

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在视觉-语言任务中高效部署的问题，主要挑战在于高计算成本和内存需求。解决方案的关键在于提出了一种无需训练的视觉标记压缩方法，称为 VTC-CLS。该方法利用视觉编码器中的 [CLS] 标记对视觉标记的注意力分数作为重要性指标，进行视觉标记的剪枝，并通过集成不同层级的 [CLS] 标记的重要性分数，更全面地捕捉关键视觉信息。实验结果表明，VTC-CLS 在各种任务中达到了最先进的性能，同时显著降低了计算成本。

链接: https://arxiv.org/abs/2412.05819
作者: Ao Wang,Fengyuan Sun,Hui Chen,Zijia Lin,Jungong Han,Guiguang Ding
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, recently demonstrated strong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages,4 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision-language tasks, garnering significant attention in the computer vision. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. Recognizing the redundancy of information within the vision modality, recent studies have explored methods for compressing visual tokens in MLLMs to enhance efficiency in a training-free manner. Despite their effectiveness, existing methods like Fast rely on the attention between visual tokens and prompt text tokens as the importance indicator, overlooking the relevance to response text and thus introducing perception bias. In this paper, we demonstrate that in MLLMs, the [CLS] token in the visual encoder inherently knows which visual tokens are important for MLLMs. Building on this prior, we introduce a simple yet effective method for train-free visual token compression, called VTC-CLS. Firstly, it leverages the attention score of the [CLS] token on visual tokens as an importance indicator for pruning visual tokens. Besides, we also explore ensembling the importance scores derived by the [CLS] token from different layers to capture the key visual information more comprehensively. Extensive experiments demonstrate that our VTC-CLS achieves the state-of-the-art performance across various tasks compared with baseline methods. It also brings notably less computational costs in a training-free manner, highlighting its effectiveness and superiority. Code and models are available at \urlthis https URL.
zh

[CV-153] SizeGS: Size-aware Compression of 3D Gaussians with Hierarchical Mixed Precision Quantization

【速读】：该论文旨在解决3D网格压缩（3DGS）中在满足特定文件大小限制的同时保持最佳视觉质量的问题。解决方案的关键在于引入SizeGS框架，通过结合大小估算器和混合精度量化（MPQ）技术来优化压缩过程。具体来说，大小估算器建立了文件大小与超参数之间的关系，而MPQ则在属性间和属性内两个层次上进行量化，以在满足大小约束的前提下最大化视觉质量。属性间层次通过0-1整数线性规划分配比特宽度，属性内层次则通过动态规划确定块长度并进行量化。最终，该方法在10分钟内即可确定最优超参数，实现了1.69倍的效率提升，且视觉质量与现有最先进方法相当。

链接: https://arxiv.org/abs/2412.05808
作者: Shuzhao Xie,Jiahang Liu,Weixiang Zhang,Shijia Ge,Sicheng Pan,Chen Tang,Yunpeng Bai,Zhi Wang
关键词-EN: Effective compression technology, Effective compression, transmission conditions, compression technology, technology is crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Automatically compressing 3DGS into the desired file size while maximizing the visual quality

点击查看摘要

Abstract:Effective compression technology is crucial for 3DGS to adapt to varying storage and transmission conditions. However, existing methods fail to address size constraints while maintaining optimal quality. In this paper, we introduce SizeGS, a framework that compresses 3DGS within a specified size budget while optimizing visual quality. We start with a size estimator to establish a clear relationship between file size and hyperparameters. Leveraging this estimator, we incorporate mixed precision quantization (MPQ) into 3DGS attributes, structuring MPQ in two hierarchical level – inter-attribute and intra-attribute – to optimize visual quality under the size constraint. At the inter-attribute level, we assign bit-widths to each attribute channel by formulating the combinatorial optimization as a 0-1 integer linear program, which can be efficiently solved. At the intra-attribute level, we divide each attribute channel into blocks of vectors, quantizing each vector based on the optimal bit-width derived at the inter-attribute level. Dynamic programming determines block lengths. Using the size estimator and MPQ, we develop a calibrated algorithm to identify optimal hyperparameters in just 10 minutes, achieving a 1.69 \times efficiency increase with quality comparable to state-of-the-art methods.
zh

[CV-154] Language-Guided Image Tokenization for Generation

【速读】：该论文试图解决高分辨率图像生成中计算成本高的问题，特别是现有图像标记化方法压缩率有限的问题。解决方案的关键在于提出了基于文本的图像标记化方法，称为文本条件图像标记化 (Text-Conditioned Image Tokenization, TexTok)。TexTok 通过利用语言提供的高层次语义信息，将标记化过程与描述性文本标题相结合，从而在编码细粒度视觉细节的同时实现更高的压缩率和增强的重建质量。实验结果表明，TexTok 在 ImageNet-256 和 ImageNet-512 基准测试中分别实现了 29.2% 和 48.1% 的平均重建 FID 改进，并在生成 FID 上分别提升了 16.3% 和 34.3%。此外，TexTok 在仅使用 32 个标记的情况下，能够实现 93.5 倍的推理加速，并在 ImageNet 数据集上达到了最新的 FID 分数。

链接: https://arxiv.org/abs/2412.05796
作者: Kaiwen Zha,Lijun Yu,Alireza Fathi,David A. Ross,Cordelia Schmid,Dina Katabi,Xiuye Gu
关键词-EN: transforming raw image, raw image pixels, Image tokenization, low-dimensional latent representation, Image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide high-level semantics. By conditioning the tokenization process on descriptive text captions, TexTok allows the tokenization process to focus on encoding fine-grained visual details into latent tokens, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok’s superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization.
zh

[CV-155] Open-Source Acceleration of Stable-Diffusion.cpp

【速读】：该论文旨在解决Stable Diffusion模型在图像生成过程中高计算延迟和内存消耗的问题。解决方案的关键在于优化Sdcpp框架中的ggml_conv_2d算子，通过引入Winograd算法来加速2D卷积操作，这是整个流程中的主要瓶颈。通过分析依赖和独立的计算图，利用设备的局部性和并行性，实现了显著的性能提升。最终，该优化框架在多个Stable Diffusion模型上实现了高达2.76倍的单层卷积加速和4.79倍的总体图像生成加速。

链接: https://arxiv.org/abs/2412.05781
作者: Jingxu Ng,Cheng Lv,Pu Zhao,Wei Niu,Juyi Lin,Yanzhi Wang
关键词-EN: generating high-quality images, Stable diffusion plays, plays a crucial, crucial role, role in generating
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stable diffusion plays a crucial role in generating high-quality images. However, image generation is time-consuming and memory-intensive. To address this, this http URL (Sdcpp) emerges as an efficient inference framework to accelerate the diffusion models. Although it is lightweight, the current implementation of ggml_conv_2d operator in Sdcpp is suboptimal, exhibiting both high inference latency and massive memory usage. To address this, in this work, we present an optimized version of Sdcpp leveraging the Winograd algorithm to accelerate 2D convolution operations, which is the primary bottleneck in the pipeline. By analyzing both dependent and independent computation graphs, we exploit the device’s locality and parallelism to achieve substantial performance improvements. Our framework delivers correct end-to-end results across various stable diffusion models, including SDv1.4, v1.5, v2.1, SDXL, and SDXL-Turbo. Our evaluation results demonstrate a speedup up to 2.76x for individual convolutional layers and an inference speedup up to 4.79x for the overall image generation process, compared with the original Sdcpp. Homepage: this https URL
zh

[CV-156] BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

【速读】：该论文试图解决生成式模型在文本到图像生成任务中高计算需求和能源消耗的问题。解决方案的关键在于提出了一种名为BudgetFusion的新模型，该模型通过预测多层次感知指标来确定扩散模型在生成图像前所需的最优感知效率的扩散步数。通过这种方式，BudgetFusion能够在不牺牲感知相似性的前提下，显著减少每个文本提示的生成时间，从而提高推理效率并降低能源消耗。

链接: https://arxiv.org/abs/2412.05780
作者: Qinchan(Wing)Li,Kenneth Chen,Changyue(Tina)Su,Qi Sun
关键词-EN: shown unprecedented success, shown unprecedented, unprecedented success, Diffusion, efforts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have shown unprecedented success in the task of text-to-image generation. While these models are capable of generating high-quality and realistic images, the complexity of sequential denoising has raised societal concerns regarding high computational demands and energy consumption. In response, various efforts have been made to improve inference efficiency. However, most of the existing efforts have taken a fixed approach with neural network simplification or text prompt optimization. Are the quality improvements from all denoising computations equally perceivable to humans? We observed that images from different text prompts may require different computational efforts given the desired content. The observation motivates us to present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. This is achieved by predicting multi-level perceptual metrics relative to diffusion steps. With the popular Stable Diffusion as an example, we conduct both numerical analyses and user studies. Our experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?
zh

[CV-157] Prism: Semi-Supervised Multi-View Stereo with Monocular Structure Priors

【速读】：该论文试图解决无监督多视角立体视觉（MVS）在处理复杂数据（如手持智能手机拍摄的室内场景视频）时表现不佳的问题，以及在高质量合成数据集上训练的MVS网络难以泛化到真实世界数据的问题。解决方案的关键在于提出了一种半监督学习框架，称为Prism，该框架能够联合训练真实图像和渲染图像，从合成数据中捕捉结构先验信息，同时确保与真实世界数据的域一致性。核心创新在于引入了一组新的损失函数，利用在合成数据集上训练的单目相对深度估计器，将丰富的结构信息传递到无标签数据的MVS预测中，并通过深度特征损失和多尺度统计损失来比较MVS预测与单目预测，从而显著提升了无监督和合成监督MVS网络的性能。

链接: https://arxiv.org/abs/2412.05771
作者: Alex Rich,Noah Stier,Pradeep Sen,Tobias Höllerer
关键词-EN: current methods underperform, indoor scenes, MVS, methods underperform, MVS networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures, 3 tables

点击查看摘要

Abstract:The promise of unsupervised multi-view-stereo (MVS) is to leverage large unlabeled datasets, yet current methods underperform when training on difficult data, such as handheld smartphone videos of indoor scenes. Meanwhile, high-quality synthetic datasets are available but MVS networks trained on these datasets fail to generalize to real-world examples. To bridge this gap, we propose a semi-supervised learning framework that allows us to train on real and rendered images jointly, capturing structural priors from synthetic data while ensuring parity with the real-world domain. Central to our framework is a novel set of losses that leverages powerful existing monocular relative-depth estimators trained on the synthetic dataset, transferring the rich structure of this relative depth to the MVS predictions on unlabeled data. Inspired by perceptual image metrics, we compare the MVS and monocular predictions via a deep feature loss and a multi-scale statistical loss. Our full framework, which we call Prism, achieves large quantitative and qualitative improvements over current unsupervised and synthetic-supervised MVS networks. This is a best-case-scenario result, opening the door to using both unlabeled smartphone videos and photorealistic synthetic datasets for training MVS networks.
zh

[CV-158] Compositional Image Retrieval via Instruction-Aware Contrastive Learning

【速读】：该论文试图解决在零样本组合图像检索 (Zero-Shot Composed Image Retrieval, ZS-CIR) 中，现有基于CLIP的模型在解释和执行修改指令方面的能力有限的问题。解决方案的关键在于提出了一种新的嵌入方法，利用经过指令调优的多模态大语言模型 (Multimodal Large Language Model, MLLM) 生成组合表示，从而显著增强模型对指令的遵循能力。然而，直接应用MLLM存在挑战，因为MLLM主要设计用于文本生成而非嵌入提取。为此，论文引入了一种两阶段训练策略，首先高效学习联合多模态嵌入空间，然后通过在类似CIR格式的三元组数据集上微调模型，进一步增强其遵循修改指令的能力。

链接: https://arxiv.org/abs/2412.05756
作者: Wenliang Zhong,Weizhi An,Feng Jiang,Hehuan Ma,Yuzhi Guo,Junzhou Huang
关键词-EN: Composed Image Retrieval, Image Retrieval, target image based, involves retrieving, visual reference
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. CIR is inherently an instruction-following task, as the model needs to interpret and apply modifications to the image. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable. While existing ZS-CIR models based on CLIP have shown promising results, their capability in interpreting and following modification instructions remains limited. Some research attempts to address this by incorporating Large Language Models (LLMs). However, these approaches still face challenges in effectively integrating multimodal information and instruction understanding. To tackle above challenges, we propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation, which significantly enhance the instruction following capability for a comprehensive integration between images and instructions. Nevertheless, directly applying MLLMs introduces a new challenge since MLLMs are primarily designed for text generation rather than embedding extraction as required in CIR. To address this, we introduce a two-stage training strategy to efficiently learn a joint multimodal embedding space and further refining the ability to follow modification instructions by tuning the model in a triplet dataset similar to the CIR format. Extensive experiments on four public datasets: FashionIQ, CIRR, GeneCIS, and CIRCO demonstrates the superior performance of our model, outperforming state-of-the-art baselines by a significant margin. Codes are available at the GitHub repository.
zh

[CV-159] Integrating YOLO11 and Convolution Block Attention Module for Multi-Season Segmentation of Tree Trunks and Branches in Commercial Apple Orchards

【速读】：该论文旨在解决在不同季节条件下对苹果园中树干和树枝进行精确分割的问题。解决方案的关键在于将卷积块注意力模块 (Convolutional Block Attention Module, CBAM) 集成到YOLO11架构中，形成YOLO11-CBAM模型。通过在混合的休眠期和生长期苹果园图像数据集上进行训练，该模型能够在全年不同季节（包括休眠期、开花期、疏果期和收获期）中有效检测和分割树干和树枝。实验结果表明，YOLO11-CBAM模型在树干和树枝类别的分割精度上显著优于未集成CBAM的YOLO11模型，尤其是在树干类别的精度上，YOLO11m-seg-CBAM达到了0.83，而未集成CBAM的模型仅为0.80。

链接: https://arxiv.org/abs/2412.05728
作者: Ranjan Sapkota,Manoj Karkee
关键词-EN: Block Attention Module, Convolutional Block Attention, Attention Module, Convolutional Block, Block Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 Pages, YOLOv11

点击查看摘要

Abstract:In this study, we developed a customized instance segmentation model by integrating the Convolutional Block Attention Module (CBAM) with the YOLO11 architecture. This model, trained on a mixed dataset of dormant and canopy season apple orchard images, aimed to enhance the segmentation of tree trunks and branches under varying seasonal conditions throughout the year. The model was individually validated across dormant and canopy season images after training the YOLO11-CBAM on the mixed dataset collected over the two seasons. Additional testing of the model during pre-bloom, flower bloom, fruit thinning, and harvest season was performed. The highest recall and precision metrics were observed in the YOLO11x-seg-CBAM and YOLO11m-seg-CBAM respectively. Particularly, YOLO11m-seg with CBAM showed the highest precision of 0.83 as performed for the Trunk class in training, while without the CBAM, YOLO11m-seg achieved 0.80 precision score for the Trunk class. Likewise, for branch class, YOLO11m-seg with CBAM achieved the highest precision score value of 0.75 while without the CBAM, the YOLO11m-seg achieved a precision of 0.73. For dormant season validation, YOLO11x-seg exhibited the highest precision at 0.91. Canopy season validation highlighted YOLO11s-seg with superior precision across all classes, achieving 0.516 for Branch, and 0.64 for Trunk. The modeling approach, trained on two season datasets as dormant and canopy season images, demonstrated the potential of the YOLO11-CBAM integration to effectively detect and segment tree trunks and branches year-round across all seasonal variations. Keywords: YOLOv11, YOLOv11 Tree Detection, YOLOv11 Branch Detection and Segmentation, Machine Vision, Deep Learning, Machine Learning
zh

[CV-160] Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

【速读】：该论文试图解决视觉-语言模型（VLMs）在常识推理，特别是溯因推理（abductive reasoning）和可废止推理（defeasible reasoning）方面的能力评估问题。现有基准测试主要集中在典型视觉场景，难以区分模型表现是源于敏锐的感知和推理能力，还是依赖于纯粹的统计召回。论文提出通过关注视频中的非常规事件来更清晰地评估VLMs的核心能力，因为解释和理解这些异常事件需要模型超越基本的模式识别和先验知识的复述。解决方案的关键在于引入BlackSwanSuite基准，通过限制视觉信息并提出关于隐藏意外事件的问题，或提供新视觉信息以改变现有假设，来评估模型在溯因和可废止任务中的表现。该基准包含超过15,400个任务，涵盖1,655个视频，揭示了当前VLMs在处理非常规事件时的显著性能差距，强调了改进模型架构和训练策略的必要性。

链接: https://arxiv.org/abs/2412.05725
作者: Aditya Chinchure,Sahithya Ravi,Raymond Ng,Vered Shwartz,Boyang Li,Leonid Sigal
关键词-EN: remain poorly understood, commonsense reasoning capabilities, remain poorly, poorly understood, commonsense reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: For data, visit this https URL

点击查看摘要

Abstract:The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs’ ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no tasks, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations in current VLMs, emphasizing the need for enhanced model architectures and training strategies.
zh

[CV-161] A Tiered GAN Approach for Monet-Style Image Generation

【速读】：该论文试图解决生成对抗网络 (GANs) 在生成艺术图像时面临的常见问题，如训练不稳定、模式崩溃 (mode collapse) 和输出质量不足。解决方案的关键在于引入分层 GAN 模型，通过多阶段逐步优化图像质量。该模型结合了下采样和卷积技术，能够在生成高质量莫奈风格艺术作品的同时，提高计算效率。尽管实验结果显示该架构能够生成基础的艺术结构，但仍需进一步改进以提升真实感和对莫奈风格的忠实度。

链接: https://arxiv.org/abs/2412.05724
作者: FNU Neha,Deepshikha Bhati,Deepak Kumar Shukla,Md Amiruzzaman
关键词-EN: Generative Adversarial Networks, Generative Adversarial, Adversarial Networks, Claude Monet, capable of mimicking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have proven to be a powerful tool in generating artistic images, capable of mimicking the styles of renowned painters, such as Claude Monet. This paper introduces a tiered GAN model to progressively refine image quality through a multi-stage process, enhancing the generated images at each step. The model transforms random noise into detailed artistic representations, addressing common challenges such as instability in training, mode collapse, and output quality. This approach combines downsampling and convolutional techniques, enabling the generation of high-quality Monet-style artwork while optimizing computational efficiency. Experimental results demonstrate the architecture’s ability to produce foundational artistic structures, though further refinements are necessary for achieving higher levels of realism and fidelity to Monet’s style. Future work focuses on improving training methodologies and model complexity to bridge the gap between generated and true artistic images. Additionally, the limitations of traditional GANs in artistic generation are analyzed, and strategies to overcome these shortcomings are proposed.
zh

[CV-162] Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent

【速读】：该论文试图解决文本生成图像 (Text-to-Image, T2I) 模型在评估生成图像与文本提示一致性时依赖于定性人类评估的问题，特别是缺乏可重复性和自动化的问题。解决方案的关键在于提出一种基于大型语言模型 (LLMs) 的自动评估方法，通过结合场景图提取和知识增强的问答任务，量化并识别生成图像中的“幻觉问题”（即图像与文本提示不一致的情况），并提供接近人类标准的综合评分。该方法通过生成12,000张合成图像并进行人工评分验证，展示了其与人类评分模式的高度一致性，从而为T2I模型的评估提供了更可靠的量化工具。

链接: https://arxiv.org/abs/2412.05722
作者: Ziyuan Qin,Dongjie Cheng,Haoyu Wang,Huahui Yi,Yuting Shao,Zhiyuan Fan,Kang Li,Qicheng Lao
关键词-EN: models frequently depend, qualitative human evaluations, frequently depend, depend on qualitative, assess the consistency
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contemporary Text-to-Image (T2I) models frequently depend on qualitative human evaluations to assess the consistency between synthesized images and the text prompts. There is a demand for quantitative and automatic evaluation tools, given that human evaluation lacks reproducibility. We believe that an effective T2I evaluation metric should accomplish the following: detect instances where the generated images do not align with the textual prompts, a discrepancy we define as the `hallucination problem’ in T2I tasks; record the types and frequency of hallucination issues, aiding users in understanding the causes of errors; and provide a comprehensive and intuitive scoring that close to human standard. To achieve these objectives, we propose a method based on large language models (LLMs) for conducting question-answering with an extracted scene-graph and created a dataset with human-rated scores for generated images. From the methodology perspective, we combine knowledge-enhanced question-answering tasks with image evaluation tasks, making the evaluation metrics more controllable and easier to interpret. For the contribution on the dataset side, we generated 12,000 synthesized images based on 1,000 composited prompts using three advanced T2I models. Subsequently, we conduct human scoring on all synthesized images and prompt pairs to validate the accuracy and effectiveness of our method as an evaluation metric. All generated images and the human-labeled scores will be made publicly available in the future to facilitate ongoing research on this crucial issue. Extensive experiments show that our method aligns more closely with human scoring patterns than other evaluation metrics.
zh

[CV-163] Impact of Sunglasses on One-to-Many Facial Identification Accuracy

【速读】：该论文旨在解决在非“mugshot quality”图像（如监控视频截图）中进行一对多人脸识别时，由于佩戴深色太阳镜导致的识别准确性下降问题。研究的关键在于系统性地分析太阳镜对识别精度的影响，并提出相应的解决方案。具体来说，论文展示了太阳镜对识别精度的影响与强模糊或低分辨率类似，且太阳镜与模糊或低分辨率的组合会进一步加剧精度损失。为缓解这一问题，论文提出通过合成方式将太阳镜添加到所有库图像中，无需重新训练模型即可恢复约38%的精度损失。此外，增加训练集中佩戴太阳镜图像的比例也能显著降低错误率。

链接: https://arxiv.org/abs/2412.05721
作者: Sicong Tian,Haiyu Wu,Michael C. King,Kevin W. Bowyer
关键词-EN: facial identification, achieve high accuracy, mugshot quality, achieve high, facial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One-to-many facial identification is documented to achieve high accuracy in the case where both the probe and the gallery are mugshot quality' images. However, an increasing number of documented instances of wrongful arrest following one-to-many facial identification have raised questions about its accuracy. Probe images used in one-to-many facial identification are often cropped from frames of surveillance video and deviate from mugshot quality’ in various ways. This paper systematically explores how the accuracy of one-to-many facial identification is degraded by the person in the probe image choosing to wear dark sunglasses. We show that sunglasses degrade accuracy for mugshot-quality images by an amount similar to strong blur or noticeably lower resolution. Further, we demonstrate that the combination of sunglasses with blur or lower resolution results in even more pronounced loss in accuracy. These results have important implications for developing objective criteria to qualify a probe image for the level of accuracy to be expected if it used for one-to-many identification. To ameliorate the accuracy degradation caused by dark sunglasses, we show that it is possible to recover about 38% of the lost accuracy by synthetically adding sunglasses to all the gallery images, without model re-training. We also show that increasing the representation of wearing-sunglasses images in the training set can largely reduce the error rate. The image set assembled for this research will be made available to support replication and further research into this problem.
zh

[CV-164] Segment-Level Road Obstacle Detection Using Visual Foundation Model Priors and Likelihood Ratios

【速读】：该论文试图解决自动驾驶车辆在复杂交通环境中检测道路障碍物时面临的两个主要问题：一是传统基于像素分类的方法容易产生碎片化的预测和大量误报；二是选择合适的阈值（threshold）进行预测存在挑战。解决方案的关键在于提出了一种新的方法，通过利用视觉基础模型（visual foundation models）的段级特征（segment-level features）和似然比（likelihood ratios）来直接预测道路障碍物，而不是依赖于单个像素的分类。这种方法通过关注段而非像素，提高了检测精度，减少了误报，并增强了场景变化的鲁棒性，且无需预定义阈值。

链接: https://arxiv.org/abs/2412.05707
作者: Youssef Shoeb,Nazir Nayal,Azarm Nowzard,Fatma Güney,Hanno Gottschalk
关键词-EN: traffic environments safely, Detecting road obstacles, complex traffic environments, Detecting road, environments safely
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, and 1 table, to be published in VISAPP 2025

点击查看摘要

Abstract:Detecting road obstacles is essential for autonomous vehicles to navigate dynamic and complex traffic environments safely. Current road obstacle detection methods typically assign a score to each pixel and apply a threshold to generate final predictions. However, selecting an appropriate threshold is challenging, and the per-pixel classification approach often leads to fragmented predictions with numerous false positives. In this work, we propose a novel method that leverages segment-level features from visual foundation models and likelihood ratios to predict road obstacles directly. By focusing on segments rather than individual pixels, our approach enhances detection accuracy, reduces false positives, and offers increased robustness to scene variability. We benchmark our approach against existing methods on the RoadObstacle and LostAndFound datasets, achieving state-of-the-art performance without needing a predefined threshold.
zh

[CV-165] mporally Compressed 3D Gaussian Splatting for Dynamic Scenes

【速读】：该论文试图解决高保真动态场景重建中的内存占用和渲染效率问题，特别是在需要实时应用（如AR/VR、游戏和低功耗设备渲染）的场景中。解决方案的关键是提出了时间压缩的三维高斯光栅化 (Temporally Compressed 3D Gaussian Splatting, TC3DGS) 技术，该技术通过选择性修剪时间相关的高斯分布并采用梯度感知的混合精度量化来动态压缩高斯参数，从而实现高效的内存压缩。此外，论文还利用Ramer-Douglas-Peucker算法的变体在后期处理中插值高斯轨迹，进一步减少存储需求。实验结果表明，TC3DGS能够在保持视觉质量的前提下实现高达67倍的压缩率。

链接: https://arxiv.org/abs/2412.05700
作者: Saqib Javed,Ahmad Jarrar Khan,Corentin Dumery,Chen Zhao,Mathieu Salzmann
关键词-EN: Recent advancements, realistic scene representation, Gaussian Splatting, advancements in high-fidelity, reconstruction have leveraged
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Code will be released soon

点击查看摘要

Abstract:Recent advancements in high-fidelity dynamic scene reconstruction have leveraged dynamic 3D Gaussians and 4D Gaussian Splatting for realistic scene representation. However, to make these methods viable for real-time applications such as AR/VR, gaming, and rendering on low-power devices, substantial reductions in memory usage and improvements in rendering efficiency are required. While many state-of-the-art methods prioritize lightweight implementations, they struggle in handling scenes with complex motions or long sequences. In this work, we introduce Temporally Compressed 3D Gaussian Splatting (TC3DGS), a novel technique designed specifically to effectively compress dynamic 3D Gaussian representations. TC3DGS selectively prunes Gaussians based on their temporal relevance and employs gradient-aware mixed-precision quantization to dynamically compress Gaussian parameters. It additionally relies on a variation of the Ramer-Douglas-Peucker algorithm in a post-processing step to further reduce storage by interpolating Gaussian trajectories across frames. Our experiments across multiple datasets demonstrate that TC3DGS achieves up to 67 \times compression with minimal or no degradation in visual quality.
zh

[CV-166] Jointly RS Image Deblurring and Super-Resolution with Adjustable-Kernel and Multi-Domain Attention

【速读】：该论文试图解决遥感图像（Remote Sensing, RS）中同时存在的全局低分辨率（Low-Resolution, LR）退化和局部模糊退化问题，并提出了一种联合遥感图像去模糊和超分辨率（Joint RS Image Deblurring and Super-Resolution, JRSIDSR）的统一模型。解决方案的关键在于设计了一个双分支并行网络AKMD-Net，该网络包含去模糊和超分辨率两个分支。去模糊分支通过像素可调核块（Pixel-Adjustable Kernel Block, PAKB）估计局部和空间变化的模糊核，而超分辨率分支则利用多域注意力块（Multi-Domain Attention Block, MDAB）捕捉全局上下文信息并增强高频细节。此外，论文还提出了自适应特征融合（Adaptive Feature Fusion, AFF）模块来建模去模糊和超分辨率分支之间的上下文关系，并通过自适应维纳损失（Adaptive Wiener Loss, AW Loss）抑制重建图像中的先验噪声。实验结果表明，AKMD-Net在常用的遥感图像数据集上达到了最先进的定量和定性性能。

链接: https://arxiv.org/abs/2412.05696
作者: Yan Zhang,Pengcheng Zheng,Chengxiao Zeng,Bin Xiao,Zhenghao Li,Xinbo Gao
关键词-EN: Remote Sensing, computer vision, vision that aim, aim at restoring, deblurring
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote Sensing (RS) image deblurring and Super-Resolution (SR) are common tasks in computer vision that aim at restoring RS image detail and spatial scale, respectively. However, real-world RS images often suffer from a complex combination of global low-resolution (LR) degeneration and local blurring degeneration. Although carefully designed deblurring and SR models perform well on these two tasks individually, a unified model that performs jointly RS image deblurring and super-resolution (JRSIDSR) task is still challenging due to the vital dilemma of reconstructing the global and local degeneration simultaneously. Additionally, existing methods struggle to capture the interrelationship between deblurring and SR processes, leading to suboptimal results. To tackle these issues, we give a unified theoretical analysis of RS images’ spatial and blur degeneration processes and propose a dual-branch parallel network named AKMD-Net for the JRSIDSR task. AKMD-Net consists of two main branches: deblurring and super-resolution branches. In the deblurring branch, we design a pixel-adjustable kernel block (PAKB) to estimate the local and spatial-varying blur kernels. In the SR branch, a multi-domain attention block (MDAB) is proposed to capture the global contextual information enhanced with high-frequency details. Furthermore, we develop an adaptive feature fusion (AFF) module to model the contextual relationships between the deblurring and SR branches. Finally, we design an adaptive Wiener loss (AW Loss) to depress the prior noise in the reconstructed images. Extensive experiments demonstrate that the proposed AKMD-Net achieves state-of-the-art (SOTA) quantitative and qualitative performance on commonly used RS image datasets. The source code is publicly available at this https URL.
zh

[CV-167] Neural network interpretability with layer-wise relevance propagation: novel techniques for neuron selection and visualization

【速读】：该论文试图解决复杂神经网络解释性问题，特别是在需要透明度和问责性的应用中，现有的层级相关性传播 (Layer-wise Relevance Propagation, LRP) 方法在评估单个神经元贡献时精度不足的问题。解决方案的关键在于提出一种改进的 LRP 反向传播方法，通过优化神经元选择和使用神经网络图与热图来突出关键路径，同时结合去卷积可视化技术重建特征图，从而提高神经网络的解释性。该方法以 VGG16 架构为例，通过均方误差 (Mean Squared Error, MSE) 和对称平均绝对百分比误差 (Symmetric Mean Absolute Percentage Error, SMAPE) 等精度指标进行优化，显著提升了 AI 系统在计算机视觉应用中的透明度和可靠性。

链接: https://arxiv.org/abs/2412.05686
作者: Deepshikha Bhati,Fnu Neha,Md Amiruzzaman,Angela Guercio,Deepak Kumar Shukla,Ben Ward
关键词-EN: Interpreting complex neural, Interpreting complex, decision-making processes, accountability are essential, complex neural networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Interpreting complex neural networks is crucial for understanding their decision-making processes, particularly in applications where transparency and accountability are essential. This proposed method addresses this need by focusing on layer-wise Relevance Propagation (LRP), a technique used in explainable artificial intelligence (XAI) to attribute neural network outputs to input features through backpropagated relevance scores. Existing LRP methods often struggle with precision in evaluating individual neuron contributions. To overcome this limitation, we present a novel approach that improves the parsing of selected neurons during LRP backward propagation, using the Visual Geometry Group 16 (VGG16) architecture as a case study. Our method creates neural network graphs to highlight critical paths and visualizes these paths with heatmaps, optimizing neuron selection through accuracy metrics like Mean Squared Error (MSE) and Symmetric Mean Absolute Percentage Error (SMAPE). Additionally, we utilize a deconvolutional visualization technique to reconstruct feature maps, offering a comprehensive view of the network’s inner workings. Extensive experiments demonstrate that our approach enhances interpretability and supports the development of more transparent artificial intelligence (AI) systems for computer vision applications. This advancement has the potential to improve the trustworthiness of AI models in real-world machine vision applications, thereby increasing their reliability and effectiveness.
zh

[CV-168] HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing

【速读】：该论文试图解决视觉-文本不一致性（Visual-textual inconsistency, VTI）评估问题，特别是在图像描述数据集中由于内容多样性导致的各种不一致性，如场景、实体、实体属性、实体数量和实体交互等方面的不一致性。解决方案的关键在于设计了一个名为分层多粒度不一致性评估（Hierarchical and Multi-Grained Inconsistency Evaluation, HMGIE）的自适应评估框架。该框架通过三个连续模块实现：首先，语义图生成模块将图像描述转换为语义图，构建结构化表示；其次，分层不一致性评估模块通过动态问答生成和评估策略，生成层次不一致性评估图（HIEG）；最后，定量评估模块基于HIEG计算准确性和完整性得分，并提供自然语言解释。该框架在处理不同图像描述数据集时表现出色，并通过实验验证了其优于现有最先进方法的性能。

链接: https://arxiv.org/abs/2412.05685
作者: Zihao Zhu,Hongbao Zhang,Guanzong Wu,Siwei Lyu,Baoyuan Wu
关键词-EN: cleansing vision-language data, Visual-textual inconsistency, hierarchical inconsistency evaluation, vision-language data, plays a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual-textual inconsistency (VTI) evaluation plays a crucial role in cleansing vision-language data. Its main challenges stem from the high variety of image captioning datasets, where differences in content can create a range of inconsistencies (\eg, inconsistencies in scene, entities, entity attributes, entity numbers, entity interactions). Moreover, variations in caption length can introduce inconsistencies at different levels of granularity as well. To tackle these challenges, we design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE), which can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs. Specifically, the HMGIE framework is implemented by three consecutive modules. Firstly, the semantic graph generation module converts the image caption to a semantic graph for building a structural representation of all involved semantic items. Then, the hierarchical inconsistency evaluation module provides a progressive evaluation procedure with a dynamic question-answer generation and evaluation strategy guided by the semantic graph, producing a hierarchical inconsistency evaluation graph (HIEG). Finally, the quantitative evaluation module calculates the accuracy and completeness scores based on the HIEG, followed by a natural language explanation about the detection results. Moreover, to verify the efficacy and flexibility of the proposed framework on handling different image captioning datasets, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies. Extensive experiments on MVTID and other benchmark datasets demonstrate the superior performance of the proposed HMGIE to current state-of-the-art methods.
zh

[CV-169] RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

【速读】：该论文试图解决现有遥感视觉语言模型（RS VLMs）在像素级理解和多图像输入处理方面的不足。解决方案的关键在于提出了RSUniVLM，这是一个统一的端到端遥感视觉语言模型，能够处理图像级、区域级和像素级任务，并在多图像分析中表现出色，如变化检测和变化描述。为在不增加模型规模的情况下提升多层次视觉信息的捕捉能力，论文设计了一种新颖的架构——面向粒度的专家混合模型（Granularity-oriented Mixture of Experts），并构建了一个大规模的遥感指令跟随数据集，涵盖多种任务。实验结果表明，RSUniVLM在多种遥感任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.05679
作者: Xu Liu,Zhouhui Lian
关键词-EN: Remote Sensing Vision-Language, Remote Sensing, Sensing Vision-Language Models, Sensing Vision-Language, image comprehension
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote Sensing Vision-Language Models (RS VLMs) have made much progress in the tasks of remote sensing (RS) image comprehension. While performing well in multi-modal reasoning and multi-turn conversations, the existing models lack pixel-level understanding and struggle with multi-image inputs. In this work, we propose RSUniVLM, a unified, end-to-end RS VLM designed for comprehensive vision understanding across multiple granularity, including image-level, region-level, and pixel-level tasks. RSUniVLM also performs effectively in multi-image analysis, with instances of change detection and change captioning. To enhance the model’s ability to capture visual information at different levels without increasing model size, we design a novel architecture called Granularity-oriented Mixture of Experts to constraint the model to about 1 billion parameters. We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain, encompassing various tasks such as object localization, visual question answering, and semantic segmentation. Substantial experiments have been conducted to validate the superiority of the proposed RSUniVLM up to state-of-the-art across various RS tasks. Code and model will be available at \hrefthis https URLhere.
zh

[CV-170] Nearly Solved? Robust Deepfake Detection Requires More than Visual Forensics

【速读】：该论文试图解决深度伪造（deepfakes）检测中的对抗攻击问题，特别是在高度现实的黑盒环境下，现有的最先进检测器容易受到经典对抗攻击的影响。论文的关键解决方案在于识别并利用深度伪造中的“鲁棒特征”（robust features），这些特征位于高层次语义层面。通过基于语义嵌入模型的检测器，论文展示了其在黑盒扰动攻击下的更强抵抗力。此外，论文提出使用大型视觉-语言模型（如GPT-4o）进行零样本深度伪造检测，并引入基于高层次语义操作的新型攻击方法。最终，论文主张通过结合低层次和高层次检测器的混合方法，利用它们的互补优势来提高对抗攻击的鲁棒性。

链接: https://arxiv.org/abs/2412.05676
作者: Guy Levy,Nathan Liebmann
关键词-EN: high-profile social engineering, social engineering attacks, increased sophistication, sophistication and prevalence, prevalence allowing
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deepfakes are on the rise, with increased sophistication and prevalence allowing for high-profile social engineering attacks. Detecting them in the wild is therefore important as ever, giving rise to new approaches breaking benchmark records in this task. In line with previous work, we show that recently developed state-of-the-art detectors are susceptible to classical adversarial attacks, even in a highly-realistic black-box setting, putting their usability in question. We argue that crucial ‘robust features’ of deepfakes are in their higher semantics, and follow that with evidence that a detector based on a semantic embedding model is less susceptible to black-box perturbation attacks. We show that large visuo-lingual models like GPT-4o can perform zero-shot deepfake detection better than current state-of-the-art methods, and introduce a novel attack based on high-level semantic manipulation. Finally, we argue that hybridising low- and high-level detectors can improve adversarial robustness, based on their complementary strengths and weaknesses.
zh

[CV-171] Multimodal Biometric Authentication Using Camera-Based PPG and Fingerprint Fusion

【速读】：该论文试图解决通过智能手机摄像头获取的光电容积描记法 (PPG) 信号与指纹数据相结合，以提高用户验证准确性的问题。解决方案的关键在于采用了一种多模态生物识别系统，该系统通过神经网络结合两个结构化状态空间模型 (SSM) 编码器来处理指纹图像和PPG波形。指纹图像被转换为像素序列，并与分段的PPG波形一起输入到编码器中。随后，跨模态注意力机制提取精炼的特征表示，并通过分布导向的对比损失函数将这些特征对齐到一个统一的潜在空间中。这种方法在单会话和双会话认证场景中均表现出优越的性能。

链接: https://arxiv.org/abs/2412.05660
作者: Xue Xian Zheng,M. M. Ur Rahma,Bilal Taha,Mudassir Masood,Dimitrios Hatzinakos,Tareq Al-Naffouri
关键词-EN: shown great promise, Camera-based photoplethysmography, obtained from smartphones, smartphones has shown, shown great
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera-based photoplethysmography (PPG) obtained from smartphones has shown great promise for personalized healthcare and secure authentication. This paper presents a multimodal biometric system that integrates PPG signals extracted from videos with fingerprint data to enhance the accuracy of user verification. The system requires users to place their fingertip on the camera lens for a few seconds, allowing the capture and processing of unique biometric characteristics. Our approach employs a neural network with two structured state-space model (SSM) encoders to manage the distinct modalities. Fingerprint images are transformed into pixel sequences, and along with segmented PPG waveforms, they are input into the encoders. A cross-modal attention mechanism then extracts refined feature representations, and a distribution-oriented contrastive loss function aligns these features within a unified latent space. Experimental results demonstrate the system’s superior performance across various evaluation metrics in both single-session and dual-session authentication scenarios.
zh

[CV-172] Efficient Continuous Video Flow Model for Video Prediction

【速读】：该论文试图解决多步预测模型（如扩散模型和校正流模型）在视频预测任务中由于高采样延迟而导致的瓶颈问题。解决方案的关键在于提出一种新颖的多步过程建模方法，该方法通过减少预测下一帧所需的采样步骤数量，并将模型大小缩减至原始大小的三分之一，从而显著降低计算需求和延迟。实验结果表明，该方法在多个标准视频预测数据集（如KTH、BAIR动作机器人、Human3.6M和UCF101）上实现了最先进的性能。

链接: https://arxiv.org/abs/2412.05633
作者: Gaurav Shrivastava,Abhinav Shrivastava
关键词-EN: rectified flow models, solutions for generation, video prediction tasks, diffusion and rectified, rectified flow
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-step prediction models, such as diffusion and rectified flow models, have emerged as state-of-the-art solutions for generation tasks. However, these models exhibit higher latency in sampling new frames compared to single-step methods. This latency issue becomes a significant bottleneck when adapting such methods for video prediction tasks, given that a typical 60-second video comprises approximately 1.5K frames. In this paper, we propose a novel approach to modeling the multi-step process, aimed at alleviating latency constraints and facilitating the adaptation of such processes for video prediction tasks. Our approach not only reduces the number of sample steps required to predict the next frame but also minimizes computational demands by reducing the model size to one-third of the original size. We evaluate our method on standard video prediction datasets, including KTH, BAIR action robot, Human3.6M and UCF101, demonstrating its efficacy in achieving state-of-the-art performance on these benchmarks.
zh

[CV-173] Biological Brain Age Estimation using Sex-Aware Adversarial Variational Autoencoder with Multimodal Neuroimages

【速读】：该论文试图解决多模态脑年龄估计中的噪声问题，特别是功能磁共振成像 (fMRI) 数据噪声较大导致传统融合方法引入过多噪声从而降低估计准确性的问题。解决方案的关键在于提出了一种新颖的多模态框架，利用性别感知对抗变分自编码器 (SA-AVAE) 来有效分离来自不同模态的潜在特征。具体而言，该框架通过对抗学习和变分学习将潜在空间分解为模态特定代码和共享代码，分别表示模态间的互补信息和共同信息，并通过交叉重构和共享-差异距离比损失作为正则化项来增强解耦效果。此外，引入性别信息到潜在代码中，使模型能够捕捉性别特定的老化模式，并通过集成回归器模块实现脑年龄估计。实验结果表明，该框架在公开的OpenBHB数据集上优于现有方法，显示出在不同年龄段的显著鲁棒性，具有实时临床应用的潜力，特别是在早期检测神经退行性疾病方面。

链接: https://arxiv.org/abs/2412.05632
作者: Abd Ur Rehman,Azka Rehman,Muhammad Usman,Abdullah Shahid,Sung-Min Gho,Aleum Lee,Tariq M. Khan,Imran Razzak
关键词-EN: magnetic resonance imaging, brain age estimation, brain age, functional magnetic resonance, key biomarker
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Brain aging involves structural and functional changes and therefore serves as a key biomarker for brain health. Combining structural magnetic resonance imaging (sMRI) and functional magnetic resonance imaging (fMRI) has the potential to improve brain age estimation by leveraging complementary data. However, fMRI data, being noisier than sMRI, complicates multimodal fusion. Traditional fusion methods often introduce more noise than useful information, which can reduce accuracy compared to using sMRI alone. In this paper, we propose a novel multimodal framework for biological brain age estimation, utilizing a sex-aware adversarial variational autoencoder (SA-AVAE). Our framework integrates adversarial and variational learning to effectively disentangle the latent features from both modalities. Specifically, we decompose the latent space into modality-specific codes and shared codes to represent complementary and common information across modalities, respectively. To enhance the disentanglement, we introduce cross-reconstruction and shared-distinct distance ratio loss as regularization terms. Importantly, we incorporate sex information into the learned latent code, enabling the model to capture sex-specific aging patterns for brain age estimation via an integrated regressor module. We evaluate our model using the publicly available OpenBHB dataset, a comprehensive multi-site dataset for brain age estimation. The results from ablation studies and comparisons with state-of-the-art methods demonstrate that our framework outperforms existing approaches and shows significant robustness across various age groups, highlighting its potential for real-time clinical applications in the early detection of neurodegenerative diseases.
zh

[CV-174] Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising

【速读】：该论文试图解决基于Transformer的扩散模型在生成高质量输出时需要大型模型，从而导致显著的训练和推理开销的问题。解决方案的关键在于引入Remix-DiT方法，通过使用多个去噪专家（diffusion experts）来提高输出质量，同时避免训练N个独立模型的昂贵成本。具体来说，Remix-DiT采用K个基础模型（K < N），并通过可学习的混合系数（learnable mixing coefficients）自适应地组合这些模型，从而在保持模型架构与标准扩散Transformer相同的前提下，动态分配模型容量以提升生成质量。

链接: https://arxiv.org/abs/2412.05628
作者: Gongfan Fang,Xinyin Ma,Xinchao Wang
关键词-EN: Transformer-based diffusion models, achieved significant advancements, Transformer-based diffusion, generative tasks, variety of generative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based diffusion models have achieved significant advancements across a variety of generative tasks. However, producing high-quality outputs typically necessitates large transformer models, which result in substantial training and inference overhead. In this work, we investigate an alternative approach involving multiple experts for denoising, and introduce Remix-DiT, a novel method designed to enhance output quality at a low cost. The goal of Remix-DiT is to craft N diffusion experts for different denoising timesteps, yet without the need for expensive training of N independent models. To achieve this, Remix-DiT employs K basis models (where K N) and utilizes learnable mixing coefficients to adaptively craft expert models. This design offers two significant advantages: first, although the total model size is increased, the model produced by the mixing operation shares the same architecture as a plain model, making the overall model as efficient as a standard diffusion transformer. Second, the learnable mixing adaptively allocates model capacity across timesteps, thereby effectively improving generation quality. Experiments conducted on the ImageNet dataset demonstrate that Remix-DiT achieves promising results compared to standard diffusion transformers and other multiple-expert methods. The code is available at this https URL.
zh

[CV-175] Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC

【速读】：该论文试图解决扩散模型在适应不同下游任务时需要额外分支、特定训练策略和损失函数的问题，这些问题导致了预训练知识传递的障碍和用户友好性的降低。解决方案的关键在于提出了一种名为ONE-PIC的方法，通过In-Visual-Context Tuning和Masking Strategy来简化下游任务的微调过程。In-Visual-Context Tuning通过将源图像和目标图像排列成单个图像来构建任务特定训练数据，使微调过程更接近预训练，从而加速模型适应。Masking Strategy则统一了不同的生成任务，将其转化为对掩码部分的预测，从而简化了任务的实现和学习门槛，同时提高了效率和性能。

链接: https://arxiv.org/abs/2412.05619
作者: Ming Tao,Bing-Kun Bao,Yaowei Wang,Changsheng Xu
关键词-EN: demonstrated impressive generation, impressive generation capabilities, unlike Large Language, Large Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 11 figures

点击查看摘要

Abstract:Large pretrained diffusion models have demonstrated impressive generation capabilities and have been adapted to various downstream tasks. However, unlike Large Language Models (LLMs) that can learn multiple tasks in a single model based on instructed data, diffusion models always require additional branches, task-specific training strategies, and losses for effective adaptation to different downstream tasks. This task-specific fine-tuning approach brings two drawbacks. 1) The task-specific additional networks create gaps between pretraining and fine-tuning which hinders the transfer of pretrained knowledge. 2) It necessitates careful additional network design, raising the barrier to learning and implementation, and making it less user-friendly. Thus, a question arises: Can we achieve a simple, efficient, and general approach to fine-tune diffusion models? To this end, we propose ONE-PIC. It enhances the inherited generative ability in the pretrained diffusion models without introducing additional modules. Specifically, we propose In-Visual-Context Tuning, which constructs task-specific training data by arranging source images and target images into a single image. This approach makes downstream fine-tuning closer to the pertaining, allowing our model to adapt more quickly to various downstream tasks. Moreover, we propose a Masking Strategy to unify different generative tasks. This strategy transforms various downstream fine-tuning tasks into predictions of the masked portions. The extensive experimental results demonstrate that our method is simple and efficient which streamlines the adaptation process and achieves excellent performance with lower costs. Code is available at this https URL.
zh

[CV-176] Rethinking Annotation for Object Detection: Is Annotating Small-size Instances Worth Its Cost?

【速读】：该论文试图解决的问题是：是否值得为标注图像中的小尺寸目标实例付出高昂成本。解决方案的关键在于验证是否可以通过仅使用未标注小尺寸目标的训练数据来训练检测器，从而检测小尺寸目标。研究评估了两种方法：在测试时对输入图像进行上采样，以及在训练时对图像进行下采样。实验结果表明，通过在测试时上采样并缩小训练和测试输入之间的领域差距，该方法能够达到与使用完整训练数据的基线检测器相当的性能。此外，通过蒸馏技术，可以将该方法转化为单路径检测器，其性能与基线检测器相当。这些结果表明，有必要重新思考目标检测训练数据的标注策略。

链接: https://arxiv.org/abs/2412.05611
作者: Yusuke Hosoya,Masanori Suganuma,Takayuki Okatani
关键词-EN: Detecting objects occupying, Detecting objects, occupying only small, small areas, small-size instances
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:Detecting objects occupying only small areas in an image is difficult, even for humans. Therefore, annotating small-size object instances is hard and thus costly. This study questions common sense by asking the following: is annotating small-size instances worth its cost? We restate it as the following verifiable question: can we detect small-size instances with a detector trained using training data free of small-size instances? We evaluate a method that upscales input images at test time and a method that downscales images at training time. The experiments conducted using the COCO dataset show the following. The first method, together with a remedy to narrow the domain gap between training and test inputs, achieves at least comparable performance to the baseline detector trained using complete training data. Although the method needs to apply the same detector twice to an input image with different scaling, we show that its distillation yields a single-path detector that performs equally well to the same baseline detector. These results point to the necessity of rethinking the annotation of training data for object detection.
zh

[CV-177] RefSAM3D: Adapting SAM with Cross-modal Reference for 3D Medical Image Segmentation

【速读】：该论文试图解决生成式 AI (Generative AI) 模型 Segment Anything Model (SAM) 在处理3D医学影像（如CT和MRI）时的局限性，这些影像需要捕捉体积空间中的空间信息以进行器官分割和肿瘤量化。解决方案的关键在于引入 RefSAM3D，通过结合3D图像适配器和跨模态参考提示生成，修改视觉编码器以处理3D输入，并增强掩码解码器以直接生成3D掩码。此外，通过集成文本提示和分层注意力机制，模型能够更准确和一致地分割复杂解剖结构，从而在多个医学影像数据集上展现出优于现有最先进方法的性能。

链接: https://arxiv.org/abs/2412.05605
作者: Xiang Gao,Kai Lu
关键词-EN: Vision Transformer, capturing global patterns, originally built, global patterns, Vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM), originally built on a 2D Vision Transformer (ViT), excels at capturing global patterns in 2D natural images but struggles with 3D medical imaging modalities like CT and MRI. These modalities require capturing spatial information in volumetric space for tasks such as organ segmentation and tumor quantification. To address this challenge, we introduce RefSAM3D, which adapts SAM for 3D medical imaging by incorporating a 3D image adapter and cross-modal reference prompt generation. Our approach modifies the visual encoder to handle 3D inputs and enhances the mask decoder for direct 3D mask generation. We also integrate textual prompts to improve segmentation accuracy and consistency in complex anatomical scenarios. By employing a hierarchical attention mechanism, our model effectively captures and integrates information across different scales. Extensive evaluations on multiple medical imaging datasets demonstrate the superior performance of RefSAM3D over state-of-the-art methods. Our contributions advance the application of SAM in accurately segmenting complex anatomical structures in medical imaging.
zh

[CV-178] Multispecies Animal Re-ID Using a Large Community-Curated Dataset

【速读】：该论文试图解决传统动物个体识别算法中存在的三个主要问题：(1) 每个物种单独训练模型的成本高昂，涉及数据收集、整理、模型训练、部署和维护；(2) 许多物种的训练数据稀缺；(3) 跨物种的外观相似性未被充分利用。解决方案的关键在于提出了一种多物种个体识别 (multi-species individual identification) 模型，通过构建包含49个物种、37,000个个体和225,000张图像的数据集，训练一个单一的嵌入网络来处理所有物种。该模型采用EfficientNetV2作为骨干网络，并结合子中心ArcFace损失函数和动态边际，显著提高了跨物种识别的准确性，平均提升了12.5%的top-1准确率，并展示了强大的零样本性能和针对新物种的微调能力。

链接: https://arxiv.org/abs/2412.05602
作者: Lasha Otarashvili,Tamilselvan Subramanian,Jason Holmberg,J.J. Levenson,Charles V. Stewart
关键词-EN: identifying animals individually, species, work has established, established the ecological, ecological importance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent work has established the ecological importance of developing algorithms for identifying animals individually from images. Typically, a separate algorithm is trained for each species, a natural step but one that creates significant barriers to wide-spread use: (1) each effort is expensive, requiring data collection, data curation, and model training, deployment, and maintenance, (2) there is little training data for many species, and (3) commonalities in appearance across species are not exploited. We propose an alternative approach focused on training multi-species individual identification (re-id) models. We construct a dataset that includes 49 species, 37K individual animals, and 225K images, using this data to train a single embedding network for all species. Our model employs an EfficientNetV2 backbone and a sub-center ArcFace loss function with dynamic margins. We evaluate the performance of this multispecies model in several ways. Most notably, we demonstrate that it consistently outperforms models trained separately on each species, achieving an average gain of 12.5% in top-1 accuracy. Furthermore, the model demonstrates strong zero-shot performance and fine-tuning capabilities for new species with limited training data, enabling effective curation of new species through both incremental addition of data to the training set and fine-tuning without the original data. Additionally, our model surpasses the recent MegaDescriptor on unseen species, averaging an 19.2% top-1 improvement per species and showing gains across all 33 species tested. The fully-featured code repository is publicly available on GitHub, and the feature extractor model can be accessed on HuggingFace for seamless integration with wildlife re-identification pipelines. The model is already in production use for 60+ species in a large-scale wildlife monitoring system.
zh

[CV-179] Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space

【速读】：该论文试图解决地球观测数据在大规模项目（如Copernicus）中日益增长的处理需求，特别是如何高效地表示这些原始数据的向量表示问题。解决方案的关键在于利用预训练深度神经网络提取特征表示，从而为输入数据提供语义抽象。论文提出对现有社区项目Major TOM的扩展，旨在提供和标准化开放且免费的AI就绪地球观测数据集。此外，论文还发布了四个全球密集嵌入数据集，这些数据集在覆盖地球表面方面构成了最全面的全球开放地理空间视觉嵌入数据集。

链接: https://arxiv.org/abs/2412.05600
作者: Mikolaj Czerkawski,Marcin Kluczek,Jędrzej S. Bojanowski
关键词-EN: efficient vector representations, underlying raw data, observation data present, ever-increasing volumes, large programmes
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:

点击查看摘要

Abstract:With the ever-increasing volumes of the Earth observation data present in the archives of large programmes such as Copernicus, there is a growing need for efficient vector representations of the underlying raw data. The approach of extracting feature representations from pretrained deep neural networks is a powerful approach that can provide semantic abstractions of the input data. However, the way this is done for imagery archives containing geospatial data has not yet been defined. In this work, an extension is proposed to an existing community project, Major TOM, focused on the provision and standardization of open and free AI-ready datasets for Earth observation. Furthermore, four global and dense embedding datasets are released openly and for free along with the publication of this manuscript, resulting in the most comprehensive global open dataset of geospatial visual embeddings in terms of covered Earth’s surface.
zh

[CV-180] B-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances AAAI2025

【速读】：该论文试图解决3D场景理解中的功能性和可操作性（affordance）问题，特别是如何在一个3D层次场景图（3DHSG）中结构化和变化这些功能性可操作性，以支持任务导向的目标。解决方案的关键在于开发了一种算法，能够从分割的对象点云和对象语义标签出发，构建一个3DHSG，该图包含房间标签的顶节点、定义房间内局部空间区域的子节点（具有区域特定的可操作性），以及指示对象位置和对象特定可操作性的孙节点。通过使用基于transformer的多任务学习框架，模型同时学习房间分类和定义房间内空间区域及其特定可操作性，从而提升了现有基线模型的性能。

链接: https://arxiv.org/abs/2412.05596
作者: Wenting Xu,Viorela Ila,Luping Zhou,Craig T. Jin
关键词-EN: supports task-oriented objectives, hierarchical scene graph, spatial, spatial organization, task-oriented objectives
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to AAAI2025

点击查看摘要

Abstract:The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a transformer-based model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.
zh

[CV-181] Real-Time 3D Object Detection Using InnovizOne LiDAR and Low-Power Hailo-8 AI Accelerator

【速读】：该论文试图解决在自动驾驶领域中，利用低功耗硬件实现实时3D目标检测（3D object detection）的问题。解决方案的关键在于使用Hailo-8 AI加速器处理来自InnovizOne LiDAR传感器的3D点云数据，并采用PointPillars算法进行高效的目标检测。通过这种方法，论文成功在低功耗硬件上实现了约5Hz的实时推理，F1得分达到0.91%，且仅比在NVIDIA GeForce RTX 2080 Ti上运行时下降了0.2%。这一成果表明，低成本、低功耗硬件也能有效支持高精度的实时3D目标检测，推动了自动驾驶技术的普及。

链接: https://arxiv.org/abs/2412.05594
作者: Itay Krispin-Avraham,Roy Orfaig,Ben-Zion Bobrovsky
关键词-EN: Object detection, detection, LiDAR, Object, LiDAR sensors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Object detection is a significant field in autonomous driving. Popular sensors for this task include cameras and LiDAR sensors. LiDAR sensors offer several advantages, such as insensitivity to light changes, like in a dark setting and the ability to provide 3D information in the form of point clouds, which include the ranges of objects. However, 3D detection methods, such as PointPillars, typically require high-power hardware. Additionally, most common spinning LiDARs are sparse and may not achieve the desired quality of object detection in front of the car. In this paper, we present the feasibility of performing real-time 3D object detection of cars using 3D point clouds from a LiDAR sensor, processed and deployed on a low-power Hailo-8 AI accelerator. The LiDAR sensor used in this study is the InnovizOne sensor, which captures objects in higher quality compared to spinning LiDAR techniques, especially for distant objects. We successfully achieved real-time inference at a rate of approximately 5Hz with a high accuracy of 0.91% F1 score, with only -0.2% degradation compared to running the same model on an NVIDIA GeForce RTX 2080 Ti. This work demonstrates that effective real-time 3D object detection can be achieved on low-cost, low-power hardware, representing a significant step towards more accessible autonomous driving technologies. The source code and the pre-trained models are available at this https URL PointPillarsHailoInnoviz/tree/main
zh

[CV-182] UMSPU: Universal Multi-Size Phase Unwrapping via Mutual Self-Distillation and Adaptive Boosting Ensemble Segmenters

【速读】：该论文试图解决传统空间相位解包裹方法在高精度、大图像尺寸和高速度要求下的噪声抵抗和处理速度问题。解决方案的关键在于提出了互自蒸馏（Mutual Self-Distillation, MSD）机制和自适应提升集成分割器，构建了一个通用的多尺寸相位解包裹网络（Universal Multi-Size Phase Unwrapping Network, UMSPU）。MSD通过分层注意力精炼和跨层协作学习实现双向蒸馏，确保了不同尺寸图像的细粒度语义表示；自适应提升集成分割器则结合了不同感受野的弱分割器，形成强分割器，保证了空间频率的稳定分割。实验结果表明，UMSPU在图像尺寸从256256到20482048的范围内实现了高精度处理，并在速度、鲁棒性和泛化能力上优于现有方法。

链接: https://arxiv.org/abs/2412.05584
作者: Lintong Du,Huazhen Liu,Yijia Zhang,ShuXin Liu,Yuan Qu,Zenghui Zhang,Jiamiao Yang
关键词-EN: key technique, technique for extracting, extracting phase information, image sizes, image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial phase unwrapping is a key technique for extracting phase information to obtain 3D morphology and other features. Modern industrial measurement scenarios demand high precision, large image sizes, and high speed. However, conventional methods struggle with noise resistance and processing speed. Current deep learning methods are limited by the receptive field size and sparse semantic information, making them ineffective for large size images. To address this issue, we propose a mutual self-distillation (MSD) mechanism and adaptive boosting ensemble segmenters to construct a universal multi-size phase unwrapping network (UMSPU). MSD performs hierarchical attention refinement and achieves cross-layer collaborative learning through bidirectional distillation, ensuring fine-grained semantic representation across image sizes. The adaptive boosting ensemble segmenters combine weak segmenters with different receptive fields into a strong one, ensuring stable segmentation across spatial frequencies. Experimental results show that UMSPU overcomes image size limitations, achieving high precision across image sizes ranging from 256256 to 20482048 (an 8 times increase). It also outperforms existing methods in speed, robustness, and generalization. Its practicality is further validated in structured light imaging and InSAR. We believe that UMSPU offers a universal solution for phase unwrapping, with broad potential for industrial applications.
zh

[CV-183] Rate-Distortion Optimized Skip Coding of Region Adaptive Hierarchical Transform Coefficients for MPEG G-PCC

【速读】：该论文试图解决在基于几何的点云压缩 (Geometry-based Point Cloud Compression, G-PCC) 标准中，区域自适应分层变换 (Region-Adaptive Hierarchical Transform, RAHT) 在最后几层中零残差比例过高导致不必要比特率消耗的问题。解决方案的关键在于提出了一种自适应跳过编码方法，该方法能够动态决定是否对最后几层的残差进行编码，从而提高编码效率。此外，论文还提出了一种与自适应拉格朗日乘数相关的率失真代价计算方法，进一步优化了压缩性能。实验结果表明，该方法在动态点云数据上相较于现有的G-PCC参考软件，在Luma、Cb和Cr分量上分别实现了平均Bjøntegaard率提升-3.50%、-5.56%和-4.18%。

链接: https://arxiv.org/abs/2412.05574
作者: Zehan Wang,Yuxuan Wei,Hui Yuan,Wei Zhang,Peng Li
关键词-EN: Geometry-based Point Cloud, Point Cloud Compression, point clouds, Picture Experts Group, objects and scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Three-dimensional (3D) point clouds are becoming more and more popular for representing 3D objects and scenes. Due to limited network bandwidth, efficient compression of 3D point clouds is crucial. To tackle this challenge, the Moving Picture Experts Group (MPEG) is actively developing the Geometry-based Point Cloud Compression (G-PCC) standard, incorporating innovative methods to optimize compression, such as the Region-Adaptive Hierarchical Transform (RAHT) nestled within a layer-by-layer octree-tree structure. Nevertheless, a notable problem still exists in RAHT, i.e., the proportion of zero residuals in the last few RAHT layers leads to unnecessary bitrate consumption. To address this problem, we propose an adaptive skip coding method for RAHT, which adaptively determines whether to encode the residuals of the last several layers or not, thereby improving the coding efficiency. In addition, we propose a rate-distortion cost calculation method associated with an adaptive Lagrange multiplier. Experimental results demonstrate that the proposed method achieves average Bjøntegaard rate improvements of -3.50%, -5.56%, and -4.18% for the Luma, Cb, and Cr components, respectively, on dynamic point clouds, when compared with the state-of-the-art G-PCC reference software under the common test conditions recommended by MPEG.
zh

[CV-184] Neighborhood Commonality-aware Evolution Network for Continuous Generalized Category Discovery

【速读】：该论文试图解决连续广义类别发现 (Continuous Generalized Category Discovery, C-GCD) 问题，即在不依赖标签的情况下，持续从无标签图像集中发现新类别，同时保持对旧类别的性能。解决方案的关键在于提出了一种新的学习框架——邻域共性感知进化网络 (Neighborhood Commonality-aware Evolution Network, NCENet)。具体来说，NCENet 通过设计邻域共性感知表示学习 (Neighborhood Commonality-aware Representation Learning, NCRL) 模块，利用局部共性来指导不同类别实例之间的表示差异学习，从而提升对新类别的判别能力。同时，为了保持对旧类别的表示能力，NCENet 引入了双层对比知识蒸馏 (Bi-level Contrastive Knowledge Distillation, BCKD) 模块，通过对比学习感知学习和已学知识，并进行知识蒸馏。实验结果表明，NCENet 在 CIFAR10、CIFAR100 和 Tiny-ImageNet 数据集上均优于现有最先进方法。

链接: https://arxiv.org/abs/2412.05573
作者: Ye Wang,Yaxiong Wang,Guoshuai Zhao,Xueming Qian
关键词-EN: Continuous Generalized Category, Generalized Category Discovery, Continuous Generalized, Category Discovery, Generalized Category
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 Figures

点击查看摘要

Abstract:Continuous Generalized Category Discovery (C-GCD) aims to continually discover novel classes from unlabelled image sets while maintaining performance on old classes. In this paper, we propose a novel learning framework, dubbed Neighborhood Commonality-aware Evolution Network (NCENet) that conquers this task from the perspective of representation learning. Concretely, to learn discriminative representations for novel classes, a Neighborhood Commonality-aware Representation Learning (NCRL) is designed, which exploits local commonalities derived neighborhoods to guide the learning of representational differences between instances of different classes. To maintain the representation ability for old classes, a Bi-level Contrastive Knowledge Distillation (BCKD) module is designed, which leverages contrastive learning to perceive the learning and learned knowledge and conducts knowledge distillation. Extensive experiments conducted on CIFAR10, CIFAR100, and Tiny-ImageNet demonstrate the superior performance of NCENet compared to the previous state-of-the-art method. Particularly, in the last incremental learning session on CIFAR100, the clustering accuracy of NCENet outperforms the second-best method by a margin of 3.09% on old classes and by a margin of 6.32% on new classes. Our code will be publicly available at \hrefthis https URLthis https URL. \endabstract
zh

[CV-185] From Deterministic to Probabilistic: A Novel Perspective on Domain Generalization for Medical Image Segmentation

【速读】：该论文试图解决传统领域泛化方法在面对领域偏移（domain shifts）时，依赖领域对齐（domain alignment）导致模型泛化能力受限的问题。解决方案的关键在于通过概率建模（probabilistic modeling）和对比学习（contrastive learning）提升数据表示质量，减少对领域对齐的依赖，并增强模型在领域变化下的鲁棒性。具体来说，论文结合确定性特征与不确定性建模，捕捉全面的特征分布，并通过对比学习对特征分布的均值和协方差进行对齐，从而实现动态适应领域变化并缓解分布偏移。此外，论文还设计了一种基于离散小波变换（discrete wavelet transforms）的频率域结构增强策略，以保留关键结构细节并减少由风格变化引起的视觉失真。

链接: https://arxiv.org/abs/2412.05572
作者: Yuheng Xu,Taiping Zhang
关键词-EN: Traditional domain generalization, learn domain-invariant representations, Traditional domain, inter-domain distribution differences, domain generalization methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Traditional domain generalization methods often rely on domain alignment to reduce inter-domain distribution differences and learn domain-invariant representations. However, domain shifts are inherently difficult to eliminate, which limits model generalization. To address this, we propose an innovative framework that enhances data representation quality through probabilistic modeling and contrastive learning, reducing dependence on domain alignment and improving robustness under domain variations. Specifically, we combine deterministic features with uncertainty modeling to capture comprehensive feature distributions. Contrastive learning enforces distribution-level alignment by aligning the mean and covariance of feature distributions, enabling the model to dynamically adapt to domain variations and mitigate distribution shifts. Additionally, we design a frequency-domain-based structural enhancement strategy using discrete wavelet transforms to preserve critical structural details and reduce visual distortions caused by style variations. Experimental results demonstrate that the proposed framework significantly improves segmentation performance, providing a robust solution to domain generalization challenges in medical image segmentation.
zh

[CV-186] mplate-free Articulated Gaussian Splatting for Real-time Reposable Dynamic View Synthesis NEURIPS2024

【速读】：该论文试图解决动态场景中的新视角合成问题，特别是如何从视频中自动发现动态对象的骨架模型并实现重新定位。解决方案的关键在于利用3D高斯平滑（3D Gaussian Splatting）和超点（superpoints）来重建动态对象，并将超点视为刚性部分，通过直观线索和运动学模型优化骨架模型。此外，采用自适应控制策略来避免冗余超点的出现，从而实现高效的3D对象重定位和实时高分辨率图像渲染。

链接: https://arxiv.org/abs/2412.05570
作者: Diwen Wan,Yuxiang Wang,Ruijie Lu,Gang Zeng
关键词-EN: made significant progress, capturing skeleton models, significant progress, challenging task, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:While novel view synthesis for dynamic scenes has made significant progress, capturing skeleton models of objects and re-posing them remains a challenging task. To tackle this problem, in this paper, we propose a novel approach to automatically discover the associated skeleton model for dynamic objects from videos without the need for object-specific templates. Our approach utilizes 3D Gaussian Splatting and superpoints to reconstruct dynamic objects. Treating superpoints as rigid parts, we can discover the underlying skeleton model through intuitive cues and optimize it using the kinematic model. Besides, an adaptive control strategy is applied to avoid the emergence of redundant superpoints. Extensive experiments demonstrate the effectiveness and efficiency of our method in obtaining re-posable 3D objects. Not only can our approach achieve excellent visual fidelity, but it also allows for the real-time rendering of high-resolution images.
zh

[CV-187] Dif4FF: Leveraging Multimodal Diffusion Models and Graph Neural Networks for Accurate New Fashion Product Performance Forecasting ICPR2024

【速读】：该论文试图解决快时尚行业中由于过度生产和未售库存导致的环境问题，特别是对全新款式产品的销售预测难题。解决方案的关键在于利用扩散模型（diffusion models）来应对传统确定性模型在处理训练数据之外的领域转移（domain shift）问题。论文提出的Dif4FF系统是一个两阶段的新时尚产品性能预测（NFPPF）管道，首先使用基于多模态数据的评分扩散模型预测多种服装的销售轨迹，然后通过图卷积网络（GCN）架构对预测结果进行优化，以捕捉时间和空间数据中的长程依赖关系，从而实现更精确和高效的销售预测。

链接: https://arxiv.org/abs/2412.05566
作者: Andrea Avogaro,Luigi Capogrosso,Franco Fummi,Marco Cristani
关键词-EN: significant environmental problems, unsold inventory create, inventory create significant, create significant environmental, fast-fashion industry
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the 27th International Conference on Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:In the fast-fashion industry, overproduction and unsold inventory create significant environmental problems. Precise sales forecasts for unreleased items could drastically improve the efficiency and profits of industries. However, predicting the success of entirely new styles is difficult due to the absence of past data and ever-changing trends. Specifically, currently used deterministic models struggle with domain shifts when encountering items outside their training data. The recently proposed diffusion models address this issue using a continuous-time diffusion process. Specifically, these models enable us to predict the sales of new items, mitigating the domain shift challenges encountered by deterministic models. As a result, this paper proposes Dif4FF, a novel two-stage pipeline for New Fashion Product Performance Forecasting (NFPPF) that leverages the power of diffusion models conditioned on multimodal data related to specific clothes. Dif4FF first utilizes a multimodal score-based diffusion model to forecast multiple sales trajectories for various garments over time. The forecasts are refined using a powerful Graph Convolutional Network (GCN) architecture. By leveraging the GCN’s capability to capture long-range dependencies within both the temporal and spatial data and seeking the optimal solution between these two dimensions, Dif4FF offers the most accurate and efficient forecasting system available in the literature for predicting the sales of new items. We tested Dif4FF on VISUELLE, the de facto standard for NFPPF, achieving new state-of-the-art results.
zh

[CV-188] xt-to-3D Gaussian Splatting with Physics-Grounded Motion Generation

【速读】：该论文试图解决文本到3D生成技术中存在的两个主要问题：一是生成高保真3D物体时提示词效率低下，二是难以准确模拟3D物体基于物理的运动。解决方案的关键在于创新性地结合了大型语言模型（LLM）优化提示词和扩散先验引导的高斯溅射（Gaussian Splatting, GS）技术，以生成具有准确外观和几何结构的3D模型。此外，论文还引入了基于连续介质力学的变形映射和颜色正则化，以合成符合质量守恒和动量守恒的物理运动，从而实现对不同材料和受力条件下物体行为的准确模拟。通过将文本到3D生成与物理运动合成相结合，该框架能够生成具有物理感知运动的高保真、照片级真实感的3D物体。

链接: https://arxiv.org/abs/2412.05560
作者: Wenqing Wang,Yun Fu
关键词-EN: digital content creation, content creation, valuable technology, technology in virtual, virtual reality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Text-to-3D generation is a valuable technology in virtual reality and digital content creation. While recent works have pushed the boundaries of text-to-3D generation, producing high-fidelity 3D objects with inefficient prompts and simulating their physics-grounded motion accurately still remain unsolved challenges. To address these challenges, we present an innovative framework that utilizes the Large Language Model (LLM)-refined prompts and diffusion priors-guided Gaussian Splatting (GS) for generating 3D models with accurate appearances and geometric structures. We also incorporate a continuum mechanics-based deformation map and color regularization to synthesize vivid physics-grounded motion for the generated 3D Gaussians, adhering to the conservation of mass and momentum. By integrating text-to-3D generation with physics-grounded motion synthesis, our framework renders photo-realistic 3D objects that exhibit physics-aware motion, accurately reflecting the behaviors of the objects under various forces and constraints across different materials. Extensive experiments demonstrate that our approach achieves high-quality 3D generations with realistic physics-grounded motion.
zh

[CV-189] WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

【速读】：该论文试图解决语音情感识别 (Speech Emotion Recognition, SER) 中多模态融合技术的不足，特别是现有方法未能充分捕捉跨模态交互的复杂性和模态间的异质性，导致特征表示不理想的问题。解决方案的关键在于提出了WavFusion框架，通过引入门控跨模态注意力机制 (gated cross-modal attention mechanism) 和多模态同质特征差异学习 (multimodal homogeneous feature discrepancy learning)，有效提升了跨模态交互的捕捉能力和特征的判别性，从而在基准数据集上实现了优于现有最先进方法的性能。

链接: https://arxiv.org/abs/2412.05558
作者: Feng Li,Jiusong Luo,Wanjun Xia
关键词-EN: crucial task due, remains a challenging, challenging yet crucial, crucial task, task due
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted by 31st International Conference on MultiMedia Modeling (MMM2025)

点击查看摘要

Abstract:Speech emotion recognition (SER) remains a challenging yet crucial task due to the inherent complexity and diversity of human emotions. To address this problem, researchers attempt to fuse information from other modalities via multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of cross-modal interactions, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal speech emotion recognition framework that addresses critical research problems in effective multimodal fusion, heterogeneity among modalities, and discriminative representation learning. By leveraging a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion demonstrates improved performance over existing state-of-the-art methods on benchmark datasets. Our work highlights the importance of capturing nuanced cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results on two benchmark datasets (IEMOCAP and MELD) demonstrate that WavFusion succeeds over the state-of-the-art strategies on emotion recognition.
zh

[CV-190] CoE: Deep Coupled Embedding for Non-Rigid Point Cloud Correspondences

【速读】：该论文试图解决非刚性变形形状（non-rigidly deformed shapes）在原始点云（raw point clouds）表示下的匹配问题，这一问题由于低成本3D传感器的普及而变得日益重要。点云的不规则性以及缺乏内在形状信息使得这一任务具有挑战性。论文提出的解决方案关键在于学习一种新的形状表示——每个点的多维嵌入（per-point high dimensional embedding），在嵌入空间中，语义相似的点具有相似的嵌入。这种嵌入具有多个有益特性：它能够感知底层形状几何，对形状变形和噪声、部分性等形状伪影具有鲁棒性。因此，通过在嵌入空间中进行简单的最近邻搜索，可以直接利用这种嵌入来检索高质量的密集对应关系。实验结果表明，该方法在多个非刚性形状匹配基准测试中达到了新的最先进水平，并展示了其在形状分析任务（如分割）中的巨大潜力。

链接: https://arxiv.org/abs/2412.05557
作者: Huajian Zeng,Maolin Gao,Daniel Cremers
关键词-EN: raw point clouds, non-rigidly deformed shapes, deformed shapes represented, matching non-rigidly deformed, proliferation of low-cost
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 17 figures

点击查看摘要

Abstract:The interest in matching non-rigidly deformed shapes represented as raw point clouds is rising due to the proliferation of low-cost 3D sensors. Yet, the task is challenging since point clouds are irregular and there is a lack of intrinsic shape information. We propose to tackle these challenges by learning a new shape representation – a per-point high dimensional embedding, in an embedding space where semantically similar points share similar embeddings. The learned embedding has multiple beneficial properties: it is aware of the underlying shape geometry and is robust to shape deformations and various shape artefacts, such as noise and partiality. Consequently, this embedding can be directly employed to retrieve high-quality dense correspondences through a simple nearest neighbor search in the embedding space. Extensive experiments demonstrate new state-of-the-art results and robustness in numerous challenging non-rigid shape matching benchmarks and show its great potential in other shape analysis tasks, such as segmentation.
zh

[CV-191] Psych-Occlusion: Using Visual Psychophysics for Aerial Detection of Occluded Persons during Search and Rescue

【速读】：该论文试图解决在紧急响应场景（Emergency Response, ER）中，小型无人机系统（small Unmanned Aerial Systems, sUAS）在复杂环境下（如遮挡和低目标分辨率）进行人员检测时性能下降的问题。解决方案的关键在于利用人类行为数据集（Psych-ER）来调整检测模型的损失函数（loss function），从而提高模型在远距离和高遮挡情况下的检测准确性，同时不影响近距离的检测性能。通过在RetinaNet模型上进行实验，证明了这种基于人类行为数据的损失函数调整方法在不同距离和遮挡水平下均能有效提升检测效果。

链接: https://arxiv.org/abs/2412.05553
作者: Arturo Miguel Russell Bernal,Jane Cleland-Huang,Walter Scheirer
关键词-EN: Emergency Response, Unmanned Aerial Systems, success of Emergency, small Unmanned Aerial, lost or injured
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The success of Emergency Response (ER) scenarios, such as search and rescue, is often dependent upon the prompt location of a lost or injured person. With the increasing use of small Unmanned Aerial Systems (sUAS) as “eyes in the sky” during ER scenarios, efficient detection of persons from aerial views plays a crucial role in achieving a successful mission outcome. Fatigue of human operators during prolonged ER missions, coupled with limited human resources, highlights the need for sUAS equipped with Computer Vision (CV) capabilities to aid in finding the person from aerial views. However, the performance of CV models onboard sUAS substantially degrades under real-life rigorous conditions of a typical ER scenario, where person search is hampered by occlusion and low target resolution. To address these challenges, we extracted images from the NOMAD dataset and performed a crowdsource experiment to collect behavioural measurements when humans were asked to “find the person in the picture”. We exemplify the use of our behavioral dataset, Psych-ER, by using its human accuracy data to adapt the loss function of a detection model. We tested our loss adaptation on a RetinaNet model evaluated on NOMAD against increasing distance and occlusion, with our psychophysical loss adaptation showing improvements over the baseline at higher distances across different levels of occlusion, without degrading performance at closer distances. To the best of our knowledge, our work is the first human-guided approach to address the location task of a detection model, while addressing real-world challenges of aerial search and rescue. All datasets and code can be found at: this https URL.
zh

[CV-192] GAQAT: gradient-adaptive quantization-aware training for domain generalization

【速读】：该论文试图解决在资源受限的边缘设备上应用低精度量化训练时，现有的基于平坦最小值的领域泛化（Domain Generalization, DG）技术性能显著下降的问题。解决方案的关键在于提出了一种新的梯度自适应量化感知训练（Gradient-Adaptive Quantization-Aware Training, GAQAT）框架。该框架通过识别低精度量化中的尺度梯度冲突问题，即任务损失和光滑损失对量化器缩放因子的梯度产生冲突，导致某些层的梯度方向相反，从而使量化权重的优化变得不稳定。为解决这一问题，GAQAT引入了一种机制来量化梯度不一致性，并选择性地冻结缩放因子的梯度，从而稳定训练过程并增强跨领域泛化能力。

链接: https://arxiv.org/abs/2412.05551
作者: Jiacheng Jiang,Yuan Meng,Chen Tang,Han Yu,Qun Li,Zhi Wang,Wenwu Zhu
关键词-EN: flatter minima improve, minima improve generalization, Sharpness-Aware Minimization, loss surface geometry, flatter minima
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Research on loss surface geometry, such as Sharpness-Aware Minimization (SAM), shows that flatter minima improve generalization. Recent studies further reveal that flatter minima can also reduce the domain generalization (DG) gap. However, existing flatness-based DG techniques predominantly operate within a full-precision training process, which is impractical for deployment on resource-constrained edge devices that typically rely on lower bit-width representations (e.g., 4 bits, 3 bits). Consequently, low-precision quantization-aware training is critical for optimizing these techniques in real-world applications. In this paper, we observe a significant degradation in performance when applying state-of-the-art DG-SAM methods to quantized models, suggesting that current approaches fail to preserve generalizability during the low-precision training process. To address this limitation, we propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG. Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization, where the task loss and smoothness loss induce conflicting gradients for the scaling factors of quantizers, with certain layers exhibiting opposing gradient directions. This conflict renders the optimization of quantized weights highly unstable. To mitigate this, we further introduce a mechanism to quantify gradient inconsistencies and selectively freeze the gradients of scaling factors, thereby stabilizing the training process and enhancing out-of-domain generalization. Extensive experiments validate the effectiveness of the proposed GAQAT framework. On PACS, our 3-bit and 4-bit models outperform direct DG-QAT integration by up to 4.5%. On DomainNet, the 4-bit model achieves near-lossless performance compared to full precision, with improvements of 1.39% (4-bit) and 1.06% (3-bit) over the SOTA QAT baseline.
zh

[CV-193] Street Gaussians without 3D Object Tracker

【速读】：该论文试图解决自动驾驶场景中快速移动物体的真实场景重建问题，现有方法依赖于手动标注物体姿态或使用泛化能力有限的3D追踪器，导致重建效果不佳。论文的关键解决方案是利用2D深度追踪器的关联信息，结合3D物体融合策略，提出了一种稳定的物体追踪模块，并通过在隐式特征空间中引入运动学习策略，自主纠正轨迹误差和恢复漏检，从而在多样化的环境中提升重建的鲁棒性。实验结果表明，该方法在Waymo-NOTR数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.05548
作者: Ruida Zhang,Chengxi Li,Chenyangguang Zhang,Xingyu Liu,Haili Yuan,Yanyan Li,Xiangyang Ji,Gim Hee Lee
关键词-EN: Realistic scene reconstruction, significant challenges due, Realistic scene, driving scenarios poses, scenarios poses significant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Realistic scene reconstruction in driving scenarios poses significant challenges due to fast-moving objects. Most existing methods rely on labor-intensive manual labeling of object poses to reconstruct dynamic objects in canonical space and move them based on these poses during rendering. While some approaches attempt to use 3D object trackers to replace manual annotations, the limited generalization of 3D trackers – caused by the scarcity of large-scale 3D datasets – results in inferior reconstructions in real-world settings. In contrast, 2D foundation models demonstrate strong generalization capabilities. To eliminate the reliance on 3D trackers and enhance robustness across diverse environments, we propose a stable object tracking module by leveraging associations from 2D deep trackers within a 3D object fusion strategy. We address inevitable tracking errors by further introducing a motion learning strategy in an implicit feature space that autonomously corrects trajectory errors and recovers missed detections. Experimental results on Waymo-NOTR datasets show we achieve state-of-the-art performance. Our code will be made publicly available.
zh

[CV-194] Radiant: Large-scale 3D Gaussian Rendering based on Hierarchical Framework

【速读】：该论文试图解决在大规模场景重建中，传统分布式3D高斯喷射 (3D Gaussian Splatting, 3DGS) 框架在实际环境中面临的计算和通信挑战，以及隐私风险问题。解决方案的关键在于提出了一种分层3DGS算法Radiant，通过考虑系统异构性，合理划分区域并分配不同的相机位置给各个边缘设备进行图像采集和训练，从而提升模型性能和训练效率。Radiant的核心在于基于异构环境信息进行区域划分，并相应地分配工作负载给每个设备，同时提供了一种3DGS模型聚合算法，以增强模型质量并确保模型边界的连续性。实验结果表明，Radiant显著提高了重建质量（最高提升25.7%）并大幅减少了端到端延迟（最高减少79.6%）。

链接: https://arxiv.org/abs/2412.05546
作者: Haosong Peng,Tianyu Qi,Yufeng Zhan,Hao Li,Yalun Dai,Yuanqing Xia
关键词-EN: Gaussian Splatting, computer vision, recently emerged, popular scene reconstruction, advancement of computer
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:With the advancement of computer vision, the recently emerged 3D Gaussian Splatting (3DGS) has increasingly become a popular scene reconstruction algorithm due to its outstanding performance. Distributed 3DGS can efficiently utilize edge devices to directly train on the collected images, thereby offloading computational demands and enhancing efficiency. However, traditional distributed frameworks often overlook computational and communication challenges in real-world environments, hindering large-scale deployment and potentially posing privacy risks. In this paper, we propose Radiant, a hierarchical 3DGS algorithm designed for large-scale scene reconstruction that considers system heterogeneity, enhancing the model performance and training efficiency. Via extensive empirical study, we find that it is crucial to partition the regions for each edge appropriately and allocate varying camera positions to each device for image collection and training. The core of Radiant is partitioning regions based on heterogeneous environment information and allocating workloads to each device accordingly. Furthermore, we provide a 3DGS model aggregation algorithm that enhances the quality and ensures the continuity of models’ boundaries. Finally, we develop a testbed, and experiments demonstrate that Radiant improved reconstruction quality by up to 25.7% and reduced up to 79.6% end-to-end latency.
zh

[CV-195] Uncovering Vision Modality Threats in Image-to-Image Tasks

【速读】：该论文试图解决图像生成模型在视觉模态（vision modality）中面临的安全威胁问题，特别是在涉及真实世界图像编辑的任务中，这些威胁可能侵犯图像所有者的权利。论文通过提出一种名为“typographic attack”的方法，揭示了现有图像生成模型在视觉模态中普遍存在的脆弱性，并评估了现有防御方法在应对视觉模态威胁时的无效性。解决方案的关键在于提出了一个新的数据集——视觉模态威胁在图像生成模型中的数据集（Vision Modal Threats in Image Generation Models, VMT-IGMs），该数据集将作为评估图像生成模型在视觉模态中脆弱性的基准。

链接: https://arxiv.org/abs/2412.05538
作者: Hao Cheng,Erjia Xiao,Jiayan Yang,Jiahang Cao,Qiang Zhang,Jize Zhang,Kaidi Xu,Jindong Gu,Renjing Xu
关键词-EN: Current image generation, effortlessly produce high-quality, image generation models, highly realistic images, Current image
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Current image generation models can effortlessly produce high-quality, highly realistic images, but this also increases the risk of misuse. In various Text-to-Image or Image-to-Image tasks, attackers can generate a series of images containing inappropriate content by simply editing the language modality input. Currently, to prevent this security threat, the various guard or defense methods that are proposed also focus on defending the language modality. However, in practical applications, threats in the visual modality, particularly in tasks involving the editing of real-world images, pose greater security risks as they can easily infringe upon the rights of the image owner. Therefore, this paper uses a method named typographic attack to reveal that various image generation models also commonly face threats in the vision modality. Furthermore, we also evaluate the defense performance of various existing methods when facing threats in the vision modality and uncover their ineffectiveness. Finally, we propose the Vision Modal Threats in Image Generation Models (VMT-IGMs) dataset, which would serve as a baseline for evaluating the vision modality vulnerability of various image generation models.
zh

[CV-196] Action Recognition based Industrial Safety Violation Detection

【速读】：该论文试图解决个人防护装备（PPE）应用中普遍存在的误报问题，特别是在大型制造业中，现有的PPE检测系统往往因过度泛化不同行业和任务的PPE要求而产生大量误报。解决方案的关键在于理解工人正在执行的具体动作，并根据该动作的特定PPE需求进行定制化推理。论文提出了一种系统，首先通过活动识别模型理解工人的动作，然后使用目标检测技术检查是否存在PPE违规，从而在测试数据集上将F1分数提高了23%。

链接: https://arxiv.org/abs/2412.05531
作者: Surya N Reddy,Vaibhav Kurrey,Mayank Nagar,Gagan Raj Gupta
关键词-EN: personal protective equipment, large manufacturing industries, protective equipment, manufacturing industries, personal protective
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Proper use of personal protective equipment (PPE) can save the lives of industry workers and it is a widely used application of computer vision in the large manufacturing industries. However, most of the applications deployed generate a lot of false alarms (violations) because they tend to generalize the requirements of PPE across the industry and tasks. The key to resolving this issue is to understand the action being performed by the worker and customize the inference for the specific PPE requirements of that action. In this paper, we propose a system that employs activity recognition models to first understand the action being performed and then use object detection techniques to check for violations. This leads to a 23% improvement in the F1-score compared to the PPE-based approach on our test dataset of 109 videos.
zh

[CV-197] CLIP-TNseg: A Multi-Modal Hybrid Framework for Thyroid Nodule Segmentation in Ultrasound Images

【速读】：该论文试图解决甲状腺结节在超声图像中的分割问题，现有方法在分割精度、可解释性和泛化能力方面存在挑战。解决方案的关键在于提出了一种名为CLIP-TNseg的新框架，通过整合多模态大模型与神经网络架构来应对这些问题。CLIP-TNseg包含两个主要分支：粗粒度分支（Coarse-grained Branch）利用冻结的CLIP模型提取高层语义特征，细粒度分支（Fine-grained Branch）则通过U-Net风格的残差块捕捉细粒度特征。这两个分支的特征融合后由预测头生成精确的分割图。该框架通过粗粒度分支增强语义理解，细粒度分支细化空间细节，从而实现精确且鲁棒的分割。

链接: https://arxiv.org/abs/2412.05530
作者: Xinjie Sun,Boxiong Wei,Yalong Jiang,Liquan Mao,Qi Zhao
关键词-EN: Thyroid nodule segmentation, Thyroid nodule, treatment planning, ultrasound images, images is crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, submitted to IEEE Signal Processing Letters

点击查看摘要

Abstract:Thyroid nodule segmentation in ultrasound images is crucial for accurate diagnosis and treatment planning. However, existing methods face challenges in segmentation accuracy, interpretability, and generalization, which hinder their performance. This letter proposes a novel framework, CLIP-TNseg, to address these issues by integrating a multimodal large model with a neural network architecture. CLIP-TNseg consists of two main branches: the Coarse-grained Branch, which extracts high-level semantic features from a frozen CLIP model, and the Fine-grained Branch, which captures fine-grained features using U-Net style residual blocks. These features are fused and processed by the prediction head to generate precise segmentation maps. CLIP-TNseg leverages the Coarse-grained Branch to enhance semantic understanding through textual and high-level visual features, while the Fine-grained Branch refines spatial details, enabling precise and robust segmentation. Extensive experiments on public and our newly collected datasets demonstrate its competitive performance. Our code and the original dataset are available at this https URL.
zh

[CV-198] Video2Reward: Generating Reward Function from Videos for Legged Robot Behavior Learning ECAI2024

【速读】：该论文试图解决传统基于文本描述生成奖励函数（reward function）的方法在行为学习中缺乏可控性和精确性的问题。解决方案的关键在于引入了一种新的视频2奖励（video2reward）方法，通过直接从展示目标行为的视频中生成奖励函数。具体来说，该方法首先将视频中的运动信息转换为关键点轨迹（keypoint trajectories），然后利用大型语言模型（LLM）生成奖励函数，并通过视频辅助的迭代奖励优化方案不断改进奖励函数，从而实现更高效的行为学习。实验结果表明，该方法在双足和四足机器人运动控制任务中显著优于现有的基于LLM的奖励生成方法。

链接: https://arxiv.org/abs/2412.05515
作者: Runhao Zeng,Dingjie Zhou,Qiwei Liang,Junlin Liu,Hui Li,Changxin Huang,Jianqiang Li,Xiping Hu,Fuchun Sun
关键词-EN: significant challenge due, legged robots presents, complex constraints, presents a significant, significant challenge
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 6 figures, ECAI2024

点击查看摘要

Abstract:Learning behavior in legged robots presents a significant challenge due to its inherent instability and complex constraints. Recent research has proposed the use of a large language model (LLM) to generate reward functions in reinforcement learning, thereby replacing the need for manually designed rewards by experts. However, this approach, which relies on textual descriptions to define learning objectives, fails to achieve controllable and precise behavior learning with clear directionality. In this paper, we introduce a new video2reward method, which directly generates reward functions from videos depicting the behaviors to be mimicked and learned. Specifically, we first process videos containing the target behaviors, converting the motion information of individuals in the videos into keypoint trajectories represented as coordinates through a video2text transforming module. These trajectories are then fed into an LLM to generate the reward function, which in turn is used to train the policy. To enhance the quality of the reward function, we develop a video-assisted iterative reward refinement scheme that visually assesses the learned behaviors and provides textual feedback to the LLM. This feedback guides the LLM to continually refine the reward function, ultimately facilitating more efficient behavior learning. Experimental results on tasks involving bipedal and quadrupedal robot motion control demonstrate that our method surpasses the performance of state-of-the-art LLM-based reward generation methods by over 37.6% in terms of human normalized score. More importantly, by switching video inputs, we find our method can rapidly learn diverse motion behaviors such as walking and running.
zh

[CV-199] AutoURDF: Unsupervised Robot Modeling from Point Cloud Frames Using Cluster Registration

【速读】：该论文试图解决机器人描述模型创建过程中手动工作量大的问题，提出了AutoURDF，一种无监督的方法，用于从未见过的机器人点云帧中自动构建描述文件。解决方案的关键在于利用基于聚类的点云配准模型，跟踪点簇的6自由度变换，并通过分析簇的运动来分层次解决以下挑战：(1) 运动部件分割 (moving part segmentation)，(2) 身体拓扑推断 (body topology inference)，以及 (3) 关节参数估计 (joint parameter estimation)。该方法生成的机器人描述文件与现有仿真器完全兼容，并在注册和身体拓扑估计精度方面优于先前的方法，提供了一种可扩展的自动化机器人建模解决方案。

链接: https://arxiv.org/abs/2412.05507
作者: Jiong Lin,Lechen Zhang,Kwansoo Lee,Jialong Ning,Judah Goldfeder,Hod Lipson
关键词-EN: significant manual effort, requires significant manual, simulation and control, manual effort, essential for simulation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 20 figures

点击查看摘要

Abstract:Robot description models are essential for simulation and control, yet their creation often requires significant manual effort. To streamline this modeling process, we introduce AutoURDF, an unsupervised approach for constructing description files for unseen robots from point cloud frames. Our method leverages a cluster-based point cloud registration model that tracks the 6-DoF transformations of point clusters. Through analyzing cluster movements, we hierarchically address the following challenges: (1) moving part segmentation, (2) body topology inference, and (3) joint parameter estimation. The complete pipeline produces robot description files that are fully compatible with existing simulators. We validate our method across a variety of robots, using both synthetic and real-world scan data. Results indicate that our approach outperforms previous methods in registration and body topology estimation accuracy, offering a scalable solution for automated robot modeling.
zh

[CV-200] Enhancing Sample Generation of Diffusion Models using Noise Level Correction

【速读】：该论文试图解决扩散模型去噪过程中噪声水平估计不准确的问题，关键在于提出了一种噪声水平校正网络 (noise level correction network)，通过利用预训练的去噪网络来优化噪声水平估计，从而将噪声样本更准确地投影到数据流形上。该方法不仅提升了样本生成的质量，还通过引入任务特定约束扩展到多种图像恢复任务（如修复、去模糊、超分辨率、着色和压缩感知）。实验结果表明，该方法在无约束和有约束的生成场景中均显著提高了样本质量，并且与现有的去噪调度器（如DDIM）兼容，进一步提升了性能。

链接: https://arxiv.org/abs/2412.05488
作者: Abulikemu Abuduweili,Chenyang Yuan,Changliu Liu,Frank Permenter
关键词-EN: noise level, diffusion models, noise level correction, data manifold, projection of noisy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The denoising process of diffusion models can be interpreted as a projection of noisy samples onto the data manifold. Moreover, the noise level in these samples approximates their distance to the underlying manifold. Building on this insight, we propose a novel method to enhance sample generation by aligning the estimated noise level with the true distance of noisy samples to the manifold. Specifically, we introduce a noise level correction network, leveraging a pre-trained denoising network, to refine noise level estimates during the denoising process. Additionally, we extend this approach to various image restoration tasks by integrating task-specific constraints, including inpainting, deblurring, super-resolution, colorization, and compressed sensing. Experimental results demonstrate that our method significantly improves sample quality in both unconstrained and constrained generation scenarios. Notably, the proposed noise level correction framework is compatible with existing denoising schedulers (e.g., DDIM), offering additional performance improvements.
zh

[CV-201] Securing Social Media Against Deepfakes using Identity Behavioral and Geometric Signatures

【速读】：该论文试图解决深度伪造（deepfake）检测技术在泛化能力上的局限性问题，即现有方法在面对未见过的或多样化的深度伪造内容时表现不佳。解决方案的关键在于提出了一种新的深度伪造检测框架，该框架通过集成深度身份（Deep identity）、行为（Behavioral）和几何（Geometric）特征的DBaG签名，并结合DBaGNet分类器，利用三元组损失（triplet loss）目标函数来增强泛化表示学习，从而提升分类性能。通过在六个基准深度伪造数据集上的广泛实验和跨数据集评估，验证了该方法在不同深度伪造内容上的有效性和泛化能力。

链接: https://arxiv.org/abs/2412.05487
作者: Muhammad Umar Farooq,Awais Khan,Ijaz Ul Haq,Khalid Mahmood Malik
关键词-EN: growing concern due, influence significant societal, growing concern, concern due, ability to influence
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Trust in social media is a growing concern due to its ability to influence significant societal changes. However, this space is increasingly compromised by various types of deepfake multimedia, which undermine the authenticity of shared content. Although substantial efforts have been made to address the challenge of deepfake content, existing detection techniques face a major limitation in generalization: they tend to perform well only on specific types of deepfakes they were trained this http URL dependency on recognizing specific deepfake artifacts makes current methods vulnerable when applied to unseen or varied deepfakes, thereby compromising their performance in real-world applications such as social media platforms. To address the generalizability of deepfake detection, there is a need for a holistic approach that can capture a broader range of facial attributes and manipulations beyond isolated artifacts. To address this, we propose a novel deepfake detection framework featuring an effective feature descriptor that integrates Deep identity, Behavioral, and Geometric (DBaG) signatures, along with a classifier named DBaGNet. Specifically, the DBaGNet classifier utilizes the extracted DBaG signatures, leveraging a triplet loss objective to enhance generalized representation learning for improved classification. Specifically, the DBaGNet classifier utilizes the extracted DBaG signatures and applies a triplet loss objective to enhance generalized representation learning for improved classification. To test the effectiveness and generalizability of our proposed approach, we conduct extensive experiments using six benchmark deepfake datasets: WLDR, CelebDF, DFDC, FaceForensics++, DFD, and NVFAIR. Specifically, to ensure the effectiveness of our approach, we perform cross-dataset evaluations, and the results demonstrate significant performance gains over several state-of-the-art methods.
zh

[CV-202] ACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

【速读】：该论文试图解决开源多模态语言模型在处理复杂、多步骤、多模态任务时表现不佳的问题，尤其是那些需要细粒度识别、视觉定位和推理能力的任务。解决方案的关键在于提出了TACO模型，该模型通过生成思维与行动链（Chains-of-Thought-and-Action, CoTA），并在推理过程中调用外部工具（如OCR、深度估计和计算器）来执行中间步骤，最终整合思维和行动输出以生成连贯的响应。为了训练TACO，研究者创建了一个包含超过100万条合成CoTA轨迹的大型数据集，并通过数据过滤和混合技术筛选出29.3万条高质量的CoTA样本。这种结构化的多步骤指令调优方法显著提升了模型在复杂多模态推理任务中的表现，超越了仅依赖直接答案进行指令调优的现有模型。

链接: https://arxiv.org/abs/2412.05479
作者: Zixian Ma,Jianguo Zhang,Zhiwei Liu,Jieyu Zhang,Juntao Tan,Manli Shu,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Caiming Xiong,Ranjay Krishna,Silvio Savarese
关键词-EN: simple question answering, question answering tasks, demand multi-step solutions, require multiple capabilities, language models perform
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While open-source multi-modal language models perform well on simple question answering tasks, they often fail on complex questions that require multiple capabilities, such as fine-grained recognition, visual grounding, and reasoning, and that demand multi-step solutions. We present TACO, a family of multi-modal large action models designed to improve performance on such complex, multi-step, and multi-modal tasks. During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. To train TACO, we create a large dataset of over 1M synthetic CoTA traces generated with GPT-4o and Python programs. We then experiment with various data filtering and mixing techniques and obtain a final subset of 293K high-quality CoTA examples. This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers. Our model TACO outperforms the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains of up to 15% in MMVet tasks involving OCR, mathematical reasoning, and spatial reasoning. Training on high-quality CoTA traces sets a new standard for complex multi-modal reasoning, highlighting the need for structured, multi-step instruction tuning in advancing open-source mutli-modal models’ capabilities.
zh

[CV-203] Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data

【速读】：该论文试图解决合成数据在监督机器学习中的可用性评估问题，关键在于提出了一种基于上置信界 (UCB) 的训练过程和动态可用性度量。该度量整合了合成图像的低级和高级信息，并结合了真实数据集和合成数据集的信息，超越了传统的评估指标。通过采用动态的UCB方法，确保模型学习过程的持续改进，并能有效适应机器学习模型状态的变化，同时考虑训练样本在训练过程中的动态效用。此外，论文还提出了一种属性感知的生成式数据管道，结合大语言模型 (Large Language Model) 和稳定扩散 (Stable Diffusion) 生成合成数据，显著提升了监督分类器的性能，分类准确率提高了多达10%。

链接: https://arxiv.org/abs/2412.05466
作者: Abdulrahman Kerim,Leandro Soriano Marcolino,Erickson R. Nascimento,Richard Jiang
关键词-EN: require large-scale training, methods require large-scale, require large-scale, large-scale training datasets, Synthetic
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised machine learning methods require large-scale training datasets to perform well in practice. Synthetic data has been showing great progress recently and has been used as a complement to real data. However, there is yet a great urge to assess the usability of synthetically generated data. To this end, we propose a novel UCB-based training procedure combined with a dynamic usability metric. Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets, surpassing existing traditional metrics. By utilizing a UCB-based dynamic approach ensures continual enhancement of model learning. Unlike other approaches, our method effectively adapts to changes in the machine learning model’s state and considers the evolving utility of training samples during the training process. We show that our metric is an effective way to rank synthetic images based on their usability. Furthermore, we propose a new attribute-aware bandit pipeline for generating synthetic data by integrating a Large Language Model with Stable Diffusion. Quantitative results show that our approach can boost the performance of a wide range of supervised classifiers. Notably, we observed an improvement of up to 10% in classification accuracy compared to traditional approaches, demonstrating the effectiveness of our approach. Our source code, datasets, and additional materials are publically available at this https URL.
zh

[CV-204] COOOL: Challenge Of Out-Of-Label A Novel Benchmark for Autonomous Driving

【速读】：该论文试图解决自动驾驶系统在处理未见过的新情况（novelty problem）时面临的挑战，这是实现完全自动驾驶的核心难题之一。解决方案的关键在于引入了一个名为COOOL的基准测试（Challenge of Out-Of-Label），该基准提供了一个新颖的数据集，包含超过200个面向行车记录仪的视频集合，由人工标注者标注了感兴趣的对象和潜在的驾驶危险。COOOL数据集涵盖了多种危险和干扰对象，适用于多种任务的评估，包括异常检测（Anomaly Detection）、开放集识别（Open-Set Recognition）、开放词汇（Open Vocabulary）和领域适应（Domain Adaptation）。由于数据集的规模和复杂性，COOOL仅作为评估基准使用。

链接: https://arxiv.org/abs/2412.05462
作者: Ali K. AlShami,Ananya Kalita,Ryan Rabinowitz,Khang Lam,Rishabh Bezbarua,Terrance Boult,Jugal Kalita
关键词-EN: Computer Vision community, Vision community rapidly, efficient autonomous transportation, Computer Vision, community rapidly develops
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As the Computer Vision community rapidly develops and advances algorithms for autonomous driving systems, the goal of safer and more efficient autonomous transportation is becoming increasingly achievable. However, it is 2024, and we still do not have fully self-driving cars. One of the remaining core challenges lies in addressing the novelty problem, where self-driving systems still struggle to handle previously unseen situations on the open road. With our Challenge of Out-Of-Label (COOOL) benchmark, we introduce a novel dataset for hazard detection, offering versatile evaluation metrics applicable across various tasks, including novelty-adjacent domains such as Anomaly Detection, Open-Set Recognition, Open Vocabulary, and Domain Adaptation. COOOL comprises over 200 collections of dashcam-oriented videos, annotated by human labelers to identify objects of interest and potential driving hazards. It includes a diverse range of hazards and nuisance objects. Due to the dataset’s size and data complexity, COOOL serves exclusively as an evaluation benchmark.
zh

[CV-205] CigTime: Corrective Instruction Generation Through Inverse Motion Editing NEURIPS2024

【速读】：该论文试图解决从用户当前动作（源动作）到目标动作的转换过程中，生成纠正性指导文本的问题。解决方案的关键在于利用大规模语言模型生成纠正性文本，并通过现有的动作生成与编辑框架构建包含源动作、目标动作和纠正文本的三元组数据集。基于此数据集，论文提出了一种新的动作-语言模型，用于生成纠正性指导指令，从而在多种应用场景中显著提升用户表现的纠正与增强效果。

链接: https://arxiv.org/abs/2412.05460
作者: Qihang Fang,Chengcheng Tang,Bugra Tekin,Yanchao Yang
关键词-EN: shown significant promise, Recent advancements, models linking natural, linking natural language, linking natural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures, NeurIPS 2024

点击查看摘要

Abstract:Recent advancements in models linking natural language with human motions have shown significant promise in motion generation and editing based on instructional text. Motivated by applications in sports coaching and motor skill learning, we investigate the inverse problem: generating corrective instructional text, leveraging motion editing and generation models. We introduce a novel approach that, given a user’s current motion (source) and the desired motion (target), generates text instructions to guide the user towards achieving the target motion. We leverage large language models to generate corrective texts and utilize existing motion generation and editing frameworks to compile datasets of triplets (source motion, target motion, and corrective text). Using this data, we propose a new motion-language model for generating corrective instructions. We present both qualitative and quantitative results across a diverse range of applications that largely improve upon baselines. Our approach demonstrates its effectiveness in instructional scenarios, offering text-based guidance to correct and enhance user performance.
zh

[CV-206] UniScene: Unified Occupancy-centric Driving Scene Generation

【速读】：该论文试图解决自动驾驶领域中生成高保真、可控且带注释的训练数据问题。现有方法通常直接从粗略场景布局生成单一数据形式，无法满足多样下游任务所需的丰富数据形式，且难以建模布局到数据的直接分布。解决方案的关键在于提出了UniScene框架，这是首个统一生成驾驶场景中三种关键数据形式（语义占用、视频和LiDAR）的框架。UniScene采用渐进生成过程，首先从自定义场景布局生成语义占用作为元场景表示，然后基于占用生成视频和LiDAR数据，分别采用基于高斯联合渲染和先验引导稀疏建模的两种新颖转移策略。这种以占用为中心的方法不仅减轻了复杂场景的生成负担，还提供了详细的中间表示，为后续生成阶段提供了支持。

链接: https://arxiv.org/abs/2412.05435
作者: Bohan Li,Jiazhe Guo,Hongsi Liu,Yingshuang Zou,Yikang Ding,Xiwu Chen,Hu Zhu,Feiyang Tan,Chi Zhang,Tiancai Wang,Shuchang Zhou,Li Zhang,Xiaojuan Qi,Hao Zhao,Mu Yang,Wenjun Zeng,Xin Jin
关键词-EN: annotated training data, Prior-guided Sparse Modeling, annotated training, critical for autonomous, Gaussian-based Joint Rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms - semantic occupancy, video, and LiDAR - in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks.
zh

[CV-207] Swap Path Network for Robust Person Search Pre-training WACV2025

【速读】：该论文试图解决行人搜索任务中缺乏端到端预训练框架的问题。解决方案的关键在于提出了一个全新的端到端行人搜索预训练框架，该框架将行人搜索任务分解为以对象为中心（object-centric）和以查询为中心（query-centric）两种方法。其中，以查询为中心的方法对标签噪声具有鲁棒性，并且可以使用弱标签的行人边界框进行训练。论文进一步提出了名为Swap Path Net (SPNet)的新模型，该模型能够同时实现以查询为中心和以对象为中心的训练目标，并在两者之间切换时使用相同的权重。通过SPNet，论文展示了以查询为中心的预训练和以对象为中心的微调相结合的方法在PRW和CUHK-SYSU标准行人搜索基准上达到了最先进的性能，分别为61.2%和96.4%的mAP。此外，该方法在行人搜索预训练中比仅使用骨干网络的预训练方法更为有效、高效和鲁棒。

链接: https://arxiv.org/abs/2412.05433
作者: Lucas Jaffe,Avideh Zakhor
关键词-EN: person search, query person image, gallery scenes, person, search
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025; Code: this https URL

点击查看摘要

Abstract:In person search, we detect and rank matches to a query person image within a set of gallery scenes. Most person search models make use of a feature extraction backbone, followed by separate heads for detection and re-identification. While pre-training methods for vision backbones are well-established, pre-training additional modules for the person search task has not been previously examined. In this work, we present the first framework for end-to-end person search pre-training. Our framework splits person search into object-centric and query-centric methodologies, and we show that the query-centric framing is robust to label noise, and trainable using only weakly-labeled person bounding boxes. Further, we provide a novel model dubbed Swap Path Net (SPNet) which implements both query-centric and object-centric training objectives, and can swap between the two while using the same weights. Using SPNet, we show that query-centric pre-training, followed by object-centric fine-tuning, achieves state-of-the-art results on the standard PRW and CUHK-SYSU person search benchmarks, with 96.4% mAP on CUHK-SYSU and 61.2% mAP on PRW. In addition, we show that our method is more effective, efficient, and robust for person search pre-training than recent backbone-only pre-training alternatives.
zh

[CV-208] Whats the Move? Hybrid Imitation Learning via Salient Points

【速读】：该论文试图解决模仿学习（Imitation Learning, IL）在复杂任务中难以有效泛化的问题，特别是在视觉和空间变化下的表现。解决方案的关键在于SPHINX：基于显著点（Salient Points）的混合模仿与执行策略。SPHINX通过利用多模态观测（点云和腕部图像）以及混合动作空间（低频稀疏路径点和密集末端执行器运动），实现了对复杂任务的高效处理。具体来说，SPHINX从点云中学习识别任务相关的显著点，这些点作为锚点用于预测长距离运动的路径点，并在接近显著点后切换到基于腕部图像的密集末端执行器运动，以实现任务的精确执行。这种方法通过结合不同输入模态和动作表示的优势，显著提高了任务的成功率和泛化能力。

链接: https://arxiv.org/abs/2412.05426
作者: Priya Sundaresan,Hengyuan Hu,Quan Vuong,Jeannette Bohg,Dorsa Sadigh
关键词-EN: tasks remains challenging, offers a promising, robots various behaviors, remains challenging, promising framework
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: Salient Point-based Hybrid ImitatioN and eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, SPHINX learns to infer task-relevant points within a point cloud, or salient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space. Once near a salient point, SPHINX learns to switch to predicting dense end-effector movements given close-up wrist images for precise phases of a task. By exploiting the strengths of different input modalities and action representations for different manipulation phases, SPHINX tackles complex tasks in a sample-efficient, generalizable manner. Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. SPHINX additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7x speedup over the most competitive baseline. Our website (this http URL) provides open-sourced code for data collection, training, and evaluation, along with supplementary videos.
zh

[CV-209] YOLOv5-Based Object Detection for Emergency Response in Aerial Imagery

【速读】：该论文试图解决在航空影像中检测关键对象（如救护车、车祸、警车、拖车、消防车、翻车和着火车辆）的问题，特别是在复杂背景和小目标检测方面的挑战。解决方案的关键在于使用YOLOv5模型，并通过自定义数据集的完整流程（包括数据收集、标注、模型训练和评估）来实现高效且准确的目标检测。YOLOv5在速度和精度之间的平衡使其特别适合实时应急响应应用。

链接: https://arxiv.org/abs/2412.05394
作者: Sindhu Boddu,Arindam Mukherjee,Arindrajit Seal
关键词-EN: paper presents, presents a robust, robust approach, object detection, aerial imagery
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 8 figures, submitted for open-access publication on arXiv

点击查看摘要

Abstract:This paper presents a robust approach for object detection in aerial imagery using the YOLOv5 model. We focus on identifying critical objects such as ambulances, car crashes, police vehicles, tow trucks, fire engines, overturned cars, and vehicles on fire. By leveraging a custom dataset, we outline the complete pipeline from data collection and annotation to model training and evaluation. Our results demonstrate that YOLOv5 effectively balances speed and accuracy, making it suitable for real-time emergency response applications. This work addresses key challenges in aerial imagery, including small object detection and complex backgrounds, and provides insights for future research in automated emergency response systems.
zh

[CV-210] DIFEM: Key-points Interaction based Feature Extraction Module for Violence Recognition in Videos

【速读】：该论文试图解决监控视频中暴力行为的自动检测问题，旨在提供一种高效且轻量级的系统。解决方案的关键在于利用人体骨骼关键点（human skeleton key-points）捕捉暴力行为的固有属性，如特定关节的快速运动及其近距离接触。核心创新是提出的动态交互特征提取模块（Dynamic Interaction Feature Extraction Module, DIFEM），该模块能够有效捕捉速度、关节交叉等特征，从而准确反映暴力行为的动态特性。通过DIFEM提取的特征，结合随机森林、决策树、AdaBoost和k近邻等多种分类算法，实现了在参数开销显著低于现有基于深度学习的最先进方法（SOTA）的情况下，依然在多个标准暴力识别数据集上表现出优越的性能。

链接: https://arxiv.org/abs/2412.05386
作者: Himanshu Mittal,Suvramalya Basak,Anjali Gautam
关键词-EN: ensuring public safety, public safety, surveillance videos, critical task, task for ensuring
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Violence detection in surveillance videos is a critical task for ensuring public safety. As a result, there is increasing need for efficient and lightweight systems for automatic detection of violent behaviours. In this work, we propose an effective method which leverages human skeleton key-points to capture inherent properties of violence, such as rapid movement of specific joints and their close proximity. At the heart of our method is our novel Dynamic Interaction Feature Extraction Module (DIFEM) which captures features such as velocity, and joint intersections, effectively capturing the dynamics of violent behavior. With the features extracted by our DIFEM, we use various classification algorithms such as Random Forest, Decision tree, AdaBoost and k-Nearest Neighbor. Our approach has substantially lesser amount of parameter expense than the existing state-of-the-art (SOTA) methods employing deep learning techniques. We perform extensive experiments on three standard violence recognition datasets, showing promising performance in all three datasets. Our proposed method surpasses several SOTA violence recognition methods.
zh

[CV-211] MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

【速读】：该论文试图解决扩散模型中的运动迁移问题，提出了基于混合得分引导 (Mixture of Score Guidance, MSG) 的理论框架。解决方案的关键在于重新定义条件得分，将其分解为运动得分和内容得分，从而实现运动和内容的分离。通过将运动迁移建模为潜在能量的混合，MSG不仅保留了场景的组成，还支持创造性的场景变换，同时保持了迁移运动模式的完整性。该方法直接在预训练的视频扩散模型上操作，无需额外的训练或微调，并通过实验验证了其在单物体、多物体、跨物体运动迁移以及复杂相机运动迁移中的有效性。

链接: https://arxiv.org/abs/2412.05355
作者: Hidir Yesiltepe,Tuna Han Salih Meral,Connor Dunlop,Pinar Yanardag
关键词-EN: Score Guidance, motion transfer, diffusion models, motion transfer approach, motion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we propose the first motion transfer approach in diffusion transformer through Mixture of Score Guidance (MSG), a theoretically-grounded framework for motion transfer in diffusion models. Our key theoretical contribution lies in reformulating conditional score to decompose motion score and content score in diffusion models. By formulating motion transfer as a mixture of potential energies, MSG naturally preserves scene composition and enables creative scene transformations while maintaining the integrity of transferred motion patterns. This novel sampling operates directly on pre-trained video diffusion models without additional training or fine-tuning. Through extensive experiments, MSG demonstrates successful handling of diverse scenarios including single object, multiple objects, and cross-object motion transfer as well as complex camera motion transfer. Additionally, we introduce MotionBench, the first motion transfer dataset consisting of 200 source videos and 1000 transferred motions, covering single/multi-object transfers, and complex camera motions.
zh

[CV-212] owards Predicting the Success of Transfer-based Attacks by Quantifying Shared Feature Representations

【速读】：该论文试图解决黑盒计算机视觉模型中基于迁移的攻击（TBA）成功率的预测问题。解决方案的关键在于通过识别目标模型中存在的脆弱特征，首次尝试进行攻击成功的事前预测。具体方法包括将来自不同模型的特征向量投影到相同的低维流形空间，量化流形上的结构相似性，并将这些相似性与TBA的成功率相关联。研究发现，共享的特征表示与TBA成功率适度相关（相关系数ρ=0.56），这表明可以通过这种方法在不了解模型权重、训练过程、架构或攻击细节的情况下预测攻击是否会成功。

链接: https://arxiv.org/abs/2412.05351
作者: Ashley S. Dale,Mei Qiu,Foo Bin Che,Thomas Bsaibes,Lauren Christopher,Paul Salama
关键词-EN: computer vision models, black-box computer vision, made to explain, explain and improve, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Much effort has been made to explain and improve the success of transfer-based attacks (TBA) on black-box computer vision models. This work provides the first attempt at a priori prediction of attack success by identifying the presence of vulnerable features within target models. Recent work by Chen and Liu (2024) proposed the manifold attack model, a unifying framework proposing that successful TBA exist in a common manifold space. Our work experimentally tests the common manifold space hypothesis by a new methodology: first, projecting feature vectors from surrogate and target feature extractors trained on ImageNet onto the same low-dimensional manifold; second, quantifying any observed structure similarities on the manifold; and finally, by relating these observed similarities to the success of the TBA. We find that shared feature representation moderately correlates with increased success of TBA (\rho= 0.56). This method may be used to predict whether an attack will transfer without information of the model weights, training, architecture or details of the attack. The results confirm the presence of shared feature representations between two feature extractors of different sizes and complexities, and demonstrate the utility of datasets from different target domains as test signals for interpreting black-box feature representations.
zh

[CV-213] Automated Dynamic Image Analysis for Particle Size and Shape Classification in Three Dimensions

【速读】：该论文试图解决现有动态图像分析技术在处理细小颗粒时主要局限于二维成像的问题，这可能导致颗粒特性表征的不准确性。解决方案的关键在于引入OCULAR，这是一种创新的硬件和软件解决方案，通过同步的光学相机阵列对连续颗粒流进行三维动态成像。其核心在于通过三维表面重建实现颗粒形状的精确表征，从而克服了传统三维成像技术（如计算机断层扫描、激光扫描和正射摄影）在处理动态对象时的局限性，同时降低了成本和复杂性。

链接: https://arxiv.org/abs/2412.05347
作者: Sadegh Nadimi,Vasileios Angelidakis,Sadaf Maramizonouz,Chao Zhang
关键词-EN: dynamic image analysis, dynamic image, innovative hardware, hardware and software, image analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Statistical Mechanics (cond-mat.stat-mech); Image and Video Processing (eess.IV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:We introduce OCULAR, an innovative hardware and software solution for three-dimensional dynamic image analysis of fine particles. Current state-of-the art instruments for dynamic image analysis are largely limited to two-dimensional imaging. However, extensive literature has demonstrated that relying on a single two-dimensional projection for particle characterisation can lead to inaccuracies in many applications. Existing three-dimensional imaging technologies, such as computed tomography, laser scanning, and orthophotography, are limited to static objects. These methods are often not statistically representative and come with significant post-processing requirements, as well as the need for specialised imaging and computing resources. OCULAR addresses these challenges by providing a cost-effective solution for imaging continuous particle streams using a synchronised array of optical cameras. Particle shape characterisation is achieved through the reconstruction of their three-dimensional surfaces. This paper details the OCULAR methodology, evaluates its reproducibility, and compares its results against X-ray micro computed tomography, highlighting its potential for efficient and reliable particle analysis.
zh

[CV-214] Generative Model-Based Fusion for Improved Few-Shot Semantic Segmentation of Infrared Images WACV

【速读】：该论文试图解决红外图像（IR images）语义分割中的数据稀缺、对比度差异以及类别未在数据库中出现的问题。解决方案的关键在于利用生成式建模（generative modeling）和融合技术（fusion techniques）来增强少样本分割（Few-shot Segmentation, FSS）模型的性能。具体来说，论文提出了通过合成辅助数据来补充红外图像的有限对比度，并通过红外数据合成进行数据增强，以解决数据稀缺问题。此外，论文还引入了一个新的融合集成模块（fusion ensemble module），用于整合不同模态的信息，从而进一步提高模型在支持集和查询集之间关系的捕捉能力。这些方法在多个红外数据集上进行了评估，并显著提升了现有最先进（SOTA）FSS模型的性能。

链接: https://arxiv.org/abs/2412.05341
作者: Junno Yun,Mehmet Akçakaya
关键词-EN: including autonomous driving, imaging is commonly, autonomous driving, fire safety, FSS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Winter Conference on Applications of Computer Vision (WACV), 2025

点击查看摘要

Abstract:Infrared (IR) imaging is commonly used in various scenarios, including autonomous driving, fire safety and defense applications. Thus, semantic segmentation of such images is of great interest. However, this task faces several challenges, including data scarcity, differing contrast and input channel number compared to natural images, and emergence of classes not represented in databases in certain scenarios, such as defense applications. Few-shot segmentation (FSS) provides a framework to overcome these issues by segmenting query images using a few labeled support samples. However, existing FSS models for IR images require paired visible RGB images, which is a major limitation since acquiring such paired data is difficult or impossible in some applications. In this work, we develop new strategies for FSS of IR images by using generative modeling and fusion techniques. To this end, we propose to synthesize auxiliary data to provide additional channel information to complement the limited contrast in the IR images, as well as IR data synthesis for data augmentation. Here, the former helps the FSS model to better capture the relationship between the support and query sets, while the latter addresses the issue of data scarcity. Finally, to further improve the former aspect, we propose a novel fusion ensemble module for integrating the two different modalities. Our methods are evaluated on different IR datasets, and improve upon the state-of-the-art (SOTA) FSS models.
zh

[CV-215] ACT-Bench: Towards Action Controllable World Models for Autonomous Driving

【速读】：该论文试图解决当前世界模型在自动驾驶领域中对特定动作指令的忠实度（action fidelity）评估不足的问题。现有研究主要关注视觉逼真度或下游任务性能，而忽略了模型对给定指令的执行精度。为此，论文提出了一个开放的评估框架ACT-Bench，并开发了一个基准世界模型Terra。解决方案的关键在于：1) 构建了一个大规模数据集，将nuScenes的短上下文视频与未来轨迹数据配对，用于条件生成未来视频帧并评估动作忠实度；2) Terra模型通过在多个大规模轨迹标注数据集上训练，提升了动作忠实度；3) 通过ACT-Bench框架，验证了现有最先进模型在执行指令时的不足，同时展示了Terra在动作忠实度上的改进。该框架的所有组件将公开发布，以支持未来的研究。

链接: https://arxiv.org/abs/2412.05337
作者: Hidehisa Arai,Keishi Ishihara,Tsubasa Takahashi,Yu Yamaguchi
关键词-EN: promising neural simulators, supplement scarce real-world, scarce real-world data, autonomous driving, emerged as promising
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:World models have emerged as promising neural simulators for autonomous driving, with the potential to supplement scarce real-world data and enable closed-loop evaluations. However, current research primarily evaluates these models based on visual realism or downstream task performance, with limited focus on fidelity to specific action instructions - a crucial property for generating targeted simulation scenes. Although some studies address action fidelity, their evaluations rely on closed-source mechanisms, limiting reproducibility. To address this gap, we develop an open-access evaluation framework, ACT-Bench, for quantifying action fidelity, along with a baseline world model, Terra. Our benchmarking framework includes a large-scale dataset pairing short context videos from nuScenes with corresponding future trajectory data, which provides conditional input for generating future video frames and enables evaluation of action fidelity for executed motions. Furthermore, Terra is trained on multiple large-scale trajectory-annotated datasets to enhance action fidelity. Leveraging this framework, we demonstrate that the state-of-the-art model does not fully adhere to given instructions, while Terra achieves improved action fidelity. All components of our benchmark framework will be made publicly available to support future research.
zh

[CV-216] Flexible Mesh Segmentation through Integration of Geometric andTopological Features of Reeb Graphs

【速读】：该论文旨在解决网格分割（mesh segmentation）这一计算机图形学和几何分析中的关键问题，提出了一种基于Reeb图的创新方法。解决方案的关键在于通过三个主要阶段实现灵活且鲁棒的分割：首先，通过增强的拓扑骨架构建（topological skeleton construction）高效捕捉Reeb图结构并保留退化临界点；其次，利用临界点消除（critical point cancellation）进行拓扑简化，降低图复杂度同时保持关键形状特征和对应关系；最后，结合Reeb图邻接关系和网格顶点连通性，采用区域生长算法（region growing algorithm）生成连续且语义上有意义的分割结果。该方法具有O(n log n)的计算复杂度，适用于基于局部几何和形状直径函数的分割任务，展示了其在网格处理和理解中的广泛应用潜力。

链接: https://arxiv.org/abs/2412.05335
作者: Beguet Florian,Lanquetin Sandrine,Raffin Romain
关键词-EN: spanning texture mapping, diverse applications spanning, applications spanning texture, Mesh segmentation represents, texture mapping
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mesh segmentation represents a crucial task in computer graphics and geometric analysis, with diverse applications spanning texture mapping, animation, and beyond. This paper introduces an innovative Reeb graph-based mesh segmentation method that seamlessly integrates geometric and topological features to achieve flexible and robust segmentation results. The proposed approach encompasses three primary phases. First, an enhanced topological skeleton construction efficiently captures the Reeb graph structure while preserving degenerate critical points. Second, a topological simplification process employing critical point cancellation reduces graph complexity while maintaining essential shape features and correspondences. Finally, a region growing algorithm leverages both Reeb graph adjacency and mesh vertex connectivity to generate contiguous, semantically meaningful segments. The presented method exhibits computational efficiency, achieving a complexity of O(n \log n ) for a mesh containing n vertices. Its versatility and effectiveness are validated through application to both local geometry-based segmentation using the Shape Index and part-based decomposition utilizing the Shape Diameter Function. This flexible framework establishes a solid foundation for advanced analysis and applications across various domains, offering new possibilities for mesh processing and understanding. Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.05335 [cs.GR] (or arXiv:2412.05335v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2412.05335 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-217] Deep Learning and Hybrid Approaches for Dynamic Scene Analysis Object Detection and Motion Tracking

【速读】：该论文旨在解决视频监控系统中存储效率和检索便捷性的问题，通过将视频分割为基于活动检测的小片段来优化存储并简化数字搜索。解决方案的关键在于结合多种先进的计算机视觉技术，包括卷积神经网络 (CNNs) 如 YOLO、SSD 和 Faster R-CNN，以及循环神经网络 (RNNs) 和长短期记忆网络 (LSTMs) 来实现高精度的目标检测和时间依赖性捕捉。此外，采用高斯混合模型 (GMM) 和光流法如 Lucas-Kanade 进行自适应背景建模和运动检测，并通过多尺度与上下文分析提升不同物体大小和环境下的检测性能。混合运动分割策略结合统计和深度学习模型处理复杂运动，同时通过实时处理优化确保计算效率。跟踪方法如卡尔曼滤波器和孪生网络用于在遮挡情况下保持平滑跟踪。这些技术的综合应用显著提高了检测和跟踪的精度与速度，从而有效减少了存储需求并增强了监控系统的安全性。

链接: https://arxiv.org/abs/2412.05331
作者: Shahran Rahman Alve
关键词-EN: smaller clips based, Convolutional Neural Networks, Recurrent Neural Networks, Neural Networks, project aims
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 Pages, 7 Figures

点击查看摘要

Abstract:This project aims to develop a robust video surveillance system, which can segment videos into smaller clips based on the detection of activities. It uses CCTV footage, for example, to record only major events-like the appearance of a person or a thief-so that storage is optimized and digital searches are easier. It utilizes the latest techniques in object detection and tracking, including Convolutional Neural Networks (CNNs) like YOLO, SSD, and Faster R-CNN, as well as Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs), to achieve high accuracy in detection and capture temporal dependencies. The approach incorporates adaptive background modeling through Gaussian Mixture Models (GMM) and optical flow methods like Lucas-Kanade to detect motions. Multi-scale and contextual analysis are used to improve detection across different object sizes and environments. A hybrid motion segmentation strategy combines statistical and deep learning models to manage complex movements, while optimizations for real-time processing ensure efficient computation. Tracking methods, such as Kalman Filters and Siamese networks, are employed to maintain smooth tracking even in cases of occlusion. Detection is improved on various-sized objects for multiple scenarios by multi-scale and contextual analysis. Results demonstrate high precision and recall in detecting and tracking objects, with significant improvements in processing times and accuracy due to real-time optimizations and illumination-invariant features. The impact of this research lies in its potential to transform video surveillance, reducing storage requirements and enhancing security through reliable and efficient object detection and tracking.
zh

[CV-218] Mapping The Layers of The Ocean Floor With a Convolutional Neural Network

【速读】：该论文试图解决海洋底层层序的映射问题，这是石油行业面临的一个挑战。现有的解决方案主要依赖于地震方法和波场反演，这些方法复杂且计算成本高。论文的关键解决方案是引入人工神经网络，特别是UNet架构，通过基于从海底反射的地震波数据来预测速度模型。研究验证了两种神经网络架构在速度模型反演中的应用，并通过损失函数、相似系数等稳定性指标以及预测模型与实际模型之间的差异进行了比较。结果表明，神经网络在这一领域具有潜力，达到了超过70%的Sørensen-Dice系数值。

链接: https://arxiv.org/abs/2412.05329
作者: Guilherme G. D. Fernandes,Vitor S. P. P. Oliveira,João P. I. Astolfo
关键词-EN: oil industry, ocean floor layers, ocean floor, floor layers, methods involve mapping
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph); Geophysics (physics.geo-ph)
备注: 10 pages, 5 figures. Developed during the 6th Edition of the Advanced School of Experimental Physics (EAFExp), Brazilian Centre for Physics Research

点击查看摘要

Abstract:The mapping of ocean floor layers is a current challenge for the oil industry. Existing solution methods involve mapping through seismic methods and wave inversion, which are complex and computationally expensive. The introduction of artificial neural networks, specifically UNet, to predict velocity models based on seismic shots reflected from the ocean floor shows promise for optimising this process. In this study, two neural network architectures are validated for velocity model inversion and compared in terms of stability metrics such as loss function and similarity coefficient, as well as the differences between predicted and actual models. Indeed, neural networks prove promising as a solution to this challenge, achieving Sørensen-Dice coefficient values above 70%.
zh

[CV-219] he Role of Text-to-Image Models in Advanced Style Transfer Applications: A Case Study with DALL-E 3

【速读】：该论文试图解决在风格迁移领域中，如何利用生成式 AI (Generative AI) 如 DALL-E 3 来提升风格图像的多样性和艺术质量的问题。解决方案的关键在于将 DALL-E 3 生成的风格图像与传统的神经风格迁移技术（如 Magenta 的 Arbitrary Image Stylization 模型）相结合。通过生成基于文本描述的风格图像，DALL-E 3 不仅增强了最终输出图像的艺术多样性，还在保持处理效率的同时，显著提升了图像质量，使其在视觉上优于传统方法。

链接: https://arxiv.org/abs/2412.05325
作者: Ebubechukwu Ike
关键词-EN: remains slightly underexplored, transfer remains slightly, style transfer remains, style transfer, Magenta Arbitrary Image
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 7 pages

点击查看摘要

Abstract:While DALL-E 3 has gained popularity for its ability to generate creative and complex images from textual descriptions, its application in the domain of style transfer remains slightly underexplored. This project investigates the integration of DALL-E 3 with traditional neural style transfer techniques to assess the impact of generated style images on the quality of the final output. DALL-E 3 was employed to generate style images based on the descriptions provided and combine these with the Magenta Arbitrary Image Stylization model. This integration is evaluated through metrics such as the Structural Similarity Index Measure (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as well as processing time assessments. The findings reveal that DALL-E 3 significantly enhances the diversity and artistic quality of stylized images. Although this improvement comes with a slight increase in style transfer time, the data shows that this trade-off is worthwhile because the overall processing time with DALL-E 3 is about 2.5 seconds faster than traditional methods, making it both an efficient and visually superior option.
zh

[CV-220] FodFoM: Fake Outlier Data by Foundation Models Creates Stronger Visual Out-of-Distribution Detector

【速读】：该论文试图解决在开放世界应用中部署机器学习模型时面临的分布外检测（Out-of-Distribution, OOD）问题，特别是模型对OOD数据的过度自信问题。解决方案的关键在于提出了一种新颖的OOD检测框架FodFoM，该框架通过结合多个基础模型（如BLIP-2、CLIP、Stable Diffusion和GroundingDINO）生成两种类型的挑战性假异常图像，用于分类器训练。第一种类型利用BLIP-2的图像描述能力、CLIP的视觉-语言知识和Stable Diffusion的图像生成能力，生成与分布内（In-Distribution, ID）图像语义相似但不同的假异常图像。第二种类型则利用GroundingDINO的对象检测能力，通过模糊ID图像中的前景对象来构建纯背景图像。该框架能够灵活地与多种现有的OOD检测方法结合，并通过实验验证了其在多个基准测试中实现了最先进的OOD检测性能。

链接: https://arxiv.org/abs/2412.05293
作者: Jiankang Chen,Ling Deng,Zhiyong Gan,Wei-Shi Zheng,Ruixuan Wang
关键词-EN: OOD detection, OOD, OOD detection performance, open-world applications, detection
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection is crucial when deploying machine learning models in open-world applications. The core challenge in OOD detection is mitigating the model’s overconfidence on OOD data. While recent methods using auxiliary outlier datasets or synthesizing outlier features have shown promising OOD detection performance, they are limited due to costly data collection or simplified assumptions. In this paper, we propose a novel OOD detection framework FodFoM that innovatively combines multiple foundation models to generate two types of challenging fake outlier images for classifier training. The first type is based on BLIP-2’s image captioning capability, CLIP’s vision-language knowledge, and Stable Diffusion’s image generation ability. Jointly utilizing these foundation models constructs fake outlier images which are semantically similar to but different from in-distribution (ID) images. For the second type, GroundingDINO’s object detection ability is utilized to help construct pure background images by blurring foreground ID objects in ID images. The proposed framework can be flexibly combined with multiple existing OOD detection methods. Extensive empirical evaluations show that image classifiers with the help of constructed fake images can more accurately differentiate real OOD images from ID ones. New state-of-the-art OOD detection performance is achieved on multiple benchmarks. The code is available at \urlthis https URL. Comments: 13 pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2412.05293 [cs.CV] (or arXiv:2412.05293v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.05293 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Proceedings of the 32nd ACM International Conference on Multimedia, 2024 Submission history From: Jiankang Chen [view email] [v1] Fri, 22 Nov 2024 17:29:52 UTC (1,791 KB)
zh

[CV-221] agFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection

【速读】：该论文试图解决智能模型在处理分布外数据（Out-of-distribution, OOD）时出现的过度自信问题。解决方案的关键在于提出了一种新的学习框架，该框架利用基于拼图的假OOD数据（Jigsaw-based fake OOD data）和从ChatGPT生成的丰富语义嵌入（rich semantic embeddings, anchors）来指导图像编码器的训练。通过结合现有的OOD检测后处理方法，该框架在多个OOD检测基准上实现了新的最先进性能。

链接: https://arxiv.org/abs/2412.05292
作者: Jiankang Chen,Tong Zhang,Wei-Shi Zheng,Ruixuan Wang
关键词-EN: OOD detection, OOD, Jigsaw-based fake OOD, real-world applications, OOD data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is crucial in many real-world applications. However, intelligent models are often trained solely on in-distribution (ID) data, leading to overconfidence when misclassifying OOD data as ID classes. In this study, we propose a new learning framework which leverage simple Jigsaw-based fake OOD data and rich semantic embeddings (`anchors’) from the ChatGPT description of ID knowledge to help guide the training of the image encoder. The learning framework can be flexibly combined with existing post-hoc approaches to OOD detection, and extensive empirical evaluations on multiple OOD detection benchmarks demonstrate that rich textual representation of ID knowledge and fake OOD knowledge can well help train a visual encoder for OOD detection. With the learning framework, new state-of-the-art performance was achieved on all the benchmarks. The code is available at \urlthis https URL.
zh

[CV-222] FedSynthCT-Brain: A Federated Learning Framework for Multi-Institutional Brain MRI-to-CT Synthesis

链接: https://arxiv.org/abs/2412.06690
作者: Ciro Benito Raggio,Mathias Krohmer Zabaleta,Nils Skupien,Oliver Blanck,Francesco Cicone,Giuseppe Lucio Cascini,Paolo Zaffino,Lucia Migliorelli,Maria Francesca Spadea
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-223] Diff5T: Benchmarking Human Brain Diffusion MRI with an Extensive 5.0 Tesla K-Space and Spatial Dataset

【速读】：该论文试图解决高场强（5.0 Tesla）扩散磁共振成像（dMRI）数据集的稀缺问题，特别是缺乏包含原始k空间数据的开源数据集，这些数据对于高级研究至关重要。解决方案的关键在于引入了Diff5T数据集，这是一个全面的人脑5.0 Tesla dMRI数据集，包含了原始k空间数据和重建的扩散图像，以及多种成像协议。该数据集设计用于支持创新方法的开发和基准测试，涵盖伪影校正、图像重建、图像预处理、扩散建模和纤维束追踪等领域。通过提供广泛的扩散参数（如多b值和梯度方向），Diff5T为研究人脑微结构和连接性提供了广泛的应用前景，同时强调了开放访问和详细基准测试，促进了神经科学和医学影像领域的可重复性和协作。

链接: https://arxiv.org/abs/2412.06666
作者: Shanshan Wang,Shoujun Yu,Jian Cheng,Sen Jia,Changjun Tie,Jiayu Zhu,Haohao Peng,Yijing Dong,Jianzhong He,Fan Zhang,Yaowen Xing,Xiuqin Jia,Qi Yang,Qiyuan Tian,Hua Guo,Guobin Li,Hairong Zheng
关键词-EN: Diffusion magnetic resonance, magnetic resonance imaging, magnetic resonance, critical insights, microstructural and connectional
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 19 pages, 4 figures, 1 table

点击查看摘要

Abstract:Diffusion magnetic resonance imaging (dMRI) provides critical insights into the microstructural and connectional organization of the human brain. However, the availability of high-field, open-access datasets that include raw k-space data for advanced research remains limited. To address this gap, we introduce Diff5T, a first comprehensive 5.0 Tesla diffusion MRI dataset focusing on the human brain. This dataset includes raw k-space data and reconstructed diffusion images, acquired using a variety of imaging protocols. Diff5T is designed to support the development and benchmarking of innovative methods in artifact correction, image reconstruction, image preprocessing, diffusion modelling and tractography. The dataset features a wide range of diffusion parameters, including multiple b-values and gradient directions, allowing extensive research applications in studying human brain microstructure and connectivity. With its emphasis on open accessibility and detailed benchmarks, Diff5T serves as a valuable resource for advancing human brain mapping research using diffusion MRI, fostering reproducibility, and enabling collaboration across the neuroscience and medical imaging communities.
zh

[CV-224] Fundus Image-based Visual Acuity Assessment with PAC-Guarantees ML4H2024

链接: https://arxiv.org/abs/2412.06624
作者: Sooyong Jang,Kuk Jin Jang,Hyonyoung Choi,Yong-Seop Han,Seongjin Lee,Jin-hyun Kim,Insup Lee
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in ML4H 2024

点击查看摘要

[CV-225] A No-Reference Medical Image Quality Assessment Method Based on Automated Distortion Recognition Technology: Application to Preprocessing in MRI-guided Radiotherapy

链接: https://arxiv.org/abs/2412.06599
作者: Zilin Wang,Shengqi Chen,Jianrong Dai,Shirui Qin,Ying Cao,Ruiao Zhao,Guohua Wu,Yuan Tang,Jiayun Chen
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

[CV-226] HES-UNet: A U-Net for Hepatic Echinococcosis Lesion Segmentation

链接: https://arxiv.org/abs/2412.06530
作者: Jiayan Chen,Kai Li,Zhanjin Wang,Zhan Wang,Jianqiang Huang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures

点击查看摘要

[CV-227] BATseg: Boundary-aware Multiclass Spinal Cord Tumor Segmentation on 3D MRI Scans ECCV2024

链接: https://arxiv.org/abs/2412.06507
作者: Hongkang Song,Zihui Zhang,Yanpeng Zhou,Jie Hu,Zishuo Wang,Hou Him Chan,Chon Lok Lei,Chen Xu,Yu Xin,Bo Yang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ECCV 2024 Workshop on BioImage Computing. Code and data are available at: this https URL

点击查看摘要

[CV-228] Improving text-conditioned latent diffusion for cancer pathology

链接: https://arxiv.org/abs/2412.06487
作者: Aakash Madhav Rao,Debayan Gupta
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-229] Echocardiography to Cardiac MRI View Transformation for Real-Time Blind Restoration

【速读】：该论文试图解决超声心动图（echocardiography）在临床诊断中因噪声、对比度不足、严重饱和及心肌节段缺失等问题而限制其使用的问题。解决方案的关键在于提出一种将超声心动图转换为心脏磁共振成像（cardiac MRI）视图的新方法。为此，研究构建了包含超声心动图与真实心脏MRI图像对的Echo2MRI数据集，并训练了一个专门的Cycle-consistent Generative Adversarial Network (Cycle-GAN)，以学习从超声心动图帧到心脏MRI视图的转换。通过定性和医学专家评估，该方法能够合成高质量、无伪影的合成心脏MRI视图，且在78.9%的病例中，合成MRI视图被认为与原始MRI视图难以区分，并优于原始超声心动图序列用于诊断。

链接: https://arxiv.org/abs/2412.06445
作者: Ilke Adalioglu,Serkan Kiranyaz,Mete Ahishali,Aysen Degerli,Tahir Hamid,Rahmat Ghaffar,Ridha Hamila,Moncef Gabbouj
关键词-EN: monitor cardiac functions, cardiac MRI, MRI, ischemia and infarction, MRI views
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 42 figures

点击查看摘要

Abstract:Echocardiography is the most widely used imaging to monitor cardiac functions, serving as the first line in early detection of myocardial ischemia and infarction. However, echocardiography often suffers from several artifacts including sensor noise, lack of contrast, severe saturation, and missing myocardial segments which severely limit its usage in clinical diagnosis. In recent years, several machine learning methods have been proposed to improve echocardiography views. Yet, these methods usually address only a specific problem (e.g. denoising) and thus cannot provide a robust and reliable restoration in general. On the other hand, cardiac MRI provides a clean view of the heart without suffering such severe issues. However, due to its significantly higher cost, it is often only afforded by a few major hospitals, hence hindering its use and accessibility. In this pilot study, we propose a novel approach to transform echocardiography into the cardiac MRI view. For this purpose, Echo2MRI dataset, consisting of echocardiography and real cardiac MRI image pairs, is composed and will be shared publicly. A dedicated Cycle-consistent Generative Adversarial Network (Cycle-GAN) is trained to learn the transformation from echocardiography frames to cardiac MRI views. An extensive set of qualitative evaluations shows that the proposed transformer can synthesize high-quality artifact-free synthetic cardiac MRI views from a given sequence of echocardiography frames. Medical evaluations performed by a group of cardiologists further demonstrate that synthetic MRI views are indistinguishable from their original counterparts and are preferred over their initial sequence of echocardiography frames for diagnosis in 78.9% of the cases.
zh

[CV-230] CAD-Unet: A Capsule Network-Enhanced Unet Architecture for Accurate Segmentation of COVID-19 Lung Infections from CT Images

链接: https://arxiv.org/abs/2412.06314
作者: Yijie Dang,Weijun Ma,Xiaohu Luo
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-231] A CT Image Denoising Method Based on Projection Domain Feature

链接: https://arxiv.org/abs/2412.06135
作者: Mengyu Sun,Dimeng Xia,Shusen Zhao,Weibin Zhang,Yaobin He
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures, uses this http URL

点击查看摘要

[CV-232] Dilated Balanced Cross Entropy Loss for Medical Image Segmentation

链接: https://arxiv.org/abs/2412.06045
作者: Seyed Mohsen Hosseini,Mahdieh Soleymani Baghshah
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-233] Vision Transformer-based Semantic Communications With Importance-Aware Quantization

链接: https://arxiv.org/abs/2412.06038
作者: Joohyuk Park,Yongjeong Oh,Yongjune Kim,Yo-Seb Jeon
关键词-EN:
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:

点击查看摘要

[CV-234] opoCellGen: Generating Histopathology Cell Topology with a Diffusion Model

链接: https://arxiv.org/abs/2412.06011
作者: Meilong Xu,Saumya Gupta,Xiaoling Hu,Chen Li,Shahira Abousamra,Dimitris Samaras,Prateek Prasanna,Chao Chen
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

[CV-235] LVS-Net: A Lightweight Vessels Segmentation Network for Retinal Image Analysis

链接: https://arxiv.org/abs/2412.05968
作者: Mehwish Mehmood,Shahzaib Iqbal,Tariq Mahmood Khan,Ivor Spence,Muhammad Fahim
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-236] Unsupervised Multi-Parameter Inverse Solving for Reducing Ring Artifacts in 3D X-Ray CBCT

链接: https://arxiv.org/abs/2412.05853
作者: Qing Wu,Hongjiang Wei,Jingyi Yu,Yuyao Zhang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

[CV-237] Emulating Clinical Quality Muscle B-mode Ultrasound Images from Plane Wave Images Using a Two-Stage Machine Learning Model

链接: https://arxiv.org/abs/2412.05758
作者: Reed Chen,Courtney Trutna Paley,Wren Wightman,Lisa Hobson-Webb,Yohei Harada,Felix Jin,Ouwen Huang,Mark Palmeri,Kathryn Nightingale
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 10 figures

点击查看摘要

[CV-238] Early Diagnosis of Alzheimers Diseases and Dementia from MRI Images Using an Ensemble Deep Learning

链接: https://arxiv.org/abs/2412.05666
作者: Mozhgan Naderi,Maryam Rastgarpour,Amir Reza Takhsha
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-239] Self-Supervised Masked Mesh Learning for Unsupervised Anomaly Detection on 3D Cortical Surfaces

链接: https://arxiv.org/abs/2412.05580
作者: Hao-Chun Yang,Sicheng Dai,Saige Rutherford,Christian Gaser Andre F Marquand,Christian F Beckmann,Thomas Wolfers
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-240] st-time Cost-and-Quality Controllable Arbitrary-Scale Super-Resolution with Variable Fourier Components

链接: https://arxiv.org/abs/2412.05517
作者: Kazutoshi Akita,Norimichi Ukita
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures

点击查看摘要

[CV-241] A Comparative Study of Image Denoising Algorithms

链接: https://arxiv.org/abs/2412.05490
作者: Muhammad Umair Danish
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-242] Accurate early detection of Parkinsons disease from SPECT imaging through Convolutional Neural Networks

链接: https://arxiv.org/abs/2412.05348
作者: R. Prashanth
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

[CV-243] Osteoporosis Prediction from Hand X-ray Images Using Segmentation-for-Classification and Self-Supervised Learning

链接: https://arxiv.org/abs/2412.05345
作者: Ung Hwang,Chang-Hun Lee,Kijung Yoon
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-244] Equivariant Denoisers for Image Restoration

链接: https://arxiv.org/abs/2412.05343
作者: Marien Renaud,Arthur Leclaire,Nicolas Papadakis
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

[CV-245] rho-NeRF: Leveraging Attenuation Priors in Neural Radiance Field for 3D Computed Tomography Reconstruction CVPR2025

链接: https://arxiv.org/abs/2412.05322
作者: Li Zhou,Changsheng Fang,Bahareh Morovati,Yongtong Liu,Shuo Han,Yongshun Xu,Hengyong Yu
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The paper was submitted to CVPR 2025

点击查看摘要

人工智能

[AI-0] AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation

链接: https://arxiv.org/abs/2412.06779
作者: Guanxing Lu,Tengbo Yu,Haoyuan Deng,Season Si Chen,Yansong Tang,Ziwei Wang
关键词-EN: Performing general language-conditioned, bimanual manipulation, bimanual manipulation tasks, Performing general, general bimanual manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:Performing general language-conditioned bimanual manipulation tasks is of great importance for many applications ranging from household service to industrial assembly. However, collecting bimanual manipulation data is expensive due to the high-dimensional action space, which poses challenges for conventional methods to handle general bimanual manipulation tasks. In contrast, unimanual policy has recently demonstrated impressive generalizability across a wide range of tasks because of scaled model parameters and training data, which can provide sharable manipulation knowledge for bimanual systems. To this end, we propose a plug-and-play method named AnyBimanual, which transfers pre-trained unimanual policy to general bimanual manipulation policy with few bimanual demonstrations. Specifically, we first introduce a skill manager to dynamically schedule the skill representations discovered from pre-trained unimanual policy for bimanual manipulation tasks, which linearly combines skill primitives with task-oriented compensation to represent the bimanual manipulation instruction. To mitigate the observation discrepancy between unimanual and bimanual systems, we present a visual aligner to generate soft masks for visual embedding of the workspace, which aims to align visual input of unimanual policy model for each arm with those during pretraining stage. AnyBimanual shows superiority on 12 simulated tasks from RLBench2 with a sizable 12.67% improvement in success rate over previous methods. Experiments on 9 real-world tasks further verify its practicality with an average success rate of 84.62%.

[AI-1] XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications

链接: https://arxiv.org/abs/2412.06759
作者: Shuqing Li,Chenran Zhang,Cuiyun Gao,Michael R. Lyu
关键词-EN: Extended Reality, spatial computing technologies, computing technologies forms, enabling innovative applications, emerging Metaverse
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The rapid advancement of Extended Reality (XR, encompassing AR, MR, and VR) and spatial computing technologies forms a foundational layer for the emerging Metaverse, enabling innovative applications across healthcare, education, manufacturing, and entertainment. However, research in this area is often limited by the lack of large, representative, and highquality application datasets that can support empirical studies and the development of new approaches benefiting XR software processes. In this paper, we introduce XRZoo, a comprehensive and curated dataset of XR applications designed to bridge this gap. XRZoo contains 12,528 free XR applications, spanning nine app stores, across all XR techniques (i.e., AR, MR, and VR) and use cases, with detailed metadata on key aspects such as application descriptions, application categories, release dates, user review numbers, and hardware specifications, etc. By making XRZoo publicly available, we aim to foster reproducible XR software engineering and security research, enable cross-disciplinary investigations, and also support the development of advanced XR systems by providing examples to developers. Our dataset serves as a valuable resource for researchers and practitioners interested in improving the scalability, usability, and effectiveness of XR applications. XRZoo will be released and actively maintained.

[AI-2] Source Separation Automatic Transcription for Music

链接: https://arxiv.org/abs/2412.06703
作者: Bradford Derby,Lucas Dunker,Samarth Galchar,Shashank Jarmale,Akash Setti
关键词-EN: isolating individual sounds, Source separation, Automatic Music Transcription, digital audio production, lyric transcription
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Source separation is the process of isolating individual sounds in an auditory mixture of multiple sounds [1], and has a variety of applications ranging from speech enhancement and lyric transcription [2] to digital audio production for music. Furthermore, Automatic Music Transcription (AMT) is the process of converting raw music audio into sheet music that musicians can read [3]. Historically, these tasks have faced challenges such as significant audio noise, long training times, and lack of free-use data due to copyright restrictions. However, recent developments in deep learning have brought new promising approaches to building low-distortion stems and generating sheet music from audio signals [4]. Using spectrogram masking, deep neural networks, and the MuseScore API, we attempt to create an end-to-end pipeline that allows for an initial music audio mixture (e.g…wav file) to be separated into instrument stems, converted into MIDI files, and transcribed into sheet music for each component instrument.

[AI-3] Digital Transformation in the Water Distribution System based on the Digital Twins Concept

链接: https://arxiv.org/abs/2412.06694
作者: MohammadHossein Homaei,Agustín Javier Di Bartolo,Mar Ávila,Óscar Mogollón-Gutiérrez,Andrés Caro
关键词-EN: offering real-time monitoring, Digital Twins, Twins have emerged, Machine Learning models, great potential
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 78 pages, 18 figures

点击查看摘要

Abstract:Digital Twins have emerged as a disruptive technology with great potential; they can enhance WDS by offering real-time monitoring, predictive maintenance, and optimization capabilities. This paper describes the development of a state-of-the-art DT platform for WDS, introducing advanced technologies such as the Internet of Things, Artificial Intelligence, and Machine Learning models. This paper provides insight into the architecture of the proposed platform-CAUCCES-that, informed by both historical and meteorological data, effectively deploys AI/ML models like LSTM networks, Prophet, LightGBM, and XGBoost in trying to predict water consumption patterns. Furthermore, we delve into how optimization in the maintenance of WDS can be achieved by formulating a Constraint Programming problem for scheduling, hence minimizing the operational cost efficiently with reduced environmental impacts. It also focuses on cybersecurity and protection to ensure the integrity and reliability of the DT platform. In this view, the system will contribute to improvements in decision-making capabilities, operational efficiency, and system reliability, with reassurance being drawn from the important role it can play toward sustainable management of water resources.

[AI-4] Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

链接: https://arxiv.org/abs/2412.06685
作者: Max Sobol Mark,Tian Gao,Georgia Gabriela Sampaio,Mohan Kumar Srirama,Archit Sharma,Chelsea Finn,Aviral Kumar
关键词-EN: Recent advances, expressive policy models, policy, imitation learning, learning decision-making policies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on “optimized” actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

[AI-5] oward LLM -Agent -Based Modeling of Transportation Systems: A Conceptual Framework

链接: https://arxiv.org/abs/2412.06681
作者: Tianming Liu,Jirong Yang,Yafeng Yin
关键词-EN: existing agent-based models, agent-based models, microsimulations are current, system demand modeling, existing agent-based
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In transportation system demand modeling and simulation, agent-based models and microsimulations are current state-of-the-art approaches. However, existing agent-based models still have some limitations on behavioral realism and resource demand that limit their applicability. In this study, leveraging the emerging technology of large language models (LLMs) and LLM-based agents, we propose a general LLM-agent-based modeling framework for transportation systems. We argue that LLM agents not only possess the essential capabilities to function as agents but also offer promising solutions to overcome some limitations of existing agent-based models. Our conceptual framework design closely replicates the decision-making and interaction processes and traits of human travelers within transportation networks, and we demonstrate that the proposed systems can meet critical behavioral criteria for decision-making and learning behaviors using related studies and a demonstrative example of LLM agents’ learning and adjustment in the bottleneck setting. Although further refinement of the LLM-agent-based modeling framework is necessary, we believe that this approach has the potential to improve transportation system modeling and simulation.

[AI-6] Semantic Search and Recommendation Algorithm

链接: https://arxiv.org/abs/2412.06649
作者: Aryan Duhan,Aryan Singhal,Shourya Sharma,Neeraj,Arti MK
关键词-EN: Annoy Index, semantic search algorithm, Index to improve, paper introduces, improve the efficiency
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 6 pages, 5 Figures

点击查看摘要

Abstract:This paper introduces a new semantic search algorithm that uses Word2Vec and Annoy Index to improve the efficiency of information retrieval from large datasets. The proposed approach addresses the limitations of traditional search methods by offering enhanced speed, accuracy, and scalability. Testing on datasets up to 100GB demonstrates the method’s effectiveness in processing vast amounts of data while maintaining high precision and performance.

[AI-7] Advancing Music Therapy: Integrating Eastern Five-Element Music Theory and Western Techniques with AI in the Novel Five-Element Harmony System

链接: https://arxiv.org/abs/2412.06600
作者: Yubo Zhou,Weizhen Bian,Kaitai Zhang,Xiaohan Gu
关键词-EN: music therapy, traditional Chinese medicine, Technology and Artificial, therapy, traditional medical practices
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 5 pages, 1 figure. Accepted for Publication in the International Symposium on Chinese Spoken Language Processing

点击查看摘要

Abstract:In traditional medical practices, music therapy has proven effective in treating various psychological and physiological ailments. Particularly in Eastern traditions, the Five Elements Music Therapy (FEMT), rooted in traditional Chinese medicine, possesses profound cultural significance and unique therapeutic philosophies. With the rapid advancement of Information Technology and Artificial Intelligence, applying these modern technologies to FEMT could enhance the personalization and cultural relevance of the therapy and potentially improve therapeutic outcomes. In this article, we developed a music therapy system for the first time by applying the theory of the five elements in music therapy to practice. This innovative approach integrates advanced Information Technology and Artificial Intelligence with Five-Element Music Therapy (FEMT) to enhance personalized music therapy practices. As traditional music therapy predominantly follows Western methodologies, the unique aspects of Eastern practices, specifically the Five-Element theory from traditional Chinese medicine, should be considered. This system aims to bridge this gap by utilizing computational technologies to provide a more personalized, culturally relevant, and therapeutically effective music therapy experience.

[AI-8] EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations

链接: https://arxiv.org/abs/2412.06581
作者: Weizhen Bian,Yubo Zhou,Kaitai Zhang,Xiaohan Gu
关键词-EN: closely matching, target speaker, improved the quality, quality of generated, matching the timbre
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 4 pages, 1 figure. To appear in the Proceedings of the International Symposium on Chinese Spoken Language Processing, 7-10 November 2024, Beijing, China

点击查看摘要

Abstract:Advances in text-to-speech (TTS) technology have significantly improved the quality of generated speech, closely matching the timbre and intonation of the target speaker. However, due to the inherent complexity of human emotional expression, the development of TTS systems capable of controlling subtle emotional differences remains a formidable challenge. Existing emotional speech databases often suffer from overly simplistic labelling schemes that fail to capture a wide range of emotional states, thus limiting the effectiveness of emotion synthesis in TTS applications. To this end, recent efforts have focussed on building databases that use natural language annotations to describe speech emotions. However, these approaches are costly and require more emotional depth to train robust systems. In this paper, we propose a novel process aimed at building databases by systematically extracting emotion-rich speech segments and annotating them with detailed natural language descriptions through a generative model. This approach enhances the emotional granularity of the database and significantly reduces the reliance on costly manual annotations by automatically augmenting the data with high-level language models. The resulting rich database provides a scalable and economically viable solution for developing a more nuanced and dynamic basis for developing emotionally controlled TTS systems.

[AI-9] Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

链接: https://arxiv.org/abs/2412.06540
作者: Felipe Maia Polo,Seamus Somerstep,Leshem Choshen,Yuekai Sun,Mikhail Yurochkin
关键词-EN: large language models, Scaling laws, Skills Scaling Laws, Scaling, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance efficiently and offers insights into scaling behaviors for downstream tasks such as coding and emotional intelligence applications.

[AI-10] Unraveling the Complexity of Memory in RL Agents : an Approach for Classification and Evaluation

链接: https://arxiv.org/abs/2412.06531
作者: Egor Cherepanov,Nikita Kachaev,Artem Zholus,Alexey K. Kovalev,Aleksandr I. Panov
关键词-EN: Reinforcement Learning, domain of Reinforcement, memory, agent memory, essential for numerous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL). In particular, memory is paramount for tasks that require the utilization of past information, adaptation to novel environments, and improved sample efficiency. However, the term ``memory’’ encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent’s memory, leads to erroneous judgments about agents’ memory capabilities and prevents objective comparison with other memory-enhanced agents. This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types, such as long-term versus short-term memory and declarative versus procedural memory, inspired by cognitive science. Using these definitions, we categorize different classes of agent memory, propose a robust experimental methodology for evaluating the memory capabilities of RL agents, and standardize evaluations. Furthermore, we empirically demonstrate the importance of adhering to the proposed methodology when evaluating different types of agent memory by conducting experiments with different RL agents and what its violation leads to.

[AI-11] SimuDICE: Offline Policy Optimization Through World Model Updates and DICE Estimation

链接: https://arxiv.org/abs/2412.06486
作者: Catalin E. Brita,Stephan Bongers,Frans A. Oliehoek
关键词-EN: limited sample size, offline reinforcement learning, Model-based reinforcement learning, reinforcement learning, reinforcement learning improves
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at BNAIC/BeNeLearn 2024

点击查看摘要

Abstract:In offline reinforcement learning, deriving an effective policy from a pre-collected set of experiences is challenging due to the distribution mismatch between the target policy and the behavioral policy used to collect the data, as well as the limited sample size. Model-based reinforcement learning improves sample efficiency by generating simulated experiences using a learned dynamic model of the environment. However, these synthetic experiences often suffer from the same distribution mismatch. To address these challenges, we introduce SimuDICE, a framework that iteratively refines the initial policy derived from offline data using synthetically generated experiences from the world model. SimuDICE enhances the quality of these simulated experiences by adjusting the sampling probabilities of state-action pairs based on stationary DIstribution Correction Estimation (DICE) and the estimated confidence in the model’s predictions. This approach guides policy improvement by balancing experiences similar to those frequently encountered with ones that have a distribution mismatch. Our experiments show that SimuDICE achieves performance comparable to existing algorithms while requiring fewer pre-collected experiences and planning steps, and it remains robust across varying data collection policies.

[AI-12] How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning

链接: https://arxiv.org/abs/2412.06451
作者: Yuanyuan Wang,Qian Song,Dawood Wasif,Muhammad Shahzad,Christoph Koller,Jonathan Bamber,Xiao Xiang Zhu
关键词-EN: machine learning models, Earth observation, reliability of Earth, machine learning, learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Submitted to IEEE Geoscience and Remote Sensing Magazine

点击查看摘要

Abstract:Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models. The dataset and code are accessible via this https URL.

[AI-13] Simulating Human-like Daily Activities with Desire-driven Autonomy

链接: https://arxiv.org/abs/2412.06435
作者: Yiding Wang,Yuxuan Chen,Fangwei Zhong,Long Ma,Yizhou Wang
关键词-EN: Large Language Model-based, Existing task-oriented, Desire-driven Autonomous Agent, external rewards, limiting their ability
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing task-oriented AI agents often depend on explicit instructions or external rewards, limiting their ability to be driven by intrinsic motivations like humans. In this paper, we present a desire-driven autonomy framework to guide a Large Language Model-based (LLM-based) agent to simulate human-like daily activities. In contrast to previous agents, our Desire-driven Autonomous Agent (D2A) operates on the principle of intrinsic desire, allowing it to propose and select tasks that fulfill its motivational framework autonomously. Inspired by the Theory of Needs, the motivational framework incorporates an understanding of human-like desires, such as the need for social interaction, personal fulfillment, and self-care. Utilizing a desire-driven task generation mechanism, the agent evaluates its current state and takes a sequence of activities aligned with its intrinsic motivations. Through simulations, we demonstrate that our Desire-driven Autonomous Agent (D2A) generates coherent, contextually relevant daily activities while exhibiting variability and adaptability similar to human behavior. A comparative analysis with other LLM-based frameworks demonstrates that our approach significantly enhances the rationality of the simulated activities.

[AI-14] BatchTopK Sparse Autoencoders

链接: https://arxiv.org/abs/2412.06410
作者: Bart Bussmann,Patrick Leask,Neel Nanda
关键词-EN: interpreting language model, Sparse autoencoders, language model activations, interpretable features, powerful tool
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting language model activations by decomposing them into sparse, interpretable features. A popular approach is the TopK SAE, that uses a fixed number of the most active latents per sample to reconstruct the model activations. We introduce BatchTopK SAEs, a training method that improves upon TopK SAEs by relaxing the top-k constraint to the batch-level, allowing for a variable number of latents to be active per sample. As a result, BatchTopK adaptively allocates more or fewer latents depending on the sample, improving reconstruction without sacrificing average sparsity. We show that BatchTopK SAEs consistently outperform TopK SAEs in reconstructing activations from GPT-2 Small and Gemma 2 2B, and achieve comparable performance to state-of-the-art JumpReLU SAEs. However, an advantage of BatchTopK is that the average number of latents can be directly specified, rather than approximately tuned through a costly hyperparameter sweep. We provide code for training and evaluating BatchTopK SAEs at this https URL

[AI-15] Edge Delayed Deep Deterministic Policy Gradient: efficient continuous control for edge scenarios

链接: https://arxiv.org/abs/2412.06390
作者: Alberto Sinigaglia,Niccolò Turcato,Ruggero Carli,Gian Antonio Susto
关键词-EN: gaining increasing attention, Deterministic Policy Gradient, Deep Deterministic Policy, Deep Reinforcement Learning, gaining increasing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning is gaining increasing attention thanks to its capability to learn complex policies in high-dimensional settings. Recent advancements utilize a dual-network architecture to learn optimal policies through the Q-learning algorithm. However, this approach has notable drawbacks, such as an overestimation bias that can disrupt the learning process and degrade the performance of the resulting policy. To address this, novel algorithms have been developed that mitigate overestimation bias by employing multiple Q-functions. Edge scenarios, which prioritize privacy, have recently gained prominence. In these settings, limited computational resources pose a significant challenge for complex Machine Learning approaches, making the efficiency of algorithms crucial for their performance. In this work, we introduce a novel Reinforcement Learning algorithm tailored for edge scenarios, called Edge Delayed Deep Deterministic Policy Gradient (EdgeD3). EdgeD3 enhances the Deep Deterministic Policy Gradient (DDPG) algorithm, achieving significantly improved performance with 25% less Graphics Process Unit (GPU) time while maintaining the same memory usage. Additionally, EdgeD3 consistently matches or surpasses the performance of state-of-the-art methods across various benchmarks, all while using 30% fewer computational resources and requiring 30% less memory.

[AI-16] Exploring Memorization and Copyright Violation in Frontier LLM s: A Study of the New York Times v. OpenAI 2023 Lawsuit

链接: https://arxiv.org/abs/2412.06370
作者: Joshua Freeman,Chloe Rippe,Edoardo Debenedetti,Maksym Andriushchenko
关键词-EN: York Times, York Times copyright, York Times claims, attention recently due, recently due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Copyright infringement in frontier LLMs has received much attention recently due to the New York Times v. OpenAI lawsuit, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights by reproducing articles for use in LLM training and by memorizing the inputs, thereby publicly displaying them in LLM outputs. Our work aims to measure the propensity of OpenAI’s LLMs to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. We discover that both GPT and Claude models use refusal training and output filters to prevent verbatim output of the memorized articles. We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization. Our findings have practical implications for training: more attention must be placed on preventing verbatim memorization in very large models. Our findings also have legal significance: in assessing the relative memorization capacity of OpenAI’s LLMs, we probe the strength of The New York Times’s copyright infringement claims and OpenAI’s legal defenses, while underscoring issues at the intersection of generative AI, law, and policy.

[AI-17] Measuring Pre-training Data Quality without Labels for Time Series Foundation Models

链接: https://arxiv.org/abs/2412.06368
作者: Songkang Wen,Vasilii Feofanov,Jianfeng Zhang
关键词-EN: time series foundation, time series, series foundation models, foundation model, time series classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recently, there has been a growing interest in time series foundation models that generalize across different downstream tasks. A key to strong foundation models is a diverse pre-training dataset, which is particularly challenging to collect for time series classification. In this work, we explore the performance of a contrastive-learning-based foundation model as a function of the data used for pre-training. We introduce contrastive accuracy, a new measure to evaluate the quality of the representation space learned by the foundation model. Our experiments reveal the positive correlation between the proposed measure and the accuracy of the model on a collection of downstream tasks. This suggests that the contrastive accuracy can serve as a criterion to search for time series datasets that can enhance the pre-training and improve thereby the foundation model’s generalization.

[AI-18] Augmenting the action space with conventions to improve multi-agent cooperation in Hanabi AAMAS

链接: https://arxiv.org/abs/2412.06333
作者: F. Bredell,H.A. Engelbrecht,J.C. Schoeman
关键词-EN: multi-agent reinforcement learning, card game Hanabi, hidden information, reinforcement learning, remarkable complexity
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper is under review at the journal of autonomous agents and multi-agent systems (JAAMAS)

点击查看摘要

Abstract:The card game Hanabi is considered a strong medium for the testing and development of multi-agent reinforcement learning (MARL) algorithms, due to its cooperative nature, hidden information, limited communication and remarkable complexity. Previous research efforts have explored the capabilities of MARL algorithms within Hanabi, focusing largely on advanced architecture design and algorithmic manipulations to achieve state-of-the-art performance for a various number of cooperators. However, this often leads to complex solution strategies with high computational cost and requiring large amounts of training data. For humans to solve the Hanabi game effectively, they require the use of conventions, which often allows for a means to implicitly convey ideas or knowledge based on a predefined, and mutually agreed upon, set of ``rules’'. Multi-agent problems containing partial observability, especially when limited communication is present, can benefit greatly from the use of implicit knowledge sharing. In this paper, we propose a novel approach to augmenting the action space using conventions, which act as special cooperative actions that span over multiple time steps and multiple agents, requiring agents to actively opt in for it to reach fruition. These conventions are based on existing human conventions, and result in a significant improvement on the performance of existing techniques for self-play and cross-play across a various number of cooperators within Hanabi.

[AI-19] owards High-Level Modelling in Automated Planning

链接: https://arxiv.org/abs/2412.06312
作者: Carla Davesa Sureda,Joan Espasa Arxer,Ian Miguel,Mateu Villaret Auselle
关键词-EN: Domain Definition Language, Planning Domain Definition, fundamental activity, arising frequently, industrial processes
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Planning is a fundamental activity, arising frequently in many contexts, from daily tasks to industrial processes. The planning task consists of selecting a sequence of actions to achieve a specified goal from specified initial conditions. The Planning Domain Definition Language (PDDL) is the leading language used in the field of automated planning to model planning problems. Previous work has highlighted the limitations of PDDL, particularly in terms of its expressivity. Our interest lies in facilitating the handling of complex problems and enhancing the overall capability of automated planning systems. Unified-Planning is a Python library offering high-level API to specify planning problems and to invoke automated planners. In this paper, we present an extension of the UP library aimed at enhancing its expressivity for high-level problem modelling. In particular, we have added an array type, an expression to count booleans, and the allowance for integer parameters in actions. We show how these facilities enable natural high-level models of three classical planning problems.

[AI-20] PRECISE: Pre-training Sequential Recommenders with Collaborative and Semantic Information

链接: https://arxiv.org/abs/2412.06308
作者: Chonggang Song,Chunxu Shen,Hao Gu,Yaoming Wu,Lingling Yi,Jie Wen,Chuan Chen
关键词-EN: commonly offer diverse, offer diverse content, diverse content scenarios, Real-world recommendation systems, systems commonly offer
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-world recommendation systems commonly offer diverse content scenarios for users to interact with. Considering the enormous number of users in industrial platforms, it is infeasible to utilize a single unified recommendation model to meet the requirements of all scenarios. Usually, separate recommendation pipelines are established for each distinct scenario. This practice leads to challenges in comprehensively grasping users’ interests. Recent research endeavors have been made to tackle this problem by pre-training models to encapsulate the overall interests of users. Traditional pre-trained recommendation models mainly capture user interests by leveraging collaborative signals. Nevertheless, a prevalent drawback of these systems is their incapacity to handle long-tail items and cold-start scenarios. With the recent advent of large language models, there has been a significant increase in research efforts focused on exploiting LLMs to extract semantic information for users and items. However, text-based recommendations highly rely on elaborate feature engineering and frequently fail to capture collaborative similarities. To overcome these limitations, we propose a novel pre-training framework for sequential recommendation, termed PRECISE. This framework combines collaborative signals with semantic information. Moreover, PRECISE employs a learning framework that initially models users’ comprehensive interests across all recommendation scenarios and subsequently concentrates on the specific interests of target-scene behaviors. We demonstrate that PRECISE precisely captures the entire range of user interests and effectively transfers them to the target interests. Empirical findings reveal that the PRECISE framework attains outstanding performance on both public and industrial datasets.

[AI-21] DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI

链接: https://arxiv.org/abs/2412.06303
作者: Hyowon Cho,Soonwon Ka,Daechul Park,Jaewook Kang,Minjoon Seo,Bokyung Son
关键词-EN: Large language models, objectively identify latent, identify latent characteristics, large datasets due, Large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework’s practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification. The title of our paper is chosen from multiple candidates based on DSAI-generated criteria. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.06303 [cs.LG] (or arXiv:2412.06303v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.06303 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] S2FT: Efficient Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

链接: https://arxiv.org/abs/2412.06289
作者: Xinyu Yang,Jixuan Leng,Geyang Guo,Jiawei Zhao,Ryumei Nakada,Linjun Zhang,Huaxiu Yao,Beidi Chen
关键词-EN: Current PEFT methods, Current PEFT, Structured Sparse Fine-Tuning, PEFT methods, high quality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current PEFT methods for LLMs can achieve either high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of Structured Sparse Fine-Tuning (S ^2 FT) methods for LLMs, which concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability. S ^2 FT accomplishes this by “selecting sparsely and computing densely”. It selects a few heads and channels in the MHA and FFN modules for each Transformer block, respectively. Next, it co-permutes weight matrices on both sides of the coupled structures in LLMs to connect the selected components in each layer into a dense submatrix. Finally, S ^2 FT performs in-place gradient updates on all submatrices. Through theoretical analysis and empirical results, our method prevents overfitting and forgetting, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6% and 1.3% average improvements compared to LoRA, and surpasses full FT by 11.5% when generalizing to various domains after instruction tuning. Using our partial backpropagation algorithm, S ^2 FT saves training memory up to 3 \times and improves latency by 1.5-2.7 \times compared to full FT, while delivering an average 10% improvement over LoRA on both metrics. We further demonstrate that the weight updates in S ^2 FT can be decoupled into adapters, enabling effective fusion, fast switch, and efficient parallelism for serving multiple fine-tuned models.

[AI-23] Unseen Attack Detection in Software-Defined Networking Using a BERT-Based Large Language Model

链接: https://arxiv.org/abs/2412.06239
作者: Mohammed N. Swileh(1),Shengli Zhang(1) ((1) College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China)
关键词-EN: Software defined networking, Software defined, SDN centralized control, SDN, defined networking
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Mohammed N. Swileh is first author. Shengli Zhang is corresponding author

点击查看摘要

Abstract:Software defined networking (SDN) represents a transformative shift in network architecture by decoupling the control plane from the data plane, enabling centralized and flexible management of network resources. However, this architectural shift introduces significant security challenges, as SDN’s centralized control becomes an attractive target for various types of attacks. While current research has yielded valuable insights into attack detection in SDN, critical gaps remain. Addressing challenges in feature selection, broadening the scope beyond DDoS attacks, strengthening attack decisions based on multi flow analysis, and building models capable of detecting unseen attacks that they have not been explicitly trained on are essential steps toward advancing security in SDN. In this paper, we introduce a novel approach that leverages Natural Language Processing (NLP) and the pre trained BERT base model to enhance attack detection in SDN. Our approach transforms network flow data into a format interpretable by language models, allowing BERT to capture intricate patterns and relationships within network traffic. By using Random Forest for feature selection, we optimize model performance and reduce computational overhead, ensuring accurate detection. Attack decisions are made based on several flows, providing stronger and more reliable detection of malicious traffic. Furthermore, our approach is specifically designed to detect previously unseen attacks, offering a solution for identifying threats that the model was not explicitly trained on. To rigorously evaluate our approach, we conducted experiments in two scenarios: one focused on detecting known attacks, achieving 99.96% accuracy, and another on detecting unseen attacks, where our model achieved 99.96% accuracy, demonstrating the robustness of our approach in detecting evolving threats to improve the security of SDN networks.

[AI-24] A Self-guided Multimodal Approach to Enhancing Graph Representation Learning for Alzheimers Diseases

链接: https://arxiv.org/abs/2412.06212
作者: Zhepeng Wang,Runxue Bao,Yawen Wu,Guodong Liu,Lei Yang,Liang Zhan,Feng Zheng,Weiwen Jiang,Yanfu Zhang
关键词-EN: Graph neural networks, irregularly structured data, handle irregularly structured, Graph neural, powerful machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are powerful machine learning models designed to handle irregularly structured data. However, their generic design often proves inadequate for analyzing brain connectomes in Alzheimer’s Disease (AD), highlighting the need to incorporate domain knowledge for optimal performance. Infusing AD-related knowledge into GNNs is a complicated task. Existing methods typically rely on collaboration between computer scientists and domain experts, which can be both time-intensive and resource-demanding. To address these limitations, this paper presents a novel self-guided, knowledge-infused multimodal GNN that autonomously incorporates domain knowledge into the model development process. Our approach conceptualizes domain knowledge as natural language and introduces a specialized multimodal GNN capable of leveraging this uncurated knowledge to guide the learning process of the GNN, such that it can improve the model performance and strengthen the interpretability of the predictions. To evaluate our framework, we curated a comprehensive dataset of recent peer-reviewed papers on AD and integrated it with multiple real-world AD datasets. Experimental results demonstrate the ability of our method to extract relevant domain knowledge, provide graph-based explanations for AD diagnosis, and improve the overall performance of the GNN. This approach provides a more scalable and efficient alternative to inject domain knowledge for AD compared with the manual design from the domain expert, advancing both prediction accuracy and interpretability in AD diagnosis.

[AI-25] Skill-Enhanced Reinforcement Learning Acceleration from Demonstrations ICML2024

链接: https://arxiv.org/abs/2412.06207
作者: Hanping Zhang,Yuhong Guo
关键词-EN: facilitate rapid Reinforcement, leveraging expert demonstrations, rapid Reinforcement Learning, Reinforcement Learning Acceleration, Skill-enhanced Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICML 2024 AutoRL Workshop; 9 pages

点击查看摘要

Abstract:Learning from Demonstration (LfD) aims to facilitate rapid Reinforcement Learning (RL) by leveraging expert demonstrations to pre-train the RL agent. However, the limited availability of expert demonstration data often hinders its ability to effectively aid downstream RL learning. To address this problem, we propose a novel two-stage method dubbed as Skill-enhanced Reinforcement Learning Acceleration (SeRLA). SeRLA introduces a skill-level adversarial Positive-Unlabeled (PU) learning model to extract useful skill prior knowledge by enabling learning from both limited expert data and general low-cost demonstration data in the offline prior learning stage. Subsequently, it deploys a skill-based soft actor-critic algorithm to leverage this acquired prior knowledge in the downstream online RL stage for efficient training of a skill policy network. Moreover, we develop a simple skill-level data enhancement technique to further alleviate data sparsity and improve both skill prior learning and downstream skill policy training. Our experimental results on multiple standard RL environments show the proposed SeRLA method achieves state-of-the-art performance on accelerating reinforcement learning on downstream tasks, especially in the early learning phase.

[AI-26] Enhancing Adversarial Resistance in LLM s with Recursion

链接: https://arxiv.org/abs/2412.06181
作者: Bryan Li,Sounak Bagchi,Zizhan Wang
关键词-EN: Large Language Models, Language Models, society necessitates robust, necessitates robust defenses, Large Language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing integration of Large Language Models (LLMs) into society necessitates robust defenses against vulnerabilities from jailbreaking and adversarial prompts. This project proposes a recursive framework for enhancing the resistance of LLMs to manipulation through the use of prompt simplification techniques. By increasing the transparency of complex and confusing adversarial prompts, the proposed method enables more reliable detection and prevention of malicious inputs. Our findings attempt to address a critical problem in AI safety and security, providing a foundation for the development of systems able to distinguish harmless inputs from prompts containing malicious intent. As LLMs continue to be used in diverse applications, the importance of such safeguards will only grow.

[AI-27] AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

链接: https://arxiv.org/abs/2412.06176
作者: Pranjal Aggarwal,Bryan Parno,Sean Welleck
关键词-EN: gained significant traction, Automated code generation, Automated code, significant traction, gained significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automated code generation with large language models has gained significant traction, but there remains no guarantee on the correctness of generated code. We aim to use formal verification to provide mathematical guarantees that the generated code is correct. However, generating formally verified code with LLMs is hindered by the scarcity of training data and the complexity of formal proofs. To tackle this challenge, we introduce AlphaVerus, a self-improving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language and leveraging feedback from a verifier. AlphaVerus operates in three phases: exploration of candidate translations, Treefinement – a novel tree search algorithm for program refinement using verifier feedback, and filtering misaligned specifications and programs to prevent reward hacking. Through this iterative process, AlphaVerus enables a LLaMA-3.1-70B model to generate verified code without human intervention or model finetuning. AlphaVerus shows an ability to generate formally verified solutions for HumanEval and MBPP, laying the groundwork for truly trustworthy code-generation agents.

[AI-28] ACQ: A Unified Framework for Automated Programmatic Creativity in Online Advertising

链接: https://arxiv.org/abs/2412.06167
作者: Ruizhi Wang,Kai Liu,Bingjie Li,Yu Rong,Qingpeng Cai,Fei Pan,Peng Jiang
关键词-EN: creatives, Automated Creatives Quota, prediction module, module, ACQ
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In online advertising, the demand-side platform (a.k.a. DSP) enables advertisers to create different ad creatives for real-time bidding. Intuitively, advertisers tend to create more ad creatives for a single photo to increase the probability of participating in bidding, further enhancing their ad cost. From the perspective of DSP, the following are two overlooked issues. On the one hand, the number of ad creatives cannot grow indefinitely. On the other hand, the marginal effects of ad cost diminish as the number of ad creatives increases. To this end, this paper proposes a two-stage framework named Automated Creatives Quota (ACQ) to achieve the automatic creation and deactivation of ad creatives. ACQ dynamically allocates the creative quota across multiple advertisers to maximize the revenue of the ad platform. ACQ comprises two components: a prediction module to estimate the cost of a photo under different numbers of ad creatives, and an allocation module to decide the quota for photos considering their estimated costs in the prediction module. Specifically, in the prediction module, we develop a multi-task learning model based on an unbalanced binary tree to effectively mitigate the target variable imbalance problem. In the allocation module, we formulate the quota allocation problem as a multiple-choice knapsack problem (MCKP) and develop an efficient solver to solve such large-scale problems involving tens of millions of ads. We performed extensive offline and online experiments to validate the superiority of our proposed framework, which increased cost by 9.34%.

[AI-29] Conservative Contextual Bandits: Beyond Linear Representations

链接: https://arxiv.org/abs/2412.06165
作者: Rohan Deb,Mohammad Ghavamzadeh,Arindam Banerjee
关键词-EN: Conservative Contextual Bandits, sequential decision making, Conservative Contextual, Contextual Bandits, sequential decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent’s policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than (1+\alpha) factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms \mathttC-SquareCB and \mathttC-FastCB , using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and that the regret of \mathttC-SquareCB is sub-linear in horizon T , while the regret of \mathttC-FastCB is first-order and is sub-linear in L^* , the cumulative loss of the optimal policy. Subsequently, we use a neural network for function approximation and online gradient descent as the regression oracle to provide \tildeO(\sqrtKT + K/\alpha) and \tildeO(\sqrtKL^* + K (1 + 1/\alpha)) regret bounds, respectively. Finally, we demonstrate the efficacy of our algorithms on real-world data and show that they significantly outperform the existing baseline while maintaining the performance guarantee.

[AI-30] MoSH: Modeling Multi-Objective Tradeoffs with Soft and Hard Bounds

链接: https://arxiv.org/abs/2412.06154
作者: Edward Chen,Natalie Dullerud,Thomas Niedermayr,Elizabeth Kidd,Ransalu Senanayake,Pang Wei Koh,Sanmi Koyejo,Carlos Guestrin
关键词-EN: Countless science, Pareto frontier, necessitate that decision-makers, applications in multi-objective, full Pareto frontier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Countless science and engineering applications in multi-objective optimization (MOO) necessitate that decision-makers (DMs) select a Pareto-optimal solution which aligns with their preferences. Evaluating individual solutions is often expensive, necessitating cost-sensitive optimization techniques. Due to competing objectives, the space of trade-offs is also expansive – thus, examining the full Pareto frontier may prove overwhelming to a DM. Such real-world settings generally have loosely-defined and context-specific desirable regions for each objective function that can aid in constraining the search over the Pareto frontier. We introduce a novel conceptual framework that operationalizes these priors using soft-hard functions, SHFs, which allow for the DM to intuitively impose soft and hard bounds on each objective – which has been lacking in previous MOO frameworks. Leveraging a novel minimax formulation for Pareto frontier sampling, we propose a two-step process for obtaining a compact set of Pareto-optimal points which respect the user-defined soft and hard bounds: (1) densely sample the Pareto frontier using Bayesian optimization, and (2) sparsify the selected set to surface to the user, using robust submodular function optimization. We prove that (2) obtains the optimal compact Pareto-optimal set of points from (1). We further show that many practical problems fit within the SHF framework and provide extensive empirical validation on diverse domains, including brachytherapy, engineering design, and large language model personalization. Specifically, for brachytherapy, our approach returns a compact set of points with over 3% greater SHF-defined utility than the next best approach. Among the other diverse experiments, our approach consistently leads in utility, allowing the DM to reach 99% of their maximum possible desired utility within validation of 5 points.

[AI-31] Privacy-Preserving Large Language Models : Mechanisms Applications and Future Directions

链接: https://arxiv.org/abs/2412.06113
作者: Guoshenghui Zhao,Eric Song
关键词-EN: natural language processing, revolutionized natural language, finance and education, rapid advancement, revolutionized natural
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has revolutionized natural language processing, enabling applications in diverse domains such as healthcare, finance and education. However, the growing reliance on extensive data for training and inference has raised significant privacy concerns, ranging from data leakage to adversarial attacks. This survey comprehensively explores the landscape of privacy-preserving mechanisms tailored for LLMs, including differential privacy, federated learning, cryptographic protocols, and trusted execution environments. We examine their efficacy in addressing key privacy challenges, such as membership inference and model inversion attacks, while balancing trade-offs between privacy and model utility. Furthermore, we analyze privacy-preserving applications of LLMs in privacy-sensitive domains, highlighting successful implementations and inherent limitations. Finally, this survey identifies emerging research directions, emphasizing the need for novel frameworks that integrate privacy by design into the lifecycle of LLMs. By synthesizing state-of-the-art approaches and future trends, this paper provides a foundation for developing robust, privacy-preserving large language models that safeguard sensitive information without compromising performance.

[AI-32] DECO: Life-Cycle Management of Enterprise-Grade Chatbots

链接: https://arxiv.org/abs/2412.06099
作者: Yiwen Zhu,Mathieu Demarne,Kai Deng,Wenjing Wang,Nutan Sahoo,Divya Vermareddy,Hannah Lerner,Yunlei Lu,Swati Bararia,Anjali Bhavan,William Zhang,Xia Li,Katherine Lin,Miso Cilimdzic,Subru Krishnan
关键词-EN: including Troubleshooting Guides, Software engineers frequently, engineers frequently grapple, including Troubleshooting, internal tools developed
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Software engineers frequently grapple with the challenge of accessing disparate documentation and telemetry data, including Troubleshooting Guides (TSGs), incident reports, code repositories, and various internal tools developed by multiple stakeholders. While on-call duties are inevitable, incident resolution becomes even more daunting due to the obscurity of legacy sources and the pressures of strict time constraints. To enhance the efficiency of on-call engineers (OCEs) and streamline their daily workflows, we introduced DECO – a comprehensive framework for developing, deploying, and managing enterprise-grade chatbots tailored to improve productivity in engineering routines. This paper details the design and implementation of the DECO framework, emphasizing its innovative NL2SearchQuery functionality and a hierarchical planner. These features support efficient and customized retrieval-augmented-generation (RAG) algorithms that not only extract relevant information from diverse sources but also select the most pertinent toolkits in response to user queries. This enables the addressing of complex technical questions and provides seamless, automated access to internal resources. Additionally, DECO incorporates a robust mechanism for converting unstructured incident logs into user-friendly, structured guides, effectively bridging the documentation gap. Feedback from users underscores DECO’s pivotal role in simplifying complex engineering tasks, accelerating incident resolution, and bolstering organizational productivity. Since its launch in September 2023, DECO has demonstrated its effectiveness through extensive engagement, with tens of thousands of interactions from hundreds of active users across multiple organizations within the company.

[AI-33] rust No AI: Prompt Injection Along The CIA Security Triad MICRO

链接: https://arxiv.org/abs/2412.06090
作者: Johann Rehberger(Independent Researcher, Embrace The Red)
关键词-EN: CIA security triad, cornerstone of data, CIA security, Confidentiality, Integrity
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Based on research presented at Black Hat Europe 2024, Microsoft Bluehat 2024 and publications from this http URL

点击查看摘要

Abstract:The CIA security triad - Confidentiality, Integrity, and Availability - is a cornerstone of data and cybersecurity. With the emergence of large language model (LLM) applications, a new class of threat, known as prompt injection, was first identified in 2022. Since then, numerous real-world vulnerabilities and exploits have been documented in production LLM systems, including those from leading vendors like OpenAI, Microsoft, Anthropic and Google. This paper compiles real-world exploits and proof-of concept examples, based on the research conducted and publicly documented by the author, demonstrating how prompt injection undermines the CIA triad and poses ongoing risks to cybersecurity and AI systems at large.

[AI-34] Ethnography and Machine Learning: Synergies and New Directions

链接: https://arxiv.org/abs/2412.06087
作者: Zhuofan Li,Corey M. Abramson
关键词-EN: contemporary social science, perform quantifiable tasks, real world contexts, social scientific methods, statistical learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Methodology (stat.ME)
*备注: 20 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Ethnography (social scientific methods that illuminate how people understand, navigate and shape the real world contexts in which they live their lives) and machine learning (computational techniques that use big data and statistical learning models to perform quantifiable tasks) are each core to contemporary social science. Yet these tools have remained largely separate in practice. This chapter draws on a growing body of scholarship that argues that ethnography and machine learning can be usefully combined, particularly for large comparative studies. Specifically, this paper (a) explains the value (and challenges) of using machine learning alongside qualitative field research for certain types of projects, (b) discusses recent methodological trends to this effect, © provides examples that illustrate workflow drawn from several large projects, and (d) concludes with a roadmap for enabling productive coevolution of field methods and machine learning.

[AI-35] Fuzzy Norm-Explicit Product Quantization for Recommender Systems

链接: https://arxiv.org/abs/2412.06069
作者: Mohammadreza Jamalifard,Javier Andreu-Perez,Hani Hagras,Luis Martínez López
关键词-EN: data resources grow, information overload problem, resources grow, data resources, meet the demands
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the data resources grow, providing recommendations that best meet the demands has become a vital requirement in business and life to overcome the information overload problem. However, building a system suggesting relevant recommendations has always been a point of debate. One of the most cost-efficient techniques in terms of producing relevant recommendations at a low complexity is Product Quantization (PQ). PQ approaches have continued developing in recent years. This system’s crucial challenge is improving product quantization performance in terms of recall measures without compromising its complexity. This makes the algorithm suitable for problems that require a greater number of potentially relevant items without disregarding others, at high-speed and low-cost to keep up with traffic. This is the case of online shops where the recommendations for the purpose are important, although customers can be susceptible to scoping other products. This research proposes a fuzzy approach to perform norm-based product quantization. Type-2 Fuzzy sets (T2FSs) define the codebook allowing sub-vectors (T2FSs) to be associated with more than one element of the codebook, and next, its norm calculus is resolved by means of integration. Our method finesses the recall measure up, making the algorithm suitable for problems that require querying at most possible potential relevant items without disregarding others. The proposed method outperforms all PQ approaches such as NEQ, PQ, and RQ up to +6%, +5%, and +8% by achieving a recall of 94%, 69%, 59% in Netflix, Audio, Cifar60k datasets, respectively. More and over, computing time and complexity nearly equals the most computationally efficient existing PQ method in the state-of-the-art.

[AI-36] Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

链接: https://arxiv.org/abs/2412.06061
作者: Yekun Ke,Yingyu Liang,Zhenmei Shi,Zhao Song,Chiwun Yang
关键词-EN: time series forecasting, popular to study, long been popular, linear residual model, TSF tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The application of transformer-based models on time series forecasting (TSF) tasks has long been popular to study. However, many of these works fail to beat the simple linear residual model, and the theoretical understanding of this issue is still limited. In this work, we propose the first theoretical explanation of the inefficiency of transformers on TSF tasks. We attribute the mechanism behind it to \bf Asymmetric Learning in training attention networks. When the sign of the previous step is inconsistent with the sign of the current step in the next-step-prediction time series, attention fails to learn the residual features. This makes it difficult to generalize on out-of-distribution (OOD) data, especially on the sign-inconsistent next-step-prediction data, with the same representation pattern, whereas a linear residual network could easily accomplish it. We hope our theoretical insights provide important necessary conditions for designing the expressive and efficient transformer-based architecture for practitioners.

[AI-37] Cloud Platforms for Developing Generative AI Solutions: A Scoping Review of Tools and Services

链接: https://arxiv.org/abs/2412.06044
作者: Dhavalkumar Patel,Ganesh Raut,Satya Narayan Cheetirala,Girish N Nadkarni,Robert Freeman,Benjamin S. Glicksberg,Eyal Klang,Prem Timsina
关键词-EN: transforming enterprise application, enterprise application development, create content, application development, development by enabling
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 65 pages, 10 Figures,and supplementary methods detailing extended technical descriptions, service matrices, SWOT analyses, and detailed provider comparisons

点击查看摘要

Abstract:Generative AI is transforming enterprise application development by enabling machines to create content, code, and designs. These models, however, demand substantial computational power and data management. Cloud computing addresses these needs by offering infrastructure to train, deploy, and scale generative AI models. This review examines cloud services for generative AI, focusing on key providers like Amazon Web Services (AWS), Microsoft Azure, Google Cloud, IBM Cloud, Oracle Cloud, and Alibaba Cloud. It compares their strengths, weaknesses, and impact on enterprise growth. We explore the role of high-performance computing (HPC), serverless architectures, edge computing, and storage in supporting generative AI. We also highlight the significance of data management, networking, and AI-specific tools in building and deploying these models. Additionally, the review addresses security concerns, including data privacy, compliance, and AI model protection. It assesses the performance and cost efficiency of various cloud providers and presents case studies from healthcare, finance, and entertainment. We conclude by discussing challenges and future directions, such as technical hurdles, vendor lock-in, sustainability, and regulatory issues. Put together, this work can serve as a guide for practitioners and researchers looking to adopt cloud-based generative AI solutions, serving as a valuable guide to navigating the intricacies of this evolving field.

[AI-38] he AI Double Standard: Humans Judge All AIs for the Actions of One

链接: https://arxiv.org/abs/2412.06040
作者: Aikaterina Manoli,Janet V. T. Pauketat,Jacy Reese Anthis
关键词-EN: moral agents responsible, moral, artificial intelligence, Robots, human
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Robots and other artificial intelligence (AI) systems are widely perceived as moral agents responsible for their actions. As AI proliferates, these perceptions may become entangled via the moral spillover of attitudes towards one AI to attitudes towards other AIs. We tested how the seemingly harmful and immoral actions of an AI or human agent spill over to attitudes towards other AIs or humans in two preregistered experiments. In Study 1 (N = 720), we established the moral spillover effect in human-AI interaction by showing that immoral actions increased attributions of negative moral agency (i.e., acting immorally) and decreased attributions of positive moral agency (i.e., acting morally) and moral patiency (i.e., deserving moral concern) to both the agent (a chatbot or human assistant) and the group to which they belong (all chatbot or human assistants). There was no significant difference in the spillover effects between the AI and human contexts. In Study 2 (N = 684), we tested whether spillover persisted when the agent was individuated with a name and described as an AI or human, rather than specifically as a chatbot or personal assistant. We found that spillover persisted in the AI context but not in the human context, possibly because AIs were perceived as more homogeneous due to their outgroup status relative to humans. This asymmetry suggests a double standard whereby AIs are judged more harshly than humans when one agent morally transgresses. With the proliferation of diverse, autonomous AI systems, HCI research and design should account for the fact that experiences with one AI could easily generalize to perceptions of all AIs and negative HCI outcomes, such as reduced trust.

[AI-39] PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations

链接: https://arxiv.org/abs/2412.05994
作者: Namgyu Kang,Jaemin Oh,Youngjoon Hong,Eunbyung Park
关键词-EN: Partial Differential Equations, Differential Equations, Partial Differential, approximation of Partial, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:The approximation of Partial Differential Equations (PDEs) using neural networks has seen significant advancements through Physics-Informed Neural Networks (PINNs). Despite their straightforward optimization framework and flexibility in implementing various PDEs, PINNs often suffer from limited accuracy due to the spectral bias of Multi-Layer Perceptrons (MLPs), which struggle to effectively learn high-frequency and non-linear components. Recently, parametric mesh representations in combination with neural networks have been investigated as a promising approach to eliminate the inductive biases of neural networks. However, they usually require very high-resolution grids and a large number of collocation points to achieve high accuracy while avoiding overfitting issues. In addition, the fixed positions of the mesh parameters restrict their flexibility, making it challenging to accurately approximate complex PDEs. To overcome these limitations, we propose Physics-Informed Gaussians (PIGs), which combine feature embeddings using Gaussian functions with a lightweight neural network. Our approach uses trainable parameters for the mean and variance of each Gaussian, allowing for dynamic adjustment of their positions and shapes during training. This adaptability enables our model to optimally approximate PDE solutions, unlike models with fixed parameter positions. Furthermore, the proposed approach maintains the same optimization framework used in PINNs, allowing us to benefit from their excellent properties. Experimental results show the competitive performance of our model across various PDEs, demonstrating its potential as a robust tool for solving complex PDEs. Our project page is available at this https URL

[AI-40] Accelerating Manufacturing Scale-Up from Material Discovery Using Agent ic Web Navigation and Retrieval-Augmented AI for Process Engineering Schematics Design

链接: https://arxiv.org/abs/2412.05937
作者: Sakhinana Sagar Srinivas,Akash Das,Shivam Gupta,Venkataramana Runkana
关键词-EN: Process Flow Diagrams, industrial process design, Process Flow, process design, Process and Instrumentation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (PIDs) are critical tools for industrial process design, control, and safety. However, the generation of precise and regulation-compliant diagrams remains a significant challenge, particularly in scaling breakthroughs from material discovery to industrial production in an era of automation and digitalization. This paper introduces an autonomous agentic framework to address these challenges through a twostage approach involving knowledge acquisition and generation. The framework integrates specialized sub-agents for retrieving and synthesizing multimodal data from publicly available online sources and constructs ontological knowledge graphs using a Graph Retrieval-Augmented Generation (Graph RAG) paradigm. These capabilities enable the automation of diagram generation and open-domain question answering (ODQA) tasks with high contextual accuracy. Extensive empirical experiments demonstrate the frameworks ability to deliver regulation-compliant diagrams with minimal expert intervention, highlighting its practical utility for industrial applications.

[AI-41] Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

链接: https://arxiv.org/abs/2412.05934
作者: Ma Teng,Jia Xiaojun,Duan Ranjie,Li Xinfeng,Huang Yihao,Chu Zhixuan,Liu Yang,Ren Wenqi
关键词-EN: large language models, multimodal large language, multimodal risk distribution, academia and industry, rapid advancement
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid advancement of multimodal large language models (MLLMs), concerns regarding their security have increasingly captured the attention of both academia and industry. Although MLLMs are vulnerable to jailbreak attacks, designing effective multimodal jailbreak attacks poses unique challenges, especially given the distinct protective measures implemented across various modalities in commercial models. Previous works concentrate risks into a single modality, resulting in limited jailbreak performance. In this paper, we propose a heuristic-induced multimodal risk distribution jailbreak attack method, called HIMRD, which consists of two elements: multimodal risk distribution strategy and heuristic-induced search strategy. The multimodal risk distribution strategy is used to segment harmful instructions across multiple modalities to effectively circumvent MLLMs’ security protection. The heuristic-induced search strategy identifies two types of prompts: the understanding-enhancing prompt, which helps the MLLM reconstruct the malicious prompt, and the inducing prompt, which increases the likelihood of affirmative outputs over refusals, enabling a successful jailbreak attack. Extensive experiments demonstrate that this approach effectively uncovers vulnerabilities in MLLMs, achieving an average attack success rate of 90% across seven popular open-source MLLMs and an average attack success rate of around 68% in three popular closed-source MLLMs. Our code will coming soon. Warning: This paper contains offensive and harmful examples, reader discretion is advised.

[AI-42] BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs

链接: https://arxiv.org/abs/2412.05892
作者: Ruoxi Cheng,Yizhong Ding,Shuirong Cao,Shaowei Yuan,Zhiqiang Wang,Xiaojun Jia
关键词-EN: vulnerable to illegal, illegal or unethical, unethical responses, single-round attack limitation, Abstract
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs

点击查看摘要

Abstract:LVLMs are widely used but vulnerable to illegal or unethical responses under jailbreak attacks. To ensure their responsible deployment in real-world applications, it is essential to understand their vulnerabilities. There are four main issues in current work: single-round attack limitation, insufficient dual-modal synergy, poor transferability to black-box models, and reliance on prompt engineering. To address these limitations, we propose BAMBA, a bimodal adversarial multi-round black-box jailbreak attacker for LVLMs. We first use an image optimizer to learn malicious features from a harmful corpus, then deepen these features through a bimodal optimizer through text-image interaction, generating adversarial text and image for jailbreak. Experiments on various LVLMs and datasets demonstrate that BAMBA outperforms other baselines.

[AI-43] owards Modeling Data Quality and Machine Learning Model Performance

链接: https://arxiv.org/abs/2412.05882
作者: Usman Anjum,Chris Trentman,Elrod Caden,Justin Zhan
关键词-EN: Understanding the effect, machine learning models, effect of uncertainty, machine learning, crucial in developing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding the effect of uncertainty and noise in data on machine learning models (MLM) is crucial in developing trust and measuring performance. In this paper, a new model is proposed to quantify uncertainties and noise in data on MLMs. Using the concept of signal-to-noise ratio (SNR), a new metric called deterministic-non-deterministic ratio (DDR) is proposed to formulate performance of a model. Using synthetic data in experiments, we show how accuracy can change with DDR and how we can use DDR-accuracy curves to determine performance of a model.

[AI-44] CardOOD: Robust Query-driven Cardinality Estimation under Out-of-Distribution

链接: https://arxiv.org/abs/2412.05864
作者: Rui Li,Kangfei Zhao,Jeffrey Xu Yu,Guoren Wang
关键词-EN: Query-driven learned estimators, OOD problem, lightweight alternatives, alternatives to traditional, learning
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Query-driven learned estimators are accurate, flexible, and lightweight alternatives to traditional estimators in query optimization. However, existing query-driven approaches struggle with the Out-of-distribution (OOD) problem, where the test workload distribution differs from the training workload, leading to performancedegradation. In this paper, we present CardOOD, a general learning framework designed to construct robust query-driven cardinality estimators that are resilient against the OOD problem. Our framework focuses on offline training algorithms that develop one-off models from a static workload, suitable for model initialization and periodic retraining. In CardOOD, we extend classical transfer/robust learning techniques to train query-driven cardinalityestimators, and the algorithms fall into three categories: representation learning, data manipulation, and new learning strategies. As these learning techniques are originally evaluated in computervision tasks, we also propose a new learning algorithm that exploits the property of cardinality estimation. This algorithm, lying in the category of new learning strategy, models the partial order constraint of cardinalities by a self-supervised learning task. Comprehensive experimental studies demonstrate the efficacy of the algorithms of CardOOD in mitigating the OOD problem to varying extents. We further integrate CardOOD into PostgreSQL, showcasing its practical utility in query optimization.

[AI-45] Evolving Algebraic Multigrid Methods Using Grammar-Guided Genetic Programming

链接: https://arxiv.org/abs/2412.05852
作者: Dinesh Parthasarathy,Wayne Bradford Mitchell,Harald Köstler
关键词-EN: asymptotically optimal algorithms, optimal algorithms, asymptotically optimal, careful selection, individual components
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Multigrid methods despite being known to be asymptotically optimal algorithms, depend on the careful selection of their individual components for efficiency. Also, they are mostly restricted to standard cycle types like V-, F-, and W-cycles. We use grammar rules to generate arbitrary-shaped cycles, wherein the smoothers and their relaxation weights are chosen independently at each step within the cycle. We call this a flexible multigrid cycle. These flexible cycles are used in Algebraic Multigrid (AMG) methods with the help of grammar rules and optimized using genetic programming. The flexible AMG methods are implemented in the software library of hypre, and the programs are optimized separately for two cases: a standalone AMG solver for a 3D anisotropic problem and an AMG preconditioner with conjugate gradient for a multiphysics code. We observe that the optimized flexible cycles provide higher efficiency and better performance than the standard cycle types.

[AI-46] Kernel Stochastic Configuration Networks for Nonlinear Regression

链接: https://arxiv.org/abs/2412.05846
作者: Yongxuan Chen,Dianhui Wang
关键词-EN: Stochastic configuration networks, Stochastic configuration, universal approximation property, randomized learner models, configuration networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 14 figures

点击查看摘要

Abstract:Stochastic configuration networks (SCNs), as a class of randomized learner models, are featured by its way of random parameters assignment in the light of a supervisory mechanism, resulting in the universal approximation property at algorithmic level. This paper presents a kernel version of SCNs, termed KSCNs, aiming to enhance model’s representation learning capability and performance stability. The random bases of a built SCN model can be used to span a reproducing kernel Hilbert space (RKHS), followed by our proposed algorithm for constructing KSCNs. It is shown that the data distribution in the reconstructive space is favorable for regression solving and the proposed KSCN learner models hold the universal approximation property. Three benchmark datasets including two industrial datasets are used in this study for performance evaluation. Experimental results with comparisons against existing solutions clearly demonstrate that the proposed KSCN remarkably outperforms the original SCNs and some typical kernel methods for resolving nonlinear regression problems in terms of the learning performance, the model’s stability and robustness with respect to the kernel parameter settings.

[AI-47] DREAM: Domain-agnostic Reverse Engineering Attributes of Black-box Model

链接: https://arxiv.org/abs/2412.05842
作者: Rongqing Li,Jiaqi Yu,Changsheng Li,Wenhan Luo,Ye Yuan,Guoren Wang
关键词-EN: Deep learning models, machine learning platforms, target black-box model, Deep learning, target black-box
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2307.10997

点击查看摘要

Abstract:Deep learning models are usually black boxes when deployed on machine learning platforms. Prior works have shown that the attributes (e.g., the number of convolutional layers) of a target black-box model can be exposed through a sequence of queries. There is a crucial limitation: these works assume the training dataset of the target model is known beforehand and leverage this dataset for model attribute attack. However, it is difficult to access the training dataset of the target black-box model in reality. Therefore, whether the attributes of a target black-box model could be still revealed in this case is doubtful. In this paper, we investigate a new problem of black-box reverse engineering, without requiring the availability of the target model’s training dataset. We put forward a general and principled framework DREAM, by casting this problem as out-of-distribution (OOD) generalization. In this way, we can learn a domain-agnostic meta-model to infer the attributes of the target black-box model with unknown training data. This makes our method one of the kinds that can gracefully apply to an arbitrary domain for model attribute reverse engineering with strong generalization ability. Extensive experimental results demonstrate the superiority of our proposed method over the baselines.

[AI-48] A Collaborative Multi-Agent Approach to Retrieval-Augmented Generation Across Diverse Data

链接: https://arxiv.org/abs/2412.05838
作者: Aniruddha Salve,Saba Attar,Mahesh Deshmukh,Sayali Shivpuje,Arnab Mitra Utsab
关键词-EN: Large Language Models, enhances Large Language, Language Models, Large Language, incorporating external
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, 3 figures. This preprint introduces a multi-agent framework for Retrieval-Augmented Generation (RAG), enhancing Large Language Models (LLMs) for efficient integration of diverse data sources. Relevant for researchers in AI, ML, generative AI, and database systems

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external, domain-specific data into the generative process. While LLMs are highly capable, they often rely on static, pre-trained datasets, limiting their ability to integrate dynamic or private data. Traditional RAG systems typically use a single-agent architecture to handle query generation, data retrieval, and response synthesis. However, this approach becomes inefficient when dealing with diverse data sources, such as relational databases, document stores, and graph databases, often leading to performance bottlenecks and reduced accuracy. This paper proposes a multi-agent RAG system to address these limitations. Specialized agents, each optimized for a specific data source, handle query generation for relational, NoSQL, and document-based systems. These agents collaborate within a modular framework, with query execution delegated to an environment designed for compatibility across various database types. This distributed approach enhances query efficiency, reduces token overhead, and improves response accuracy by ensuring that each agent focuses on its specialized task. The proposed system is scalable and adaptable, making it ideal for generative AI workflows that require integration with diverse, dynamic, or private data sources. By leveraging specialized agents and a modular execution environment, the system provides an efficient and robust solution for handling complex, heterogeneous data environments in generative AI applications.

[AI-49] Large Language Models Merging for Enhancing the Link Stealing Attack on Graph Neural Networks

链接: https://arxiv.org/abs/2412.05830
作者: Faqian Guan,Tianqing Zhu,Wenhan Chang,Wei Ren,Wanlei Zhou
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, achieved remarkable success, Large Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Link Stealing Attacks, Large Language Models, Graph Neural Networks, Privacy Attacks, Model Merging

点击查看摘要

Abstract:Graph Neural Networks (GNNs), specifically designed to process the graph data, have achieved remarkable success in various applications. Link stealing attacks on graph data pose a significant privacy threat, as attackers aim to extract sensitive relationships between nodes (entities), potentially leading to academic misconduct, fraudulent transactions, or other malicious activities. Previous studies have primarily focused on single datasets and did not explore cross-dataset attacks, let alone attacks that leverage the combined knowledge of multiple attackers. However, we find that an attacker can combine the data knowledge of multiple attackers to create a more effective attack model, which can be referred to cross-dataset attacks. Moreover, if knowledge can be extracted with the help of Large Language Models (LLMs), the attack capability will be more significant. In this paper, we propose a novel link stealing attack method that takes advantage of cross-dataset and Large Language Models (LLMs). The LLM is applied to process datasets with different data structures in cross-dataset attacks. Each attacker fine-tunes the LLM on their specific dataset to generate a tailored attack model. We then introduce a novel model merging method to integrate the parameters of these attacker-specific models effectively. The result is a merged attack model with superior generalization capabilities, enabling effective attacks not only on the attackers’ datasets but also on previously unseen (out-of-domain) datasets. We conducted extensive experiments in four datasets to demonstrate the effectiveness of our method. Additional experiments with three different GNN and LLM architectures further illustrate the generality of our approach.

[AI-50] DapperFL: Domain Adaptive Federated Learning with Model Fusion Pruning for Edge Devices NEURIPS2024

链接: https://arxiv.org/abs/2412.05823
作者: Yongzhe Jia,Xuyun Zhang,Hongsheng Hu,Kim-Kwang Raymond Choo,Lianyong Qi,Xiaolong Xu,Amin Beheshti,Wanchun Dou
关键词-EN: machine learning paradigm, prominent machine learning, Federated learning, enabling edge devices, edge computing environments
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Oral accepted by NeurIPS 2024

点击查看摘要

Abstract:Federated learning (FL) has emerged as a prominent machine learning paradigm in edge computing environments, enabling edge devices to collaboratively optimize a global model without sharing their private data. However, existing FL frameworks suffer from efficacy deterioration due to the system heterogeneity inherent in edge computing, especially in the presence of domain shifts across local data. In this paper, we propose a heterogeneous FL framework DapperFL, to enhance model performance across multiple domains. In DapperFL, we introduce a dedicated Model Fusion Pruning (MFP) module to produce personalized compact local models for clients to address the system heterogeneity challenges. The MFP module prunes local models with fused knowledge obtained from both local and remaining domains, ensuring robustness to domain shifts. Additionally, we design a Domain Adaptive Regularization (DAR) module to further improve the overall performance of DapperFL. The DAR module employs regularization generated by the pruned model, aiming to learn robust representations across domains. Furthermore, we introduce a specific aggregation algorithm for aggregating heterogeneous local models with tailored architectures and weights. We implement DapperFL on a realworld FL platform with heterogeneous clients. Experimental results on benchmark datasets with multiple domains demonstrate that DapperFL outperforms several state-of-the-art FL frameworks by up to 2.28%, while significantly achieving model volume reductions ranging from 20% to 80%. Our code is available at: this https URL.

[AI-51] Strategizing Equitable Transit Evacuations: A Data-Driven Reinforcement Learning Approach

链接: https://arxiv.org/abs/2412.05777
作者: Fang Tang,Han Wang,Maria Laura Delle Monache
关键词-EN: increasingly frequent, Markov Decision Process, natural disasters, disasters become increasingly, General Transit Feed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Systems and Control (eess.SY)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:As natural disasters become increasingly frequent, the need for efficient and equitable evacuation planning has become more critical. This paper proposes a data-driven, reinforcement learning-based framework to optimize bus-based evacuations with an emphasis on improving both efficiency and equity. We model the evacuation problem as a Markov Decision Process solved by reinforcement learning, using real-time transit data from General Transit Feed Specification and transportation networks extracted from OpenStreetMap. The reinforcement learning agent dynamically reroutes buses from their scheduled location to minimize total passengers’ evacuation time while prioritizing equity-priority communities. Simulations on the San Francisco Bay Area transportation network indicate that the proposed framework achieves significant improvements in both evacuation efficiency and equitable service distribution compared to traditional rule-based and random strategies. These results highlight the potential of reinforcement learning to enhance system performance and urban resilience during emergency evacuations, offering a scalable solution for real-world applications in intelligent transportation systems.

[AI-52] Policy-shaped prediction: avoiding distractions in model-based reinforcement learning NEURIPS2024

链接: https://arxiv.org/abs/2412.05766
作者: Miles Hutson,Isaac Kauvar,Nick Haber
关键词-EN: sample-efficient policy optimization, promising route, route to sample-efficient, Model-based reinforcement learning, policy optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) is a promising route to sample-efficient policy optimization. However, a known vulnerability of reconstruction-based MBRL consists of scenarios in which detailed aspects of the world are highly predictable, but irrelevant to learning a good policy. Such scenarios can lead the model to exhaust its capacity on meaningless content, at the cost of neglecting important environment dynamics. While existing approaches attempt to solve this problem, we highlight its continuing impact on leading MBRL methods – including DreamerV3 and DreamerPro – with a novel environment where background distractions are intricate, predictable, and useless for planning future actions. To address this challenge we develop a method for focusing the capacity of the world model through synergy of a pretrained segmentation model, a task-aware reconstruction loss, and adversarial learning. Our method outperforms a variety of other approaches designed to reduce the impact of distractors, and is an advance towards robust model-based reinforcement learning.

[AI-53] Can OpenAI o1 outperform humans in higher-order cognitive thinking?

链接: https://arxiv.org/abs/2412.05753
作者: Ehsan Latif,Yifan Zhou,Shuchen Guo,Lehong Shi,Yizhu Gao,Matthew Nyaaba,Arne Bewerdorff,Xiantong Yang,Xiaoming Zhai
关键词-EN: higher-order cognitive domains, including critical thinking, Critical Thinking Essay, Thinking Essay Test, cognitive domains
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study evaluates the performance of OpenAI’s o1-preview model in higher-order cognitive domains, including critical thinking, systematic thinking, computational thinking, data literacy, creative thinking, logical reasoning, and scientific reasoning. Using established benchmarks, we compared the o1-preview models’s performance to human participants from diverse educational levels. o1-preview achieved a mean score of 24.33 on the Ennis-Weir Critical Thinking Essay Test (EWCTET), surpassing undergraduate (13.8) and postgraduate (18.39) participants (z = 1.60 and 0.90, respectively). In systematic thinking, it scored 46.1, SD = 4.12 on the Lake Urmia Vignette, significantly outperforming the human mean (20.08, SD = 8.13, z = 3.20). For data literacy, o1-preview scored 8.60, SD = 0.70 on Merk et al.'s “Use Data” dimension, compared to the human post-test mean of 4.17, SD = 2.02 (z = 2.19). On creative thinking tasks, the model achieved originality scores of 2.98, SD = 0.73, higher than the human mean of 1.74 (z = 0.71). In logical reasoning (LogiQA), it outperformed humans with average 90%, SD = 10% accuracy versus 86%, SD = 6.5% (z = 0.62). For scientific reasoning, it achieved near-perfect performance (mean = 0.99, SD = 0.12) on the TOSLS, exceeding the highest human scores of 0.85, SD = 0.13 (z = 1.78). While o1-preview excelled in structured tasks, it showed limitations in problem-solving and adaptive reasoning. These results demonstrate the potential of AI to complement education in structured assessments but highlight the need for ethical oversight and refinement for broader applications.

[AI-54] Constrained Control for Autonomous Spacecraft Rendezvous: Learning-Based Time Shift Governor

链接: https://arxiv.org/abs/2412.05748
作者: Taehyeun Kim,Robin Inho Kee,Ilya Kolmanovsky,Anouck Girard
关键词-EN: Time Shift Governor, Time Shift, time shift parameter, Shift Governor, based control scheme
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Taehyeun Kim and Robin Inho Kee contributed equally to this work. 18 pages, 12 figures

点击查看摘要

Abstract:This paper develops a Time Shift Governor (TSG)-based control scheme to enforce constraints during rendezvous and docking (RD) missions in the setting of the Two-Body problem. As an add-on scheme to the nominal closed-loop system, the TSG generates a time-shifted Chief spacecraft trajectory as a target reference for the Deputy spacecraft. This modification of the commanded reference trajectory ensures that constraints are enforced while the time shift is reduced to zero to effect the rendezvous. Our approach to TSG implementation integrates an LSTM neural network which approximates the time shift parameter as a function of a sequence of past Deputy and Chief spacecraft states. This LSTM neural network is trained offline from simulation data. We report simulation results for RD missions in the Low Earth Orbit (LEO) and on the Molniya orbit to demonstrate the effectiveness of the proposed control scheme. The proposed scheme reduces the time to compute the time shift parameter in most of the scenarios and successfully completes rendezvous missions.

[AI-55] Charting the Shapes of Stories with Game Theory NEURIPS2024

链接: https://arxiv.org/abs/2412.05747
作者: Constantinos Daskalakis,Ian Gemp,Yanchen Jiang,Renato Paes Leme,Christos Papadimitriou,Georgios Piliouras
关键词-EN: analysis reveals insights, reveals insights, Stories, analysis reveals, leveraging mathematical tools
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Creative AI Track

点击查看摘要

Abstract:Stories are records of our experiences and their analysis reveals insights into the nature of being human. Successful analyses are often interdisciplinary, leveraging mathematical tools to extract structure from stories and insights from structure. Historically, these tools have been restricted to one dimensional charts and dynamic social networks; however, modern AI offers the possibility of identifying more fully the plot structure, character incentives, and, importantly, counterfactual plot lines that the story could have taken but did not take. In this work, we use AI to model the structure of stories as game-theoretic objects, amenable to quantitative analysis. This allows us to not only interrogate each character’s decision making, but also possibly peer into the original author’s conception of the characters’ world. We demonstrate our proposed technique on Shakespeare’s famous Romeo and Juliet. We conclude with a discussion of how our analysis could be replicated in broader contexts, including real-life scenarios.

[AI-56] PrivAgent : Agent ic-based Red-teaming for LLM Privacy Leakage

链接: https://arxiv.org/abs/2412.05734
作者: Yuzhou Nie,Zhun Wang,Ye Yu,Xian Wu,Xuandong Zhao,Wenbo Guo,Dawn Song
关键词-EN: Recent studies, outputting private information, carefully crafted adversarial, system prompt, crafted adversarial prompts
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies have discovered that LLMs have serious privacy leakage concerns, where an LLM may be fooled into outputting private information under carefully crafted adversarial prompts. These risks include leaking system prompts, personally identifiable information, training data, and model parameters. Most existing red-teaming approaches for privacy leakage rely on humans to craft the adversarial prompts. A few automated methods are proposed for system prompt extraction, but they cannot be applied to more severe risks (e.g., training data extraction) and have limited effectiveness even for system prompt extraction. In this paper, we propose PrivAgent, a novel black-box red-teaming framework for LLM privacy leakage. We formulate different risks as a search problem with a unified attack goal. Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts for different target models under different risks. We propose a novel reward function to provide effective and fine-grained rewards for the attack agent. Finally, we introduce customizations to better fit our general framework to system prompt extraction and training data extraction. Through extensive evaluations, we first show that PrivAgent outperforms existing automated methods in system prompt leakage against six popular LLMs. Notably, our approach achieves a 100% success rate in extracting system prompts from real-world applications in OpenAI’s GPT Store. We also show PrivAgent’s effectiveness in extracting training data from an open-source LLM with a success rate of 5.9%. We further demonstrate PrivAgent’s effectiveness in evading the existing guardrail defense and its helpfulness in enabling better safety alignment. Finally, we validate our customized designs through a detailed ablation study. We release our code here this https URL. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.05734 [cs.CR] (or arXiv:2412.05734v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.05734 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-57] RL Zero: Zero-Shot Language to Behaviors without any Supervision

链接: https://arxiv.org/abs/2412.05718
作者: Harshit Sikchi,Siddhant Agarwal,Pranaya Jajoo,Samyak Parajuli,Caleb Chuck,Max Rudolph,Peter Stone,Amy Zhang,Scott Niekum
关键词-EN: poor reward design, reward design, bypass reward design, Reinforcement Learning, reward function
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 27 pages

点击查看摘要

Abstract:Rewards remain an uninterpretable way to specify tasks for Reinforcement Learning, as humans are often unable to predict the optimal behavior of any given reward function, leading to poor reward design and reward hacking. Language presents an appealing way to communicate intent to agents and bypass reward design, but prior efforts to do so have been limited by costly and unscalable labeling efforts. In this work, we propose a method for a completely unsupervised alternative to grounding language instructions in a zero-shot manner to obtain policies. We present a solution that takes the form of imagine, project, and imitate: The agent imagines the observation sequence corresponding to the language description of a task, projects the imagined sequence to our target domain, and grounds it to a policy. Video-language models allow us to imagine task descriptions that leverage knowledge of tasks learned from internet-scale video-text mappings. The challenge remains to ground these generations to a policy. In this work, we show that we can achieve a zero-shot language-to-behavior policy by first grounding the imagined sequences in real observations of an unsupervised RL agent and using a closed-form solution to imitation learning that allows the RL agent to mimic the grounded observations. Our method, RLZero, is the first to our knowledge to show zero-shot language to behavior generation abilities without any supervision on a variety of tasks on simulated domains. We further show that RLZero can also generate policies zero-shot from cross-embodied videos such as those scraped from YouTube.

[AI-58] Learning Soft Driving Constraints from Vectorized Scene Embeddings while Imitating Expert Trajectories

链接: https://arxiv.org/abs/2412.05717
作者: Niloufar Saeidi Mobarakeh,Behzad Khamidehi,Chunlin Li,Hamidreza Mirkhani,Fazel Arasteh,Mohammed Elmahgiubi,Weize Zhang,Kasra Rezaee,Pascal Poupart
关键词-EN: primary goal, generate safe, safe and efficient, motion planning, motion planning models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The primary goal of motion planning is to generate safe and efficient trajectories for vehicles. Traditionally, motion planning models are trained using imitation learning to mimic the behavior of human experts. However, these models often lack interpretability and fail to provide clear justifications for their decisions. We propose a method that integrates constraint learning into imitation learning by extracting driving constraints from expert trajectories. Our approach utilizes vectorized scene embeddings that capture critical spatial and temporal features, enabling the model to identify and generalize constraints across various driving scenarios. We formulate the constraint learning problem using a maximum entropy model, which scores the motion planner’s trajectories based on their similarity to the expert trajectory. By separating the scoring process into distinct reward and constraint streams, we improve both the interpretability of the planner’s behavior and its attention to relevant scene components. Unlike existing constraint learning methods that rely on simulators and are typically embedded in reinforcement learning (RL) or inverse reinforcement learning (IRL) frameworks, our method operates without simulators, making it applicable to a wider range of datasets and real-world scenarios. Experimental results on the InD and TrafficJams datasets demonstrate that incorporating driving constraints enhances model interpretability and improves closed-loop performance.

[AI-59] Flow-based Detection of Botnets through Bio-inspired Optimisation of Machine Learning

链接: https://arxiv.org/abs/2412.05688
作者: Biju Issac,Kyle Fryer,Seibu Mary Jacob
关键词-EN: autonomously infect, communicate and coordinate, enabling cybercriminals, cybercriminals to exploit, exploit the cumulative
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 24 pages

点击查看摘要

Abstract:Botnets could autonomously infect, propagate, communicate and coordinate with other members in the botnet, enabling cybercriminals to exploit the cumulative computing and bandwidth of its bots to facilitate cybercrime. Traditional detection methods are becoming increasingly unsuitable against various network-based detection evasion methods. These techniques ultimately render signature-based fingerprinting detection infeasible and thus this research explores the application of network flow-based behavioural modelling to facilitate the binary classification of bot network activity, whereby the detection is independent of underlying communications architectures, ports, protocols and payload-based detection evasion mechanisms. A comparative evaluation of various machine learning classification methods is conducted, to precisely determine the average accuracy of each classifier on bot datasets like CTU-13, ISOT 2010 and ISCX 2014. Additionally, hyperparameter tuning using Genetic Algorithm (GA), aiming to efficiently converge to the fittest hyperparameter set for each dataset was done. The bioinspired optimisation of Random Forest (RF) with GA achieved an average accuracy of 99.85% when it was tested against the three datasets. The model was then developed into a software product. The YouTube link of the project and demo of the software developed: this https URL

[AI-60] raining neural networks without backpropagation using particles

链接: https://arxiv.org/abs/2412.05667
作者: Deepak Kumar
关键词-EN: Neural networks, human brain, constructed neural network, multiple layers, layers to mimic
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Neural networks are a group of neurons stacked together in multiple layers to mimic the biological neurons in a human brain. Neural networks have been trained using the backpropagation algorithm based on gradient descent strategy for several decades. Several variants have been developed to improve the backpropagation algorithm. The loss function for the neural network is optimized through backpropagation, but several local minima exist in the manifold of the constructed neural network. We obtain several solutions matching the minima. The gradient descent strategy cannot avoid the problem of local minima and gets stuck in the minima due to the initialization. Particle swarm optimization (PSO) was proposed to select the best local minima among the search space of the loss function. The search space is limited to the instantiated particles in the PSO algorithm, and sometimes it cannot select the best solution. In the proposed approach, we overcome the problem of gradient descent and the limitation of the PSO algorithm by training individual neurons separately, capable of collectively solving the problem as a group of neurons forming a network.

[AI-61] Hyperedge Anomaly Detection with Hypergraph Neural Network

链接: https://arxiv.org/abs/2412.05641
作者: Md. Tanvir Alam,Chowdhury Farhan Ahmed,Carson K. Leung
关键词-EN: data structure, Hypergraph, data, data entities, entities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Hypergraph is a data structure that enables us to model higher-order associations among data entities. Conventional graph-structured data can represent pairwise relationships only, whereas hypergraph enables us to associate any number of entities, which is essential in many real-life applications. Hypergraph learning algorithms have been well-studied for numerous problem settings, such as node classification, link prediction, etc. However, much less research has been conducted on anomaly detection from hypergraphs. Anomaly detection identifies events that deviate from the usual pattern and can be applied to hypergraphs to detect unusual higher-order associations. In this work, we propose an end-to-end hypergraph neural network-based model for identifying anomalous associations in a hypergraph. Our proposed algorithm operates in an unsupervised manner without requiring any labeled data. Extensive experimentation on several real-life datasets demonstrates the effectiveness of our model in detecting anomalous hyperedges.

[AI-62] From Flexibility to Manipulation: The Slippery Slope of XAI Evaluation ECCV2024 NEURIPS2024

链接: https://arxiv.org/abs/2412.05592
作者: Kristoffer Wickstrøm,Marina Marie-Claire Höhne,Anna Hedström
关键词-EN: explainable artificial intelligence, truth explanation labels, artificial intelligence, ground truth, explainable artificial
类目: Artificial Intelligence (cs.AI)
*备注: Published in ECCV 2024 Workshop on Explainable Computer Vision: Where are We and Where are We Going? Shorter non-archival version also appeared in the NeurIPS 2024 Interpretable AI workshop. Code is available at \url{ this https URL }

点击查看摘要

Abstract:The lack of ground truth explanation labels is a fundamental challenge for quantitative evaluation in explainable artificial intelligence (XAI). This challenge becomes especially problematic when evaluation methods have numerous hyperparameters that must be specified by the user, as there is no ground truth to determine an optimal hyperparameter selection. It is typically not feasible to do an exhaustive search of hyperparameters so researchers typically make a normative choice based on similar studies in the literature, which provides great flexibility for the user. In this work, we illustrate how this flexibility can be exploited to manipulate the evaluation outcome. We frame this manipulation as an adversarial attack on the evaluation where seemingly innocent changes in hyperparameter setting significantly influence the evaluation outcome. We demonstrate the effectiveness of our manipulation across several datasets with large changes in evaluation outcomes across several explanation methods and models. Lastly, we propose a mitigation strategy based on ranking across hyperparameters that aims to provide robustness towards such manipulation. This work highlights the difficulty of conducting reliable XAI evaluation and emphasizes the importance of a holistic and transparent approach to evaluation in XAI.

[AI-63] GEE-OPs: An Operator Knowledge Base for Geospatial Code Generation on the Google Earth Engine Platform Powered by Large Language Models

链接: https://arxiv.org/abs/2412.05587
作者: Shuyang Hou,Jianyuan Liang,Anqi Zhao,Huayi Wu
关键词-EN: Google Earth Engine, Earth Engine, Google Earth, platform presents dual, spatiotemporal data continue
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:As the scale and complexity of spatiotemporal data continue to grow rapidly, the use of geospatial modeling on the Google Earth Engine (GEE) platform presents dual challenges: improving the coding efficiency of domain experts and enhancing the coding capabilities of interdisciplinary users. To address these challenges and improve the performance of large language models (LLMs) in geospatial code generation tasks, we propose a framework for building a geospatial operator knowledge base tailored to the GEE JavaScript API. This framework consists of an operator syntax knowledge table, an operator relationship frequency table, an operator frequent pattern knowledge table, and an operator relationship chain knowledge table. By leveraging Abstract Syntax Tree (AST) techniques and frequent itemset mining, we systematically extract operator knowledge from 185,236 real GEE scripts and syntax documentation, forming a structured knowledge base. Experimental results demonstrate that the framework achieves over 90% accuracy, recall, and F1 score in operator knowledge extraction. When integrated with the Retrieval-Augmented Generation (RAG) strategy for LLM-based geospatial code generation tasks, the knowledge base improves performance by 20-30%. Ablation studies further quantify the necessity of each knowledge table in the knowledge base construction. This work provides robust support for the advancement and application of geospatial code modeling techniques, offering an innovative approach to constructing domain-specific knowledge bases that enhance the code generation capabilities of LLMs, and fostering the deeper integration of generative AI technologies within the field of geoinformatics.

[AI-64] owards Learning to Reason: Comparing LLM s with Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

链接: https://arxiv.org/abs/2412.05586
作者: Michael Hersche,Giacomo Camposampiero,Roger Wattenhofer,Abu Sebastian,Abbas Rahimi
关键词-EN: Raven progressive matrices, work compares large, compares large language, solving Raven progressive, large language models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:This work compares large language models (LLMs) and neuro-symbolic approaches in solving Raven’s progressive matrices (RPM), a visual abstract reasoning test that involves the understanding of mathematical rules such as progression or arithmetic addition. Providing the visual attributes directly as textual prompts, which assumes an oracle visual perception module, allows us to measure the model’s abstract reasoning capability in isolation. Despite providing such compositionally structured representations from the oracle visual perception and advanced prompting techniques, both GPT-4 and Llama-3 70B cannot achieve perfect accuracy on the center constellation of the I-RAVEN dataset. Our analysis reveals that the root cause lies in the LLM’s weakness in understanding and executing arithmetic rules. As a potential remedy, we analyze the Abductive Rule Learner with Context-awareness (ARLC), a neuro-symbolic approach that learns to reason with vector-symbolic architectures (VSAs). Here, concepts are represented with distributed vectors s.t. dot products between encoded vectors define a similarity kernel, and simple element-wise operations on the vectors perform addition/subtraction on the encoded values. We find that ARLC achieves almost perfect accuracy on the center constellation of I-RAVEN, demonstrating a high fidelity in arithmetic rules. To stress the length generalization capabilities of the models, we extend the RPM tests to larger matrices (3x10 instead of typical 3x3) and larger dynamic ranges of the attribute values (from 10 up to 1000). We find that the LLM’s accuracy of solving arithmetic rules drops to sub-10%, especially as the dynamic range expands, while ARLC can maintain a high accuracy due to emulating symbolic computations on top of properly distributed representations. Our code is available at this https URL.

[AI-65] Electrocardiogram (ECG) Based Cardiac Arrhythmia Detection and Classification using Machine Learning Algorithms

链接: https://arxiv.org/abs/2412.05583
作者: Atit Pokharel,Shashank Dahal,Pratik Sapkota,Bhupendra Bimal Chhetri
关键词-EN: specifically Machine Learning, Machine Learning, Deep Learning, Artificial Intelligence, specifically Machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The rapid advancements in Artificial Intelligence, specifically Machine Learning (ML) and Deep Learning (DL), have opened new prospects in medical sciences for improved diagnosis, prognosis, and treatment of severe health conditions. This paper focuses on the development of an ML model with high predictive accuracy to classify arrhythmic electrocardiogram (ECG) signals. The ECG signals datasets utilized in this study were sourced from the PhysioNet and MIT-BIH databases. The research commenced with binary classification, where an optimized Bidirectional Long Short-Term Memory (Bi-LSTM) model yielded excellent results in differentiating normal and atrial fibrillation signals. A pivotal aspect of this research was a survey among medical professionals, which not only validated the practicality of AI-based ECG classifiers but also identified areas for improvement, including accuracy and the inclusion of more arrhythmia types. These insights drove the development of an advanced Convolutional Neural Network (CNN) system capable of classifying five different types of ECG signals with better accuracy and precision. The CNN model’s robust performance was ensured through rigorous stratified 5-fold cross validation. A web portal was also developed to demonstrate real-world utility, offering access to the trained model for real-time classification. This study highlights the potential applications of such models in remote health monitoring, predictive healthcare, assistive diagnostic tools, and simulated environments for educational training and interdisciplinary collaboration between data scientists and medical personnel.

[AI-66] Fragmented Layer Grouping in GUI Designs Through Graph Learning Based on Multimodal Information

链接: https://arxiv.org/abs/2412.05555
作者: Yunnong Chen,Shuhong Xiao,Jiazhi Li,Tingting Zhou,Yanfang Chang,Yankun Zhen,Lingyun Sun,Liuqing Chen
关键词-EN: Automatically constructing GUI, Automatically constructing, automating GUI design, critical intelligent step, constructing GUI groups
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 28 pages,6 figures

点击查看摘要

Abstract:Automatically constructing GUI groups of different granularities constitutes a critical intelligent step towards automating GUI design and implementation tasks. Specifically, in the industrial GUI-to-code process, fragmented layers may decrease the readability and maintainability of generated code, which can be alleviated by grouping semantically consistent fragmented layers in the design prototypes. This study aims to propose a graph-learning-based approach to tackle the fragmented layer grouping problem according to multi-modal information in design prototypes. Our graph learning module consists of self-attention and graph neural network modules. By taking the multimodal fused representation of GUI layers as input, we innovatively group fragmented layers by classifying GUI layers and regressing the bounding boxes of the corresponding GUI components simultaneously. Experiments on two real-world datasets demonstrate that our model achieves state-of-the-art performance. A further user study is also conducted to validate that our approach can assist an intelligent downstream tool in generating more maintainable and readable front-end code.

[AI-67] KG-Retriever: Efficient Knowledge Indexing for Retrieval-Augmented Large Language Models

链接: https://arxiv.org/abs/2412.05547
作者: Weijie Chen,Ting Bai,Jinbo Su,Jian Luan,Wei Liu,Chuan Shi
关键词-EN: Large language models, multi-hop question answering, retrieval-augmented generation encounter, generate comprehensive responses, comprehensive responses based
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models with retrieval-augmented generation encounter a pivotal challenge in intricate retrieval tasks, e.g., multi-hop question answering, which requires the model to navigate across multiple documents and generate comprehensive responses based on fragmented information. To tackle this challenge, we introduce a novel Knowledge Graph-based RAG framework with a hierarchical knowledge retriever, termed KG-Retriever. The retrieval indexing in KG-Retriever is constructed on a hierarchical index graph that consists of a knowledge graph layer and a collaborative document layer. The associative nature of graph structures is fully utilized to strengthen intra-document and inter-document connectivity, thereby fundamentally alleviating the information fragmentation problem and meanwhile improving the retrieval efficiency in cross-document retrieval of LLMs. With the coarse-grained collaborative information from neighboring documents and concise information from the knowledge graph, KG-Retriever achieves marked improvements on five public QA datasets, showing the effectiveness and efficiency of our proposed RAG framework.

[AI-68] owards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers

链接: https://arxiv.org/abs/2412.05540
作者: Boxun Xu,Junyoung Hwang,Pruek Vanna-iampikul,Yuxuan Yin,Sung Kyu Lim,Peng Li
关键词-EN: Spiking Neural Networks, Neural Networks, energy-efficient deep learning, unlock energy-efficient deep, deep learning
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks(SNNs) provide a brain-inspired and event-driven mechanism that is believed to be critical to unlock energy-efficient deep learning. The mixture-of-experts approach mirrors the parallel distributed processing of nervous systems, introducing conditional computation policies and expanding model capacity without scaling up the number of computational operations. Additionally, spiking mixture-of-experts self-attention mechanisms enhance representation capacity, effectively capturing diverse patterns of entities and dependencies between visual or linguistic tokens. However, there is currently a lack of hardware support for highly parallel distributed processing needed by spiking transformers, which embody a brain-inspired computation. This paper introduces the first 3D hardware architecture and design methodology for Mixture-of-Experts and Multi-Head Attention spiking transformers. By leveraging 3D integration with memory-on-logic and logic-on-logic stacking, we explore such brain-inspired accelerators with spatially stackable circuitry, demonstrating significant optimization of energy efficiency and latency compared to conventional 2D CMOS integration.

[AI-69] Memory-enhanced Invariant Prompt Learning for Urban Flow Prediction under Distribution Shifts

链接: https://arxiv.org/abs/2412.05534
作者: Haiyang Jiang,Tong Chen,Wentao Zhang,Nguyen Quoc Viet Hung,Yuan Yuan,Yong Li,Lizhen Cui
关键词-EN: future traffic flow, Graph Neural Networks, classic spatial-temporal forecasting, spatial-temporal forecasting task, Urban flow
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Urban flow prediction is a classic spatial-temporal forecasting task that estimates the amount of future traffic flow for a given location. Though models represented by Spatial-Temporal Graph Neural Networks (STGNNs) have established themselves as capable predictors, they tend to suffer from distribution shifts that are common with the urban flow data due to the dynamics and unpredictability of spatial-temporal events. Unfortunately, in spatial-temporal applications, the dynamic environments can hardly be quantified via a fixed number of parameters, whereas learning time- and location-specific environments can quickly become computationally prohibitive. In this paper, we propose a novel framework named Memory-enhanced Invariant Prompt learning (MIP) for urban flow prediction under constant distribution shifts. Specifically, MIP is equipped with a learnable memory bank that is trained to memorize the causal features within the spatial-temporal graph. By querying a trainable memory bank that stores the causal features, we adaptively extract invariant and variant prompts (i.e., patterns) for a given location at every time step. Then, instead of intervening the raw data based on simulated environments, we directly perform intervention on variant prompts across space and time. With the intervened variant prompts in place, we use invariant learning to minimize the variance of predictions, so as to ensure that the predictions are only made with invariant features. With extensive comparative experiments on two public urban flow datasets, we thoroughly demonstrate the robustness of MIP against OOD data.

[AI-70] AI Planning: A Primer and Survey (Preliminary Report)

链接: https://arxiv.org/abs/2412.05528
作者: Dillon Z. Chen,Pulkit Verma,Siddharth Srivastava,Michael Katz,Sylvie Thiébaux
关键词-EN: spans multiple sub-disciplines, Automated decision-making, foundation models, operations research, fundamental topic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automated decision-making is a fundamental topic that spans multiple sub-disciplines in AI: reinforcement learning (RL), AI planning (AP), foundation models, and operations research, among others. Despite recent efforts to ``bridge the gaps’’ between these communities, there remain many insights that have not yet transcended the boundaries. Our goal in this paper is to provide a brief and non-exhaustive primer on ideas well-known in AP, but less so in other sub-disciplines. We do so by introducing the classical AP problem and representation, and extensions that handle uncertainty and time through the Markov Decision Process formalism. Next, we survey state-of-the-art techniques and ideas for solving AP problems, focusing on their ability to exploit problem structure. Lastly, we cover subfields within AP for learning structure from unstructured inputs and learning to generalise to unseen scenarios and situations.

[AI-71] More than Marketing? On the Information Value of AI Benchmarks for Practitioners

链接: https://arxiv.org/abs/2412.05520
作者: Amelia Hardy,Anka Reuel,Kiana Jafari Meimandi,Lisa Soder,Allie Griffith,Dylan M. Asmar,Sanmi Koyejo,Michael S. Bernstein,Mykel J. Kochenderfer
关键词-EN: competitive market, widely broadcast, developers as indicators, growing and competitive, benchmarks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks – even those developed internally for specific tasks – were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.

[AI-72] rimming Down Large Spiking Vision Transformers via Heterogeneous Quantization Search

链接: https://arxiv.org/abs/2412.05505
作者: Boxun Xu,Yufei Song,Peng Li
关键词-EN: Artificial Neural Networks, Spiking Neural Networks, Neural Networks, neuromorphic hardware due, Artificial Neural
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are amenable to deployment on edge devices and neuromorphic hardware due to their lower dissipation. Recently, SNN-based transformers have garnered significant interest, incorporating attention mechanisms akin to their counterparts in Artificial Neural Networks (ANNs) while demonstrating excellent performance. However, deploying large spiking transformer models on resource-constrained edge devices such as mobile phones, still poses significant challenges resulted from the high computational demands of large uncompressed high-precision models. In this work, we introduce a novel heterogeneous quantization method for compressing spiking transformers through layer-wise quantization. Our approach optimizes the quantization of each layer using one of two distinct quantization schemes, i.e., uniform or power-of-two quantification, with mixed bit resolutions. Our heterogeneous quantization demonstrates the feasibility of maintaining high performance for spiking transformers while utilizing an average effective resolution of 3.14-3.67 bits with less than a 1% accuracy drop on DVS Gesture and CIFAR10-DVS datasets. It attains a model compression rate of 8.71x-10.19x for standard floating-point spiking transformers. Moreover, the proposed approach achieves a significant energy reduction of 5.69x, 8.72x, and 10.2x while maintaining high accuracy levels of 85.3%, 97.57%, and 80.4% on N-Caltech101, DVS-Gesture, and CIFAR10-DVS datasets, respectively.

[AI-73] A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning System

链接: https://arxiv.org/abs/2412.05498
作者: Pengyu Li,Zhijie Zhong,Tong Zhang,Zhiwen Yu,C.L. Philip Chen,Kaixiang Yang
关键词-EN: Broad Learning System, Time series anomaly, learning, Deep learning, Time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 7 figures, 3 tables, Under review

点击查看摘要

Abstract:Time series anomaly detection (TSAD) has been a research hotspot in both academia and industry in recent years. Deep learning methods have become the mainstream research direction due to their excellent performance. However, new viewpoints have emerged in recent TSAD research. Deep learning is not required for TSAD due to limitations such as slow deep learning speed. The Broad Learning System (BLS) is a shallow network framework that benefits from its ease of optimization and speed. It has been shown to outperform machine learning approaches while remaining competitive with deep learning. Based on the current situation of TSAD, we propose the Contrastive Patch-based Broad Learning System (CPatchBLS). This is a new exploration of patching technique and BLS, providing a new perspective for TSAD. We construct Dual-PatchBLS as a base through patching and Simple Kernel Perturbation (SKP) and utilize contrastive learning to capture the differences between normal and abnormal data under different representations. To compensate for the temporal semantic loss caused by various patching, we propose CPatchBLS with model level integration, which takes advantage of BLS’s fast feature to build model-level integration and improve model detection. Using five real-world series anomaly detection datasets, we confirmed the method’s efficacy, outperforming previous deep learning and machine learning methods while retaining a high level of computing efficiency.

[AI-74] A Compositional Atlas for Algebraic Circuits NEURIPS2024

链接: https://arxiv.org/abs/2412.05481
作者: Benjie Wang,Denis Deratani Mauá,Guy Van den Broeck,YooJung Choi
关键词-EN: compactly encode knowledge, Boolean functions, encode knowledge, probability distributions, based on sum-product
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Circuits based on sum-product structure have become a ubiquitous representation to compactly encode knowledge, from Boolean functions to probability distributions. By imposing constraints on the structure of such circuits, certain inference queries become tractable, such as model counting and most probable configuration. Recent works have explored analyzing probabilistic and causal inference queries as compositions of basic operators to derive tractability conditions. In this paper, we take an algebraic perspective for compositional inference, and show that a large class of queries - including marginal MAP, probabilistic answer set programming inference, and causal backdoor adjustment - correspond to a combination of basic operators over semirings: aggregation, product, and elementwise mapping. Using this framework, we uncover simple and general sufficient conditions for tractable composition of these operators, in terms of circuit properties (e.g., marginal determinism, compatibility) and conditions on the elementwise mappings. Applying our analysis, we derive novel tractability conditions for many such compositional queries. Our results unify tractability conditions for existing problems on circuits, while providing a blueprint for analysing novel compositional inference queries.

[AI-75] he BrowserGym Ecosystem for Web Agent Research

链接: https://arxiv.org/abs/2412.05467
作者: Thibault Le Sellier De Chezelles,Maxime Gasse,Alexandre Lacoste,Alexandre Drouin,Massimo Caccia,Léo Boisvert,Megh Thakkar,Tom Marty,Rim Assouel,Sahar Omidi Shayegan,Lawrence Keunho Jang,Xing Han Lù,Ori Yoran,Dehan Kong,Frank F. Xu,Siva Reddy,Quentin Cappart,Graham Neubig,Ruslan Salakhutdinov,Nicolas Chapados
关键词-EN: Large Language Models, BrowserGym ecosystem addresses, Large Language, Language Models, ecosystem addresses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic’s latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

[AI-76] A Graph-Based Approach for Conversational AI-Driven Personal Memory Capture and Retrieval in a Real-world Application

链接: https://arxiv.org/abs/2412.05447
作者: Savini Kashmira,Jayanaka L. Dantanarayana,Joshua Brodsky,Ashish Mahendra,Yiping Kang,Krisztian Flautner,Lingjia Tang,Jason Mars
关键词-EN: user-engaging AI-guided conversational, AI-guided conversational approach, personal memories’, captures and retrieves, mobile application
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:TOBU is a novel mobile application that captures and retrieves `personal memories’ (pictures/videos together with stories and context around those moments) in a user-engaging AI-guided conversational approach. Our initial prototype showed that existing retrieval techniques such as retrieval-augmented generation (RAG) systems fall short due to their limitations in understanding memory relationships, causing low recall, hallucination, and unsatisfactory user experience. We design TOBUGraph, a novel graph-based retrieval approach. During capturing, TOBUGraph leverages large language models (LLMs) to automatically create a dynamic knowledge graph of memories, establishing context and relationships of those memories. During retrieval, TOBUGraph combines LLMs with the memory graph to achieve comprehensive recall through graph traversal. Our evaluation using real user data demonstrates that TOBUGraph outperforms multiple RAG implementations in both precision and recall, significantly improving user experience through improved retrieval accuracy and reduced hallucination.

[AI-77] From Voice to Value: Leveraging AI to Enhance Spoken Online Reviews on the Go

链接: https://arxiv.org/abs/2412.05445
作者: Kavindu Ravishan,Dániel Szabó,Niels van Berkel,Aku Visuri,Chi-Lan Yang,Koji Yatani,Simo Hosio
关键词-EN: make better decisions, people make, reviews, users, Abstract
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Online reviews help people make better decisions. Review platforms usually depend on typed input, where leaving a good review requires significant effort because users must carefully organize and articulate their thoughts. This may discourage users from leaving comprehensive and high-quality reviews, especially when they are on the go. To address this challenge, we developed Vocalizer, a mobile application that enables users to provide reviews through voice input, with enhancements from a large language model (LLM). In a longitudinal study, we analysed user interactions with the app, focusing on AI-driven features that help refine and improve reviews. Our findings show that users frequently utilized the AI agent to add more detailed information to their reviews. We also show how interactive AI features can improve users self-efficacy and willingness to share reviews online. Finally, we discuss the opportunities and challenges of integrating AI assistance into review-writing systems.

[AI-78] DRL4AOI: A DRL Framework for Semantic-aware AOI Segmentation in Location-Based Services

链接: https://arxiv.org/abs/2412.05437
作者: Youfang Lin,Jinji Fu,Haomin Wen,Jiyuan Wang,Zhenjie Wei,Yuting Qiang,Xiaowei Mao,Lixia Wu,Haoyuan Hu,Yuxuan Liang,Huaiyu Wan
关键词-EN: AOI segmentation, urban geographical spaces, AOI, partition urban areas, Areas of Interest
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:In Location-Based Services (LBS), such as food delivery, a fundamental task is segmenting Areas of Interest (AOIs), aiming at partitioning the urban geographical spaces into non-overlapping regions. Traditional AOI segmentation algorithms primarily rely on road networks to partition urban areas. While promising in modeling the geo-semantics, road network-based models overlooked the service-semantic goals (e.g., workload equality) in LBS service. In this paper, we point out that the AOI segmentation problem can be naturally formulated as a Markov Decision Process (MDP), which gradually chooses a nearby AOI for each grid in the current AOI’s border. Based on the MDP, we present the first attempt to generalize Deep Reinforcement Learning (DRL) for AOI segmentation, leading to a novel DRL-based framework called DRL4AOI. The DRL4AOI framework introduces different service-semantic goals in a flexible way by treating them as rewards that guide the AOI generation. To evaluate the effectiveness of DRL4AOI, we develop and release an AOI segmentation system. We also present a representative implementation of DRL4AOI - TrajRL4AOI - for AOI segmentation in the logistics service. It introduces a Double Deep Q-learning Network (DDQN) to gradually optimize the AOI generation for two specific semantic goals: i) trajectory modularity, i.e., maximize tightness of the trajectory connections within an AOI and the sparsity of connections between AOIs, ii) matchness with the road network, i.e., maximizing the matchness between AOIs and the road network. Quantitative and qualitative experiments conducted on synthetic and real-world data demonstrate the effectiveness and superiority of our method. The code and system is publicly available at this https URL.

[AI-79] KEDformer:Knowledge Extraction Seasonal Trend Decomposition for Long-term Sequence Prediction

链接: https://arxiv.org/abs/2412.05421
作者: Zhenkai Qin,Baozhong Wei,Caifeng Gao,Jianyuan Ni
关键词-EN: accurate long-term predictions, predictions are essential, critical task, Time series, Time series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series forecasting is a critical task in domains such as energy, finance, and meteorology, where accurate long-term predictions are essential. While Transformer-based models have shown promise in capturing temporal dependencies, their application to extended sequences is limited by computational inefficiencies and limited generalization. In this study, we propose KEDformer, a knowledge extraction-driven framework that integrates seasonal-trend decomposition to address these challenges. KEDformer leverages knowledge extraction methods that focus on the most informative weights within the self-attention mechanism to reduce computational overhead. Additionally, the proposed KEDformer framework decouples time series into seasonal and trend components. This decomposition enhances the model’s ability to capture both short-term fluctuations and long-term patterns. Extensive experiments on five public datasets from energy, transportation, and weather domains demonstrate the effectiveness and competitiveness of KEDformer, providing an efficient solution for long-term time series forecasting.

[AI-80] FogROS2-FT: Fault Tolerant Cloud Robotics

链接: https://arxiv.org/abs/2412.05408
作者: Kaiyuan Chen,Kush Hari,Trinity Chung,Michael Wang,Nan Tian,Christian Juette,Jeffrey Ichnowski,Liu Ren,John Kubiatowicz,Ion Stoica,Ken Goldberg
关键词-EN: offload complex computational, complex computational tasks, ease of management, offload complex, complex computational
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注: IEEE/RSJ International Conference on Intelligent Robots and Systems 2024 Best Paper Finalist

点击查看摘要

Abstract:Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance and ease of management. However, cloud compute can be costly, cloud services can suffer occasional downtime, and connectivity between the robot and cloud can be prone to variations in network Quality-of-Service (QoS). We present FogROS2-FT (Fault Tolerant) to mitigate these issues by introducing a multi-cloud extension that automatically replicates independent stateless robotic services, routes requests to these replicas, and directs the first response back. With replication, robots can still benefit from cloud computations even when a cloud service provider is down or there is low QoS. Additionally, many cloud computing providers offer low-cost spot computing instances that may shutdown unpredictably. Normally, these low-cost instances would be inappropriate for cloud robotics, but the fault tolerance nature of FogROS2-FT allows them to be used reliably. We demonstrate FogROS2-FT fault tolerance capabilities in 3 cloud-robotics scenarios in simulation (visual object detection, semantic segmentation, motion planning) and 1 physical robot experiment (scan-pick-and-place). Running on the same hardware specification, FogROS2-FT achieves motion planning with up to 2.2x cost reduction and up to a 5.53x reduction on 99 Percentile (P99) long-tail latency. FogROS2-FT reduces the P99 long-tail latency of object detection and semantic segmentation by 2.0x and 2.1x, respectively, under network slowdown and resource contention.

[AI-81] HiVeGen – Hierarchical LLM -based Verilog Generation for Scalable Chip Design

链接: https://arxiv.org/abs/2412.05393
作者: Jinwei Tang,Jiayin Qin,Kiran Thorat,Chen Zhu-Tian,Yu Cao,Yang(Katie)Zhao,Caiwen Ding
关键词-EN: Large Language Models, Hardware Description Language, Language Models, recently demonstrating impressive, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:With Large Language Models (LLMs) recently demonstrating impressive proficiency in code generation, it is promising to extend their abilities to Hardware Description Language (HDL). However, LLMs tend to generate single HDL code blocks rather than hierarchical structures for hardware designs, leading to hallucinations, particularly in complex designs like Domain-Specific Accelerators (DSAs). To address this, we propose HiVeGen, a hierarchical LLM-based Verilog generation framework that decomposes generation tasks into LLM-manageable hierarchical submodules. HiVeGen further harnesses the advantages of such hierarchical structures by integrating automatic Design Space Exploration (DSE) into hierarchy-aware prompt generation, introducing weight-based retrieval to enhance code reuse, and enabling real-time human-computer interaction to lower error-correction cost, significantly improving the quality of generated designs.

[AI-82] IMPACT:InMemory ComPuting Architecture Based on Y-FlAsh Technology for Coalesced Tsetlin Machine Inference

链接: https://arxiv.org/abs/2412.05327
作者: Omar Ghazal,Wei Wang,Shahar Kvatinsky,Farhad Merchant,Alex Yakovlev,Rishad Shafik
关键词-EN: traditional von Neumann, von Neumann architecture, pushed data bandwidth, data bandwidth requirements, processing large volumes
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 27 Pages, 14 Figures, 6 Tables

点击查看摘要

Abstract:The increasing demand for processing large volumes of data for machine learning models has pushed data bandwidth requirements beyond the capability of traditional von Neumann architecture. In-memory computing (IMC) has recently emerged as a promising solution to address this gap by enabling distributed data storage and processing at the micro-architectural level, significantly reducing both latency and energy. In this paper, we present the IMPACT: InMemory ComPuting Architecture Based on Y-FlAsh Technology for Coalesced Tsetlin Machine Inference, underpinned on a cutting-edge memory device, Y-Flash, fabricated on a 180 nm CMOS process. Y-Flash devices have recently been demonstrated for digital and analog memory applications, offering high yield, non-volatility, and low power consumption. The IMPACT leverages the Y-Flash array to implement the inference of a novel machine learning algorithm: coalesced Tsetlin machine (CoTM) based on propositional logic. CoTM utilizes Tsetlin automata (TA) to create Boolean feature selections stochastically across parallel clauses. The IMPACT is organized into two computational crossbars for storing the TA and weights. Through validation on the MNIST dataset, IMPACT achieved 96.3% accuracy. The IMPACT demonstrated improvements in energy efficiency, e.g., 2.23X over CNN-based ReRAM, 2.46X over Neuromorphic using NOR-Flash, and 2.06X over DNN-based PCM, suited for modern ML inference applications.

[AI-83] LaNMP: A Language-Conditioned Mobile Manipulation Benchmark for Autonomous Robots

链接: https://arxiv.org/abs/2412.05313
作者: Ahmed Jaafar,Shreyas Sundara Raman,Yichen Wei,Sofia Juliani,Anneke Wernerfelt,Benedict Quartey,Ifrah Idrees,Jason Xinyu Liu,Stefanie Tellex
关键词-EN: follow natural language, capable and prevalent, holistically develop, develop and evaluate, evaluate their ability
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As robots that follow natural language become more capable and prevalent, we need a benchmark to holistically develop and evaluate their ability to solve long-horizon mobile manipulation tasks in large, diverse environments. To tackle this challenge, robots must use visual and language understanding, navigation, and manipulation capabilities. Existing datasets do not integrate all these aspects, restricting their efficacy as benchmarks. To address this gap, we present the Language, Navigation, Manipulation, Perception (LaNMP, pronounced Lamp) dataset and demonstrate the benefits of integrating these four capabilities and various modalities. LaNMP comprises 574 trajectories across eight simulated and real-world environments for long-horizon room-to-room pick-and-place tasks specified by natural language. Every trajectory consists of over 20 attributes, including RGB-D images, segmentations, and the poses of the robot body, end-effector, and grasped objects. We fine-tuned and tested two models in simulation, and evaluated a third on a physical robot, to demonstrate the benchmark’s applicability in development and evaluation, as well as making models more sample efficient. The models performed suboptimally compared to humans; however, showed promise in increasing model sample efficiency, indicating significant room for developing more sample efficient multimodal mobile manipulation models using our benchmark.

[AI-84] DRC-Coder: Automated DRC Checker Code Generation Using LLM Autonomous Agent

链接: https://arxiv.org/abs/2412.05311
作者: Chen-Chia Chang,Chia-Tung Ho,Yaguang Li,Yiran Chen,Haoxing Ren
关键词-EN: fast optimization loops, DRC, integrated DRC checkers, utilized in place, place and route
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: Proceedings of the 2025 International Symposium on Physical Design (ISPD '25), March 16–19, 2025, Austin, TX, USA

点击查看摘要

Abstract:In the advanced technology nodes, the integrated design rule checker (DRC) is often utilized in place and route tools for fast optimization loops for power-performance-area. Implementing integrated DRC checkers to meet the standard of commercial DRC tools demands extensive human expertise to interpret foundry specifications, analyze layouts, and debug code iteratively. However, this labor-intensive process, requiring to be repeated by every update of technology nodes, prolongs the turnaround time of designing circuits. In this paper, we present DRC-Coder, a multi-agent framework with vision capabilities for automated DRC code generation. By incorporating vision language models and large language models (LLM), DRC-Coder can effectively process textual, visual, and layout information to perform rule interpretation and coding by two specialized LLMs. We also design an auto-evaluation function for LLMs to enable DRC code debugging. Experimental results show that targeting on a sub-3nm technology node for a state-of-the-art standard cell layout tool, DRC-Coder achieves perfect F1 score 1.000 in generating DRC codes for meeting the standard of a commercial DRC tool, highly outperforming standard prompting techniques (F1=0.631). DRC-Coder can generate code for each design rule within four minutes on average, which significantly accelerates technology advancement and reduces engineering costs.

[AI-85] Revisiting Your Memory: Reconstruction of Affect-Contextualized Memory via EEG-guided Audiovisual Generation

链接: https://arxiv.org/abs/2412.05296
作者: Joonwoo Kwon,Heehwan Wang,Jinwoo Lee,Sooyoung Kim,Shinjae Yoo,Yuewei Lin,Jiook Cha
关键词-EN: audio-visual generation guided, introduce RecallAffectiveMemory, extracted from electroencephalogram, generation guided, EEG recordings collected
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Codes and the dataset will be released upon acceptance

点击查看摘要

Abstract:In this paper, we introduce RecallAffectiveMemory, a novel task designed to reconstruct autobiographical memories through audio-visual generation guided by affect extracted from electroencephalogram (EEG) signals. To support this pioneering task, we present the EEG-AffectiveMemory dataset, which encompasses textual descriptions, visuals, music, and EEG recordings collected during memory recall from nine participants. Furthermore, we propose RYM (Recall Your Memory), a three-stage framework for generating synchronized audio-visual contents while maintaining dynamic personal memory affect trajectories. Experimental results indicate that our method can faithfully reconstruct affect-contextualized audio-visual memory across all subjects, both qualitatively and quantitatively, with participants reporting strong affective concordance between their recalled memories and the generated content. Our approaches advance affect decoding research and its practical applications in personalized media creation via neural-based affect comprehension.

[AI-86] International Scientific Report on the Safety of Advanced AI (Interim Report) WWW

链接: https://arxiv.org/abs/2412.05282
作者: Yoshua Bengio,Sören Mindermann,Daniel Privitera,Tamay Besiroglu,Rishi Bommasani,Stephen Casper,Yejin Choi,Danielle Goldfarb,Hoda Heidari,Leila Khalatbari,Shayne Longpre,Vasilios Mavroudis,Mantas Mazeika,Kwan Yee Ng,Chinasa T. Okolo,Deborah Raji,Theodora Skeadas,Florian Tramèr,Bayo Adekanmbi,Paul Christiano,David Dalrymple,Thomas G. Dietterich,Edward Felten,Pascale Fung,Pierre-Olivier Gourinchas,Nick Jennings,Andreas Krause,Percy Liang,Teresa Ludermir,Vidushi Marda,Helen Margetts,John A. McDermid,Arvind Narayanan,Alondra Nelson,Alice Oh,Gopal Ramchurn,Stuart Russell,Marietje Schaake,Dawn Song,Alvaro Soto,Lee Tiedrich,Gaël Varoquaux,Andrew Yao,Ya-Qin Zhang
关键词-EN: Safety of Advanced, International Scientific Report, interim publication, international Expert Advisory, Expert Advisory Panel
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Available under the open government license at this https URL

点击查看摘要

Abstract:This is the interim publication of the first International Scientific Report on the Safety of Advanced AI. The report synthesises the scientific understanding of general-purpose AI – AI that can perform a wide variety of tasks – with a focus on understanding and managing its risks. A diverse group of 75 AI experts contributed to this report, including an international Expert Advisory Panel nominated by 30 countries, the EU, and the UN. Led by the Chair, these independent experts collectively had full discretion over the report’s content.

[AI-87] Security Threats in Agent ic AI System

链接: https://arxiv.org/abs/2410.14728
作者: Raihan Khan,Sayak Sarkar,Sainik Kumar Mahata,Edwin Jose
关键词-EN: research paper explores, security threats posed, research paper, paper explores, explores the privacy
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:This research paper explores the privacy and security threats posed to an Agentic AI system with direct access to database systems. Such access introduces significant risks, including unauthorized retrieval of sensitive information, potential exploitation of system vulnerabilities, and misuse of personal or confidential data. The complexity of AI systems combined with their ability to process and analyze large volumes of data increases the chances of data leaks or breaches, which could occur unintentionally or through adversarial manipulation. Furthermore, as AI agents evolve with greater autonomy, their capacity to bypass or exploit security measures becomes a growing concern, heightening the need to address these critical vulnerabilities in agentic systems.

[AI-88] Sublinear Regret for a Class of Continuous-Time Linear–Quadratic Reinforcement Learning Problems

链接: https://arxiv.org/abs/2407.17226
作者: Yilie Huang,Yanwei Jia,Xun Yu Zhou
关键词-EN: running control rewards, state processes depend, control variables, control problems, running control
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 44 pages, 4 figures

点击查看摘要

Abstract:We study reinforcement learning (RL) for a class of continuous-time linear-quadratic (LQ) control problems for diffusions, where states are scalar-valued and running control rewards are absent but volatilities of the state processes depend on both state and control variables. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly. Our main contributions include the introduction of an exploration schedule and a regret analysis of the proposed algorithm. We provide the convergence rate of the policy parameter to the optimal one, and prove that the algorithm achieves a regret bound of O(N^\frac34) up to a logarithmic factor, where N is the number of learning episodes. We conduct a simulation study to validate the theoretical results and demonstrate the effectiveness and reliability of the proposed algorithm. We also perform numerical comparisons between our method and those of the recent model-based stochastic LQ RL studies adapted to the state- and control-dependent volatility setting, demonstrating a better performance of the former in terms of regret bounds.

[AI-89] On the Replicability and Reproducibility of Deep Learning in Software Engineering

链接: https://arxiv.org/abs/2006.14244
作者: Chao Liu,Cuiyun Gao,Xin Xia,David Lo,John Grundy,Xiaohu Yang
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-90] Imputation Matters: A Deeper Look into an Overlooked Step in Longitudinal Health and Behavior Sensing Research

链接: https://arxiv.org/abs/2412.06018
作者: Akshat Choube,Rahul Majethia,Sohini Bhattacharya,Vedant Das Swain,Jiachen Li,Varun Mishra
关键词-EN: Handling missing data, passive sensing, incomplete data, missing data effectively, health and behavior
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Longitudinal passive sensing studies for health and behavior outcomes often have missing and incomplete data. Handling missing data effectively is thus a critical data processing and modeling step. Our formative interviews with researchers working in longitudinal health and behavior passive sensing revealed a recurring theme: most researchers consider imputation a low-priority step in their analysis and inference pipeline, opting to use simple and off-the-shelf imputation strategies without comprehensively evaluating its impact on study outcomes. Through this paper, we call attention to the importance of imputation. Using publicly available passive sensing datasets for depression, we show that prioritizing imputation can significantly impact the study outcomes – with our proposed imputation strategies resulting in up to 31% improvement in AUROC to predict depression over the original imputation strategy. We conclude by discussing the challenges and opportunities with effective imputation in longitudinal sensing studies.

[AI-91] Materials-Discovery Workflows Guided by Symbolic Regression: Identifying Acid-Stable Oxides for Electrocatalysis

链接: https://arxiv.org/abs/2412.05947
作者: Akhil S. Nair,Lucas Foppa,Matthias Scheffler
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

[AI-92] A Scoping Review of ChatGPT Research in Accounting and Finance

链接: https://arxiv.org/abs/2412.05731
作者: Mengming Michael Dong,Theophanis C. Stratopoulos,Victor Xiaoqi Wang
关键词-EN:
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 56 pages, 3 figures, 16 tables

点击查看摘要

[AI-93] Leveraging Time-Series Foundation Model for Subsurface Well Logs Prediction and Anomaly Detection

链接: https://arxiv.org/abs/2412.05681
作者: Ardiansyah Koeshidayatullah,Abdulrahman Al-Fakih,SanLinn Ismael Kaka
关键词-EN:
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

[AI-94] No-Free-Lunch Theories for Tensor-Network Machine Learning Models

链接: https://arxiv.org/abs/2412.05674
作者: Jing-Chuan Wu,Qi Ye,Dong-Ling Deng,Li-Wei Yu
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 7+23 pages, comments welcome

点击查看摘要

[AI-95] Leveraging Large Language Models to Democratize Access to Costly Financial Datasets for Academic Research

链接: https://arxiv.org/abs/2412.02065
作者: Julian Junyan Wang,Victor Xiaoqi Wang
关键词-EN:
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

机器学习

[LG-0] MISFEAT: Feature Selection for Subgroups with Systematic Missing Data

链接: https://arxiv.org/abs/2412.06711
作者: Bar Genossar,Thinh On,Md. Mouinul Islam,Ben Eliav,Senjuti Basu Roy,Avigdor Gal
关键词-EN: mutual Information, groups and age, investigate the problem, problem of selecting, naturally partitioned
类目: Machine Learning (cs.LG); Databases (cs.DB); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We investigate the problem of selecting features for datasets that can be naturally partitioned into subgroups (e.g., according to socio-demographic groups and age), each with its own dominant set of features. Within this subgroup-oriented framework, we address the challenge of systematic missing data, a scenario in which some feature values are missing for all tuples of a subgroup, due to flawed data integration, regulatory constraints, or privacy concerns. Feature selection is governed by finding mutual Information, a popular quantification of correlation, between features and a target variable. Our goal is to identify top-K feature subsets of some fixed size with the highest joint mutual information with a target variable. In the presence of systematic missing data, the closed form of mutual information could not simply be applied. We argue that in such a setting, leveraging relationships between available feature mutual information within a subgroup or across subgroups can assist inferring missing mutual information values. We propose a generalizable model based on heterogeneous graph neural network to identify interdependencies between feature-subgroup-target variable connections by modeling it as a multiplex graph, and employing information propagation between its nodes. We address two distinct scalability challenges related to training and propose principled solutions to tackle them. Through an extensive empirical evaluation, we demonstrate the efficacy of the proposed solutions both qualitatively and running time wise.

[LG-1] Impact of Privacy Parameters on Deep Learning Models for Image Classification

链接: https://arxiv.org/abs/2412.06689
作者: Basanta Chaulagain
关键词-EN: develop differentially private, differentially private deep, Support Vector Machine, Naive Bayes Classifier, private deep learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 10 pages

点击查看摘要

Abstract:The project aims to develop differentially private deep learning models for image classification on CIFAR-10 datasets \citecifar10 and analyze the impact of various privacy parameters on model accuracy. We have implemented five different deep learning models, namely ConvNet, ResNet18, EfficientNet, ViT, and DenseNet121 and three supervised classifiers namely K-Nearest Neighbors, Naive Bayes Classifier and Support Vector Machine. We evaluated the performance of these models under varying settings. Our best performing model to date is EfficientNet with test accuracy of 59.63% with the following parameters (Adam optimizer, batch size 256, epoch size 100, epsilon value 5.0, learning rate 1e-3 , clipping threshold 1.0, and noise multiplier 0.912).

[LG-2] Some Best Practices in Operator Learning

链接: https://arxiv.org/abs/2412.06686
作者: Dustin Enyeart,Guang Lin
关键词-EN: computationally expensive, searches are computationally, Hyperparameters searches, Abstract, Hyperparameters
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: arXiv admin note: substantial text overlap with arXiv:2412.04578

点击查看摘要

Abstract:Hyperparameters searches are computationally expensive. This paper studies some general choices of hyperparameters and training methods specifically for operator learning. It considers the architectures DeepONets, Fourier neural operators and Koopman autoencoders for several differential equations to find robust trends. Some options considered are activation functions, dropout and stochastic weight averaging.

[LG-3] Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach

链接: https://arxiv.org/abs/2412.06684
作者: Weichao Xu,Huaxin Pei,Jingxuan Yang,Yuchen Shi,Yi Zhang
关键词-EN: witnessed surprising achievements, Recent years, driving and robotics, decision-making policies, years have witnessed
类目: Machine Learning (cs.LG)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:Recent years have witnessed surprising achievements of decision-making policies across various fields, such as autonomous driving and robotics. Testing for decision-making policies is crucial with the existence of critical scenarios that may threaten their reliability. Numerous research efforts have been dedicated to testing these policies. However, there are still significant challenges, such as low testing efficiency and diversity due to the complexity of the policies and environments under test. Inspired by the remarkable capabilities of large language models (LLMs), in this paper, we propose an LLM-driven online testing framework for efficiently testing decision-making policies. The main idea is to employ an LLM-based test scenario generator to intelligently generate challenging test cases through contemplation and reasoning. Specifically, we first design a “generate-test-feedback” pipeline and apply templated prompt engineering to fully leverage the knowledge and reasoning abilities of LLMs. Then, we introduce a multi-scale scenario generation strategy to address the inherent challenges LLMs face in making fine adjustments, further enhancing testing efficiency. Finally, we evaluate the LLM-driven approach on five widely used benchmarks. The experimental results demonstrate that our method significantly outperforms baseline approaches in uncovering both critical and diverse scenarios.

[LG-4] Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

链接: https://arxiv.org/abs/2412.06655
作者: Adrien Bolland,Gaspard Lambrechts,Damien Ernst
关键词-EN: states and actions, learning framework based, intrinsic reward, framework based, intrinsic reward function
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) visited during the next time steps. We first prove that an optimal exploration policy, which maximizes the expected discounted sum of intrinsic rewards, is also a policy that maximizes a lower bound on the state-action value function of the decision process under some assumptions. We also prove that the visitation distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Following, we describe how to adapt existing algorithms to learn this fixed point and compute the intrinsic rewards to enhance exploration. A new practical off-policy maximum entropy reinforcement learning algorithm is finally introduced. Empirically, exploration policies have good state-action space coverage, and high-performing control policies are computed efficiently.

[LG-5] AI TrackMate: Finally Someone Who Will Give Your Music More Than Just “Sounds Great!” NEURIPS2024

链接: https://arxiv.org/abs/2412.06617
作者: Yi-Lin Jiang,Chia-Ho Hsiung,Yen-Tung Yeh,Lu-Rong Chen,Bo-Yu Chen
关键词-EN: democratized music creation, evaluate their work, objectively evaluate, music, Music Analysis Module
类目: ound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted for the NeurIPS 2024 Creative AI Track

点击查看摘要

Abstract:The rise of “bedroom producers” has democratized music creation, while challenging producers to objectively evaluate their work. To address this, we present AI TrackMate, an LLM-based music chatbot designed to provide constructive feedback on music productions. By combining LLMs’ inherent musical knowledge with direct audio track analysis, AI TrackMate offers production-specific insights, distinguishing it from text-only approaches. Our framework integrates a Music Analysis Module, an LLM-Readable Music Report, and Music Production-Oriented Feedback Instruction, creating a plug-and-play, training-free system compatible with various LLMs and adaptable to future advancements. We demonstrate AI TrackMate’s capabilities through an interactive web interface and present findings from a pilot study with a music producer. By bridging AI capabilities with the needs of independent producers, AI TrackMate offers on-demand analytical feedback, potentially supporting the creative process and skill development in music production. This system addresses the growing demand for objective self-assessment tools in the evolving landscape of independent music production.

[LG-6] Vulnerability of Text-Matching in ML/AI Conference Reviewer Assignments to Collusions

链接: https://arxiv.org/abs/2412.06606
作者: Jhih-Yi(Janet)Hsieh,Aditi Raghunathan,Nihar B. Shah
关键词-EN: peer review process, artificial intelligence, automated methods, top-tier machine learning, peer review
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Digital Libraries (cs.DL)
*备注:

点击查看摘要

Abstract:In the peer review process of top-tier machine learning (ML) and artificial intelligence (AI) conferences, reviewers are assigned to papers through automated methods. These assignment algorithms consider two main factors: (1) reviewers’ expressed interests indicated by their bids for papers, and (2) reviewers’ domain expertise inferred from the similarity between the text of their previously published papers and the submitted manuscripts. A significant challenge these conferences face is the existence of collusion rings, where groups of researchers manipulate the assignment process to review each other’s papers, providing positive evaluations regardless of their actual quality. Most efforts to combat collusion rings have focused on preventing bid manipulation, under the assumption that the text similarity component is secure. In this paper, we demonstrate that even in the absence of bidding, colluding reviewers and authors can exploit the machine learning based text-matching component of reviewer assignment used at top ML/AI venues to get assigned their target paper. We also highlight specific vulnerabilities within this system and offer suggestions to enhance its robustness.

[LG-7] VOPy: A Framework for Black-box Vector Optimization

链接: https://arxiv.org/abs/2412.06604
作者: Yaşar Cahit Yıldırım,Efe Mert Karagözlü,İlter Onat Korkmaz,Çağın Ararat,Cem Tekin
关键词-EN: open-source Python library, Python library designed, partial order induced, open-source Python, Python library
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce VOPy, an open-source Python library designed to address black-box vector optimization, where multiple objectives must be optimized simultaneously with respect to a partial order induced by a convex cone. VOPy extends beyond traditional multi-objective optimization (MOO) tools by enabling flexible, cone-based ordering of solutions; with an application scope that includes environments with observation noise, discrete or continuous design spaces, limited budgets, and batch observations. VOPy provides a modular architecture, facilitating the integration of existing methods and the development of novel algorithms. We detail VOPy’s architecture, usage, and potential to advance research and application in the field of vector optimization. The source code for VOPy is available at this https URL.

[LG-8] Self-Interested Agents in Collaborative Learning: An Incentivized Adaptive Data-Centric Framework

链接: https://arxiv.org/abs/2412.06597
作者: Nithia Vijayan,Bryan Kian Hsiang Low
关键词-EN: adaptive data-centric collaborative, data-centric collaborative learning, adaptive data-centric, data-centric collaborative, collaborative learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a framework for adaptive data-centric collaborative learning among self-interested agents, coordinated by an arbiter. Designed to handle the incremental nature of real-world data, the framework operates in an online manner: at each step, the arbiter collects a batch of data from agents, trains a machine learning model, and provides each agent with a distinct model reflecting its data contributions. This setup establishes a feedback loop where shared data influence model updates, and the resulting models guide future data-sharing strategies. Agents evaluate and partition their data, selecting a partition to share using a stochastic parameterized policy optimized via policy gradient methods to optimize the utility of the received model as defined by agent-specific evaluation functions. On the arbiter side, the expected loss function over the true data distribution is optimized, incorporating agent-specific weights to account for distributional differences arising from diverse sources and selective sharing. A bilevel optimization algorithm jointly learns the model parameters and agent-specific weights. Mean-zero noise, computed using a distortion function that adjusts these agent-specific weights, is introduced to generate distinct agent-specific models, promoting valuable data sharing without requiring separate training. Our framework is underpinned by non-asymptotic analyses, ensuring convergence of the agent-side policy optimization to an approximate stationary point of the evaluation functions and convergence of the arbiter-side optimization to an approximate stationary point of the expected loss function.

[LG-9] CONDEN-FI: Consistency and Diversity Learning-based Multi-View Unsupervised Feature and In-stance Co-Selection

链接: https://arxiv.org/abs/2412.06568
作者: Yanyong Huang,Yuxin Cai,Dongjie Wang,Xiuwen Yi,Tianrui Li
关键词-EN: reducing instance size, multi-view unsupervised feature, down-stream tasks, multi-view unlabeled data, simultaneously iden-tify
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The objective of multi-view unsupervised feature and instance co-selection is to simultaneously iden-tify the most representative features and samples from multi-view unlabeled data, which aids in mit-igating the curse of dimensionality and reducing instance size to improve the performance of down-stream tasks. However, existing methods treat feature selection and instance selection as two separate processes, failing to leverage the potential interactions between the feature and instance spaces. Addi-tionally, previous co-selection methods for multi-view data require concatenating different views, which overlooks the consistent information among them. In this paper, we propose a CONsistency and DivErsity learNing-based multi-view unsupervised Feature and Instance co-selection (CONDEN-FI) to address the above-mentioned issues. Specifically, CONDEN-FI reconstructs mul-ti-view data from both the sample and feature spaces to learn representations that are consistent across views and specific to each view, enabling the simultaneous selection of the most important features and instances. Moreover, CONDEN-FI adaptively learns a view-consensus similarity graph to help select both dissimilar and similar samples in the reconstructed data space, leading to a more diverse selection of instances. An efficient algorithm is developed to solve the resultant optimization problem, and the comprehensive experimental results on real-world datasets demonstrate that CONDEN-FI is effective compared to state-of-the-art methods.

[LG-10] DEX: Data Channel Extension for Efficient CNN Inference on Tiny AI Accelerators NEURIPS2024

链接: https://arxiv.org/abs/2412.06566
作者: Taesik Gong,Fahim Kawsar,Chulhong Min
关键词-EN: Tiny machine learning, machine learning, aims to run, enhanced privacy, low cost
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Tiny machine learning (TinyML) aims to run ML models on small devices and is increasingly favored for its enhanced privacy, reduced latency, and low cost. Recently, the advent of tiny AI accelerators has revolutionized the TinyML field by significantly enhancing hardware processing power. These accelerators, equipped with multiple parallel processors and dedicated per-processor memory instances, offer substantial performance improvements over traditional microcontroller units (MCUs). However, their limited data memory often necessitates downsampling input images, resulting in accuracy degradation. To address this challenge, we propose Data channel EXtension (DEX), a novel approach for efficient CNN execution on tiny AI accelerators. DEX incorporates additional spatial information from original images into input images through patch-wise even sampling and channel-wise stacking, effectively extending data across input channels. By leveraging underutilized processors and data memory for channel extension, DEX facilitates parallel execution without increasing inference latency. Our evaluation with four models and four datasets on tiny AI accelerators demonstrates that this simple idea improves accuracy on average by 3.5%p while keeping the inference latency the same on the AI accelerator. The source code is available at this https URL.

[LG-11] When Dimensionality Reduction Meets Graph (Drawing) Theory: Introducing a Common Framework Challenges and Opportunities

链接: https://arxiv.org/abs/2412.06555
作者: Fernando Paulovich,Alessio Arleo,Stef van den Elzen
关键词-EN: Dimensionality Reduction, data analytics setups, analytics setups, visual data analytics, vast landscape
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the vast landscape of visualization research, Dimensionality Reduction (DR) and graph analysis are two popular subfields, often essential to most visual data analytics setups. DR aims to create representations to support neighborhood and similarity analysis on complex, large datasets. Graph analysis focuses on identifying the salient topological properties and key actors within networked data, with specialized research on investigating how such features could be presented to the user to ease the comprehension of the underlying structure. Although these two disciplines are typically regarded as disjoint subfields, we argue that both fields share strong similarities and synergies that can potentially benefit both. Therefore, this paper discusses and introduces a unifying framework to help bridge the gap between DR and graph (drawing) theory. Our goal is to use the strongly math-grounded graph theory to improve the overall process of creating DR visual representations. We propose how to break the DR process into well-defined stages, discussing how to match some of the DR state-of-the-art techniques to this framework and presenting ideas on how graph drawing, topology features, and some popular algorithms and strategies used in graph analysis can be employed to improve DR topology extraction, embedding generation, and result validation. We also discuss the challenges and identify opportunities for implementing and using our framework, opening directions for future visualization research.

[LG-12] On How Iterative Magnitude Pruning Discovers Local Receptive Fields in Fully Connected Neural Networks

链接: https://arxiv.org/abs/2412.06545
作者: William T. Redman,Zhangyang Wang,Alessandro Ingrosso,Sebastian Goldt
关键词-EN: Lottery Ticket Hypothesis, Lottery Ticket, iterative magnitude pruning, extracting sparse subnetworks, Ticket Hypothesis
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, comments welcome!

点击查看摘要

Abstract:Since its use in the Lottery Ticket Hypothesis, iterative magnitude pruning (IMP) has become a popular method for extracting sparse subnetworks that can be trained to high performance. Despite this, the underlying nature of IMP’s general success remains unclear. One possibility is that IMP is especially capable of extracting and maintaining strong inductive biases. In support of this, recent work has shown that applying IMP to fully connected neural networks (FCNs) leads to the emergence of local receptive fields (RFs), an architectural feature present in mammalian visual cortex and convolutional neural networks. The question of how IMP is able to do this remains unanswered. Inspired by results showing that training FCNs on synthetic images with highly non-Gaussian statistics (e.g., sharp edges) is sufficient to drive the formation of local RFs, we hypothesize that IMP iteratively maximizes the non-Gaussian statistics present in the representations of FCNs, creating a feedback loop that enhances localization. We develop a new method for measuring the effect of individual weights on the statistics of the FCN representations (“cavity method”), which allows us to find evidence in support of this hypothesis. Our work, which is the first to study the effect IMP has on the representations of neural networks, sheds parsimonious light one way in which IMP can drive the formation of strong inductive biases.

[LG-13] A cautionary tale on the cost-effectiveness of collaborative AI in real-world medical applications

链接: https://arxiv.org/abs/2412.06494
作者: Francesco Cremonesi,Lucia Innocenti,Sebastien Ourselin,Vicky Goh,Michela Antonelli,Marco Lorenzi
关键词-EN: Background, CBL, collaborative, learning, CBL methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background. Federated learning (FL) has gained wide popularity as a collaborative learning paradigm enabling collaborative AI in sensitive healthcare applications. Nevertheless, the practical implementation of FL presents technical and organizational challenges, as it generally requires complex communication infrastructures. In this context, consensus-based learning (CBL) may represent a promising collaborative learning alternative, thanks to the ability of combining local knowledge into a federated decision system, while potentially reducing deployment overhead. Methods. In this work we propose an extensive benchmark of the accuracy and cost-effectiveness of a panel of FL and CBL methods in a wide range of collaborative medical data analysis scenarios. The benchmark includes 7 different medical datasets, encompassing 3 machine learning tasks, 8 different data modalities, and multi-centric settings involving 3 to 23 clients. Findings. Our results reveal that CBL is a cost-effective alternative to FL. When compared across the panel of medical dataset in the considered benchmark, CBL methods provide equivalent accuracy to the one achieved by this http URL, CBL significantly reduces training time and communication cost (resp. 15 fold and 60 fold decrease) (p 0.05). Interpretation. This study opens a novel perspective on the deployment of collaborative AI in real-world applications, whereas the adoption of cost-effective methods is instrumental to achieve sustainability and democratisation of AI by alleviating the need for extensive computational resources.

[LG-14] Food for thought: How can machine learning help better predict and understand changes in food prices?

链接: https://arxiv.org/abs/2412.06472
作者: Kristina L. Kupferschmidt,James Requiema,Mya Simpson,Zohrah Varsallay,Ethan Jackson,Cody Kupferschmidt,Sara El-Shawa,Graham W. Taylor
关键词-EN: University of Guelph, Vector Institute, Vector Institute forecasting, Canada Food Price, University
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we address a lack of systematic understanding of fluctuations in food affordability in Canada. Canada’s Food Price Report (CPFR) is an annual publication that predicts food inflation over the next calendar year. The published predictions are a collaborative effort between forecasting teams that each employ their own approach at Canadian Universities: Dalhousie University, the University of British Columbia, the University of Saskatchewan, and the University of Guelph/Vector Institute. While the University of Guelph/Vector Institute forecasting team has leveraged machine learning (ML) in previous reports, the most recent editions (2024–2025) have also included a human-in-the-loop approach. For the 2025 report, this focus was expanded to evaluate several different data-centric approaches to improve forecast accuracy. In this study, we evaluate how different types of forecasting models perform when estimating food price fluctuations. We also examine the sensitivity of models that curate time series data representing key factors in food pricing.

[LG-15] Can foundation models actively gather information in interactive environments to test hypotheses?

链接: https://arxiv.org/abs/2412.06438
作者: Nan Rosemary Ke,Danny P. Sawyer,Hubert Soyer,Martin Engelcke,David P Reichert,Drew A. Hudson,John Reid,Alexander Lerchner,Danilo Jimenez Rezende,Timothy P Lillicrap,Michael Mozer,Jane X Wang
关键词-EN: problem solving, standard evaluation task, actively and strategically, test hypotheses, closely investigated
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While problem solving is a standard evaluation task for foundation models, a crucial component of problem solving – actively and strategically gathering information to test hypotheses – has not been closely investigated. To assess the information gathering abilities of foundation models in interactive environments, we introduce a framework in which a model must determine the factors influencing a hidden reward function by iteratively reasoning about its previously gathered information and proposing its next exploratory action to maximize information gain at each step. We implement this framework in both a text-based environment, which offers a tightly controlled setting and enables high-throughput parameter sweeps, and in an embodied 3D environment, which requires addressing complexities of multi-modal interaction more relevant to real-world applications. We further investigate whether approaches such as self-correction and increased inference time improve information gathering efficiency. In a relatively simple task that requires identifying a single rewarding feature, we find that LLM’s information gathering capability is close to optimal. However, when the model must identify a conjunction of rewarding features, performance is suboptimal. The hit in performance is due partly to the model translating task description to a policy and partly to the model’s effectiveness in using its in-context memory. Performance is comparable in both text and 3D embodied environments, although imperfect visual object recognition reduces its accuracy in drawing conclusions from gathered information in the 3D embodied case. For single-feature-based rewards, we find that smaller models curiously perform better; for conjunction-based rewards, incorporating self correction into the model improves performance.

[LG-16] Federated Split Learning with Model Pruning and Gradient Quantization in Wireless Networks

链接: https://arxiv.org/abs/2412.06414
作者: Junhe Zhang,Wanli Ni,Dongyu Wang
关键词-EN: complete model locally, distributed machine learning, learning typically requires, federated learning typically, edge devices
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:As a paradigm of distributed machine learning, federated learning typically requires all edge devices to train a complete model locally. However, with the increasing scale of artificial intelligence models, the limited resources on edge devices often become a bottleneck for efficient fine-tuning. To address this challenge, federated split learning (FedSL) implements collaborative training across the edge devices and the server through model splitting. In this paper, we propose a lightweight FedSL scheme, that further alleviates the training burden on resource-constrained edge devices by pruning the client-side model dynamicly and using quantized gradient updates to reduce computation overhead. Additionally, we apply random dropout to the activation values at the split layer to reduce communication overhead. We conduct theoretical analysis to quantify the convergence performance of the proposed scheme. Finally, simulation results verify the effectiveness and advantages of the proposed lightweight FedSL in wireless network environments.

[LG-17] Exploring the Impact of Synthetic Data on Human Gesture Recognition Tasks Using GANs

链接: https://arxiv.org/abs/2412.06389
作者: George Kontogiannis,Pantelis Tzamalis,Sotiris Nikoletseas
关键词-EN: employing Deep Generative, Deep Generative Models, Human Activity Recognition, Generative Adversarial Networks, Deep Generative
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT), 2024

点击查看摘要

Abstract:In the evolving domain of Human Activity Recognition (HAR) using Internet of Things (IoT) devices, there is an emerging interest in employing Deep Generative Models (DGMs) to address data scarcity, enhance data quality, and improve classification metrics scores. Among these types of models, Generative Adversarial Networks (GANs) have arisen as a powerful tool for generating synthetic data that mimic real-world scenarios with high fidelity. However, Human Gesture Recognition (HGR), a subset of HAR, particularly in healthcare applications, using time series data such as allergic gestures, remains highly unexplored. In this paper, we examine and evaluate the performance of two GANs in the generation of synthetic gesture motion data that compose a part of an open-source benchmark dataset. The data is related to the disease identification domain and healthcare, specifically to allergic rhinitis. We also focus on these AI models’ performance in terms of fidelity, diversity, and privacy. Furthermore, we examine the scenario if the synthetic data can substitute real data, in training scenarios and how well models trained on synthetic data can be generalized for the allergic rhinitis gestures. In our work, these gestures are related to 6-axes accelerometer and gyroscope data, serving as multi-variate time series instances, and retrieved from smart wearable devices. To the best of our knowledge, this study is the first to explore the feasibility of synthesizing motion gestures for allergic rhinitis from wearable IoT device data using Generative Adversarial Networks (GANs) and testing their impact on the generalization of gesture recognition systems. It is worth noting that, even if our method has been applied to a specific category of gestures, it is designed to be generalized and can be deployed also to other motion data in the HGR domain. Comments: 8 pages, 5 figures, 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT), 2024 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.06389 [cs.LG] (or arXiv:2412.06389v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.06389 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2024 20th Int. Conf. on Distributed Computing in Smart Systems and the IoT (DCOSS-IoT), Abu Dhabi, UAE, 2024, pp. 384-391 Related DOI: https://doi.org/10.1109/DCOSS-IoT61029.2024.00064 Focus to learn more DOI(s) linking to related resources

[LG-18] PyPulse: A Python Library for Biosignal Imputation

链接: https://arxiv.org/abs/2412.06382
作者: Kevin Gao,Maxwell A. Xu,James M. Rehg,Alexander Moreno
关键词-EN: Python package, wearable sensor settings, clinical and wearable, wearable sensor, Python
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 7 pages, 3 figures. Implementation and documentation are available at this https URL

点击查看摘要

Abstract:We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings. Missingness is commonplace in these settings and can arise from multiple causes, such as insecure sensor attachment or data transmission loss. PyPulse’s framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers. Specifically, its new capabilities include using pre-trained imputation methods out-of-the-box on custom datasets, running the full workflow of training or testing a baseline method with a single line of code, and comparing baseline methods in an interactive visualization tool. We released PyPulse under the MIT License on Github and PyPI. The source code can be found at: this https URL.

[LG-19] Gentle robustness implies Generalization

链接: https://arxiv.org/abs/2412.06381
作者: Khoat Than,Dat Phan,Giang Vu
关键词-EN: application domains, utmost importance, machine learning models, Robustness and generalization, Robustness
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Robustness and generalization ability of machine learning models are of utmost importance in various application domains. There is a wide interest in efficient ways to analyze those properties. One important direction is to analyze connection between those two properties. Prior theories suggest that a robust learning algorithm can produce trained models with a high generalization ability. However, we show in this work that the existing error bounds are vacuous for the Bayes optimal classifier which is the best among all measurable classifiers for a classification problem with overlapping classes. Those bounds cannot converge to the true error of this ideal classifier. This is undesirable, surprizing, and never known before. We then present a class of novel bounds, which are model-dependent and provably tighter than the existing robustness-based ones. Unlike prior ones, our bounds are guaranteed to converge to the true error of the best classifier, as the number of samples increases. We further provide an extensive experiment and find that two of our bounds are often non-vacuous for a large class of deep neural networks, pretrained from ImageNet.

[LG-20] Low-Rank Matrix Factorizations with Volume-based Constraints and Regularizations

链接: https://arxiv.org/abs/2412.06380
作者: Olivier Vu Thanh
关键词-EN: Low-rank matrix factorizations, linear models widely, class of linear, matrix factorizations, matrix
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Low-rank matrix factorizations are a class of linear models widely used in various fields such as machine learning, signal processing, and data analysis. These models approximate a matrix as the product of two smaller matrices, where the left matrix captures latent features while the right matrix linearly decomposes the data based on these features. There are many ways to define what makes a component “important.” Standard LRMFs, such as the truncated singular value decomposition, focus on minimizing the distance between the original matrix and its low-rank approximation. In this thesis, the notion of “importance” is closely linked to interpretability and uniqueness, which are key to obtaining reliable and meaningful results. This thesis thus focuses on volume-based constraints and regularizations designed to enhance interpretability and uniqueness. We first introduce two new volume-constrained LRMFs designed to enhance these properties. The first assumes that data points are naturally bounded (e.g., movie ratings between 1 and 5 stars) and can be explained by convex combinations of features within the same bounds, allowing them to be interpreted in the same way as the data. The second model is more general, constraining the factors to belong to convex polytopes. Then, two variants of volume-regularized LRMFs are proposed. The first minimizes the volume of the latent features, encouraging them to cluster closely together, while the second maximizes the volume of the decompositions, promoting sparse representations. Across all these models, uniqueness is achieved under the core principle that the factors must be “sufficiently scattered” within their respective feasible sets. Motivated by applications such as blind source separation and missing data imputation, this thesis also proposes efficient algorithms that make these models practical for real-world applications. Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2412.06380 [cs.LG] (or arXiv:2412.06380v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.06380 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] GraphNeuralNetworks.jl: Deep Learning on Graphs with Julia

链接: https://arxiv.org/abs/2412.06354
作者: Carlo Lucibello,Aurora Rossi
关键词-EN: Julia programming language, Julia programming, programming language, multiple GPU backends, supports multiple GPU
类目: Machine Learning (cs.LG)
*备注: Submitted to JMLR OSS

点击查看摘要

Abstract:this http URL is an open-source framework for deep learning on graphs, written in the Julia programming language. It supports multiple GPU backends, generic sparse or dense graph representations, and offers convenient interfaces for manipulating standard, heterogeneous, and temporal graphs with attributes at the node, edge, and graph levels. The framework allows users to define custom graph convolutional layers using gather/scatter message-passing primitives or optimized fused operations. It also includes several popular layers, enabling efficient experimentation with complex deep architectures. The package is available on GitHub: \urlthis https URL.

[LG-22] racking control of latent dynamic systems with application to spacecraft attitude control

链接: https://arxiv.org/abs/2412.06342
作者: Congxi Zhang,Yongchun Xie
关键词-EN: space robots perform, robots perform tasks, latent dynamic systems, dynamic systems, latent dynamic
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When intelligent spacecraft or space robots perform tasks in a complex environment, the controllable variables are usually not directly available and have to be inferred from high-dimensional observable variables, such as outputs of neural networks or images. While the dynamics of these observations are highly complex, the mechanisms behind them may be simple, which makes it possible to regard them as latent dynamic systems. For control of latent dynamic systems, methods based on reinforcement learning suffer from sample inefficiency and generalization problems. In this work, we propose an asymptotic tracking controller for latent dynamic systems. The latent variables are related to the high-dimensional observations through an unknown nonlinear function. The dynamics are unknown but assumed to be affine nonlinear. To realize asymptotic tracking, an identifiable latent dynamic model is learned to recover the latents and estimate the dynamics. This training process does not depend on the goals or reference trajectories. Based on the learned model, we use a manually designed feedback linearization controller to ensure the asymptotic tracking property of the closed-loop system. After considering fully controllable systems, the results are extended to the case that uncontrollable environmental latents exist. As an application, simulation experiments on a latent spacecraft attitude dynamic model are conducted to verify the proposed methods, and the observation noise and control deviation are taken into consideration.

[LG-23] able2Image: Interpretable Tabular data Classification with Realistic Image Transformations

链接: https://arxiv.org/abs/2412.06265
作者: Seungeun Lee,Seungsang Oh
关键词-EN: models remain limited, Recent advancements, demonstrated promising performance, interpretable models remain, remain limited
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning for tabular data have demonstrated promising performance, yet interpretable models remain limited, with many relying on complex and large-scale architectures. This paper introduces Table2Image, an interpretable framework that transforms tabular data into realistic image representations for classification, achieving competitive performance with relatively lightweight models. Additionally, we propose variance inflation factor (VIF) initialization, which reflects the statistical properties of the data, and a novel interpretability framework that integrates insights from both the original tabular data and its image transformations. By leveraging Shapley additive explanations (SHAP) with methods to minimize distributional discrepancies, our approach combines tabular and image-based representations. Experiments on benchmark datasets showcase competitive classification accuracy, area under the curve (AUC), and improved interpretability, offering a scalable and reliable solution. Our code is available at this https URL.

[LG-24] Flow Matching Guide and Code

链接: https://arxiv.org/abs/2412.06264
作者: Yaron Lipman,Marton Havasi,Peter Holderrieth,Neta Shaul,Matt Le,Brian Karrer,Ricky T. Q. Chen,David Lopez-Paz,Heli Ben-Hamu,Itai Gat
关键词-EN: Flow Matching, biological structures, recent framework, framework for generative, generative modeling
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flow Matching (FM) is a recent framework for generative modeling that has achieved state-of-the-art performance across various domains, including image, video, audio, speech, and biological structures. This guide offers a comprehensive and self-contained review of FM, covering its mathematical foundations, design choices, and extensions. By also providing a PyTorch package featuring relevant examples (e.g., image and text generation), this work aims to serve as a resource for both novice and experienced researchers interested in understanding, applying and further developing FM.

[LG-25] A Scalable Decentralized Reinforcement Learning Framework for UAV Target Localization Using Recurrent PPO

链接: https://arxiv.org/abs/2412.06231
作者: Leon Fernando,Billy Pik Lik Lau,Chau Yuen,U-Xuan Tan
关键词-EN: unmanned aerial vehicles, unlocked numerous applications, including environmental monitoring, disaster response, aerial vehicles
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to TENCON 2024

点击查看摘要

Abstract:The rapid advancements in unmanned aerial vehicles (UAVs) have unlocked numerous applications, including environmental monitoring, disaster response, and agricultural surveying. Enhancing the collective behavior of multiple decentralized UAVs can significantly improve these applications through more efficient and coordinated operations. In this study, we explore a Recurrent PPO model for target localization in perceptually degraded environments like places without GNSS/GPS signals. We first developed a single-drone approach for target identification, followed by a decentralized two-drone model. Our approach can utilize two types of sensors on the UAVs, a detection sensor and a target signal sensor. The single-drone model achieved an accuracy of 93%, while the two-drone model achieved an accuracy of 86%, with the latter requiring fewer average steps to locate the target. This demonstrates the potential of our method in UAV swarms, offering efficient and effective localization of radiant targets in complex environmental conditions.

[LG-26] H-FedSN: Personalized Sparse Networks for Efficient and Accurate Hierarchical Federated Learning for IoT Applications

链接: https://arxiv.org/abs/2412.06210
作者: Jiechao Gao,Yuangang Li,Yue Zhao,Brad Campbell
关键词-EN: Internet of Things, proliferation of Internet, distributed data utilization, Hierarchical Federated Learning, privacy-preserving distributed data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of Internet of Things (IoT) has increased interest in federated learning (FL) for privacy-preserving distributed data utilization. However, traditional two-tier FL architectures inadequately adapt to multi-tier IoT environments. While Hierarchical Federated Learning (HFL) improves practicality in multi-tier IoT environments by multi-layer aggregation, it still faces challenges in communication efficiency and accuracy due to high data transfer volumes, data heterogeneity, and imbalanced device distribution, struggling to meet the low-latency and high-accuracy model training requirements of practical IoT scenarios. To overcome these limitations, we propose H-FedSN, an innovative approach for practical IoT environments. H-FedSN introduces a binary mask mechanism with shared and personalized layers to reduce communication overhead by creating a sparse network while keeping original weights frozen. To address data heterogeneity and imbalanced device distribution, we integrate personalized layers for local data adaptation and apply Bayesian aggregation with cumulative Beta distribution updates at edge and cloud levels, effectively balancing contributions from diverse client groups. Evaluations on three real-world IoT datasets and MNIST under non-IID settings demonstrate that H-FedSN significantly reduces communication costs by 58 to 238 times compared to HierFAVG while achieving high accuracy, making it highly effective for practical IoT applications in hierarchical federated learning scenarios.

[LG-27] Applying Machine Learning Tools for Urban Resilience Against Floods

链接: https://arxiv.org/abs/2412.06205
作者: Mahla Ardebili Pour,Mohammad B. Ghiasi,Ali Karkehabadi
关键词-EN: destructive natural disasters, resilience, population density, prevalent and destructive, leading to severe
类目: Machine Learning (cs.LG)
*备注: IEEE Fifth International Conference on Advances in Electrical, Computing, Communications and Sustainable Technologies

点击查看摘要

Abstract:Floods are among the most prevalent and destructive natural disasters, often leading to severe social and economic impacts in urban areas due to the high concentration of assets and population density. In Iran, particularly in Tehran, recurring flood events underscore the urgent need for robust urban resilience strategies. This paper explores flood resilience models to identify the most effective approach for District 6 in Tehran. Through an extensive literature review, various resilience models were analyzed, with the Climate Disaster Resilience Index (CDRI) emerging as the most suitable model for this district due to its comprehensive resilience dimensions: Physical, Social, Economic, Organizational, and Natural Health resilience. Although the CDRI model provides a structured approach to resilience measurement, it remains a static model focused on spatial characteristics and lacks temporal adaptability. An extensive literature review enhances the CDRI model by integrating data from 2013 to 2022 in three-year intervals and applying machine learning techniques to predict resilience dimensions for 2025. This integration enables a dynamic resilience model that can accommodate temporal changes, providing a more adaptable and data driven foundation for urban flood resilience planning. By employing artificial intelligence to reflect evolving urban conditions, this model offers valuable insights for policymakers and urban planners to enhance flood resilience in Tehrans critical District 6.

[LG-28] Applications of Positive Unlabeled (PU) and Negative Unlabeled (NU) Learning in Cybersecurity

链接: https://arxiv.org/abs/2412.06203
作者: Robert Dilworth,Charan Gudla
关键词-EN: Positive Unlabeled, Negative Unlabeled, application of Positive, Unlabeled, underexplored application
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the relatively underexplored application of Positive Unlabeled (PU) Learning and Negative Unlabeled (NU) Learning in the cybersecurity domain. While these semi-supervised learning methods have been applied successfully in fields like medicine and marketing, their potential in cybersecurity remains largely untapped. The paper identifies key areas of cybersecurity–such as intrusion detection, vulnerability management, malware detection, and threat intelligence–where PU/NU learning can offer significant improvements, particularly in scenarios with imbalanced or limited labeled data. We provide a detailed problem formulation for each subfield, supported by mathematical reasoning, and highlight the specific challenges and research gaps in scaling these methods to real-time systems, addressing class imbalance, and adapting to evolving threats. Finally, we propose future directions to advance the integration of PU/NU learning in cybersecurity, offering solutions that can better detect, manage, and mitigate emerging cyber threats.

[LG-29] Revisiting the Necessity of Graph Learning and Common Graph Benchmarks

链接: https://arxiv.org/abs/2412.06173
作者: Isay Katsman,Ethan Lou,Anna Gilbert
关键词-EN: Graph, graph learning, Graph machine learning, node features, enjoyed a meteoric
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Graph machine learning has enjoyed a meteoric rise in popularity since the introduction of deep learning in graph contexts. This is no surprise due to the ubiquity of graph data in large scale industrial settings. Tacitly assumed in all graph learning tasks is the separation of the graph structure and node features: node features strictly encode individual data while the graph structure consists only of pairwise interactions. The driving belief is that node features are (by themselves) insufficient for these tasks, so benchmark performance accurately reflects improvements in graph learning. In our paper, we challenge this orthodoxy by showing that, surprisingly, node features are oftentimes more-than-sufficient for many common graph benchmarks, breaking this critical assumption. When comparing against a well-tuned feature-only MLP baseline on seven of the most commonly used graph learning datasets, one gains little benefit from using graph structure on five datasets. We posit that these datasets do not benefit considerably from graph learning because the features themselves already contain enough graph information to obviate or substantially reduce the need for the graph. To illustrate this point, we perform a feature study on these datasets and show how the features are responsible for closing the gap between MLP and graph-method performance. Further, in service of introducing better empirical measures of progress for graph neural networks, we present a challenging parametric family of principled synthetic datasets that necessitate graph information for nontrivial performance. Lastly, we section out a subset of real-world datasets that are not trivially solved by an MLP and hence serve as reasonable benchmarks for graph neural networks.

[LG-30] Out-of-Distribution Detection with Overlap Index

链接: https://arxiv.org/abs/2412.06168
作者: Hao Fu,Prashanth Krishnamurthy,Siddharth Garg,Farshad Khorrami
关键词-EN: OOD detectors, OOD, machine learning models, open world, confidence score function
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is crucial for the deployment of machine learning models in the open world. While existing OOD detectors are effective in identifying OOD samples that deviate significantly from in-distribution (ID) data, they often come with trade-offs. For instance, deep OOD detectors usually suffer from high computational costs, require tuning hyperparameters, and have limited interpretability, whereas traditional OOD detectors may have a low accuracy on large high-dimensional datasets. To address these limitations, we propose a novel effective OOD detection approach that employs an overlap index (OI)-based confidence score function to evaluate the likelihood of a given input belonging to the same distribution as the available ID samples. The proposed OI-based confidence score function is non-parametric, lightweight, and easy to interpret, hence providing strong flexibility and generality. Extensive empirical evaluations indicate that our OI-based OOD detector is competitive with state-of-the-art OOD detectors in terms of detection accuracy on a wide range of datasets while requiring less computation and memory costs. Lastly, we show that the proposed OI-based confidence score function inherits nice properties from OI (e.g., insensitivity to small distributional variations and robustness against Huber \epsilon -contamination) and is a versatile tool for estimating OI and model accuracy in specific contexts.

[LG-31] MVD: A Multi-Lingual Software Vulnerability Detection Framework

链接: https://arxiv.org/abs/2412.06166
作者: Boyu Zhang,Triet H. M. Le,M. Ali Babar
关键词-EN: threaten business operations, increasingly threaten business, business operations, catastrophic cyberattacks, cyberattacks that increasingly
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Software vulnerabilities can result in catastrophic cyberattacks that increasingly threaten business operations. Consequently, ensuring the safety of software systems has become a paramount concern for both private and public sectors. Recent literature has witnessed increasing exploration of learning-based approaches for software vulnerability detection. However, a key limitation of these techniques is their primary focus on a single programming language, such as C/C++, which poses constraints considering the polyglot nature of modern software projects. Further, there appears to be an oversight in harnessing the synergies of vulnerability knowledge across varied languages, potentially underutilizing the full capabilities of these methods. To address the aforementioned issues, we introduce MVD - an innovative multi-lingual vulnerability detection framework. This framework acquires the ability to detect vulnerabilities across multiple languages by concurrently learning from vulnerability data of various languages, which are curated by our specialized pipeline. We also incorporate incremental learning to enable the detection capability of MVD to be extended to new languages, thus augmenting its practical utility. Extensive experiments on our curated dataset of more than 11K real-world multi-lingual vulnerabilities substantiate that our framework significantly surpasses state-of-the-art methods in multi-lingual vulnerability detection by 83.7% to 193.6% in PR-AUC. The results also demonstrate that MVD detects vulnerabilities well for new languages without compromising the detection performance of previously trained languages, even when training data for the older languages is unavailable. Overall, our findings motivate and pave the way for the prediction of multi-lingual vulnerabilities in modern software systems.

[LG-32] Obstacle-aware Gaussian Process Regression

链接: https://arxiv.org/abs/2412.06160
作者: Gaurav Shrivastava
关键词-EN: Obstacle-aware trajectory navigation, Obstacle-aware trajectory, Gaussian Process, Gaussian Process regression, data
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Obstacle-aware trajectory navigation is crucial for many systems. For example, in real-world navigation tasks, an agent must avoid obstacles, such as furniture in a room, while planning a trajectory. Gaussian Process (GP) regression, in its current form, fits a curve to a set of data pairs, with each pair consisting of an input point ‘x’ and its corresponding target regression value ‘y(x)’ (a positive data pair). However, to account for obstacles, we need to constrain the GP to avoid a target regression value ‘y(x-)’ for an input point ‘x-’ (a negative data pair). Our proposed approach, ‘GP-ND’ (Gaussian Process with Negative Datapairs), fits the model to the positive data pairs while avoiding the negative ones. Specifically, we model the negative data pairs using small blobs of Gaussian distribution and maximize their KL divergence from the GP. Our framework jointly optimizes for both positive and negative data pairs. Our experiments show that GP-ND outperforms traditional GP learning. Additionally, our framework does not affect the scalability of Gaussian Process regression and helps the model converge faster as the data size increases.

[LG-33] Advancements in Machine Learning and Deep Learning for Early Detection and Management of Mental Health Disorder

链接: https://arxiv.org/abs/2412.06147
作者: Kamala Devi Kannan,Senthil Kumar Jagatheesaperumal,Rajesh N. V. P. S. Kandala,Mojtaba Lotfaliany,Roohallah Alizadehsanid,Mohammadreza Mohebbi
关键词-EN: deep learning, machine learning, mental health illnesses, started playing, playing a significant
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: 20 pages, 3 figures, 3 tables

点击查看摘要

Abstract:For the early identification, diagnosis, and treatment of mental health illnesses, the integration of deep learning (DL) and machine learning (ML) has started playing a significant role. By evaluating complex data from imaging, genetics, and behavioral assessments, these technologies have the potential to significantly improve clinical outcomes. However, they also present unique challenges related to data integration and ethical issues. This survey reviews the development of ML and DL methods for the early diagnosis and treatment of mental health issues. It examines a range of applications, with a particular emphasis on behavioral assessments, genetic and biomarker analysis, and medical imaging for diagnosing diseases like depression, bipolar disorder, and schizophrenia. Predictive modeling for illness progression is further discussed, focusing on the role of risk prediction models and longitudinal studies. Key findings highlight how ML and DL can improve diagnostic accuracy and treatment outcomes while addressing methodological inconsistencies, data integration challenges, and ethical concerns. The study emphasizes the importance of building real-time monitoring systems for individualized treatment, enhancing data fusion techniques, and fostering interdisciplinary collaboration. Future research should focus on overcoming these obstacles to ensure the valuable and ethical application of ML and DL in mental health services.

[LG-34] Bounded Exploration with World Model Uncertainty in Soft Actor-Critic Reinforcement Learning Algorithm

链接: https://arxiv.org/abs/2412.06139
作者: Ting Qiao,Henry Williams,David Valencia,Bruce MacDonald
关键词-EN: Deep Reinforcement Learning, preventing Deep Reinforcement, Reinforcement Learning algorithms, bottlenecks preventing Deep, Deep Reinforcement
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 7 figures. Accepted as a poster presentation in the Australian Robotics and Automation Association (2023)

点击查看摘要

Abstract:One of the bottlenecks preventing Deep Reinforcement Learning algorithms (DRL) from real-world applications is how to explore the environment and collect informative transitions efficiently. The present paper describes bounded exploration, a novel exploration method that integrates both ‘soft’ and intrinsic motivation exploration. Bounded exploration notably improved the Soft Actor-Critic algorithm’s performance and its model-based extension’s converging speed. It achieved the highest score in 6 out of 8 experiments. Bounded exploration presents an alternative method to introduce intrinsic motivations to exploration when the original reward function has strict meanings.

[LG-35] PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems

链接: https://arxiv.org/abs/2412.06112
作者: Ali Menati,Fatemeh Doudi,Dileep Kalathil,Le Xie
关键词-EN: undergoing substantial transformations, substantial transformations due, renewable energy resources, electrification of demand, enhanced integration
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: This paper has been submitted to the Journal of IEEE Transactions on Power Systems

点击查看摘要

Abstract:The electricity sector is undergoing substantial transformations due to the rising electrification of demand, enhanced integration of renewable energy resources, and the emergence of new technologies. These changes are rendering the electric grid more volatile and unpredictable, making it difficult to maintain reliable operations. In order to address these issues, advanced time series prediction models are needed for closing the gap between the forecasted and actual grid outcomes. In this paper, we introduce a multivariate time series prediction model that combines traditional state space models with deep learning methods to simultaneously capture and predict the underlying dynamics of multiple time series. Additionally, we design a time series processing module that incorporates high-resolution external forecasts into sequence-to-sequence prediction models, achieving this with negligible increases in size and no loss of accuracy. We also release an extended dataset spanning five years of load, electricity price, ancillary service price, and renewable generation. To complement this dataset, we provide an open-access toolbox that includes our proposed model, the dataset itself, and several state-of-the-art prediction models, thereby creating a unified framework for benchmarking advanced machine learning approaches. Our findings indicate that the proposed model outperforms existing models across various prediction tasks, improving state-of-the-art prediction error by an average of 7% and decreasing model parameters by 43%.

[LG-36] Fully Distributed Online Training of Graph Neural Networks in Networked Systems

链接: https://arxiv.org/abs/2412.06105
作者: Rostyslav Olshevskyi,Zhongyuan Zhao,Kevin Chan,Gunjan Verma,Ananthram Swami,Santiago Segarra
关键词-EN: Graph neural networks, decentralized artificial intelligence, Graph neural, large-scale networked systems, decentralized artificial
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are powerful tools for developing scalable, decentralized artificial intelligence in large-scale networked systems, such as wireless networks, power grids, and transportation networks. Currently, GNNs in networked systems mostly follow a paradigm of `centralized training, distributed execution’, which limits their adaptability and slows down their development cycles. In this work, we fill this gap for the first time by developing a communication-efficient, fully distributed online training approach for GNNs applied to large networked systems. For a mini-batch with B samples, our approach of training an L -layer GNN only adds L rounds of message passing to the LB rounds required by GNN inference, with doubled message sizes. Through numerical experiments in graph-based node regression, power allocation, and link scheduling in wireless networks, we demonstrate the effectiveness of our approach in training GNNs under supervised, unsupervised, and reinforcement learning paradigms.

[LG-37] Learning from Snapshots of Discrete and Continuous Data Streams

链接: https://arxiv.org/abs/2412.06079
作者: Pramith Devulapalli,Steve Hanneke
关键词-EN: selectively clicking pictures, understand animal movement, trap selectively clicking, camera trap selectively, animal movements unfolding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imagine a smart camera trap selectively clicking pictures to understand animal movement patterns within a particular habitat. These “snapshots”, or pieces of data captured from a data stream at adaptively chosen times, provide a glimpse of different animal movements unfolding through time. Learning a continuous-time process through snapshots, such as smart camera traps, is a central theme governing a wide array of online learning situations. In this paper, we adopt a learning-theoretic perspective in understanding the fundamental nature of learning different classes of functions from both discrete data streams and continuous data streams. In our first framework, the \textitupdate-and-deploy setting, a learning algorithm discretely queries from a process to update a predictor designed to make predictions given as input the data stream. We construct a uniform sampling algorithm that can learn with bounded error any concept class with finite Littlestone dimension. Our second framework, known as the \emphblind-prediction setting, consists of a learning algorithm generating predictions independently of observing the process, only engaging with the process when it chooses to make queries. Interestingly, we show a stark contrast in learnability where non-trivial concept classes are unlearnable. However, we show that adaptive learning algorithms are necessary to learn sets of time-dependent and data-dependent functions, called pattern classes, in either framework. Finally, we develop a theory of pattern classes under discrete data streams for the blind-prediction setting.

[LG-38] Mixture-of-PageRanks: Replacing Long-Context with Real-Time Sparse GraphRAG

链接: https://arxiv.org/abs/2412.06078
作者: Nicholas Alonso,Beren Millidge
关键词-EN: enabling entire books, frontier LLMs dramatically, enabling entire, advances have extended, window of frontier
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances have extended the context window of frontier LLMs dramatically, from a few thousand tokens up to millions, enabling entire books and codebases to fit into context. However, the compute costs of inferencing long-context LLMs are massive and often prohibitive in practice. RAG offers an efficient and effective alternative: retrieve and process only the subset of the context most important for the current task. Although promising, recent work applying RAG to long-context tasks has two core limitations: 1) there has been little focus on making the RAG pipeline compute efficient, and 2) such works only test on simple QA tasks, and their performance on more challenging tasks is unclear. To address this, we develop an algorithm based on PageRank, a graph-based retrieval algorithm, which we call mixture-of-PageRanks (MixPR). MixPR uses a mixture of PageRank-based graph-retrieval algorithms implemented using sparse matrices for efficent, cheap retrieval that can deal with a variety of complex tasks. Our MixPR retriever achieves state-of-the-art results across a wide range of long-context benchmark tasks, outperforming both existing RAG methods, specialized retrieval architectures, and long-context LLMs despite being far more compute efficient. Due to using sparse embeddings, our retriever is extremely compute efficient, capable of embedding and retrieving millions of tokens within a few seconds and runs entirely on CPU.

[LG-39] On Socially Fair Low-Rank Approximation and Column Subset Selection NEURIPS2024

链接: https://arxiv.org/abs/2412.06063
作者: Zhao Song,Ali Vakilian,David P. Woodruff,Samson Zhou
关键词-EN: column subset selection, machine learning applications, fair low-rank approximation, Low-rank approximation, fair column subset
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Low-rank approximation and column subset selection are two fundamental and related problems that are applied across a wealth of machine learning applications. In this paper, we study the question of socially fair low-rank approximation and socially fair column subset selection, where the goal is to minimize the loss over all sub-populations of the data. We show that surprisingly, even constant-factor approximation to fair low-rank approximation requires exponential time under certain standard complexity hypotheses. On the positive side, we give an algorithm for fair low-rank approximation that, for a constant number of groups and constant-factor accuracy, runs in 2^\textpoly(k) time rather than the naïve n^\textpoly(k) , which is a substantial improvement when the dataset has a large number n of observations. We then show that there exist bicriteria approximation algorithms for fair low-rank approximation and fair column subset selection that run in polynomial time.

[LG-40] siForest: Detecting Network Anomalies with Set-Structured Isolation Forest

链接: https://arxiv.org/abs/2412.06015
作者: Christie Djidjev
关键词-EN: cyber threats continue, maintaining robust cybersecurity, robust cybersecurity defenses, anomalous network behavior, Isolation Forest
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:As cyber threats continue to evolve in sophistication and scale, the ability to detect anomalous network behavior has become critical for maintaining robust cybersecurity defenses. Modern cybersecurity systems face the overwhelming challenge of analyzing billions of daily network interactions to identify potential threats, making efficient and accurate anomaly detection algorithms crucial for network defense. This paper investigates the use of variations of the Isolation Forest (iForest) machine learning algorithm for detecting anomalies in internet scan data. In particular, it presents the Set-Partitioned Isolation Forest (siForest), a novel extension of the iForest method designed to detect anomalies in set-structured data. By treating instances such as sets of multiple network scans with the same IP address as cohesive units, siForest effectively addresses some challenges of analyzing complex, multidimensional datasets. Extensive experiments on synthetic datasets simulating diverse anomaly scenarios in network traffic demonstrate that siForest has the potential to outperform traditional approaches on some types of internet scan data.

[LG-41] Accurate Multi-Category Student Performance Forecasting at Early Stages of Online Education Using Neural Networks

链接: https://arxiv.org/abs/2412.05938
作者: Naveed Ur Rehman Junejo,Muhammad Wasim Nawaz,Qingsheng Huang,Xiaoqing Dong,Chang Wang,Gengzhong Zheng
关键词-EN: analyze student performance, predicting student performance, University Learning Analytics, Open University Learning, accurately predicting student
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The ability to accurately predict and analyze student performance in online education, both at the outset and throughout the semester, is vital. Most of the published studies focus on binary classification (Fail or Pass) but there is still a significant research gap in predicting students’ performance across multiple categories. This study introduces a novel neural network-based approach capable of accurately predicting student performance and identifying vulnerable students at early stages of the online courses. The Open University Learning Analytics (OULA) dataset is employed to develop and test the proposed model, which predicts outcomes in Distinction, Fail, Pass, and Withdrawn categories. The OULA dataset is preprocessed to extract features from demographic data, assessment data, and clickstream interactions within a Virtual Learning Environment (VLE). Comparative simulations indicate that the proposed model significantly outperforms existing baseline models including Artificial Neural Network Long Short Term Memory (ANN-LSTM), Random Forest (RF) ‘gini’, RF ‘entropy’ and Deep Feed Forward Neural Network (DFFNN) in terms of accuracy, precision, recall, and F1-score. The results indicate that the prediction accuracy of the proposed method is about 25% more than the existing state-of-the-art. Furthermore, compared to existing methodologies, the model demonstrates superior predictive capability across temporal course progression, achieving superior accuracy even at the initial 20% phase of course completion.

[LG-42] Understanding the Impact of Graph Reduction on Adversarial Robustness in Graph Neural Networks

链接: https://arxiv.org/abs/2412.05883
作者: Kerui Wu,Ka-Ho Chow,Wenqi Wei,Lei Yu
关键词-EN: Graph Neural Networks, Neural Networks, scalability remains underexplored, large-scale graph data, graph reduction techniques
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As Graph Neural Networks (GNNs) become increasingly popular for learning from large-scale graph data across various domains, their susceptibility to adversarial attacks when using graph reduction techniques for scalability remains underexplored. In this paper, we present an extensive empirical study to investigate the impact of graph reduction techniques, specifically graph coarsening and sparsification, on the robustness of GNNs against adversarial attacks. Through extensive experiments involving multiple datasets and GNN architectures, we examine the effects of four sparsification and six coarsening methods on the poisoning attacks. Our results indicate that, while graph sparsification can mitigate the effectiveness of certain poisoning attacks, such as Mettack, it has limited impact on others, like PGD. Conversely, graph coarsening tends to amplify the adversarial impact, significantly reducing classification accuracy as the reduction ratio decreases. Additionally, we provide a novel analysis of the causes driving these effects and examine how defensive GNN models perform under graph reduction, offering practical insights for designing robust GNNs within graph acceleration systems.

[LG-43] Risk factor identification and classification of malnutrition among under-five children in Bangladesh: Machine learning and statistical approach

链接: https://arxiv.org/abs/2412.05813
作者: Tasfin Mahmud,Tayab Uddin Wara,Chironjeet Das Joy
关键词-EN: Multiple Indicator Cluster, Support Vector Machine, machine learning algorithms, well-established machine learning, Decision Tree
类目: Machine Learning (cs.LG)
*备注: In review to Heliyon; 18 pages; 10 figures; 6 tables

点击查看摘要

Abstract:This study aims to understand the factors that resulted in under-five children’s malnutrition from the Multiple Indicator Cluster (MICS-2019) nationwide surveys and classify different malnutrition stages based on the four well-established machine learning algorithms, namely - Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Multi-layer Perceptron (MLP) neural network. Accuracy, precision, recall, and F1 scores are obtained to evaluate the performance of each model. The statistical Pearson correlation coefficient analysis is also done to understand the significant factors related to a child’s malnutrition. The eligible data sample for analysis was 21,858 among 24,686 samples from the dataset. Satisfactory and insightful results were obtained in each case and, the RF and MLP performed extraordinarily well. For RF, the accuracy was 98.55%, average precision 98.3%, recall value 95.68%, and F1 score 97.13%. For MLP, the accuracy was 98.69%, average precision 97.62%, recall 90.96%, and F1 score of 97.39%. From the Pearson co-efficient, all negative correlation results are enlisted, and the most significant impacts are found for the WAZ2 (Weight for age Z score WHO) (-0.828"), WHZ2 (Weight for height Z score WHO) (-0.706"), ZBMI (BMI Z score WHO) (-0.656"), BD3 (whether child is still being breastfed) (-0.59"), HAZ2 (Height for age Z score WHO) (-0.452"), CA1 (whether child had diarrhea in last 2 weeks) (-0.34"), Windex5 (Wealth index quantile) (-0.161"), melevel (Mother’s education) (-0.132"), and CA14/CA16/CA17 (whether child had illness with fever, cough, and breathing) (-0.04) in successive order.

[LG-44] wo-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

链接: https://arxiv.org/abs/2412.05783
作者: Shuguang Yu,Shuxing Fang,Ruixin Peng,Zhengling Qi,Fan Zhou,Chengchun Shi
关键词-EN: studies off-policy evaluation, paper studies off-policy, off-policy evaluation, unmeasured confounders, paper studies
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies off-policy evaluation (OPE) in the presence of unmeasured confounders. Inspired by the two-way fixed effects regression model widely used in the panel data literature, we propose a two-way unmeasured confounding assumption to model the system dynamics in causal reinforcement learning and develop a two-way deconfounder algorithm that devises a neural tensor network to simultaneously learn both the unmeasured confounders and the system dynamics, based on which a model-based estimator can be constructed for consistent policy value estimation. We illustrate the effectiveness of the proposed estimator through theoretical results and numerical experiments.

[LG-45] ProtGO: A Transformer based Fusion Model for accurately predicting Gene Ontology (GO) Terms from full scale Protein Sequences

链接: https://arxiv.org/abs/2412.05776
作者: Azwad Tamir,Jiann-Shiun Yuan
关键词-EN: generation sequencing technology, open-source protein databases, Recent developments, protein databases consisting, creation of extensive
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 11 pages, 3 figures, 4 tables, This work would be submitted to Scientific journals for possible publication

点击查看摘要

Abstract:Recent developments in next generation sequencing technology have led to the creation of extensive, open-source protein databases consisting of hundreds of millions of sequences. To render these sequences applicable in biomedical applications, they must be meticulously annotated by wet lab testing or extracting them from existing literature. Over the last few years, researchers have developed numerous automatic annotation systems, particularly deep learning models based on machine learning and artificial intelligence, to address this issue. In this work, we propose a transformer-based fusion model capable of predicting Gene Ontology (GO) terms from full-scale protein sequences, achieving state-of-the-art accuracy compared to other contemporary machine learning annotation systems. The approach performs particularly well on clustered split datasets, which comprise training and testing samples originating from distinct distributions that are structurally diverse. This demonstrates that the model is able to understand both short and long term dependencies within the enzyme’s structure and can precisely identify the motifs associated with the various GO terms. Furthermore, the technique is lightweight and less computationally expensive compared to the benchmark methods, while at the same time not unaffected by sequence length, rendering it appropriate for diverse applications with varying sequence lengths.

[LG-46] KITE-DDI: A Knowledge graph Integrated Transformer Model for accurately predicting Drug-Drug Interaction Events from Drug SMILES and Biomedical Knowledge Graph

链接: https://arxiv.org/abs/2412.05770
作者: Azwad Tamir,Jiann-Shiun Yuan
关键词-EN: predicting DDI events, prescribe multiple medications, multiple medications simultaneously, DDI events, treat diseases
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 18 pages, 8 figures, 5 Tables, This work has been submitted to IEEE for possible publication

点击查看摘要

Abstract:It is a common practice in modern medicine to prescribe multiple medications simultaneously to treat diseases. However, these medications could have adverse reactions between them, known as Drug-Drug Interactions (DDI), which have the potential to cause significant bodily injury and could even be fatal. Hence, it is essential to identify all the DDI events before prescribing multiple drugs to a patient. Most contemporary research for predicting DDI events relies on either information from Biomedical Knowledge graphs (KG) or drug SMILES, with very few managing to merge data from both to make predictions. While others use heuristic algorithms to extract features from SMILES and KGs, which are then fed into a Deep Learning framework to generate output. In this study, we propose a KG-integrated Transformer architecture to generate an end-to-end fully automated Machine Learning pipeline for predicting DDI events with high accuracy. The algorithm takes full-scale molecular SMILES sequences of a pair of drugs and a biomedical KG as input and predicts the interaction between the two drugs with high precision. The results show superior performance in two different benchmark datasets compared to existing state-of-the-art models especially when the test and training sets contain distinct sets of drug molecules. This demonstrates the strong generalization of the proposed model, indicating its potential for DDI event prediction for newly developed drugs. The model does not depend on heuristic models for generating embeddings and has a minimal number of hyperparameters, making it easy to use while demonstrating outstanding performance in low-data scenarios.

[LG-47] DeMem: Privacy-Enhanced Robust Adversarial Learning via De-Memorization

链接: https://arxiv.org/abs/2412.05767
作者: Xiaoyu Luo,Qiongxiu Li
关键词-EN: withstand manipulated inputs, machine learning models, real-world applications, withstand manipulated, manipulated inputs
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 8 pages

点击查看摘要

Abstract:Adversarial robustness, the ability of a model to withstand manipulated inputs that cause errors, is essential for ensuring the trustworthiness of machine learning models in real-world applications. However, previous studies have shown that enhancing adversarial robustness through adversarial training increases vulnerability to privacy attacks. While differential privacy can mitigate these attacks, it often compromises robustness against both natural and adversarial samples. Our analysis reveals that differential privacy disproportionately impacts low-risk samples, causing an unintended performance drop. To address this, we propose DeMem, which selectively targets high-risk samples, achieving a better balance between privacy protection and model robustness. DeMem is versatile and can be seamlessly integrated into various adversarial training techniques. Extensive evaluations across multiple training methods and datasets demonstrate that DeMem significantly reduces privacy leakage while maintaining robustness against both natural and adversarial samples. These results confirm DeMem’s effectiveness and broad applicability in enhancing privacy without compromising robustness.

[LG-48] REGE: A Method for Incorporating Uncertainty in Graph Embeddings

链接: https://arxiv.org/abs/2412.05735
作者: Zohair Shafi,Germans Savcisens,Tina Eliassi-Rad
关键词-EN: Machine learning models, arise from incomplete, Machine learning, real-world applications, applications are prone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models for graphs in real-world applications are prone to two primary types of uncertainty: (1) those that arise from incomplete and noisy data and (2) those that arise from uncertainty of the model in its output. These sources of uncertainty are not mutually exclusive. Additionally, models are susceptible to targeted adversarial attacks, which exacerbate both of these uncertainties. In this work, we introduce Radius Enhanced Graph Embeddings (REGE), an approach that measures and incorporates uncertainty in data to produce graph embeddings with radius values that represent the uncertainty of the model’s output. REGE employs curriculum learning to incorporate data uncertainty and conformal learning to address the uncertainty in the model’s output. In our experiments, we show that REGE’s graph embeddings perform better under adversarial attacks by an average of 1.5% (accuracy) against state-of-the-art methods.

[LG-49] Finite Element Neural Network Interpolation. Part I: Interpretable and Adaptive Discretization for Solving PDEs

链接: https://arxiv.org/abs/2412.05719
作者: Kateřina Škardová,Alexandre Daby-Seesaram,Martin Genet
关键词-EN: Hierarchical Deep-learning Neural, Finite Element Neural, Embedded Finite Element, Deep-learning Neural Networks, sparse neural network
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 25 pages, 15 figures

点击查看摘要

Abstract:We present the Finite Element Neural Network Interpolation (FENNI) framework, a sparse neural network architecture extending previous work on Embedded Finite Element Neural Networks (EFENN) introduced with the Hierarchical Deep-learning Neural Networks (HiDeNN). Due to their mesh-based structure, EFENN requires significantly fewer trainable parameters than fully connected neural networks, with individual weights and biases having a clear interpretation. Our FENNI framework, within the EFENN framework, brings improvements to the HiDeNN approach. First, we propose a reference element-based architecture where shape functions are defined on a reference element, enabling variability in interpolation functions and straightforward use of Gaussian quadrature rules for evaluating the loss function. Second, we propose a pragmatic multigrid training strategy based on the framework’s interpretability. Third, HiDeNN’s combined rh-adaptivity is extended from 1D to 2D, with a new Jacobian-based criterion for adding nodes combining h- and r-adaptivity. From a deep learning perspective, adaptive mesh behavior through rh-adaptivity and the multigrid approach correspond to transfer learning, enabling FENNI to optimize the network’s architecture dynamically during training. The framework’s capabilities are demonstrated on 1D and 2D test cases, where its accuracy and computational cost are compared against an analytical solution and a classical FEM solver. On these cases, the multigrid training strategy drastically improves the training stage’s efficiency and robustness. Finally, we introduce a variational loss within the EFENN framework, showing that it performs as well as energy-based losses and outperforms residual-based losses. This framework is extended to surrogate modeling over the parametric space in Part II. Comments: 25 pages, 15 figures Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph) Cite as: arXiv:2412.05719 [math.NA] (or arXiv:2412.05719v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2412.05719 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] M3PC: Test-time Model Predictive Control for Pretrained Masked Trajectory Model

链接: https://arxiv.org/abs/2412.05675
作者: Kehan Wen,Yutong Hu,Yao Mu,Lei Ke
关键词-EN: Offline Reinforcement Learning, unified Transformer trained, Reinforcement Learning, Recent work, masked auto-encoding objective
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Recent work in Offline Reinforcement Learning (RL) has shown that a unified Transformer trained under a masked auto-encoding objective can effectively capture the relationships between different modalities (e.g., states, actions, rewards) within given trajectory datasets. However, this information has not been fully exploited during the inference phase, where the agent needs to generate an optimal policy instead of just reconstructing masked components from unmasked ones. Given that a pretrained trajectory model can act as both a Policy Model and a World Model with appropriate mask patterns, we propose using Model Predictive Control (MPC) at test time to leverage the model’s own predictive capability to guide its action selection. Empirical results on D4RL and RoboMimic show that our inference-phase MPC significantly improves the decision-making performance of a pretrained trajectory model without any additional parameter training. Furthermore, our framework can be adapted to Offline to Online (O2O) RL and Goal Reaching RL, resulting in more substantial performance gains when an additional online interaction budget is provided, and better generalization capabilities when different task targets are specified. Code is available: this https URL.

[LG-51] Detecting outliers by clustering algorithms

链接: https://arxiv.org/abs/2412.05669
作者: Qi Li,Shuliang Wang
关键词-EN: clustering algorithms, Clustering, algorithms, Outliers, data mining
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Clustering and outlier detection are two important tasks in data mining. Outliers frequently interfere with clustering algorithms to determine the similarity between objects, resulting in unreliable clustering results. Currently, only a few clustering algorithms (e.g., DBSCAN) have the ability to detect outliers to eliminate interference. For other clustering algorithms, it is tedious to introduce another outlier detection task to eliminate outliers before each clustering process. Obviously, how to equip more clustering algorithms with outlier detection ability is very meaningful. Although a common strategy allows clustering algorithms to detect outliers based on the distance between objects and clusters, it is contradictory to improving the performance of clustering algorithms on the datasets with outliers. In this paper, we propose a novel outlier detection approach, called ODAR, for clustering. ODAR maps outliers and normal objects into two separated clusters by feature transformation. As a result, any clustering algorithm can detect outliers by identifying clusters. Experiments show that ODAR is robust to diverse datasets. Compared with baseline methods, the clustering algorithms achieve the best on 7 out of 10 datasets with the help of ODAR, with at least 5% improvement in accuracy.

[LG-52] owards Robust Spatio-Temporal Auto-Regressive Prediction: Adams-Bashforth Time Integration with Adaptive Multi-Step Rollout

链接: https://arxiv.org/abs/2412.05657
作者: Sunwoong Yang,Ricardo Vinuesa,Namwoo Kang
关键词-EN: scientific machine learning, introducing innovative temporal, machine learning models, innovative temporal integration, study addresses
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:This study addresses the critical challenge of error accumulation in spatio-temporal auto-regressive predictions within scientific machine learning models by introducing innovative temporal integration schemes and adaptive multi-step rollout strategies. We present a comprehensive analysis of time integration methods, highlighting the adaptation of the two-step Adams-Bashforth scheme to enhance long-term prediction robustness in auto-regressive models. Additionally, we improve temporal prediction accuracy through a multi-step rollout strategy that incorporates multiple future time steps during training, supported by three newly proposed approaches that dynamically adjust the importance of each future step. By integrating the Adams-Bashforth scheme with adaptive multi-step strategies, our graph neural network-based auto-regressive model accurately predicts 350 future time steps, even under practical constraints such as limited training data and minimal model capacity – achieving an error of only 1.6% compared to the vanilla auto-regressive approach. Moreover, our framework demonstrates an 83% improvement in rollout performance over the standard noise injection method, a standard technique for enhancing long-term rollout performance. Its effectiveness is further validated in more challenging scenarios with truncated meshes, showcasing its adaptability and robustness in practical applications. This work introduces a versatile framework for robust long-term spatio-temporal auto-regressive predictions, effectively mitigating error accumulation across various model types and engineering discipline.

[LG-53] Active Sequential Posterior Estimation for Sample-Efficient Simulation-Based Inference

链接: https://arxiv.org/abs/2412.05590
作者: Sam Griesemer,Defu Cao,Zijun Cui,Carolina Osorio,Yan Liu
关键词-EN: complex real-world processes, Computer simulations, long presented, presented the exciting, exciting possibility
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Computer simulations have long presented the exciting possibility of scientific insight into complex real-world processes. Despite the power of modern computing, however, it remains challenging to systematically perform inference under simulation models. This has led to the rise of simulation-based inference (SBI), a class of machine learning-enabled techniques for approaching inverse problems with stochastic simulators. Many such methods, however, require large numbers of simulation samples and face difficulty scaling to high-dimensional settings, often making inference prohibitive under resource-intensive simulators. To mitigate these drawbacks, we introduce active sequential neural posterior estimation (ASNPE). ASNPE brings an active learning scheme into the inference loop to estimate the utility of simulation parameter candidates to the underlying probabilistic model. The proposed acquisition scheme is easily integrated into existing posterior estimation pipelines, allowing for improved sample efficiency with low computational overhead. We further demonstrate the effectiveness of the proposed method in the travel demand calibration setting, a high-dimensional inverse problem commonly requiring computationally expensive traffic simulators. Our method outperforms well-tuned benchmarks and state-of-the-art posterior estimation methods on a large-scale real-world traffic network, as well as demonstrates a performance advantage over non-active counterparts on a suite of SBI benchmark environments.

[LG-54] STONet: A novel neural operator for modeling solute transport in micro-cracked reservoirs

链接: https://arxiv.org/abs/2412.05576
作者: Ehsan Haghighat,Mohammad Hesan Adeli,S Mohammad Mousavi,Ruben Juanes
关键词-EN: efficiently model contaminant, Transport Operator Network, model contaminant transport, Solute Transport Operator, Operator Network
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:In this work, we develop a novel neural operator, the Solute Transport Operator Network (STONet), to efficiently model contaminant transport in micro-cracked reservoirs. The model combines different networks to encode heterogeneous properties effectively. By predicting the concentration rate, we are able to accurately model the transport process. Numerical experiments demonstrate that our neural operator approach achieves accuracy comparable to that of the finite element method. The previously introduced Enriched DeepONet architecture has been revised, motivated by the architecture of the popular multi-head attention of transformers, to improve its performance without increasing the compute cost. The computational efficiency of the proposed model enables rapid and accurate predictions of solute transport, facilitating the optimization of reservoir management strategies and the assessment of environmental impacts. The data and code for the paper will be published at this https URL.

[LG-55] SMI-Editor: Edit-based SMILES Language Model with Fragment-level Supervision

链接: https://arxiv.org/abs/2412.05569
作者: Kangjie Zheng,Siyue Liang,Junwei Yang,Bin Feng,Zequn Liu,Wei Ju,Zhiping Xiao,Ming Zhang
关键词-EN: garnered significant attention, SMILES, pre-trained SMILES LMs, crucial textual representation, SMILES LMs
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:SMILES, a crucial textual representation of molecular structures, has garnered significant attention as a foundation for pre-trained language models (LMs). However, most existing pre-trained SMILES LMs focus solely on the single-token level supervision during pre-training, failing to fully leverage the substructural information of molecules. This limitation makes the pre-training task overly simplistic, preventing the models from capturing richer molecular semantic information. Moreover, during pre-training, these SMILES LMs only process corrupted SMILES inputs, never encountering any valid SMILES, which leads to a train-inference mismatch. To address these challenges, we propose SMI-Editor, a novel edit-based pre-trained SMILES LM. SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model, which then attempts to restore the original SMILES through an editing process. This approach not only introduces fragment-level training signals, but also enables the use of valid SMILES as inputs, allowing the model to learn how to reconstruct complete molecules from these incomplete structures. As a result, the model demonstrates improved scalability and an enhanced ability to capture fragment-level molecular information. Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks, and even outperforming several 3D molecular representation models.

[LG-56] Exploring the Use of LLM s for SQL Equivalence Checking

链接: https://arxiv.org/abs/2412.05561
作者: Rajat Singh,Srikanta Bedathur
关键词-EN: diverse contexts ranging, intractable problem encountered, SQL, debugging query rewriting, query rewriting rules
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Equivalence checking of two SQL queries is an intractable problem encountered in diverse contexts ranging from grading student submissions in a DBMS course to debugging query rewriting rules in an optimizer, and many more. While a lot of progress has been made in recent years in developing practical solutions for this problem, the existing methods can handle only a small subset of SQL, even for bounded equivalence checking. They cannot support sophisticated SQL expressions one encounters in practice. At the same time, large language models (LLMs) – such as GPT-4 – have emerged as power generators of SQL from natural language specifications. This paper explores whether LLMs can also demonstrate the ability to reason with SQL queries and help advance SQL equivalence checking. Towards this, we conducted a detailed evaluation of several LLMs over collections with SQL pairs of varying levels of complexity. We explored the efficacy of different prompting techniques, the utility of synthetic examples explanations, as well as logical plans generated by query parsers. Our main finding is that with well-designed prompting using an unoptimized SQL Logical Plan, LLMs can perform equivalence checking beyond the capabilities of current techniques, achieving nearly 100% accuracy for equivalent pairs and up to 70% for non-equivalent pairs of SQL queries. While LLMs lack the ability to generate formal proofs, their synthetic examples and human-readable explanations offer valuable insights to students ( instructors) in a classroom setting and to database administrators (DBAs) managing large database installations. Additionally, we also show that with careful fine-tuning, we can close the performance gap between smaller (and efficient) models and larger models such as GPT, thus paving the way for potential LLM-integration in standalone data processing systems.

[LG-57] Convergence analysis of wide shallow neural operators within the framework of Neural Tangent Kernel

链接: https://arxiv.org/abs/2412.05545
作者: Xianliang Xu,Ye Li,Zhongyi Huang
关键词-EN: Partial Differential Equations, approximating operators mapping, Deep Ritz Method, mapping between Banach, Banach spaces
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Neural operators are aiming at approximating operators mapping between Banach spaces of functions, achieving much success in the field of scientific computing. Compared to certain deep learning-based solvers, such as Physics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural operators can solve a class of Partial Differential Equations (PDEs). Although much work has been done to analyze the approximation and generalization error of neural operators, there is still a lack of analysis on their training error. In this work, we conduct the convergence analysis of gradient descent for the wide shallow neural operators within the framework of Neural Tangent Kernel (NTK). The core idea lies on the fact that over-parameterization and random initialization together ensure that each weight vector remains near its initialization throughout all iterations, yielding the linear convergence of gradient descent. In this work, we demonstrate that under the setting of over-parametrization, gradient descent can find the global minimum regardless of whether it is in continuous time or discrete time.

[LG-58] Can large language models be privacy preserving and fair medical coders?

链接: https://arxiv.org/abs/2412.05533
作者: Ali Dadsetan,Dorsa Soleymani,Xijie Zeng,Frank Rudzicz
关键词-EN: Protecting patient data, deploying machine learning, machine learning algorithms, patient data privacy, algorithms in healthcare
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Protecting patient data privacy is a critical concern when deploying machine learning algorithms in healthcare. Differential privacy (DP) is a common method for preserving privacy in such settings and, in this work, we examine two key trade-offs in applying DP to the NLP task of medical coding (ICD classification). Regarding the privacy-utility trade-off, we observe a significant performance drop in the privacy preserving models, with more than a 40% reduction in micro F1 scores on the top 50 labels in the MIMIC-III dataset. From the perspective of the privacy-fairness trade-off, we also observe an increase of over 3% in the recall gap between male and female patients in the DP models. Further understanding these trade-offs will help towards the challenges of real-world deployment.

[LG-59] Upcycling Noise for Federated Unlearning

链接: https://arxiv.org/abs/2412.05529
作者: Jianan Chen,Qin Hu,Fangtian Zhong,Yan Zhuang,Minghui Xu
关键词-EN: sharing raw data, multiple clients collaboratively, Federated Learning, clients collaboratively train, DPFL
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In Federated Learning (FL), multiple clients collaboratively train a model without sharing raw data. This paradigm can be further enhanced by Differential Privacy (DP) to protect local data from information inference attacks and is thus termed DPFL. An emerging privacy requirement, ``the right to be forgotten’’ for clients, poses new challenges to DPFL but remains largely unexplored. Despite numerous studies on federated unlearning (FU), they are inapplicable to DPFL because the noise introduced by the DP mechanism compromises their effectiveness and efficiency. In this paper, we propose Federated Unlearning with Indistinguishability (FUI) to unlearn the local data of a target client in DPFL for the first time. FUI consists of two main steps: local model retraction and global noise calibration, resulting in an unlearning model that is statistically indistinguishable from the retrained model. Specifically, we demonstrate that the noise added in DPFL can endow the unlearning model with a certain level of indistinguishability after local model retraction, and then fortify the degree of unlearning through global noise calibration. Additionally, for the efficient and consistent implementation of the proposed FUI, we formulate a two-stage Stackelberg game to derive optimal unlearning strategies for both the server and the target client. Privacy and convergence analyses confirm theoretical guarantees, while experimental results based on four real-world datasets illustrate that our proposed FUI achieves superior model performance and higher efficiency compared to mainstream FU schemes. Simulation results further verify the optimality of the derived unlearning strategies.

[LG-60] Flex Attention: A Programming Model for Generating Optimized Attention Kernels

链接: https://arxiv.org/abs/2412.05496
作者: Juechu Dong,Boyuan Feng,Driss Guessous,Yanbo Liang,Horace He
关键词-EN: attention variants, deep learning, important primitives, primitives in deep, attention
类目: Machine Learning (cs.LG); Performance (cs.PF); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Over the past 7 years, attention has become one of the most important primitives in deep learning. The primary approach to optimize attention is FlashAttention, which fuses the operation together, drastically improving both the runtime and the memory consumption. However, the importance of FlashAttention combined with its monolithic nature poses a problem for researchers aiming to try new attention variants – a “software lottery”. This problem is exacerbated by the difficulty of writing efficient fused attention kernels, resisting traditional compiler-based approaches. We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. We demonstrate that many existing attention variants (e.g. Alibi, Document Masking, PagedAttention, etc.) can be implemented via FlexAttention, and that we achieve competitive performance compared to these handwritten kernels. Finally, we demonstrate how FlexAttention allows for easy composition of attention variants, solving the combinatorial explosion of attention variants.

[LG-61] AI-powered Digital Twin of the Ocean: Reliable Uncertainty Quantification for Real-time Wave Height Prediction with Deep Ensemble

链接: https://arxiv.org/abs/2412.05475
作者: Dongeon Lee,Sunwoong Yang,Jae-Won Oh,Su-Gil Cho,Sanghyuk Kim,Namwoo Kang
关键词-EN: eco-friendly power generation, real-time wave height, Environmental pollution, wave height prediction, reliable real-time wave
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 23 pages, 13 figures

点击查看摘要

Abstract:Environmental pollution and the depletion of fossil fuels have prompted the need for eco-friendly power generation methods based on renewable energy. However, renewable energy sources often face challenges in providing stable power due to low energy density and non-stationary. Wave energy converters (WECs), in particular, need reliable real-time wave height prediction to address these issues caused by irregular wave patterns, which can lead to the inefficient and unstable operation of WECs. In this study, we propose an AI-powered reliable real-time wave height prediction model, aiming both high predictive accuracy and reliable uncertainty quantification (UQ). The proposed architecture LSTM-DE, integrates long short-term memory (LSTM) networks for temporal prediction with deep ensemble (DE) for robust UQ, achieving accuracy and reliability in wave height prediction. To further enhance the reliability of the predictive models, uncertainty calibration is applied, which has proven to significantly improve the quality of the quantified uncertainty. Based on the real operational data obtained from an oscillating water column-wave energy converter (OWC-WEC) system in Jeju, South Korea, we demonstrate that the proposed LSTM-DE model architecture achieves notable predictive accuracy (R2 0.9) while increasing the uncertainty quality by over 50% through simple calibration technique. Furthermore, a comprehensive parametric study is conducted to explore the effects of key model hyperparameters, offering valuable guidelines for diverse operational scenarios, characterized by differences in wavelength, amplitude, and period. The findings show that the proposed method provides robust and reliable real-time wave height predictions, facilitating digital twin of the ocean.

[LG-62] Multi-Objective Alignment of Large Language Models Through Hypervolume Maximization

链接: https://arxiv.org/abs/2412.05469
作者: Subhojyoti Mukherjee,Anusha Lalitha,Sailik Sengupta,Aniket Deshmukh,Branislav Kveton
关键词-EN: large language models, language models, human preferences, preferences are complex, large language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-objective alignment from human feedback (MOAHF) in large language models (LLMs) is a challenging problem as human preferences are complex, multifaceted, and often conflicting. Recent works on MOAHF considered a-priori multi-objective optimization (MOO), where human preferences are known at training or inference time. In contrast, when human preferences are unknown or difficult to quantify, a natural approach is to cover the Pareto front by multiple diverse solutions. We propose an algorithm HaM for learning diverse LLM policies that maximizes their hypervolume. This is the first application of a-posteriori MOO to MOAHF. HaM is computationally and space efficient, and empirically superior across objectives such as harmlessness, helpfulness, humor, faithfulness, and hallucination, on various datasets.

[LG-63] Granular Ball K-Class Twin Support Vector Classifier

链接: https://arxiv.org/abs/2412.05438
作者: M. A. Ganaie,Vrushank Ahire,Anouck Girard
关键词-EN: Twin Support Vector, K-Class Twin Support, combines Twin Support, Support Vector Classifier, Support Vector Machines
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces the Granular Ball K-Class Twin Support Vector Classifier (GB-TWKSVC), a novel multi-class classification framework that combines Twin Support Vector Machines (TWSVM) with granular ball computing. The proposed method addresses key challenges in multi-class classification by utilizing granular ball representation for improved noise robustness and TWSVM’s non-parallel hyperplane architecture solves two smaller quadratic programming problems, enhancing efficiency. Our approach introduces a novel formulation that effectively handles multi-class scenarios, advancing traditional binary classification methods. Experimental evaluation on diverse benchmark datasets shows that GB-TWKSVC significantly outperforms current state-of-the-art classifiers in both accuracy and computational performance. The method’s effectiveness is validated through comprehensive statistical tests and complexity analysis. Our work advances classification algorithms by providing a mathematically sound framework that addresses the scalability and robustness needs of modern machine learning applications. The results demonstrate GB-TWKSVC’s broad applicability across domains including pattern recognition, fault diagnosis, and large-scale data analytics, establishing it as a valuable addition to the classification algorithm landscape.

[LG-64] DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA NEURIPS

链接: https://arxiv.org/abs/2412.05430
作者: Aman Patel,Arpita Singhal,Austin Wang,Anusri Pampari,Maya Kasowski,Anshul Kundaje
关键词-EN: genomic DNA language, Recent advances, large genomic DNA, DNA language models, natural language
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: NeurIPS Datasets and Benchmarks 2024

点击查看摘要

Abstract:Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various genomic prediction, interpretation and design tasks. Despite their potential, existing benchmarks do not adequately assess the capabilities of DNALMs on key downstream applications involving an important class of non-coding DNA elements critical for regulating gene activity. In this study, we introduce DART-Eval, a suite of representative benchmarks specifically focused on regulatory DNA to evaluate model performance across zero-shot, probed, and fine-tuned scenarios against contemporary ab initio models as baselines. Our benchmarks target biologically meaningful downstream tasks such as functional sequence feature discovery, predicting cell-type specific regulatory activity, and counterfactual prediction of the impacts of genetic variants. We find that current DNALMs exhibit inconsistent performance and do not offer compelling gains over alternative baseline models for most tasks, while requiring significantly more computational resources. We discuss potentially promising modeling, data curation, and evaluation strategies for the next generation of DNALMs. Our code is available at this https URL.

[LG-65] No Free Lunch From Random Feature Ensembles

链接: https://arxiv.org/abs/2412.05418
作者: Benjamin S. Ruben,William L. Tong,Hamza Tahir Chaudhry,Cengiz Pehlevan
关键词-EN: combine the predictions, regression models, train a single, total model size, smaller networks
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Given a budget on total model size, one must decide whether to train a single, large neural network or to combine the predictions of many smaller networks. We study this trade-off for ensembles of random-feature ridge regression models. We prove that when a fixed number of trainable parameters are partitioned among K independently trained models, K=1 achieves optimal performance, provided the ridge parameter is optimally tuned. We then derive scaling laws which describe how the test risk of an ensemble of regression models decays with its total size. We identify conditions on the kernel and task eigenstructure under which ensembles can achieve near-optimal scaling laws. Training ensembles of deep convolutional neural networks on CIFAR-10 and a transformer architecture on C4, we find that a single large network outperforms any ensemble of networks with the same total number of parameters, provided the weight decay and feature-learning strength are tuned to their optimal values.

[LG-66] abular data generation with tensor contraction layers and transformers

链接: https://arxiv.org/abs/2412.05390
作者: Aníbal Silva,André Restivo,Moisés Santos,Carlos Soares
关键词-EN: Deep Learning domain, recently gained significant, gained significant attention, Generative modeling, Deep Learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 9 figures

点击查看摘要

Abstract:Generative modeling for tabular data has recently gained significant attention in the Deep Learning domain. Its objective is to estimate the underlying distribution of the data. However, estimating the underlying distribution of tabular data has its unique challenges. Specifically, this data modality is composed of mixed types of features, making it a non-trivial task for a model to learn intra-relationships between them. One approach to address mixture is to embed each feature into a continuous matrix via tokenization, while a solution to capture intra-relationships between variables is via the transformer architecture. In this work, we empirically investigate the potential of using embedding representations on tabular data generation, utilizing tensor contraction layers and transformers to model the underlying distribution of tabular data within Variational Autoencoders. Specifically, we compare four architectural approaches: a baseline VAE model, two variants that focus on tensor contraction layers and transformers respectively, and a hybrid model that integrates both techniques. Our empirical study, conducted across multiple datasets from the OpenML CC18 suite, compares models over density estimation and Machine Learning efficiency metrics. The main takeaway from our results is that leveraging embedding representations with the help of tensor contraction layers improves density estimation metrics, albeit maintaining competitive performance in terms of machine learning efficiency.

[LG-67] BadGPT-4o: stripping safety finetuning from GPT models

链接: https://arxiv.org/abs/2412.05346
作者: Ekaterina Krupkina,Dmitrii Volkov
关键词-EN: show a version, Abstract, show, version, simple fine-tuning poisoning
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We show a version of Qi et al. 2023’s simple fine-tuning poisoning technique strips GPT-4o’s safety guardrails without degrading the model. The BadGPT attack matches best white-box jailbreaks on HarmBench and StrongREJECT. It suffers no token overhead or performance hits common to jailbreaks, as evaluated on tinyMMLU and open-ended generations. Despite having been known for a year, this attack remains easy to execute.

[LG-68] Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

链接: https://arxiv.org/abs/2412.05334
作者: Zhejun Zhang,Peter Karkus,Maximilian Igl,Wenhao Ding,Yuxiao Chen,Boris Ivanovic,Marco Pavone
关键词-EN: Traffic simulation aims, Traffic simulation, faithfully recovers, real world, aims to learn
类目: Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Traffic simulation aims to learn a policy for traffic agents that, when unrolled in closed-loop, faithfully recovers the joint distribution of trajectories observed in the real world. Inspired by large language models, tokenized multi-agent policies have recently become the state-of-the-art in traffic simulation. However, they are typically trained through open-loop behavior cloning, and thus suffer from covariate shift when executed in closed-loop during simulation. In this work, we present Closest Among Top-K (CAT-K) rollouts, a simple yet effective closed-loop fine-tuning strategy to mitigate covariate shift. CAT-K fine-tuning only requires existing trajectory data, without reinforcement learning or generative adversarial imitation. Concretely, CAT-K fine-tuning enables a small 7M-parameter tokenized traffic simulation policy to outperform a 102M-parameter model from the same model family, achieving the top spot on the Waymo Sim Agent Challenge leaderboard at the time of submission. The code is available at this https URL.

[LG-69] Self-Supervised Learning for Graph-Structured Data in Healthcare Applications: A Comprehensive Review

链接: https://arxiv.org/abs/2412.05312
作者: Safa Ben Atitallah,Chaima Ben Rabah,Maha Driss,Wadii Boulila,Anis Koubaa
关键词-EN: data offers numerous, offers numerous opportunities, data, offers numerous, interconnected healthcare data
类目: Machine Learning (cs.LG)
*备注: 46 pages, 8 figures

点击查看摘要

Abstract:The abundance of complex and interconnected healthcare data offers numerous opportunities to improve prediction, diagnosis, and treatment. Graph-structured data, which includes entities and their relationships, is well-suited for capturing complex connections. Effectively utilizing this data often requires strong and efficient learning algorithms, especially when dealing with limited labeled data. It is increasingly important for downstream tasks in various domains to utilize self-supervised learning (SSL) as a paradigm for learning and optimizing effective representations from unlabeled data. In this paper, we thoroughly review SSL approaches specifically designed for graph-structured data in healthcare applications. We explore the challenges and opportunities associated with healthcare data and assess the effectiveness of SSL techniques in real-world healthcare applications. Our discussion encompasses various healthcare settings, such as disease prediction, medical image analysis, and drug discovery. We critically evaluate the performance of different SSL methods across these tasks, highlighting their strengths, limitations, and potential future research directions. Ultimately, this review aims to be a valuable resource for both researchers and practitioners looking to utilize SSL for graph-structured data in healthcare, paving the way for improved outcomes and insights in this critical field. To the best of our knowledge, this work represents the first comprehensive review of the literature on SSL applied to graph data in healthcare.

[LG-70] A Robust Clustering Framework Combining Minimum Description Length and Genetic Optimization

链接: https://arxiv.org/abs/2412.05305
作者: H. Jahani,F. Zamio
关键词-EN: Minimum Description Length, enabling the organization, meaningful groups, Clustering, Description Length
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering algorithms are fundamental in data analysis, enabling the organization of data into meaningful groups. However, individual clustering methods often face limitations and biases, making it challenging to develop a universal solution for diverse datasets. To address this, we propose a novel clustering framework that combines the Minimum Description Length (MDL) principle with a genetic optimization algorithm. This approach begins with an ensemble clustering solution as a baseline, which is refined using MDL-based evaluation functions and optimized with a genetic algorithm. By leveraging the MDL principle, the method adapts to the intrinsic properties of datasets, minimizing dependence on input clusters and ensuring a data-driven process. The proposed method was evaluated on thirteen benchmark datasets using four validation metrics: accuracy, normalized mutual information (NMI), Fisher score, and adjusted Rand index (ARI). Results show that the method consistently outperforms traditional clustering algorithms, achieving higher accuracy, greater stability, and reduced biases. Its adaptability makes it a reliable tool for clustering complex and varied datasets. This study demonstrates the potential of combining MDL and genetic optimization to create a robust and versatile clustering framework, advancing the field of data analysis and offering a scalable solution for diverse applications.

[LG-71] A High Energy-Efficiency Multi-core Neuromorphic Architecture for Deep SNN Training

链接: https://arxiv.org/abs/2412.05302
作者: Mingjing Li,Huihui Zhou,Xiaofeng Xu,Zhiwei Zhong,Puli Quan,Xueke Zhu,Yanyu Lin,Wenjie Lin,Hongyu Guo,Junchao Zhang,Yunhao Ma,Wei Wang,Zhengyu Ma,Guoqi Li,Xiaoxin Cui,Yonghong Tian
关键词-EN: dynamically changing environment, changing environment, growing necessity, adapt to dynamically, dynamically changing
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is a growing necessity for edge training to adapt to dynamically changing environment. Neuromorphic computing represents a significant pathway for high-efficiency intelligent computation in energy-constrained edges, but existing neuromorphic architectures lack the ability of directly training spiking neural networks (SNNs) based on backpropagation. We develop a multi-core neuromorphic architecture with Feedforward-Propagation, Back-Propagation, and Weight-Gradient engines in each core, supporting high efficient parallel computing at both the engine and core levels. It combines various data flows and sparse computation optimization by fully leveraging the sparsity in SNN training, obtaining a high energy efficiency of 1.05TFLOPS/W@ FP16 @ 28nm, 55 ~ 85% reduction of DRAM access compared to A100 GPU in SNN trainings, and a 20-core deep SNN training and a 5-worker federated learning on FPGAs. Our study develops the first multi-core neuromorphic architecture supporting the direct SNN training, facilitating the neuromorphic computing in edge-learnable applications.

[LG-72] Fed-LDR: Federated Local Data-infused Graph Creation with Node-centric Model Refinement

链接: https://arxiv.org/abs/2411.04936
作者: Jiechao Gao,Yuangang Li,Syeda Faiza Ahmed
关键词-EN: enhancing urban infrastructure, Node-centric Model Refinement, Local Data-Infused Graph, Data-Infused Graph Creation, infrastructure and services
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The rapid acceleration of global urbanization has introduced novel challenges in enhancing urban infrastructure and services. Spatio-temporal data, integrating spatial and temporal dimensions, has emerged as a critical tool for understanding urban phenomena and promoting sustainability. In this context, Federated Learning (FL) has gained prominence as a distributed learning paradigm aligned with the privacy requirements of urban IoT environments. However, integrating traditional and deep learning models into the FL framework poses significant challenges, particularly in capturing complex spatio-temporal dependencies and adapting to diverse urban conditions. To address these challenges, we propose the Federated Local Data-Infused Graph Creation with Node-centric Model Refinement (Fed-LDR) algorithm. Fed-LDR leverages FL and Graph Convolutional Networks (GCN) to enhance spatio-temporal data analysis in urban environments. The algorithm comprises two key modules: (1) the Local Data-Infused Graph Creation (LDIGC) module, which dynamically reconfigures adjacency matrices to reflect evolving spatial relationships within urban environments, and (2) the Node-centric Model Refinement (NoMoR) module, which customizes model parameters for individual urban nodes to accommodate heterogeneity. Evaluations on the PeMSD4 and PeMSD8 datasets demonstrate Fed-LDR’s superior performance over six baseline methods. Fed-LDR achieved the lowest Mean Absolute Error (MAE) values of 20.15 and 17.30, and the lowest Root Mean Square Error (RMSE) values of 32.30 and 27.15, respectively, while maintaining a high correlation coefficient of 0.96 across both datasets. Notably, on the PeMSD4 dataset, Fed-LDR reduced MAE and RMSE by up to 81% and 78%, respectively, compared to the best-performing baseline FedMedian.

[LG-73] PolytopeWalk: Sparse MCMC Sampling over Polytopes

链接: https://arxiv.org/abs/2412.06629
作者: Benny Sun,Yuansi Chen
关键词-EN:
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17 pages

点击查看摘要

[LG-74] An inferential measure of dependence between two systems using Bayesian model comparison

链接: https://arxiv.org/abs/2412.06478
作者: Guillaume Marrelec,Alain Giron
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: To be published in IEEE Transaction on Systems, Man, and Cybernetics: Systems

点击查看摘要

[LG-75] An Adaptively Inexact Method for Bilevel Learning Using Primal-Dual Style Differentiation

链接: https://arxiv.org/abs/2412.06436
作者: Lea Bogensperger,Matthias J. Ehrhardt,Thomas Pock,Mohammad Sadegh Salehi,Hok Shing Wong
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-76] In Silico Pharmacokinetic and Molecular Docking Studies of Natural Plants against Essential Protein KRAS for Treatment of Pancreatic Cancer

链接: https://arxiv.org/abs/2412.06237
作者: Marsha Mariya Kappan,Joby George
关键词-EN:
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-77] Representational Transfer Learning for Matrix Completion

链接: https://arxiv.org/abs/2412.06233
作者: Yong He,Zeyu Li,Dong Liu,Kangxiang Qin,Jiahui Xie
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-78] Is the neural tangent kernel of PINNs deep learning general partial differential equations always convergent ?

链接: https://arxiv.org/abs/2412.06158
作者: Zijian Zhou,Zhenya Yan
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Mathematical Physics (math-ph); Pattern Formation and Solitons (nlin.PS); Computational Physics (physics.comp-ph)
*备注: 18 pages, 5 figures

点击查看摘要

[LG-79] UCB algorithms for multi-armed bandits: Precise regret and adaptive inference

链接: https://arxiv.org/abs/2412.06126
作者: Qiyang Han,Koulik Khamaru,Cun-Hui Zhang
关键词-EN:
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-80] Implicit Delta Learning of High Fidelity Neural Network Potentials

链接: https://arxiv.org/abs/2412.06064
作者: Stephan Thaler,Cristian Gabellini,Nikhil Shenoy,Prudencio Tossou
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-81] Reinforcement Learning for a Discrete-Time Linear-Quadratic Control Problem with an Application

链接: https://arxiv.org/abs/2412.05906
作者: Lucky Li
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

[LG-82] FedRBE – a decentralized privacy-preserving federated batch effect correction tool for omics data based on limma

链接: https://arxiv.org/abs/2412.05894
作者: Yuliya Burankova,Julian Klemm,Jens J. G. Lohmann,Ahmad Taheri,Niklas Probul,Jan Baumbach,Olga Zolotareva
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: The first two authors listed are joint first authors. The last two authors listed are joint last authors. 21 pages, 5 figures, 5 tables

点击查看摘要

[LG-83] On Diffusion Posterior Sampling via Sequential Monte Carlo for Zero-Shot Scaffolding of Protein Motifs

链接: https://arxiv.org/abs/2412.05788
作者: James Matthew Young,O. Deniz Akyildiz
关键词-EN:
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-84] Leveraging Black-box Models to Assess Feature Importance in Unconditional Distribution

链接: https://arxiv.org/abs/2412.05759
作者: Jing Zhou,Chunlin Li
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-85] Proximal Iteration for Nonlinear Adaptive Lasso

链接: https://arxiv.org/abs/2412.05726
作者: Nathan Wycoff,Lisa O. Singh,Ali Arab,Katharine M. Donato
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Some of these results were previously presented in the Technical Report at arXiv:2211.05089

点击查看摘要

[LG-86] Local Linear Convergence of Infeasible Optimization with Orthogonal Constraints

链接: https://arxiv.org/abs/2412.05689
作者: Youbang Sun,Shixiang Chen,Alfredo Garcia,Shahin Shahrampour
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-87] DM-SBL: Channel Estimation under Structured Interference

链接: https://arxiv.org/abs/2412.05582
作者: Yifan Wang,Chengjie Yu,Jiang Zhu,Fangyong Wang,Xingbin Tu,Yan Wei,Fengzhong Qu
关键词-EN:
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-88] A Variational Computational-based Framework for Unsteady Incompressible Flows

链接: https://arxiv.org/abs/2412.05525
作者: H. Sababha,A. Elmaradny,H. Taha,M. Daqaq
关键词-EN:
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

[LG-89] Ranking of Large Language Model with Nonparametric Prompts

链接: https://arxiv.org/abs/2412.05506
作者: Zebin Wang,Yi Han,Ethan X. Fang,Lan Wang,Junwei Lu
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-90] Knowledge-Based Deep Learning for Time-Efficient Inverse Dynamics

链接: https://arxiv.org/abs/2412.05403
作者: Shuhao Ma,Yu Cao,Ian D. Robertson,Chaoyang Shi,Jindong Liu,Zhi-Qiang Zhang
关键词-EN:
类目: ignal Processing (eess.SP); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: 10 pages, 8 figures, Journal paper

点击查看摘要

[LG-91] Learning Symmetry-Independent Jet Representations via Jet-Based Joint Embedding Predictive Architecture NEURIPS2024

链接: https://arxiv.org/abs/2412.05333
作者: Subash Katel,Haoyang Li,Zihan Zhao,Raghav Kansal,Farouk Mokhtar,Javier Duarte
关键词-EN:
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 5 pages, 2 figures. Accepted to Machine Learning for Physical Sciences NeurIPS 2024 workshop

点击查看摘要

[LG-92] Patient-specific prediction of glioblastoma growth via reduced order modeling and neural networks

链接: https://arxiv.org/abs/2412.05330
作者: D. Cerrone,D. Riccobelli,P. Vitullo,F. Ballarin,J. Falco,F. Acerbi,A. Manzoni,P. Zunino,P. Ciarletta
关键词-EN:
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Numerical Analysis (math.NA); Biological Physics (physics.bio-ph); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

信息检索

[IR-0] DEEPER: Dense Electroencephalography Passage Retrieval

链接: https://arxiv.org/abs/2412.06695
作者: Niall McGuire,Yashar Moshfeghi
关键词-EN: explicit query formulation, Information retrieval systems, query formulation, systems have historically, historically relied
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Information retrieval systems have historically relied on explicit query formulation, requiring users to translate their information needs into text. This process is particularly disruptive during reading tasks, where users must interrupt their natural flow to formulate queries. We present DEEPER (Dense Electroencephalography Passage Retrieval), a novel framework that enables direct retrieval of relevant passages from users’ neural signals during naturalistic reading without intermediate text translation. Building on dense retrieval architectures, DEEPER employs a dual-encoder approach with specialised components for processing neural data, mapping EEG signals and text passages into a shared semantic space. Through careful architecture design and cross-modal negative sampling strategies, our model learns to align neural patterns with their corresponding textual content. Experimental results on the ZuCo dataset demonstrate that direct brain-to-passage retrieval significantly outperforms current EEG-to-text baselines, achieving a 571% improvement in Precision@1. Our ablation studies reveal that the model successfully learns aligned representations between EEG and text modalities (0.29 cosine similarity), while our hard negative sampling strategy contributes to overall performance increases.

[IR-1] Learning Cluster Representatives for Approximate Nearest Neighbor Search

链接: https://arxiv.org/abs/2412.05921
作者: Thomas Vecchiato
关键词-EN: Developing increasingly efficient, nearest neighbor search, approximate nearest neighbor, Developing increasingly, modern information retrieval
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Developing increasingly efficient and accurate algorithms for approximate nearest neighbor search is a paramount goal in modern information retrieval. A primary approach to addressing this question is clustering, which involves partitioning the dataset into distinct groups, with each group characterized by a representative data point. By this method, retrieving the top-k data points for a query requires identifying the most relevant clusters based on their representatives – a routing step – and then conducting a nearest neighbor search within these clusters only, drastically reducing the search space. The objective of this thesis is not only to provide a comprehensive explanation of clustering-based approximate nearest neighbor search but also to introduce and delve into every aspect of our novel state-of-the-art method, which originated from a natural observation: The routing function solves a ranking problem, making the function amenable to learning-to-rank. The development of this intuition and applying it to maximum inner product search has led us to demonstrate that learning cluster representatives using a simple linear function significantly boosts the accuracy of clustering-based approximate nearest neighbor search. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2412.05921 [cs.IR] (or arXiv:2412.05921v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2412.05921 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] ULMRec: User-centric Large Language Model for Sequential Recommendation

链接: https://arxiv.org/abs/2412.05543
作者: Minglai Shao,Hua Huang,Qiyao Peng,Hongtao Liu
关键词-EN: demonstrated promising performance, Recent advances, Large Language Models, advances in Large, language understanding capabilities
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated promising performance in sequential recommendation tasks, leveraging their superior language understanding capabilities. However, existing LLM-based recommendation approaches predominantly focus on modeling item-level co-occurrence patterns while failing to adequately capture user-level personalized preferences. This is problematic since even users who display similar behavioral patterns (e.g., clicking or purchasing similar items) may have fundamentally different underlying interests. To alleviate this problem, in this paper, we propose ULMRec, a framework that effectively integrates user personalized preferences into LLMs for sequential recommendation. Considering there has the semantic gap between item IDs and LLMs, we replace item IDs with their corresponding titles in user historical behaviors, enabling the model to capture the item semantics. For integrating the user personalized preference, we design two key components: (1) user indexing: a personalized user indexing mechanism that leverages vector quantization on user reviews and user IDs to generate meaningful and unique user representations, and (2) alignment tuning: an alignment-based tuning stage that employs comprehensive preference alignment tasks to enhance the model’s capability in capturing personalized information. Through this design, ULMRec achieves deep integration of language semantics with user personalized preferences, facilitating effective adaptation to recommendation. Extensive experiments on two public datasets demonstrate that ULMRec significantly outperforms existing methods, validating the effectiveness of our approach.

[IR-3] Visualization of Knowledge Graphs with Embeddings: an Essay on Recent Trends and Methods

链接: https://arxiv.org/abs/2412.05289
作者: Davide Riva,Cristina Rossetti
关键词-EN: Knowledge Graphs, Knowledge Graph Embedding, Graph Embedding techniques, visualizing Knowledge Graphs, Knowledge Graphs include
类目: Information Retrieval (cs.IR); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:In this essay we discuss the recent trends in visual analysis and exploration of Knowledge Graphs, particularly in conjunction with Knowledge Graph Embedding techniques. We present an overview of the current state of visualization techniques and frameworks for KGs, in relation to four identified challenges. The challenges in visualizing Knowledge Graphs include the need for intuitive and modular interfaces, performance in handling big data, and difficulties for users in understanding and using query languages. We find frameworks that generally satisfy the intuitive UI, performance, and query support requirements, but few satisfying the modularity requirement. In the context of Knowledge Graph Embeddings, we divide the approaches that use embeddings to facilitate exploration of Knowledge Graphs from those that aim at the explanation of the embeddings themselves. We find significant differences between the two perspectives. Finally, we highlight some possible directions for future work, including diffusion of the unmet requirements, implementation of new visual features, and experimentation with relation visualization as a peculiar element of Knowledge Graphs.

附件下载

点击下载今日全部论文列表

目录

概览 (2024-12-10)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载