本篇博文主要展示 2024-11-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2024-11-26)
今日共更新799篇论文,其中:
- 自然语言处理共100篇(Computation and Language (cs.CL))
- 人工智能共214篇(Artificial Intelligence (cs.AI))
- 计算机视觉共276篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共250篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
【速读】: 该论文试图解决大型语言模型(LLMs)在回答多跳查询时可能依赖训练数据中的捷径(shortcuts)而非真正进行潜在推理的问题。解决方案的关键在于构建了一个名为SOCRATES(ShOrtCut-fRee lATent rEaSoning)的评估数据集,通过排除训练数据中头实体和答案实体共同出现的情况,系统地移除模型可能猜测答案或利用部分匹配的案例,从而确保模型在没有捷径的情况下进行潜在的多跳推理。研究发现,LLMs在某些类型的查询中表现出较好的潜在多跳推理能力,但在其他类型的查询中表现较差,尤其是在需要潜在回忆年份作为中间答案的查询中。
链接: https://arxiv.org/abs/2411.16679
作者: Sohee Yang,Nora Kassner,Elena Gribovskaya,Sebastian Riedel,Mor Geva
关键词-EN: Large Language Models, Large Language, Summer Olympics, Scarlett Johansson, year Scarlett Johansson
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like “In the year Scarlett Johansson was born, the Summer Olympics were hosted in the country of”. One major challenge in evaluating this ability is that LLMs may have developed shortcuts by encounters of the head entity “Scarlett Johansson” and the answer entity “United States” in the same training sequences or merely guess the answer based on frequency-based priors. To prevent shortcuts, we exclude test queries where the head and answer entities co-appear in pretraining corpora. Through careful selection of relations and facts and systematic removal of cases where models might guess answers or exploit partial matches, we construct an evaluation dataset SOCRATES (ShOrtCut-fRee lATent rEaSoning). We observe that LLMs demonstrate promising latent multi-hop reasoning abilities without exploiting shortcuts, but only for certain types of queries. For queries requiring latent recall of countries as the intermediate answer, the best models achieve 80% latent composability, but this drops to just 5% for the recall of years. Comparisons with Chain-of-Thought composability highlight a significant gap between the ability of models to reason latently versus explicitly. Analysis reveals that latent representations of the intermediate answer are constructed more often in queries with higher latent composability, and shows the emergence of latent multi-hop reasoning during pretraining.
zh
[NLP-1] DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
【速读】: 该论文试图解决故事叙述视频生成 (Storytelling Video Generation, SVG) 中的关键挑战,包括复杂精细的运动表现、多对象在场景间的连续性以及单一场景内多运动的无缝过渡。解决方案的关键在于提出了DreamRunner方法,该方法通过以下步骤解决上述问题:首先,利用大型语言模型 (Large Language Model, LLM) 对输入脚本进行结构化处理,以实现粗粒度的场景规划和细粒度的对象布局及运动规划;其次,采用检索增强的测试时适应 (retrieval-augmented test-time adaptation) 来捕捉每个场景中对象的目标运动先验,支持基于检索视频的多样化运动定制,从而生成具有复杂脚本运动的新视频;最后,提出了一种新颖的空间-时间区域基于的3D注意力和先验注入模块 (Spatial-Temporal Region-based 3D Attention and Prior Injection Module, SR3AI),用于细粒度对象运动的绑定和逐帧语义控制。通过这些创新,DreamRunner在角色一致性、文本对齐和过渡平滑性方面展示了最先进的性能,并在组合文本到视频生成任务中显著优于基线方法。
链接: https://arxiv.org/abs/2411.16657
作者: Zun Wang,Jialu Li,Han Lin,Jaehong Yoon,Mohit Bansal
关键词-EN: Storytelling video generation, Storytelling video, create long, recently emerged, task to create
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project website: this https URL
点击查看摘要
Abstract:Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner’s robust ability to generate multi-object interactions with qualitative examples.
zh
[NLP-2] Self-Generated Critiques Boost Reward Modeling for Language Models
【速读】: 该论文试图解决当前奖励模型(Reward Models)在强化学习从人类反馈(RLHF)中主要生成标量分数,难以自然地结合批评的问题。解决方案的关键在于提出Critic-RM框架,通过自生成批评(self-generated critiques)来改进奖励模型,而无需额外的监督。Critic-RM采用两阶段流程:首先生成并过滤高质量的批评,然后进行奖励预测和批评生成的联合微调。实验结果表明,Critic-RM在奖励建模准确性上比标准奖励模型和LLM评判者提高了3.7%-7.3%,同时在纠正推理步骤中的错误方面也显示出2.5%-3.2%的推理准确性提升。
链接: https://arxiv.org/abs/2411.16646
作者: Yue Yu,Zhengxing Chen,Aston Zhang,Liang Tan,Chenguang Zhu,Richard Yuanzhe Pang,Yundi Qian,Xuewei Wang,Suchin Gururangan,Chao Zhang,Melanie Kambadur,Dhruv Mahajan,Rui Hou
关键词-EN: aligning large language, human preferences, human feedback, large language models, Reward
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages
点击查看摘要
Abstract:Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.
zh
[NLP-3] Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective
【速读】: 该论文试图解决生成式 AI (Generative AI) 中的“越狱提示”(jailbreak prompts) 问题,这些提示被设计用于绕过大型语言模型中的伦理安全机制,可能导致网络犯罪分子的滥用。解决方案的关键在于从网络防御的角度分析越狱提示,提出包括高级提示分析、动态安全协议和持续模型微调在内的策略,以增强 AI 的韧性。此外,论文强调了 AI 研究人员、网络安全专家和政策制定者之间的合作,以制定保护 AI 系统的标准,并通过案例研究展示了这些网络防御方法,以促进负责任的 AI 实践,维护系统完整性和公众信任。
链接: https://arxiv.org/abs/2411.16642
作者: Jean Marie Tshimula,Xavier Ndona,D’Jeff K. Nkashama,Pierre-Martin Tardif,Froduald Kabanza,Marc Frappier,Shengrui Wang
关键词-EN: potentially enabling misuse, bypass ethical safeguards, Jailbreak prompts pose, large language models, potentially enabling
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Jailbreak prompts pose a significant threat in AI and cybersecurity, as they are crafted to bypass ethical safeguards in large language models, potentially enabling misuse by cybercriminals. This paper analyzes jailbreak prompts from a cyber defense perspective, exploring techniques like prompt injection and context manipulation that allow harmful content generation, content filter evasion, and sensitive information extraction. We assess the impact of successful jailbreaks, from misinformation and automated social engineering to hazardous content creation, including bioweapons and explosives. To address these threats, we propose strategies involving advanced prompt analysis, dynamic safety protocols, and continuous model fine-tuning to strengthen AI resilience. Additionally, we highlight the need for collaboration among AI researchers, cybersecurity experts, and policymakers to set standards for protecting AI systems. Through case studies, we illustrate these cyber defense approaches, promoting responsible AI practices to maintain system integrity and public trust. \textbf\colorredWarning: This paper contains content which the reader may find offensive.
zh
[NLP-4] Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
【速读】: 该论文试图解决自动事实一致性度量(factuality metrics)的有效性问题。研究指出,尽管现代大型语言模型(LLMs)能够生成高质量的摘要,但仍可能引入与源文本不一致的信息,即“幻觉”(hallucinations)。现有的自动度量方法,如ROUGE,在评估摘要质量时已趋于饱和,但难以准确检测这些细微的幻觉。论文的关键解决方案在于对现有的事实一致性度量进行压力测试,发现仅使用浅层特征(shallow features)的监督模型在预测事实一致性方面与最先进的度量方法相当。此外,研究还发现,许多事实一致性度量对事实修正的响应有限,而对非事实性的编辑更为敏感。基于这些发现,论文提出了对现有自动事实一致性度量的可靠性和准确性的质疑,并指出可以通过添加无害句子来人为提高这些度量的分数,从而引发对这些度量方法实际测量内容的深思。
链接: https://arxiv.org/abs/2411.16638
作者: Sanjana Ramprasad,Byron C. Wallace
关键词-EN: produce highly readable, highly readable abstractive, Modern LLMs, readable abstractive summaries, evaluating summary quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Modern LLMs can now produce highly readable abstractive summaries, to the point where traditional automated metrics for evaluating summary quality, such as ROUGE, have become saturated. However, LLMs still sometimes introduce unwanted content into summaries, i.e., information inconsistent with or unsupported by their source. Measuring the occurrence of these often subtle hallucinations'' automatically has proved to be challenging. This in turn has motivated development of a variety of metrics intended to measure the factual consistency of generated summaries against their source. But are these approaches measuring what they purport to do? In this work, we stress-test automatic factuality metrics. Specifically, we investigate whether and to what degree superficial attributes of summary texts suffice to predict
factuality’‘, finding that a (supervised) model using only such shallow features is reasonably competitive with SOTA factuality scoring methods. We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements. In contrast, some metrics are more sensitive to benign, non-factual edits. Motivated by these insights, we show that one can game'' (most) automatic factuality metrics, i.e., reliably inflate
factuality’’ scores by appending innocuous sentences to generated this http URL together, our results raise questions about the degree to which we should rely on existing automated factuality metrics and what exactly we want ``factuality metrics’’ to measure.
zh
[NLP-5] StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training
【速读】: 该论文试图解决当前基于Transformer架构的语言模型(Language Models, LMs)在处理长输入序列时计算需求呈指数增长的问题。解决方案的关键在于引入选择性注意力机制,特别是全局注意力(global attention),并对其在BERT预训练中的实际影响进行实证评估。通过创建包含结构信息的文本语料库(基于arXiv数据)和纯文本语料库,论文进行了预训练实验,分析了注意力模式的变化及其对下游任务的影响。研究结果强调了将文档结构融入语言模型的重要性,展示了这些模型在文档理解等抽象任务中的优越性能。
链接: https://arxiv.org/abs/2411.16618
作者: Kaustubh Ponkshe,Venkatapathy Subramanian,Natwar Modani,Ganesh Ramakrishnan
关键词-EN: techniques for Language, Language Models, ubiquitous attention mechanism, today rely, rely on transformer-based
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Most state-of-the-art techniques for Language Models (LMs) today rely on transformer-based architectures and their ubiquitous attention mechanism. However, the exponential growth in computational requirements with longer input sequences confines Transformers to handling short passages. Recent efforts have aimed to address this limitation by introducing selective attention mechanisms, notably local and global attention. While sparse attention mechanisms, akin to full attention in being Turing-complete, have been theoretically established, their practical impact on pre-training remains unexplored. This study focuses on empirically assessing the influence of global attention on BERT pre-training. The primary steps involve creating an extensive corpus of structure-aware text through arXiv data, alongside a text-only counterpart. We carry out pre-training on these two datasets, investigate shifts in attention patterns, and assess their implications for downstream tasks. Our analysis underscores the significance of incorporating document structure into LM models, demonstrating their capacity to excel in more abstract tasks, such as document understanding.
zh
[NLP-6] Recent Trends in Linear Text Segmentation: a Survey
【速读】: 该论文试图解决线性文本分割(Linear Text Segmentation)问题,即自动标记文本中的主题变化点。解决方案的关键在于利用自然语言处理(Natural Language Processing)领域的研究成果,结合语言学和计算语言学的概念,通过当前最先进的资源和方法来实现这一任务。论文不仅概述了现有的技术进展,还指出了现有资源和任务本身的局限性,并基于最新文献和未充分探索的研究方向提出了未来的发展方向。
链接: https://arxiv.org/abs/2411.16613
作者: Iacopo Ghinassi,Lin Wang,Chris Newell,Matthew Purver
关键词-EN: Linear Text Segmentation, tagging text documents, automatically tagging text, Natural Language Processing, Text Segmentation
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Linear Text Segmentation is the task of automatically tagging text documents with topic shifts, i.e. the places in the text where the topics change. A well-established area of research in Natural Language Processing, drawing from well-understood concepts in linguistic and computational linguistic research, the field has recently seen a lot of interest as a result of the surge of text, video, and audio available on the web, which in turn require ways of summarising and categorizing the mole of content for which linear text segmentation is a fundamental step. In this survey, we provide an extensive overview of current advances in linear text segmentation, describing the state of the art in terms of resources and approaches for the task. Finally, we highlight the limitations of available resources and of the task itself, while indicating ways forward based on the most recent literature and under-explored research directions.
zh
[NLP-7] From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge
【速读】: 该论文试图解决在人工智能(AI)和自然语言处理(NLP)领域中,传统评估方法在判断细微属性和提供满意结果方面的不足。解决方案的关键在于引入“大语言模型(LLM)作为评判者”的新范式,即利用LLM进行评分、排序或选择,以提升评估的准确性和全面性。论文通过详细定义输入和输出视角,提出了一种全面的分类法,从“评判什么”、“如何评判”和“在哪里评判”三个维度探讨了LLM-as-a-judge的应用。此外,论文还编纂了评估LLM-as-a-judge的基准,并指出了关键挑战和未来研究方向,旨在为这一新兴领域提供有价值的见解和启发。
链接: https://arxiv.org/abs/2411.16594
作者: Dawei Li,Bohan Jiang,Liangjie Huang,Alimohammad Beigi,Chengshuai Zhao,Zhen Tan,Amrita Bhattacharjee,Yuxuan Jiang,Canyu Chen,Tianhao Wu,Kai Shu,Lu Cheng,Huan Liu
关键词-EN: natural language processing, Large Language Models, artificial intelligence, evaluation have long, long been critical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 32 pages, 5 figures
点击查看摘要
Abstract:Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the “LLM-as-a-judge” paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at \urlthis https URL and \urlthis https URL.
zh
[NLP-8] Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
【速读】: 该论文试图解决大型语言模型(LLMs)在处理复杂推理任务时,如科学、编码和数学问题,如何通过增加思考和反思时间来提高解决效率的问题。解决方案的关键在于引入一个两玩家范式,其中推理模型(actor model)和批判模型(critique model)分别承担推理和批判的角色。批判模型在测试和训练过程中提供步骤级别的反馈,监督推理模型的表现。论文提出了AutoMathCritique框架,用于自动收集批判数据,并生成自然语言反馈,从而在微调语言模型时提升其数学推理能力。实验结果表明,这种批判监督方法显著提高了推理模型在处理困难查询时的表现,尤其是在增加推理时间的情况下。此外,论文还提出了基于批判的自我训练方法,进一步增强了推理模型的探索效率和解决方案的多样性。
链接: https://arxiv.org/abs/2411.16579
作者: Zhiheng Xi,Dingwen Yang,Jixuan Huang,Jiafu Tang,Guanyu Li,Yiwen Ding,Wei He,Boyang Hong,Shihan Do,Wenyu Zhan,Xiao Wang,Rui Zheng,Tao Ji,Xiaowei Shi,Yitao Zhai,Rongxiang Weng,Jingang Wang,Xunliang Cai,Tao Gui,Zuxuan Wu,Qi Zhang,Xipeng Qiu,Xuanjing Huang,Yu-Gang Jiang
关键词-EN: effectively solving complex, complex reasoning tasks, solving complex reasoning, large language models, Training large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
点击查看摘要
Abstract:Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model’s capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of 76,321 responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor’s performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor’s self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor’s exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase its potential. Our code and datasets are at \hrefthis https URLthis https URL.
zh
[NLP-9] EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code
【速读】: 该论文试图解决现有自动化软件漏洞检测方法在处理现代代码库复杂性和多样性方面的不足。解决方案的关键在于引入了一种名为EnStack的新型集成堆叠框架,该框架利用自然语言处理(NLP)技术来增强漏洞检测能力。EnStack通过协同多个预训练的大型语言模型(LLMs),包括CodeBERT(用于语义分析)、GraphCodeBERT(用于结构表示)和UniXcoder(用于跨模态能力),并结合元分类器(如Logistic Regression、Support Vector Machines (SVM)、Random Forest和XGBoost)来整合这些模型的输出。这种集成方法能够有效捕捉复杂的代码模式和漏洞,从而在检测细微和复杂的漏洞方面表现出色,显著优于现有方法。
链接: https://arxiv.org/abs/2411.16561
作者: Shahriyar Zaman Ridoy,Md. Shazzad Hossain Shaon,Alfredo Cuzzocrea,Mst Shapna Akter
关键词-EN: enhancing security, modern codebases, critical for enhancing, complexity and diversity, diversity of modern
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted in 2024 IEEE International Conference on Big Data (IEEE BigData 2024)
点击查看摘要
Abstract:Automated detection of software vulnerabilities is critical for enhancing security, yet existing methods often struggle with the complexity and diversity of modern codebases. In this paper, we introduce EnStack, a novel ensemble stacking framework that enhances vulnerability detection using natural language processing (NLP) techniques. Our approach synergizes multiple pre-trained large language models (LLMs) specialized in code understanding CodeBERT for semantic analysis, GraphCodeBERT for structural representation, and UniXcoder for cross-modal capabilities. By fine-tuning these models on the Draper VDISC dataset and integrating their outputs through meta-classifiers such as Logistic Regression, Support Vector Machines (SVM), Random Forest, and XGBoost, EnStack effectively captures intricate code patterns and vulnerabilities that individual models may overlook. The meta-classifiers consolidate the strengths of each LLM, resulting in a comprehensive model that excels in detecting subtle and complex vulnerabilities across diverse programming contexts. Experimental results demonstrate that EnStack significantly outperforms existing methods, achieving notable improvements in accuracy, precision, recall, and F1-score. This work highlights the potential of ensemble LLM approaches in code analysis tasks and offers valuable insights into applying NLP techniques for advancing automated vulnerability detection.
zh
[NLP-10] RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
【速读】: 该论文试图解决机器人空间理解能力不足的问题,特别是在基于视觉语言模型的空间推理任务中,由于训练数据缺乏复杂的空间场景理解能力,导致模型在实际应用中表现不佳。解决方案的关键在于引入了一个名为RoboSpatial的大规模空间理解数据集,该数据集包含了真实室内和桌面场景的3D扫描和以自我为中心的2D图像,并注释了丰富的与机器人相关的空间信息。RoboSpatial数据集包含100万张图像、5000个3D扫描和300万条注释的空间关系,且2D图像与3D扫描配对,使其同时适用于2D和3D任务。实验结果表明,使用RoboSpatial训练的模型在空间功能预测、空间关系预测和机器人操作等下游任务中表现优于基线模型。
链接: https://arxiv.org/abs/2411.16537
作者: Chan Hee Song,Valts Blukis,Jonathan Tremblay,Stephen Tyree,Yu Su,Stan Birchfield
关键词-EN: grounded decisions based, make grounded decisions, crucial capability, grounded decisions, decisions based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Spatial understanding is a crucial capability for robots to make grounded decisions based on their environment. This foundational skill enables robots not only to perceive their surroundings but also to reason about and interact meaningfully within the world. In modern robotics, these capabilities are taken on by visual language models, and they face significant challenges when applied to spatial reasoning context due to their training data sources. These sources utilize general-purpose image datasets, and they often lack sophisticated spatial scene understanding capabilities. For example, the datasets do not address reference frame comprehension - spatial relationships require clear contextual understanding, whether from an ego-centric, object-centric, or world-centric perspective, which allow for effective real-world interaction. To address this issue, we introduce RoboSpatial, a large-scale spatial understanding dataset consisting of real indoor and tabletop scenes captured as 3D scans and egocentric images, annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5K 3D scans, and 3M annotated spatial relationships, with paired 2D egocentric images and 3D scans to make it both 2D and 3D ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robotics manipulation.
zh
[NLP-11] Profiling Bias in LLM s: Stereotype Dimensions in Contextual Word Embeddings
【速读】: 该论文试图解决大型语言模型(LLMs)中存在的偏见问题,并提出了一种有效沟通这些风险并鼓励缓解措施的方法。解决方案的关键在于提出基于社会心理学研究中的刻板印象维度的偏见概况(bias profiles),并通过这些维度分析上下文嵌入中的性别偏见,跨上下文和层次生成12种不同LLM的刻板印象概况,从而直观地展示和可视化偏见。这种方法旨在为所有AI受众提供适当且直观的偏见描述,促进对偏见风险的认知和缓解措施的实施。
链接: https://arxiv.org/abs/2411.16527
作者: Carolin M. Schuster,Maria-Alexandra Dinisor,Shashwat Ghatiwala,Georg Groh
关键词-EN: Large language models, Large language, artificial intelligence, unavoidably biased, current successes
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.
zh
[NLP-12] Fundamental Limits of Prompt Tuning Transformers: Universality Capacity and Efficiency
【速读】: 该论文试图解决基于Transformer的基础模型在提示调优(prompt tuning)中的统计和计算极限问题。解决方案的关键在于研究单头(single-head)且仅包含单个自注意力层(single self-attention layer)的Transformer模型。论文证明了在这种最简单的Transformer结构中,提示调优具有普遍性(universal),并且支持在强指数时间假设(Strong Exponential Time Hypothesis, SETH)下高效的(甚至是近线性时间)算法。统计上,论文证明了这种最简单的Transformer模型是序列到序列Lipschitz函数的通用逼近器。此外,论文还提供了在单层、单头Transformer中,提示调优所需软提示(soft-prompt)token数量的指数下界,以记忆任何数据集。计算上,论文识别了提示调优效率的相变,由软提示诱导的键(keys)和查询(queries)的范数决定,并提供了上界标准。超出此标准,提示调优不存在次二次(高效)算法;在此标准内,论文展示了近线性时间提示调优推理算法的存在。这些基本极限为设计表达性强且高效的提示调优方法提供了重要的必要条件。
链接: https://arxiv.org/abs/2411.16525
作者: Jerry Yao-Chieh Hu,Wei-Po Wang,Ammar Gilani,Chenyang Li,Zhao Song,Han Liu
关键词-EN: transformer-based foundation models, prompt tuning, Exponential Time Hypothesis, Strong Exponential Time, prompt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on \textitsingle-head transformers with only a \textitsingle self-attention layer: (i) is universal, and (ii) supports efficient (even almost-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in- dL and -in- (1/\epsilon) lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the \textitsoft-prompt-induced keys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.
zh
[NLP-13] LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation
【速读】: 该论文试图解决图像描述生成领域中,深度学习模型依赖于高维潜在特征向量并需要模型微调的问题。解决方案的关键是提出了一种基于文本的方法,即标签增强的检索增强生成(Label Boosted Retrieval Augmented Generation, LaB-RAG),该方法利用图像描述符形式的分类标签来增强预训练大型语言模型(LLMs)的标准检索增强生成(RAG)。具体而言,通过简单的线性分类器将提取的图像嵌入转换为放射学特定标签,结合标准RAG,利用通用领域的LLMs生成放射学报告。该方法无需训练生成语言模型或图像特征编码器模型,也无需直接向LLM展示X光片,即可在自然语言和放射学语言指标上取得优于其他检索型放射学报告生成方法的结果,并与微调的视觉-语言放射学报告生成模型相媲美。
链接: https://arxiv.org/abs/2411.16523
作者: Steven Song,Anirudh Subramanyam,Irene Madejski,Robert L. Grossman
关键词-EN: Retrieval Augmented Generation, deep learning models, Boosted Retrieval Augmented, deep learning, Augmented Generation
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG), where the task is to generate a clinician’s report detailing their observations from a set of radiological images, such as X-rays. We argue that simple linear classifiers over extracted image embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image feature encoder models, and without ever directly “showing” the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further present results of our experiments with various components of LaB-RAG to better understand our method. Finally, we critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.
zh
[NLP-14] All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
【速读】: 该论文试图解决现有大型多模态模型(Large Multimodal Models, LMMs)在理解和处理多语言、多文化内容时存在的局限性问题。解决方案的关键在于提出了All Languages Matter Benchmark (ALM-bench),这是一个迄今为止最大、最全面的评估框架,旨在测试LMMs在100种语言中的表现,特别是针对低资源语言和文化多样性的理解能力。ALM-bench通过多种问题格式(如真/假、多选和开放式问题)以及短答案和长答案分类,全面评估模型在视觉和语言推理中的复杂性和难度。此外,该基准还涵盖了13个不同的文化方面,确保模型能够理解和尊重全球多样性,从而推动开发更具包容性和广泛适用性的多模态模型。
链接: https://arxiv.org/abs/2411.16508
作者: Ashmal Vayani,Dinura Dissanayake,Hasindri Watawana,Noor Ahsan,Nevasini Sasikumar,Omkar Thawakar,Henok Biadglign Ademtew,Yahya Hmaiti,Amandeep Kumar,Kartik Kuckreja,Mykola Maslych,Wafa Al Ghallabi,Mihail Mihaylov,Chao Qin,Abdelrahman M Shaker,Mike Zhang,Mahardika Krisna Ihsani,Amiel Esplana,Monil Gokani,Shachar Mirkin,Harsh Singh,Ashay Srivastava,Endre Hamerlik,Fathinah Asma Izzati,Fadillah Adamsyah Maani,Sebastian Cavada,Jenny Chim,Rohit Gupta,Sanjay Manjunath,Kamila Zhumakhanova,Feno Heriniaina Rabevohitra,Azril Amirudin,Muhammad Ridzuan,Daniya Kareem,Ketan More,Kunyang Li,Pramesh Shakya,Muhammad Saad,Amirpouya Ghasemaghaei,Amirbek Djanibekov,Dilshod Azizov,Branislava Jankovic,Naman Bhatia,Alvaro Cabrera,Johan Obando-Ceron,Olympiah Otieno,Fabian Farestam,Muztoba Rabbani,Sanoojan Baliah,Santosh Sanjeev,Abduragim Shtanchaev,Maheen Fatima,Thao Nguyen,Amrin Kareem,Toluwani Aremu,Nathan Xavier,Amit Bhatkal,Hawau Toyin,Aman Chadha,Hisham Cholakkal,Rao Muhammad Anwer,Michael Felsberg,Jorma Laaksonen,Thamar Solorio,Monojit Choudhury,Ivan Laptev,Mubarak Shah,Salman Khan,Fahad Khan
关键词-EN: Existing Large Multimodal, Large Multimodal Models, Large Multimodal, Existing Large, generally focus
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: A Multilingual Multimodal cultural benchmark for 100 languages
点击查看摘要
Abstract:Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model’s ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark is publicly available.
zh
[NLP-15] AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning
【速读】: 该论文试图解决大型语言模型(LLMs)在处理知识密集型复杂问答任务时的不足,特别是由于LLMs在推理规划和幻觉问题上的低效性。解决方案的关键在于提出了一个名为AtomR的新型异构知识推理框架,该框架在原子级别上进行多源推理。AtomR借鉴了知识图谱建模的思想,利用LLMs将复杂问题分解为三种原子知识操作符的组合,从而显著增强了推理过程的规划和执行阶段。此外,论文还引入了BlendQA,一个专门用于评估复杂异构知识推理的新型评估基准。实验结果表明,AtomR在多个单源和多源推理基准测试中显著优于现有最先进的方法。
链接: https://arxiv.org/abs/2411.16495
作者: Amy Xin,Jinxin Liu,Zijun Yao,Zhicheng Li,Shulin Cao,Lei Hou,Juanzi Li
关键词-EN: question answering due, Recent advancements, language processing tasks, perform knowledge-intensive complex, natural language processing
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advancements in large language models (LLMs) have led to significant improvements in various natural language processing tasks, but it is still challenging for LLMs to perform knowledge-intensive complex question answering due to LLMs’ inefficacy in reasoning planning and the hallucination problem. A typical solution is to employ retrieval-augmented generation (RAG) coupled with chain-of-thought (CoT) reasoning, which decomposes complex questions into chain-like sub-questions and applies iterative RAG at each sub-question. However, prior works exhibit sub-optimal reasoning planning and overlook dynamic knowledge retrieval from heterogeneous sources. In this paper, we propose AtomR, a novel heterogeneous knowledge reasoning framework that conducts multi-source reasoning at the atomic level. Drawing inspiration from the graph modeling of knowledge, AtomR leverages large language models (LLMs) to decompose complex questions into combinations of three atomic knowledge operators, significantly enhancing the reasoning process at both the planning and execution stages. We also introduce BlendQA, a novel evaluation benchmark tailored to assess complex heterogeneous knowledge reasoning. Experiments show that AtomR significantly outperforms state-of-the-art baselines across three single-source and two multi-source reasoning benchmarks, with notable performance gains of 9.4% on 2WikiMultihop and 9.5% on BlendQA.
zh
[NLP-16] O1 Replication Journey – Part 2: Surpassing O1-preview through Simple Distillation Big Progress or Bitter Lesson?
【速读】: 该论文试图解决当前在复制OpenAI的O1模型能力时,广泛但通常未公开使用的知识蒸馏技术(knowledge distillation techniques)的问题。解决方案的关键在于通过从O1的API中简单地蒸馏出知识,并结合监督微调(supervised fine-tuning),可以在复杂的数学推理任务上实现优于O1-preview的性能。具体来说,通过在数万个O1蒸馏的长思维链样本上微调基础模型,实验结果显示其在American Invitational Mathematics Examination (AIME)上的表现优于O1-preview,且技术复杂度较低。此外,研究还探讨了O1蒸馏模型在不同任务(如幻觉、安全性和开放领域问答)上的泛化能力,发现即使仅在数学问题解决数据上训练,模型也能在开放式问答任务上表现出强大的泛化能力,并在微调后显著减少了对谄媚行为的敏感性。论文旨在通过公开这些发现,促进AI研究的透明度,并挑战当前领域内技术声明的模糊趋势。
链接: https://arxiv.org/abs/2411.16489
作者: Zhen Huang,Haoyang Zou,Xuefeng Li,Yixiu Liu,Yuxiang Zheng,Ethan Chern,Shijie Xia,Yiwei Qin,Weizhe Yuan,Pengfei Liu
关键词-EN: knowledge distillation techniques, Invitational Mathematics Examination, American Invitational Mathematics, replicating OpenAI, paper presents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages
点击查看摘要
Abstract:This paper presents a critical examination of current approaches to replicating OpenAI’s O1 model capabilities, with particular focus on the widespread but often undisclosed use of knowledge distillation techniques. While our previous work explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1’s API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Moreover, our investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks: hallucination, safety and open-domain QA. Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning. We deliberately make this finding public to promote transparency in AI research and to challenge the current trend of obscured technical claims in the field. Our work includes: (1) A detailed technical exposition of the distillation process and its effectiveness, (2) A comprehensive benchmark framework for evaluating and categorizing O1 replication attempts based on their technical transparency and reproducibility, (3) A critical discussion of the limitations and potential risks of over-relying on distillation approaches, our analysis culminates in a crucial bitter lesson: while the pursuit of more capable AI systems is important, the development of researchers grounded in first-principles thinking is paramount.
zh
[NLP-17] When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets? EMNLP2024 CONLL
【速读】: 该论文试图解决数据高效预训练语言模型的问题,特别是在BabyLM挑战中推动这一领域的边界。解决方案的关键在于引入深度互学习(deep mutual learning),并通过学生模型搜索实现多样化的初始化。论文提出了将加权互学习形式化为双层优化问题,内层循环通过在线蒸馏学习紧凑的学生模型,外层循环则优化权重以从多样化的学生中更好地进行知识蒸馏。这种动态加权策略消除了对教师模型的依赖,从而降低了计算需求。实验评估表明,无教师的方法能够匹配甚至超越有教师监督的方法。
链接: https://arxiv.org/abs/2411.16487
作者: Srikrishna Iyer
关键词-EN: language model pretraining, data-efficient language model, BabyLM challenge, aiming to push, present our submission
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to BabyLM challenge, CoNLL Workshop, EMNLP 2024
点击查看摘要
Abstract:We present our submission to the BabyLM challenge, aiming to push the boundaries of data-efficient language model pretraining. Our method builds upon deep mutual learning, introducing a student model search for diverse initialization. We address the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem. The inner loop learns compact students through online distillation, while the outer loop optimizes weights for better knowledge distillation from diverse students. This dynamic weighting strategy eliminates the need for a teacher model, reducing computational requirements. Our evaluations show that teacher-less methods can match or surpass teacher-supervised approaches.
zh
[NLP-18] Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval
【速读】: 该论文试图解决大型语言模型 (Large Language Models, LLMs) 在处理复杂推理任务(如数学应用题 (Math Word Problems, MWPs))时的困难。解决方案的关键在于利用类似结构问题的类比来提升LLMs的问题解决能力。具体来说,通过检索与给定问题具有相似计算图的问题作为范例,将其嵌入到提示中,为生成模型提供正确的推理路径参考。实验结果表明,该方法在六个数学应用题数据集上显著提升了模型性能,平均绝对值提升达6.7%,凸显了其在解决当前LLMs推理挑战中的潜力。
链接: https://arxiv.org/abs/2411.16454
作者: Xiaocong Yang,Jiacheng Lin,Ziqi Wang,Chengxiang Zhai
关键词-EN: Large language models, Large language, complicated reasoning tasks, struggle with complicated, math word problems
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are known to struggle with complicated reasoning tasks such as math word problems (MWPs). In this paper, we present how analogy from similarly structured questions can improve LLMs’ problem-solving capabilities for MWPs. Specifically, we rely on the retrieval of problems with similar computational graphs to the given question to serve as exemplars in the prompt, providing the correct reasoning path for the generation model to refer to. Empirical results across six math word problem datasets demonstrate the effectiveness of our proposed method, which achieves a significant improvement of up to 6.7 percent on average in absolute value, compared to baseline methods. These results highlight our method’s potential in addressing the reasoning challenges in current LLMs.
zh
[NLP-19] Finding Structure in Language Models
【速读】: 该论文试图解决的核心问题是:大型语言模型是否具备与人类相似的深层语法结构理解能力。解决方案的关键在于开发新的可解释性技术,以增强对大规模语言模型复杂性的理解。具体方法包括:通过结构启动(structural priming)探索抽象语言信息的呈现;研究不同语言现象(如形容词顺序和负极性项目),并将其理解与模型训练数据分布相联系;以及引入一个受控测试平台,用于研究语言模型中的层次结构,使用不同复杂度的合成语言,并探讨特征交互在模型结构中的作用。这些方法共同揭示了语言模型中嵌入的语法知识,并为使用计算方法研究基本语言学问题提供了新的方向。
链接: https://arxiv.org/abs/2411.16433
作者: Jaap Jumelet
关键词-EN: continuously make predictions, make predictions based, write or listen, continuously make, make predictions
类目: Computation and Language (cs.CL)
备注: PhD Thesis at ILLC, University of Amsterdam
点击查看摘要
Abstract:When we speak, write or listen, we continuously make predictions based on our knowledge of a language’s grammar. Remarkably, children acquire this grammatical knowledge within just a few years, enabling them to understand and generalise to novel constructions that have never been uttered before. Language models are powerful tools that create representations of language by incrementally predicting the next word in a sentence, and they have had a tremendous societal impact in recent years. The central research question of this thesis is whether these models possess a deep understanding of grammatical structure similar to that of humans. This question lies at the intersection of natural language processing, linguistics, and interpretability. To address it, we will develop novel interpretability techniques that enhance our understanding of the complex nature of large-scale language models. We approach our research question from three directions. First, we explore the presence of abstract linguistic information through structural priming, a key paradigm in psycholinguistics for uncovering grammatical structure in human language processing. Next, we examine various linguistic phenomena, such as adjective order and negative polarity items, and connect a model’s comprehension of these phenomena to the data distribution on which it was trained. Finally, we introduce a controlled testbed for studying hierarchical structure in language models using various synthetic languages of increasing complexity and examine the role of feature interactions in modelling this structure. Our findings offer a detailed account of the grammatical knowledge embedded in language model representations and provide several directions for investigating fundamental linguistic questions using computational methods.
zh
[NLP-20] Adapter-based Approaches to Knowledge-enhanced Language Models – A Survey
【速读】: 该论文试图解决知识增强型语言模型(Knowledge-enhanced language models, KELMs)在结合大规模语言模型与领域特定知识时面临的挑战,特别是如何提高事实准确性并减少幻觉(hallucinations)。解决方案的关键在于利用知识图谱(Knowledge Graphs, KGs)和适配器模块(adapter modules)。适配器模块的引入不仅降低了计算负荷,还减少了灾难性遗忘(catastrophic forgetting)的风险。论文通过系统文献综述(Systematic Literature Review, SLR),对现有基于适配器的KELMs方法进行了定量和定性分析,探讨了其优势和潜在不足,并特别关注了生物医学领域,提供了现有KELMs性能的深入比较。
链接: https://arxiv.org/abs/2411.16403
作者: Alexander Fichtl,Juraj Vladika,Georg Groh
关键词-EN: Knowledge-enhanced language models, large-scale language models, language models, Knowledge-enhanced language, large-scale language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures. Published at KEOD24 via SciTePress
点击查看摘要
Abstract:Knowledge-enhanced language models (KELMs) have emerged as promising tools to bridge the gap between large-scale language models and domain-specific knowledge. KELMs can achieve higher factual accuracy and mitigate hallucinations by leveraging knowledge graphs (KGs). They are frequently combined with adapter modules to reduce the computational load and risk of catastrophic forgetting. In this paper, we conduct a systematic literature review (SLR) on adapter-based approaches to KELMs. We provide a structured overview of existing methodologies in the field through quantitative and qualitative analysis and explore the strengths and potential shortcomings of individual approaches. We show that general knowledge and domain-specific approaches have been frequently explored along with various adapter architectures and downstream tasks. We particularly focused on the popular biomedical domain, where we provided an insightful performance comparison of existing KELMs. We outline the main trends and propose promising future directions.
zh
[NLP-21] Human-Calibrated Automated Testing and Validation of Generative Language Models
【速读】: 该论文试图解决生成式语言模型(Generative Language Models, GLMs)在高风险领域(如银行业)中的评估和验证问题,特别是针对基于检索增强生成(Retrieval-Augmented Generation, RAG)系统的模型。由于GLM输出具有开放性和主观质量评估的挑战,论文提出了一个人类校准的自动化测试(Human-Calibrated Automated Testing, HCAT)框架。HCAT框架的关键在于:1) 使用分层抽样生成自动化测试;2) 利用嵌入式度量进行功能性、风险和安全属性的可解释评估;3) 通过概率校准和保形预测的两阶段校准方法,使机器生成的评估与人类判断相一致。此外,框架还包括鲁棒性测试和针对特定弱点的边际与双变量分析,以识别和改进模型的具体问题。这一多层次的评估框架提供了可扩展、透明和可解释的GLM评估方法,确保在需要高准确性、透明度和法规遵从性的应用中可靠部署。
链接: https://arxiv.org/abs/2411.16391
作者: Agus Sudjianto,Aijun Zhang,Srinivas Neppalli,Tarun Joshi,Michal Malohlava
关键词-EN: generative language models, paper introduces, introduces a comprehensive, validation of generative, deployed in high-stakes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.16391 [cs.CL] (or arXiv:2411.16391v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.16391 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-22] FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web
【速读】: 该论文试图解决传统中文(Traditional Chinese)用户在大语言模型(LLMs)预训练数据集方面的不足问题。解决方案的关键在于引入了FineWeb-zhtw数据集,该数据集专门为传统中文用户设计,并通过多阶段精心设计的过滤器来适应英语与传统中文之间的语言差异,以确保数据集的全面性和质量。
链接: https://arxiv.org/abs/2411.16387
作者: Cheng-Wei Lin,Wan-Hsuan Hsieh,Kai-Xin Guan,Chan-Jan Hsu,Chia-Chen Kuo,Chuan-Lin Lai,Chung-Wei Chung,Ming-Jen Wang,Da-Shan Shiu
关键词-EN: large language models, Traditional Chinese, Traditional Chinese users, pretraining dataset significantly, dataset significantly influence
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
点击查看摘要
Abstract:The quality and size of a pretraining dataset significantly influence the performance of large language models (LLMs). While there have been numerous efforts in the curation of such a dataset for English users, there is a relative lack of similar initiatives for Traditional Chinese. Building upon this foundation of FineWeb, we introduce FineWeb-zhtw, a dataset tailored specifically for Traditional Chinese users. We came up with multiple stages of meticulously designed filters to cater to the linguistic difference between English and Traditional Chinese, to ensure comprehensiveness and quality. We determined effectiveness from querying dataset samples with three main objectives. Our code and datasets are publicly available.
zh
[NLP-23] Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark Evaluate Metrics and Strong Baselines
【速读】: 该论文试图解决多模态检索增强多模态生成任务(Multi-modal Retrieval Augmented Multi-modal Generation, M^2 RAG),即要求基础模型浏览包含文本和图像的多模态网页,并生成能够解决用户查询的多模态响应。由于M^2 RAG任务处于早期研究阶段,缺乏系统的研究和分析,论文构建了一个基准测试(benchmark),配备了一系列文本模态和多模态的评估指标,以分析现有基础模型的能力。解决方案的关键在于通过全面评估结果,提出几种有效的方法来帮助基础模型完成这一任务,并揭示了值得进一步研究的有趣现象。
链接: https://arxiv.org/abs/2411.16365
作者: Zi-Ao Ma,Tian Lan,Rong-Cheng Tu,Yong Hu,Heyan Huang,Xian-Ling Mao
关键词-EN: Augmented Multi-modal Generation, Multi-modal Retrieval Augmented, Retrieval Augmented Multi-modal, Retrieval Augmented, Multi-modal Generation
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper investigates an intriguing task of Multi-modal Retrieval Augmented Multi-modal Generation (M ^2 RAG). This task requires foundation models to browse multi-modal web pages, with mixed text and images, and generate multi-modal responses for solving user queries, which exhibits better information density and readability. Given the early researching stage of M ^2 RAG task, there is a lack of systematic studies and analysis. To fill this gap, we construct a benchmark for M ^2 RAG task, equipped with a suite of text-modal metrics and multi-modal metrics to analyze the capabilities of existing foundation models. Besides, we also propose several effective methods for foundation models to accomplish this task, based on the comprehensive evaluation results on our benchmark. Extensive experimental results reveal several intriguing phenomena worth further research.
zh
[NLP-24] he Two-Hop Curse: LLM s trained on A-B B-C fail to learn A–C
【速读】: 该论文试图解决大语言模型(LLMs)在无链式思维推理(Chain-of-Thought, CoT)情况下进行两步推理(two-hop reasoning)的能力问题。研究的关键在于通过控制实验设置,验证LLMs在没有CoT的情况下是否能够进行潜在的推理。研究发现,当训练数据中事实同时出现或在提示中同时提供时,模型能够进行潜在推理;然而,当事实仅在不同文档中分别出现时,模型在无CoT的情况下完全失败,表现为随机水平的准确率和测试损失,这种现象被称为“两步诅咒”(Two-Hop Curse)。此外,研究还评估了9个前沿LLMs在真实世界事实上的表现,发现模型在无CoT的情况下在超过一半的问题类别上完全失败,而在使用CoT的情况下则能在大多数类别上部分成功。这些结果表明,LLMs缺乏独立于问题类型的潜在多步推理的通用能力。
链接: https://arxiv.org/abs/2411.16353
作者: Mikita Balesni,Tomek Korbak,Owain Evans
关键词-EN: performer of Imagine, reason internally, struggle when forced, forced to reason, Imagine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While LLMs excel at multi-hop questions (e.g. “Who is the spouse of the performer of Imagine?”) when using chain-of-thought reasoning (CoT), they struggle when forced to reason internally (without CoT). Previous work on the size and nature of this gap produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where the above-chance performance constitutes undeniable evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B Instruct and GPT-4o) on fictional facts and confirm that they generalize to answering two-hop questions about them using CoT. We find that models can perform latent reasoning when facts appear together during training or in the prompt. However, to our surprise, models completely fail at two-hop reasoning without CoT when learned facts only appear in different documents, achieving chance-level accuracy and chance-level test loss. We call this complete failure to compose separately learned facts the Two-Hop Curse. Moreover, we evaluate 9 frontier LLMs on real-world facts, finding that models completely fail at two-hop no-CoT reasoning for over half of question categories while maintaining partial success with CoT across most categories. These results suggest that LLMs lack a general capability for latent multi-hop reasoning independent of the question type.
zh
[NLP-25] Preference Optimization for Reasoning with Pseudo Feedback
【速读】: 该论文试图解决在数学推理和编程领域中,由于高质量推理任务数据集的稀缺性,导致大型语言模型(LLMs)在推理能力优化上的挑战。解决方案的关键在于引入了一种新的方法,通过将解决方案的标注问题转化为与测试用例的评估来生成伪反馈(pseudo feedback)。具体来说,论文探索了两种基于测试用例的伪反馈形式:一种由前沿LLMs生成,另一种通过扩展自一致性(self-consistency)到多测试用例来实现。实验结果表明,使用这种伪反馈进行偏好优化(Preference Optimization)在数学推理和编程任务上均取得了显著的性能提升。
链接: https://arxiv.org/abs/2411.16345
作者: Fangkai Jiao,Geyang Guo,Xingxing Zhang,Nancy F. Chen,Shafiq Joty,Furu Wei
关键词-EN: Direct Preference Optimization, Preference optimization techniques, Direct Preference, large language models, Preference optimization
类目: Computation and Language (cs.CL)
备注: 28 pages, 11 figures
点击查看摘要
Abstract:Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.
zh
[NLP-26] Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring
【速读】: 该论文试图解决教师在评估学生作文时面临的时间消耗问题,并探讨生成式 AI (Generative AI) 在作文评分中的应用潜力。解决方案的关键在于评估大型语言模型 (LLMs) 在评分德国学生作文时的表现和可靠性,特别是比较开源和闭源 LLMs 与教师评分的一致性。研究结果表明,闭源 GPT 模型在语言相关评分标准上表现优于开源模型,尤其是 o1 模型在整体评分上与人类评估的相关性达到 Spearman’s r = .74,内部一致性为 ICC=.80。这表明 LLM 可以作为减轻教师工作量的工具,特别是在语言相关标准的评估上,但模型在内容质量方面的评分仍需进一步优化。
链接: https://arxiv.org/abs/2411.16337
作者: Kathrin Seßler,Maurice Fürstenberg,Babette Bühler,Enkelejda Kasneci
关键词-EN: time-consuming yet critical, critical task, facilitate essay-scoring tasks, student writing, German student essays
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at LAK '25
点击查看摘要
Abstract:The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs’ scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman’s r = .74 with human assessments in the overall score, and an internal consistency of ICC=.80 . These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.
zh
[NLP-27] Learning from Relevant Subgoals in Successful Dialogs using Iterative Training for Task-oriented Dialog Systems
【速读】: 该论文试图解决面向任务的对话系统(Task-oriented Dialog, ToD)在完成用户目标时,由于反馈通常只在对话结束时获得,导致难以有效优化中间子目标的问题。解决方案的关键是提出了SUIT(SUbgoal-aware ITerative Training),一种迭代训练方法。SUIT通过从模型中采样对话并使用远监督(distant supervision)确定对对话成功有贡献的子目标,从而生成高质量的训练样本。这种方法不仅改进了监督微调或偏好学习的结果,还能迭代生成更多数据,而非依赖固定的静态数据集。最终,SUIT在流行的ToD基准测试中达到了新的最先进性能。
链接: https://arxiv.org/abs/2411.16305
作者: Magdalena Kaiser,Patrick Ernst,György Szarvas
关键词-EN: accomplish user goals, solve multiple subgoals, Task-oriented Dialog, user goals, solve multiple
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Task-oriented Dialog (ToD) systems have to solve multiple subgoals to accomplish user goals, whereas feedback is often obtained only at the end of the dialog. In this work, we propose SUIT (SUbgoal-aware ITerative Training), an iterative training approach for improving ToD systems. We sample dialogs from the model we aim to improve and determine subgoals that contribute to dialog success using distant supervision to obtain high quality training samples. We show how this data improves supervised fine-tuning or, alternatively, preference learning results. SUIT is able to iteratively generate more data instead of relying on fixed static sets. SUIT reaches new state-of-the-art performance on a popular ToD benchmark.
zh
[NLP-28] BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment
【速读】: 该论文试图解决大型语言模型(LLMs)在低资源语言(low-resource languages)上生成能力和知识相对较弱的问题。解决方案的关键在于通过语言对齐(language alignment)高效地将高资源语言(high-resource languages)的生成能力和知识转移到低资源语言上。为此,研究团队构建了一个包含320万条指令的数据集,涵盖高资源语言指令(中文和英文)以及跨语言指令,并通过基于该数据集的指令微调(instruction tuning)来促进语言间的生成能力转移。实验结果表明,BayLing在多语言翻译和多语言知识理解基准测试中,特别是在低资源语言上,表现显著优于同规模的开源模型,证明了其有效性。
链接: https://arxiv.org/abs/2411.16300
作者: Shaolei Zhang,Kehao Zhang,Qingkai Fang,Shoutao Guo,Yan Zhou,Xiaodong Liu,Yang Feng
关键词-EN: Large language models, powerful generative capabilities, Large language, languages, generative capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: BayLing 2’s online demo: this http URL . BayLing 2’s code and models: this https URL
点击查看摘要
Abstract:Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.
zh
[NLP-29] Unraveling Arithmetic in Large Language Models : The Role of Algebraic Structures
【速读】: 该论文试图解决大语言模型(LLMs)在链式思维(Chain-of-Thought, CoT)提示下进行一步算术推理的机制问题。现有研究对LLMs是否通过编码数值或依赖符号推理进行算术操作存在争议,而该论文提出LLMs通过捕捉代数结构(如交换律和恒等性)来学习算术。解决方案的关键在于利用这些代数结构,通过输入-输出关系来观察和学习,从而增强LLMs的算术能力。实验结果表明,利用代数结构可以显著提升LLMs在算术任务中的表现。
链接: https://arxiv.org/abs/2411.16260
作者: Fu-Chieh Chang,Pei-Yuan Wu
关键词-EN: Large language models, demonstrated remarkable mathematical, Large language, remarkable mathematical capabilities, decomposes complex reasoning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable mathematical capabilities, largely driven by chain-of-thought (CoT) prompting, which decomposes complex reasoning into step-by-step solutions. This approach has enabled significant advancements, as evidenced by performance on benchmarks like GSM8K and MATH. However, the mechanisms underlying LLMs’ ability to perform arithmetic in a single step of CoT remain poorly understood. Existing studies debate whether LLMs encode numerical values or rely on symbolic reasoning, while others explore attention and multi-layered processing in arithmetic tasks. In this work, we propose that LLMs learn arithmetic by capturing algebraic structures, such as \emphCommutativity and \emphIdentity properties. Since these structures are observable through input-output relationships, they can generalize to unseen data. We empirically demonstrate that LLMs can learn algebraic structures using a custom dataset of arithmetic problems. Our findings indicate that leveraging algebraic structures can enhance the LLMs’ arithmetic capabilities, offering insights into improving their arithmetic performance.
zh
[NLP-30] NormXLogit: The Head-on-Top Never Lies
【速读】: 该论文试图解决现有大型语言模型(LLMs)解释性方法依赖于特定模型架构且计算成本高的问题。解决方案的关键在于提出了一种名为NormXLogit的新技术,该技术通过分析输入和输出表示来评估单个输入词元的重要性。具体来说,NormXLogit利用词嵌入的范数来捕捉输入词元的重要性,并揭示词元重要性与模型最终预测之间的显著关系。实验结果表明,NormXLogit在忠实性方面优于现有的基于梯度的方法,并且在逐层解释方面表现优于最突出的架构特定方法。
链接: https://arxiv.org/abs/2411.16252
作者: Sina Abbasi,Mohammad Reza Modarres,Mohammad Taher Pilehvar
关键词-EN: building large language, large language models, Transformer architecture, dominant choice, choice for building
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The Transformer architecture has emerged as the dominant choice for building large language models (LLMs). However, with new LLMs emerging on a frequent basis, it is important to consider the potential value of architecture-agnostic approaches that can provide interpretability across a variety of architectures. Despite recent successes in the interpretability of LLMs, many existing approaches rely on complex methods that are often tied to a specific model design and come with a significant computational cost. To address these limitations, we propose a novel technique, called NormXLogit, for assessing the significance of individual input tokens. This method operates based on the input and output representations associated with each token. First, we demonstrate that during the pre-training of LLMs, the norms of word embeddings capture the importance of input tokens. Second, we reveal a significant relationship between a token’s importance and the extent to which its representation can resemble the model’s final prediction. Through extensive analysis, we show that our approach consistently outperforms existing gradient-based methods in terms of faithfulness. Additionally, our method achieves better performance in layer-wise explanations compared to the most prominent architecture-specific methods.
zh
[NLP-31] ransparent Neighborhood Approximation for Text Classifier Explanation
【速读】: 该论文试图解决生成式模型在解释文本分类器时缺乏透明性和可解释性的问题。解决方案的关键在于引入一种基于概率的编辑方法 (probability-based editing method),替代传统的黑箱文本生成器。通过在文本上下文中实施基于概率的操作来生成邻近文本,这种方法不仅提高了解释的质量,还增强了整个解释过程的透明度和可控性。论文提出的XPROB方法在两个实际数据集上的评估中表现出与生成式解释器相当的性能,同时具有更高的稳定性和透明度。
链接: https://arxiv.org/abs/2411.16251
作者: Yi Cai,Arthur Zimek,Eirini Ntoutsi,Gerhard Wunder
关键词-EN: Recent literature highlights, deploying generative models, improve synthetic instance, synthetic instance quality, explaining text classifiers
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: IEEE DSAA’24
点击查看摘要
Abstract:Recent literature highlights the critical role of neighborhood construction in deriving model-agnostic explanations, with a growing trend toward deploying generative models to improve synthetic instance quality, especially for explaining text classifiers. These approaches overcome the challenges in neighborhood construction posed by the unstructured nature of texts, thereby improving the quality of explanations. However, the deployed generators are usually implemented via neural networks and lack inherent explainability, sparking arguments over the transparency of the explanation process itself. To address this limitation while preserving neighborhood quality, this paper introduces a probability-based editing method as an alternative to black-box text generators. This approach generates neighboring texts by implementing manipulations based on in-text contexts. Substituting the generator-based construction process with recursive probability-based editing, the resultant explanation method, XPROB (explainer with probability-based editing), exhibits competitive performance according to the evaluation conducted on two real-world datasets. Additionally, XPROB’s fully transparent and more controllable construction process leads to superior stability compared to the generator-based explainers.
zh
[NLP-32] DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings
【速读】: 该论文试图解决基础模型对基于群体的偏见的鲁棒性问题。解决方案的关键是提出了一种名为DoubleCCA的方法,该方法通过利用随机句子和典型相关分析(CCA)来丰富基础模型的文本嵌入。具体步骤包括:首先生成各种随机句子以扩展原始提示,然后使用额外的句子嵌入模型生成这些随机句子的不同文本嵌入,最后通过两次CCA对齐和重构这些表示,使其回到原始表示空间。该方法在多种任务和数据集上展示了其有效性,不仅在性能上超越现有方法,而且在鲁棒性方面也有显著提升。DoubleCCA方法简单易实现,并能轻松集成到现有模型中,为提高基础模型对群体偏见的鲁棒性提供了一个实用解决方案。
链接: https://arxiv.org/abs/2411.16236
作者: Hong Liu,Yitong Lu
关键词-EN: Canonical Correlation Analysis, foundation models, Correlation Analysis, paper presents, Canonical Correlation
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures, 2 tables
点击查看摘要
Abstract:This paper presents a novel method to improve the robustness of foundation models to group-based biases. We propose a simple yet effective method, called DoubleCCA, that leverages random sentences and Canonical Correlation Analysis (CCA) to enrich the text embeddings of the foundation model. First, we generate various random sentences that augment the original prompts, which extends the original prompts with random words or character sequences. Second, we use an additional sentence embedding model to generate different text embeddings with respect to these random sentences. We then use CCA double twice to align the representations and reconstruct them back to the original representation space. We demonstrate the effectiveness of our method on a variety of tasks and datasets, showing that it outperforms existing methods in terms of both performance and robustness. Our method is simple to implement and can be easily integrated into existing models, making it a practical solution for improving the robustness of foundation models to group-based biases.
zh
[NLP-33] MH-MoE:Multi-Head Mixture-of-Experts
【速读】: 该论文试图解决在保持计算量(FLOPs)和参数数量与稀疏混合专家模型(MoE)相同的情况下,提升多专家混合模型(Mixture-of-Experts, MoE)性能的问题。解决方案的关键在于提出了一种新的多头混合专家模型(Multi-Head Mixture-of-Experts, MH-MoE)实现方式,通过多头机制(multi-head mechanism)共同处理来自不同专家的不同表示空间的信息,从而在语言模型实验中显著提升了模型质量,并展示了与1-bit大型语言模型(Large Language Models, LLMs)如BitNet的兼容性。
链接: https://arxiv.org/abs/2411.16205
作者: Shaohan Huang,Xun Wu,Shuming Ma,Furu Wei
关键词-EN: demonstrates superior performance, superior performance, mechanism to collectively, collectively attend, attend to information
类目: Computation and Language (cs.CL)
备注: 7 pages, 0 figures
点击查看摘要
Abstract:Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.
zh
[NLP-34] Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models
【速读】: 该论文试图解决多模态大语言模型 (MLLMs) 在视频-文本对齐任务中高质量偏好数据稀缺的问题。解决方案的关键在于提出了一个名为 MMAIP-V 的高质量视频问答 (VQA) 偏好数据集,该数据集通过从响应分布中采样并使用外部评分函数进行响应评估来构建。此外,论文还提出了 Iter-W2S-RLAIF 框架,通过迭代更新参考模型和参数外推,逐步增强 MLLMs 的对齐能力。最终,论文还提出了一种无偏且信息完整的 VQA 评估方案。实验结果表明,MMAIP-V 对 MLLMs 的偏好学习有益,而 Iter-W2S-RLAIF 则充分利用了 MMAIP-V 中的对齐信息。
链接: https://arxiv.org/abs/2411.16201
作者: Hao Yi,Qingyang Li,Yulan Hu,Fuzheng Zhang,Di Zhang,Yong Liu
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, textbf
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:High-quality video-text preference data is crucial for Multimodal Large Language Models (MLLMs) alignment. However, existing preference data is very scarce. Obtaining VQA preference data for preference training is costly, and manually annotating responses is highly unreliable, which could result in low-quality pairs. Meanwhile, AI-generated responses controlled by temperature adjustment lack diversity. To address these issues, we propose a high-quality VQA preference dataset, called \textit\textbfMultiple \textbfMultimodal \textbfArtificial \textbfIntelligence \textbfPreference Datasets in \textbfVQA (\textbfMMAIP-V), which is constructed by sampling from the response distribution set and using an external scoring function for response evaluation. Furthermore, to fully leverage the preference knowledge in MMAIP-V and ensure sufficient optimization, we propose \textit\textbfIterative \textbfWeak-to-\textbfStrong \textbfReinforcement \textbfLearning from \textbfAI \textbfFeedback for video MLLMs (\textbfIter-W2S-RLAIF), a framework that gradually enhances MLLMs’ alignment capabilities by iteratively updating the reference model and performing parameter extrapolation. Finally, we propose an unbiased and information-complete evaluation scheme in VQA evaluation. Experiments demonstrate that MMAIP-V is beneficial for MLLMs in preference learning and Iter-W2S-RLAIF fully exploits the alignment information in MMAIP-V. We believe that the proposed automatic VQA preference data generation pipeline based on AI feedback can greatly promote future work in the MLLMs alignment. \textbfCode and dataset are available \hrefthis https URLMMAIP-V_Iter-W2S-RLAIF-702F.
zh
[NLP-35] Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models
【速读】: 该论文试图解决大型语言模型 (Large Language Models, LLMs) 在处理复杂推理任务时面临的幻觉 (hallucinations) 问题,这限制了LLMs的实际应用。解决方案的关键在于引入第三方LLMs来调整代理的注意力权重,通过不确定性估计和置信度分析优化多代理系统中的共识形成。具体方法包括:1) 通过第三方LLMs的不确定性估计和置信度分析,调整各代理的注意力权重,促进代理间的深入辩论,从而优化共识形成;2) 在算术数据集上的实验验证了该方法的有效性,超越了传统的多代理基线。这一研究为大型模型在处理复杂任务时减轻幻觉现象提供了新的视角。
链接: https://arxiv.org/abs/2411.16189
作者: Zhihua Duan,Jialin Wang
关键词-EN: Large Language Models, Language Models, Large Language, complex reasoning tasks, face challenges
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) still face challenges when dealing with complex reasoning tasks, often resulting in hallucinations, which limit the practical application of LLMs. To alleviate this issue, this paper proposes a new method that integrates different LLMs to expand the knowledge boundary, reduce dependence on a single model, and promote in-depth debate among agents. The main contributions include: 1) Introducing third-party LLMs to adjust the attention weights of agents through uncertainty estimation and confidence analysis, optimizing consensus formation in multi-agent systems; 2) Experiments on arithmetic datasets have validated the effectiveness of the method, surpassing traditional multi-agent baselines. This research provides a new perspective for large models to alleviate hallucination phenomena when dealing with complex tasks.
zh
[NLP-36] LLM Augmentations to support Analytical Reasoning over Multiple Documents
【速读】: 该论文试图解决如何利用大型语言模型 (LLMs) 增强情报分析中的深度分析推理能力的问题。解决方案的关键在于开发了一种名为动态证据树 (dynamic evidence trees, DETs) 的记忆模块,以增强 LLM 的能力,使其能够开发和跟踪多个调查线索。通过在多个数据集上的广泛实验,论文指出当前的 LLMs 在支持情报分析方面仍存在不足,并提出了改进 LLMs 以适应复杂推理应用的建议。
链接: https://arxiv.org/abs/2411.16116
作者: Raquib Bin Yousuf,Nicholas Defelice,Mandar Sharma,Shengzhe Xu,Naren Ramakrishnan
关键词-EN: large language models, enhance in-depth analytical, in-depth analytical reasoning, language models, demonstrated ability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2024 IEEE International Conference on Big Data (IEEE BigData 2024)
点击查看摘要
Abstract:Building on their demonstrated ability to perform a variety of tasks, we investigate the application of large language models (LLMs) to enhance in-depth analytical reasoning within the context of intelligence analysis. Intelligence analysts typically work with massive dossiers to draw connections between seemingly unrelated entities, and uncover adversaries’ plans and motives. We explore if and how LLMs can be helpful to analysts for this task and develop an architecture to augment the capabilities of an LLM with a memory module called dynamic evidence trees (DETs) to develop and track multiple investigation threads. Through extensive experiments on multiple datasets, we highlight how LLMs, as-is, are still inadequate to support intelligence analysts and offer recommendations to improve LLMs for such intricate reasoning applications.
zh
[NLP-37] Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
【速读】: 该论文试图解决的问题是:在大规模语言模型(LLMs)中,机制性可解释性(Mechanistic interpretability)所识别的电路(circuits)在面对不同提示格式时,其泛化能力如何。具体来说,论文关注的是间接对象识别(Indirect Object Identification, IOI)电路在GPT-2 small模型中的泛化能力,特别是在面对挑战原有算法假设的提示变体时。解决方案的关键在于通过实验验证,发现IOI电路在面对不同提示变体时,能够惊人地泛化,主要通过重用其所有组件和机制,并仅增加额外的输入边。此外,论文还发现了一种称为S2 Hacking的机制,解释了电路在原有算法应失败的情况下仍能泛化的原因。这些发现表明,LLMs中的电路可能比之前认识到的更具灵活性和通用性,强调了研究电路泛化对于更好地理解这些模型广泛能力的重要性。
链接: https://arxiv.org/abs/2411.16105
作者: Jatin Nainani,Sankaran Vaidyanathan,AJ Yeung,Kartik Gupta,David Jensen
关键词-EN: Mechanistic interpretability aims, performing specific tasks, Mechanistic interpretability, large neural networks, interpretability aims
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 8 figures
点击查看摘要
Abstract:Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.
zh
[NLP-38] Cautious Optimizers: Improving Training with One Line of Code
【速读】: 该论文试图解决现有优化器在Transformer预训练中速度和稳定性不足的问题。解决方案的关键在于提出了一种名为“谨慎优化器”(Cautious Optimizer)的单行代码修改,适用于基于动量的优化器,如C-AdamW和C-Lion。这一修改在理论上保留了Adam的哈密顿函数(Hamiltonian function),并且在Lyapunov分析下不破坏收敛性保证。通过这一理论洞察,揭示了一类新的优化器家族,并在实验中验证了其在Llama和MAE预训练中的加速效果,最高可达1.47倍。
链接: https://arxiv.org/abs/2411.16085
作者: Kaizhao Liang,Lizhang Chen,Bo Liu,Qiang Liu
关键词-EN: default optimizer, Abstract, transformer pretraining, Adam Hamiltonian function, preserves Adam Hamiltonian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Discrete Mathematics (cs.DM)
备注:
点击查看摘要
Abstract:AdamW has been the default optimizer for transformer pretraining. For many years, our community searches for faster and more stable optimizers with only constraint positive outcomes. In this work, we propose a \textbfsingle-line modification in Pytorch to any momentum-based optimizer, which we rename Cautious Optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam’s Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing speed-up on Llama and MAE pretraining up to 1.47\times . Code is available at this https URL
zh
[NLP-39] SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
【速读】: 该论文试图解决在大型语言模型(LLM)集成到应用程序中时,如何在没有参考或充足标注数据的情况下,评估自然语言生成(NLG)输出的质量和相关性的问题。解决方案的关键是引入了一个名为“SAGEval”的新框架,该框架利用一个批评代理(critiquing Agent)来对LLM评估器生成的评分进行反馈和修正。通过这种方式,即使在没有参考或真实标签的情况下,批评代理也能有效纠正LLM评估器的评分,从而减少对标注数据的依赖,特别是在复杂NLG评估场景中,如生成具有不同响应风格的JSON结构表单或调查问卷。
链接: https://arxiv.org/abs/2411.16077
作者: Reshmi Ghosh,Tianyi Yao,Lizzy Chen,Sadid Hasan,Tianwei Chen,Dario Bernal,Huitian Jiao,H M Sajjad Hossain
关键词-EN: Large Language Model, Google Workspace, suite and Google, Workspace for creating, Large Language
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Large Language Model (LLM) integrations into applications like Microsoft365 suite and Google Workspace for creating/processing documents, emails, presentations, etc. has led to considerable enhancements in productivity and time savings. But as these integrations become more more complex, it is paramount to ensure that the quality of output from the LLM-integrated applications are relevant and appropriate for use. Identifying the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn’t exist or isn’t amply available, this paper introduces a novel framework called “SAGEval” which utilizes a critiquing Agent to provide feedback on scores generated by LLM evaluators. We show that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys with responses in different styles like multiple choice, likert ratings, single choice questions, etc.
zh
[NLP-40] Predicting Emergent Capabilities by Finetuning
【速读】: 该论文试图解决现代大型语言模型(LLM)扩展中的一个基本开放挑战,即对涌现能力(emergent capabilities)缺乏理解的问题。具体来说,虽然语言模型的预训练损失(pretraining loss)随着计算资源的增加是高度可预测的,但下游任务的能力却远不如预训练损失那样可预测,有时甚至会出现突变(emergent jumps),这使得预测未来模型的能力变得困难。论文的关键解决方案在于提出了一个名为“涌现预测”(emergence prediction)的任务,即在给定当前LLM在某一任务上的随机少样本准确率的情况下,预测未来模型(如GPT-N+1)是否会在该任务上表现出非平凡的准确率。论文发现,通过在特定任务上微调LLM,可以改变涌现能力出现的扩展点,使其向能力较弱的模型转移。为此,论文提出了一种操作化方法,即通过在不同数据量上微调LLM,并拟合一个参数化函数来预测涌现能力何时出现(即“涌现定律”emergence laws)。研究在四个标准NLP基准测试(MMLU, GSM8K, CommonsenseQA, 和 CoLA)上验证了这一方法,并展示了其在实际应用中的潜力。
链接: https://arxiv.org/abs/2411.16035
作者: Charlie Snell,Eric Wallace,Dan Klein,Sergey Levine
关键词-EN: fundamental open challenge, fundamental open, open challenge, challenge in modern, lack of understanding
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a function of compute. However, downstream capabilities are far less predictable – sometimes even exhibiting emergent jumps – which makes it challenging to anticipate the capabilities of future models. In this work, we first pose the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can we predict whether future models (GPT-N+1) will have non-trivial accuracy on that task? We then discover a simple insight for this problem: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models. To operationalize this insight, we can finetune LLMs with varying amounts of data and fit a parametric function that predicts when emergence will occur (i.e., “emergence laws”). We validate this approach using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and CoLA). Using only small-scale LLMs, we find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged. Finally, we present a case study of two realistic uses for emergence prediction.
zh
[NLP-41] ransCompressor: LLM -Powered Multimodal Data Compression for Smart Transportation
【速读】: 该论文试图解决智能交通系统中多模态传感器数据的高效压缩与解压缩问题。解决方案的关键在于引入了一个名为TransCompressor的新框架,该框架利用大型语言模型(Large Language Models, LLMs)来实现对多种传感器数据(如气压计、速度和高度测量)的高效压缩和解压缩。通过精心设计的提示(prompts),LLMs能够利用其广泛的知识库来优化数据压缩过程,从而在智能交通环境中提升数据存储、分析和检索的效率。
链接: https://arxiv.org/abs/2411.16020
作者: Huanqi Yang,Rucheng Wu,Weitao Xu
关键词-EN: Large Language Models, Language Models, Large Language, incorporation of Large, improving data management
类目: Computation and Language (cs.CL)
备注: 6 pages
点击查看摘要
Abstract:The incorporation of Large Language Models (LLMs) into smart transportation systems has paved the way for improving data management and operational efficiency. This study introduces TransCompressor, a novel framework that leverages LLMs for efficient compression and decompression of multimodal transportation sensor data. TransCompressor has undergone thorough evaluation with diverse sensor data types, including barometer, speed, and altitude measurements, across various transportation modes like buses, taxis, and MTRs. Comprehensive evaluation illustrates the effectiveness of TransCompressor in reconstructing transportation sensor data at different compression ratios. The results highlight that, with well-crafted prompts, LLMs can utilize their vast knowledge base to contribute to data compression processes, enhancing data storage, analysis, and retrieval in smart transportation settings.
zh
[NLP-42] Exploring Performance Contrasts in TableQA: Step-by-Step Reasoning Boosts Bigger Language Models Limits Smaller Language Models
【速读】: 该论文旨在探讨在TableQA任务中,使用逐步推理方法时,大型语言模型(LMs)与小型LMs之间的性能对比。解决方案的关键在于提出了一种名为Table-Logic的详细提示流程,该流程通过逐步识别关键列和行、确定必要的聚合、计算或比较,并最终推断结果以生成精确预测,从而处理任务。实验结果显示,大型LMs如Llama-3-70B在HybridQA任务中比传统方法提高了7.8%的准确率,而小型LMs如Llama-2-7B则出现了11%的性能下降。研究通过多维度分析,揭示了小型模型在逐步推理方法中的局限性,并提供了改进的潜在方向。
链接: https://arxiv.org/abs/2411.16002
作者: Haoyan Yang,Yixuan Wang,Keyue Tong,Hongjin Zhu,Yuanxin Zhang
关键词-EN: detailed prompting flow, termed Table-Logic, prompting flow, paper proposes, proposes a detailed
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper proposes a detailed prompting flow, termed Table-Logic, to investigate the performance contrasts between bigger and smaller language models (LMs) utilizing step-by-step reasoning methods in the TableQA task. The method processes tasks by sequentially identifying critical columns and rows given question and table with its structure, determining necessary aggregations, calculations, or comparisons, and finally inferring the results to generate a precise prediction. By deploying this method, we observe a 7.8% accuracy improvement in bigger LMs like Llama-3-70B compared to the vanilla on HybridQA, while smaller LMs like Llama-2-7B shows an 11% performance decline. We empirically investigate the potential causes of performance contrasts by exploring the capabilities of bigger and smaller LMs from various dimensions in TableQA task. Our findings highlight the limitations of the step-by-step reasoning method in small models and provide potential insights for making improvements.
zh
[NLP-43] Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models
【速读】: 该论文试图解决的问题是:在大语言模型 (LLMs) 的社交和认知能力评估中,这些模型在不同语言和文化背景下展现出的心智理论 (Theory of Mind, ToM) 能力尚不清楚。解决方案的关键在于:(1) 将现有的 ToM 数据集翻译成多种语言,创建一个多语言的 ToM 数据集;(2) 在这些翻译中融入文化特定元素,以反映不同群体相关的社交和认知场景。通过这两个关键步骤,论文对六个最先进的 LLMs 进行了广泛的评估,以测量它们在翻译和文化适应数据集上的 ToM 表现,从而揭示语言和文化多样性对模型展示 ToM 能力的影响,并质疑其社交推理能力。
链接: https://arxiv.org/abs/2411.15999
作者: Jayanta Sadhu,Ayan Antik Khan,Noshin Nawal,Sanju Basak,Abhik Bhattacharjee,Rifat Shahriyar
关键词-EN: Theory of Mind, attribute mental states, infer and attribute, attribute mental, mental states
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Theory of Mind (ToM) refers to the cognitive ability to infer and attribute mental states to oneself and others. As large language models (LLMs) are increasingly evaluated for social and cognitive capabilities, it remains unclear to what extent these models demonstrate ToM across diverse languages and cultural contexts. In this paper, we introduce a comprehensive study of multilingual ToM capabilities aimed at addressing this gap. Our approach includes two key components: (1) We translate existing ToM datasets into multiple languages, effectively creating a multilingual ToM dataset and (2) We enrich these translations with culturally specific elements to reflect the social and cognitive scenarios relevant to diverse populations. We conduct extensive evaluations of six state-of-the-art LLMs to measure their ToM performance across both the translated and culturally adapted datasets. The results highlight the influence of linguistic and cultural diversity on the models’ ability to exhibit ToM, and questions their social reasoning capabilities. This work lays the groundwork for future research into enhancing LLMs’ cross-cultural social cognition and contributes to the development of more culturally aware and socially intelligent AI systems. All our data and code are publicly available.
zh
[NLP-44] Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown
【速读】: 该论文试图解决大型语言模型(LLMs)在长文本生成中事实性不足的问题。解决方案的关键在于通过分析不同LLMs(如GPT-4、Gemini-1.5-Pro、Claude-3-Opus、Llama-3-70B和Mistral)在长文本生成中的事实性表现,揭示生成文本中事实性得分随句子位置后移而下降的现象,并伴随不支持声明数量的增加。论文进一步探讨了不同评估设置(如Self-Known和Self-Unknown)对LLMs自我判断准确性的影响,发现即使是最先进的模型也难以达到完美的Self-Known得分,且Self-Unknown得分始终高于零,表明模型在自我评估中存在持续的不确定性。研究还指出,Self-Known得分与事实性提升正相关,而Self-Unknown得分与事实性下降相关。这些发现不仅揭示了当前LLMs在长文本生成中的局限性,也为提升长文本生成的事实性提供了有价值的见解。
链接: https://arxiv.org/abs/2411.15993
作者: Lifu Tu,Rui Meng,Shafiq Joty,Yingbo Zhou,Semih Yavuz
关键词-EN: demonstrated strong capabilities, Large language models, long-form text generation, Large language, long-form generation
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality scores tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). The results indicate that even advanced models like GPT-4 and Gemini-1.5-Pro fail to achieve perfect Self-Known scores, while their Self-Unknown scores remain notably above zero, reflecting ongoing uncertainty in their self-assessments. Moreover, we find a correlation between higher Self-Known scores and improved factuality, while higher Self-Unknown scores are associated with lower factuality. Interestingly, even without significant changes in the models’ self-judgment (Self-Known and Self-Unknown), the number of unsupported claims can increases, likely as an artifact of long-form generation. These findings show the limitations of current LLMs in long-form generation, and provide valuable insights for improving factuality in long-form text generation.
zh
[NLP-45] Generative Context Distillation
【速读】: 该论文试图解决大型语言模型应用中固定且冗长的提示(prompts)导致的显著计算开销问题。解决方案的关键是提出了一种轻量级的提示内部化方法,称为生成式上下文蒸馏(Generative Context Distillation, GCD)。该方法通过联合训练的方式,不仅复制了带有提示输入的模型的行为,还生成了提示内容及其对应的模型行为变化的原因。此外,论文引入了一种数据合成技术,通过交换代理(agent)和环境(environment)的角色来自动收集对话数据集,从而在没有交互环境的条件下进行有效训练。这种方法特别适用于仅有预定义提示而没有相应训练数据集的场景。通过内部化复杂提示,GCD实现了高性能和高效的推理,无需显式提示。
链接: https://arxiv.org/abs/2411.15927
作者: Haebin Shin,Lei Ji,Yeyun Gong,Sungdong Kim,Eunbi Choi,Minjoon Seo
关键词-EN: significant computational overhead, recent large language, Generative Context Distillation, large language model, language model based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Context Distillation (GCD), a lightweight prompt internalization method that employs a joint training approach. This method not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model’s behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Context Distillation enables high-performance and efficient inference without the need for explicit prompts.
zh
[NLP-46] Evaluating Large Language Models for Causal Modeling
【速读】: 该论文试图解决将因果领域知识转化为更符合因果数据科学指南的表示形式的问题。解决方案的关键在于引入两个新任务:将因果领域知识提炼为因果变量和使用大型语言模型(LLMs)检测交互实体。研究表明,当代LLMs(如GPT-4-turbo和Llama3-70b)在提炼因果领域知识为因果变量方面表现优于稀疏专家模型(如Mixtral-8x22b),而在识别交互实体方面,稀疏专家模型则更为有效。此外,论文强调了生成实体的领域与所选LLM在因果建模中的性能之间的依赖关系。
链接: https://arxiv.org/abs/2411.15888
作者: Houssam Razouk,Leonie Benischke,Georg Niess,Roman Kern
关键词-EN: causal domain knowledge, causal data science, domain knowledge, transforming causal domain, data science
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figutrd, 4 tabels
点击查看摘要
Abstract:In this paper, we consider the process of transforming causal domain knowledge into a representation that aligns more closely with guidelines from causal data science. To this end, we introduce two novel tasks related to distilling causal domain knowledge into causal variables and detecting interaction entities using LLMs. We have determined that contemporary LLMs are helpful tools for conducting causal modeling tasks in collaboration with human experts, as they can provide a wider perspective. Specifically, LLMs, such as GPT-4-turbo and Llama3-70b, perform better in distilling causal domain knowledge into causal variables compared to sparse expert models, such as Mixtral-8x22b. On the contrary, sparse expert models such as Mixtral-8x22b stand out as the most effective in identifying interaction entities. Finally, we highlight the dependency between the domain where the entities are generated and the performance of the chosen LLM for causal modeling.
zh
[NLP-47] LLM s Do Not Think Step-by-step In Implicit Reasoning
【速读】: 该论文试图解决的问题是:隐式链式思维(implicit Chain-of-Thought, CoT)是否等同于显式链式思维(explicit CoT)。解决方案的关键在于通过实验探究模型在执行隐式CoT时的隐藏状态信息,结果表明大型语言模型(LLMs)在隐式CoT过程中几乎不考虑中间步骤,而是依赖经验而非严格的逐步推理。此外,研究发现LLMs的隐式推理能力不稳定且易受影响,这再次强调了显式CoT在支持复杂任务中的必要性。
链接: https://arxiv.org/abs/2411.15862
作者: Yijiong Yu
关键词-EN: remarkably enhance LLMs’, enhance LLMs’ performance, remarkably enhance, CoT, explicit CoT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:It has been well-known that Chain-of-Thought can remarkably enhance LLMs’ performance on complex tasks. However, because it also introduces slower inference speeds and higher computational costs, many researches have attempted to use implicit CoT, which does not need LLMs to explicitly generate the intermediate steps. But there is still gap between their efficacy and typical explicit CoT methods. This leaves us a doubt that, does implicit CoT really equal to explicit CoT? Therefore, in this study, we address this question through experiments. We probe the information of intermediate steps from the model’s hidden states when it is performing implicit CoT. The results surprisingly indicate that LLMs hardly think about intermediate steps, suggesting they may just rely on experience rather than strict step-by-step reasoning. Moreover, we find LLMs’ implicit reasoning capabilities are susceptible and unstable, reaffirming the necessity of explicit CoT to effectively support complex tasks.
zh
[NLP-48] Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?
【速读】: 该论文试图解决训练数据质量与数量对小型语言模型(SLMs)性能的相对影响问题。解决方案的关键在于通过实验分析不同数据集变体(包括大小和重复率的变化)对模型性能的影响,特别是验证损失、准确性和困惑度等指标。研究结果表明,数据质量对SLMs的整体性能影响更为显著,适量的数据重复可以轻微提升模型准确性而不显著增加困惑度,但过度重复会导致性能显著下降。这一发现不仅有助于优化模型性能,还为降低大规模模型训练的财务和计算负担,以及减少环境影响提供了理论支持,从而使AI技术更加民主化和可持续。
链接: https://arxiv.org/abs/2411.15821
作者: Aryan Sajith,Krishna Chaitanya Rao Kathala
关键词-EN: small language models, utilizing the TinyStories, study investigates, small language, data quality versus
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures
点击查看摘要
Abstract:This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs), utilizing the TinyStories dataset for empirical analysis. Analysis of dataset variations with respect to size (25% and 50% of the original size) and duplication (controlled rates of 25%, 50%, 75%, and 100%) were performed. Model performance was evaluated based on the validation loss, accuracy, and perplexity metrics. Results indicate training data quality plays a more significant role in the overall performance of SLMs, especially given scale of this experiment. Minimal duplication positively impacted model accuracy (+0.87% increase in accuracy at 25% duplication) without significantly increasing perplexity (+0.52% increase going from 0% to 25% duplication) but excessive duplication led to pronounced performance degradation (-40% drop in accuracy at 100% duplication). The implications of this exploration extend beyond just model performance; training large-scale models imposes significant financial and computational burdens, which can be prohibitive for organizations, individuals, and the public at large, especially in developing countries. Additionally, the energy consumption associated with large-scale training raises environmental concerns. Understanding the relative importance of data quality versus quantity could democratize AI technology, making advanced models more accessible and sustainable for all.
zh
[NLP-49] LoRA-Mini : Adaptation Matrices Decomposition and Selective Training
【速读】: 该论文试图解决大型语言模型(LLMs)在任务特定微调过程中面临的计算和存储效率问题。传统微调方法涉及大量参数更新,导致计算成本高且内存需求大。论文提出的解决方案是LoRA-Mini,这是对低秩适应(LoRA)方法的优化。关键在于将低秩矩阵分割为四部分,仅训练其中两个内部矩阵,从而在保持与标准LoRA相当性能的同时,实现了高达20倍的训练参数数量减少,有效提升了参数效率,解决了LLM微调中的计算和存储效率问题。
链接: https://arxiv.org/abs/2411.15804
作者: Ayush Singh,Rajdeep Aher,Shivank Garg
关键词-EN: natural language processing, revolutionized natural language, large language models, task-specific fine-tuning methods, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages
点击查看摘要
Abstract:The rapid advancements in large language models (LLMs) have revolutionized natural language processing, creating an increased need for efficient, task-specific fine-tuning methods. Traditional fine-tuning of LLMs involves updating a large number of parameters, which is computationally expensive and memory-intensive. Low-Rank Adaptation (LoRA) has emerged as a promising solution, enabling parameter-efficient fine-tuning by reducing the number of trainable parameters. However, while LoRA reduces the number of trainable parameters, LoRA modules still create significant storage challenges. We propose LoRA-Mini, an optimized adaptation of LoRA that improves parameter efficiency by splitting low-rank matrices into four parts, with only the two inner matrices being trainable. This approach achieves upto a 20x reduction compared to standard LoRA in the number of trainable parameters while preserving performance levels comparable to standard LoRA, addressing both computational and storage efficiency in LLM fine-tuning.
zh
[NLP-50] A Method for Building Large Language Models with Predefined KV Cache Capacity
【速读】: 该论文试图解决在Transformer解码器架构中,处理无限上下文时传统Key-Value (KV)缓存导致的内存消耗过大的问题。解决方案的关键在于引入固定长度的KV缓存,通过动态更新键值向量序列,在有限的缓存容量内实现高效的推理,从而显著减少内存使用并保持模型性能和系统吞吐量。
链接: https://arxiv.org/abs/2411.15785
作者: Zhonghua Yi,Ge Niu,Lei Wang,Wei Tang,Liqiu Zhang
关键词-EN: Transformer decode-only architectures, building large language, layers in Transformer, Transformer decode-only, large language models
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper proposes a method for building large language models with predefined Key-Value (KV) cache capacity, particularly suitable for the attention layers in Transformer decode-only architectures. This method introduces fixed-length KV caches to address the issue of excessive memory consumption in traditional KV caches when handling infinite contexts. By dynamically updating the key-value vector sequences, it achieves efficient inference within limited cache capacity, significantly reducing memory usage while maintaining model performance and system throughput. Experimental results show that this method significantly reduces memory usage while maintaining the model’s inference quality.
zh
[NLP-51] Detecting Turkish Synonyms Used in Different Time Periods
【速读】: 该论文试图解决历史文本处理中由于语言动态结构变化导致的性能下降问题,特别是针对土耳其语在20世纪语言改革后快速变化的情景。解决方案的关键在于提出了两种检测不同时期同义词的方法:第一种方法利用正交普鲁克特斯方法(Orthogonal Procrustes method)对不同时期文档生成的嵌入空间进行对齐;第二种方法在此基础上进一步引入斯皮尔曼相关系数(Spearman’s correlation),分析词频随时间的变化。实验结果表明,这两种方法在处理1960年代至1980年代的文本时表现优异,但随时间推移,性能略有下降。
链接: https://arxiv.org/abs/2411.15768
作者: Umur Togay Yazar,Mucahid Kutlu
关键词-EN: poses significant challenges, languages poses significant, applying natural language, natural language processing, language processing models
类目: Computation and Language (cs.CL)
备注: published at Innovations in Intelligent Systems and Applications Conference (Akıllı Sistemlerde Yenilikler ve Uygulamaları Konferansı - ASYU) 2024
点击查看摘要
Abstract:Dynamic structure of languages poses significant challenges in applying natural language processing models on historical texts, causing decreased performance in various downstream tasks. Turkish is a prominent example of rapid linguistic transformation due to the language reform in the 20th century. In this paper, we propose two methods for detecting synonyms used in different time periods, focusing on Turkish. In our first method, we use Orthogonal Procrustes method to align the embedding spaces created using documents written in the corresponding time periods. In our second method, we extend the first one by incorporating Spearman’s correlation between frequencies of words throughout the years. In our experiments, we show that our proposed methods outperform the baseline method. Furthermore, we observe that the efficacy of our methods remains consistent when the target time period shifts from the 1960s to the 1980s. However, their performance slightly decreases for subsequent time periods.
zh
[NLP-52] ableTime: Reformulating Time Series Classification as Zero-Shot Table Understanding via Large Language Models
【速读】: 该论文试图解决大型语言模型(LLMs)在多变量时间序列分类(MTSC)中存在的三个主要瓶颈:(1)难以无损地编码时间序列中的时序和通道特定信息;(2)难以将学习到的表示空间与LLMs的语义空间对齐;(3)需要针对特定任务进行重新训练,计算成本高且劳动密集。解决方案的关键在于提出了一种名为TableTime的方法,该方法将MTSC重新定义为表格理解任务。具体策略包括:(1)将多变量时间序列转换为表格形式,以最大限度地减少信息损失;(2)将表格时间序列表示为文本格式,从而自然地与LLMs的语义空间对齐;(3)设计一个推理框架,整合上下文文本信息、邻域辅助、多路径推理和问题分解,以增强LLMs的推理能力并实现零样本分类。
链接: https://arxiv.org/abs/2411.15737
作者: Jiahao Wang,Mingyue Cheng,Qingyang Mao,Qi Liu,Feiyang Xu,Xin Li,Enhong Chen
关键词-EN: Large language models, Large language, multivariate time series, time series, language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification (MTSC). Effective adaptation of LLMs for MTSC necessitates informative data representations. Existing LLM-based methods directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs. Despite their effectiveness, we reveal that these methods conceal three inherent bottlenecks: (1) they struggle to encode temporal and channel-specific information in a lossless manner, both of which are critical components of multivariate time series; (2) it is much difficult to align the learned representation space with the semantic space of the LLMs; (3) they require task-specific retraining, which is both computationally expensive and labor-intensive. To bridge these gaps, we propose TableTime, which reformulates MTSC as a table understanding task. Specifically, TableTime introduces the following strategies: (1) convert multivariate time series into a tabular form, thus minimizing information loss to the greatest extent; (2) represent tabular time series in text format to achieve natural alignment with the semantic space of LLMs; (3) design a reasoning framework that integrates contextual text information, neighborhood assistance, multi-path inference and problem decomposition to enhance the reasoning ability of LLMs and realize zero-shot classification. Extensive experiments performed on 10 publicly representative datasets from UEA archive verify the superiorities of the TableTime.
zh
[NLP-53] Development of Pre-Trained Transformer-based Models for the Nepali Language
【速读】: 该论文试图解决尼泊尔语(Nepali)在自然语言处理(NLP)领域中数据资源匮乏和模型探索不足的问题。解决方案的关键在于收集了27.5 GB的尼泊尔语文本数据,这是迄今为止最大的尼泊尔语单语语料库,比现有资源大2.4倍。利用这些数据,论文预训练了三种不同的模型:BERT、RoBERTa和GPT-2,专门针对尼泊尔语。此外,论文还进行了指令微调(instruction tuning),探索其在尼泊尔语单语数据上的潜力,为未来的研究奠定了基础。实验结果表明,这些模型在Nep-gLUE基准测试中比现有最佳模型高出2分,达到了95.60分,并且在文本生成任务中也表现出色,显著提升了尼泊尔语的理解和生成能力。
链接: https://arxiv.org/abs/2411.15734
作者: Prajwal Thapa,Jinu Nyachhyon,Mridul Sharma,Bal Krishna Bal
关键词-EN: Natural Language Processing, Transformer-based pre-trained language, field of Natural, Language Processing, Nepali language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.
zh
[NLP-54] LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training
【速读】: 该论文试图解决在保持模型激活参数数量不变的情况下,如何通过稀疏性(sparsity)概念扩展模型规模的问题。解决方案的关键在于构建混合专家模型(Mixture-of-Experts, MoE),并将其应用于Transformer块中的注意力(Attention MoE)和多层感知机(MLP MoE)模块。研究通过不同的专家构建方法和粒度,分析了稀疏化对模型的影响,并设计了两阶段的后训练策略来抵消因增加稀疏性导致的性能下降,从而提升模型在多个领域(如对话、代码、数学)的综合能力。实验结果表明,这种方法在指导性大型语言模型(LLMs)上的应用具有潜在的有效性。
链接: https://arxiv.org/abs/2411.15708
作者: Xiaoye Qu,Daize Dong,Xuyang Hu,Tong Zhu,Weigao Sun,Yu Cheng
关键词-EN: activated parameters constant, gained increasing popularity, scaling model size, parameters constant, gained increasing
类目: Computation and Language (cs.CL)
备注: Technical report,13 pages
点击查看摘要
Abstract:Recently, inspired by the concept of sparsity, Mixture-of-Experts (MoE) models have gained increasing popularity for scaling model size while keeping the number of activated parameters constant. In this study, we thoroughly investigate the sparsity of the dense LLaMA model by constructing MoE for both the attention (i.e., Attention MoE) and MLP (i.e., MLP MoE) modules in the transformer blocks. Specifically, we investigate different expert construction methods and granularities under the same activation conditions to analyze the impact of sparsifying the model. Additionally, to comprehensively evaluate the model’s capabilities across various domains (e.g., conversation, code, math) after sparsification, we apply sparsity to the instructed large language models (LLMs) and construct instructed MoE models. To counteract the performance degradation resulting from increased sparsity, we design a two-stage post-training strategy to enhance model performance. Experiments on the LLaMA3 model demonstrate the potential effectiveness of this approach for future developments of instructed MoE models. The source codes and models are available at: \urlthis https URL.
zh
[NLP-55] RAMIE: Retrieval-Augmented Multi-task Information Extraction with Large Language Models on Dietary Supplements
【速读】: 该论文旨在开发一个先进的多任务大语言模型 (Large Language Model, LLM)框架,用于从临床记录中提取与膳食补充剂 (Dietary Supplements, DS) 相关的多种信息。解决方案的关键在于引入了一种名为检索增强多任务信息提取 (Retrieval-Augmented Multi-task Information Extraction, RAMIE)的新框架,该框架结合了指令微调 (Instruction Fine-tuning)、**多任务学习 (Multi-task Learning, MTL)和检索增强生成 (Retrieval-Augmented Generation, RAG)**技术。具体来说,RAMIE框架通过任务特定的提示进行指令微调,提高了模型在多个任务上的存储效率和训练成本效益,并通过从训练集中检索相似示例来增强生成能力,从而显著提升了多任务信息提取的性能。
链接: https://arxiv.org/abs/2411.15700
作者: Zaifu Zhan,Shuang Zhou,Mingchen Li,Rui Zhang
关键词-EN: advanced multi-task large, large language model, Multi-task Information Extraction, extract multiple types, information extraction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
点击查看摘要
Abstract:\textbfObjective: We aimed to develop an advanced multi-task large language model (LLM) framework to extract multiple types of information about dietary supplements (DS) from clinical records. \textbfMethods: We used four core DS information extraction tasks - namely, named entity recognition (NER: 2,949 clinical sentences), relation extraction (RE: 4,892 sentences), triple extraction (TE: 2,949 sentences), and usage classification (UC: 2,460 sentences) as our multitasks. We introduced a novel Retrieval-Augmented Multi-task Information Extraction (RAMIE) Framework, including: 1) employed instruction fine-tuning techniques with task-specific prompts, 2) trained LLMs for multiple tasks with improved storage efficiency and lower training costs, and 3) incorporated retrieval augmentation generation (RAG) techniques by retrieving similar examples from the training set. We compared RAMIE’s performance to LLMs with instruction fine-tuning alone and conducted an ablation study to assess the contributions of multi-task learning and RAG to improved multitasking performance. \textbfResults: With the aid of the RAMIE framework, Llama2-13B achieved an F1 score of 87.39 (3.51% improvement) on the NER task and demonstrated outstanding performance on the RE task with an F1 score of 93.74 (1.15% improvement). For the TE task, Llama2-7B scored 79.45 (14.26% improvement), and MedAlpaca-7B achieved the highest F1 score of 93.45 (0.94% improvement) on the UC task. The ablation study revealed that while MTL increased efficiency with a slight trade-off in performance, RAG significantly boosted overall accuracy. \textbfConclusion: This study presents a novel RAMIE framework that demonstrates substantial improvements in multi-task information extraction for DS-related data from clinical records. Our framework can potentially be applied to other domains. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2411.15700 [cs.CL] (or arXiv:2411.15700v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.15700 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zaifu Zhan [view email] [v1] Sun, 24 Nov 2024 03:56:43 UTC (450 KB)
zh
[NLP-56] Deep Sparse Latent Feature Models for Knowledge Graph Completion
【速读】: 该论文试图解决知识图谱补全 (Knowledge Graph Completion, KGC) 中大规模知识图谱 (Knowledge Graphs, KGs) 的复杂实体间连接问题。解决方案的关键在于引入了一种基于稀疏潜在特征模型的新框架,并通过深度变分自编码器 (Variational Autoencoder, VAE) 进行优化。该方法不仅能够有效地补全缺失的三元组,还能揭示潜在的社区结构并生成可解释的表示,从而显著提升在WN18RR、FB15k-237和Wikidata5M数据集上的性能。
链接: https://arxiv.org/abs/2411.15694
作者: Haotian Li,Rui Zhang,Lingzhi Wang,Bin Yu,Youwei Wang,Yuliang Wei,Kai Wang,Richard Yi Da Xu,Bailing Wang
关键词-EN: knowledge graph completion, large-scale knowledge graphs, Recent progress, knowledge graph, graph completion
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent progress in knowledge graph completion (KGC) has focused on text-based approaches to address the challenges of large-scale knowledge graphs (KGs). Despite their achievements, these methods often overlook the intricate interconnections between entities, a key aspect of the underlying topological structure of a KG. Stochastic blockmodels (SBMs), particularly the latent feature relational model (LFRM), offer robust probabilistic frameworks that can dynamically capture latent community structures and enhance link prediction. In this paper, we introduce a novel framework of sparse latent feature models for KGC, optimized through a deep variational autoencoder (VAE). Our approach not only effectively completes missing triples but also provides clear interpretability of the latent structures, leveraging textual information. Comprehensive experiments on the WN18RR, FB15k-237, and Wikidata5M datasets show that our method significantly improves performance by revealing latent communities and producing interpretable representations.
zh
[NLP-57] Ontology-Constrained Generation of Domain-Specific Clinical Summaries
【速读】: 该论文试图解决生成式大语言模型(Large Language Models, LLMs)在特定领域(如医疗领域)生成摘要时面临的两个主要问题:一是生成的摘要缺乏领域特定的信息,二是生成的内容中存在幻觉(hallucinations)。解决方案的关键在于利用本体论(ontologies)来指导生成过程,通过本体论引导的约束解码(ontology-guided constrained decoding)方法,既提高了生成摘要的领域相关性,又减少了幻觉现象。该方法在医疗领域的电子健康记录(Electronic Health Records, EHRs)摘要生成中表现出色,特别是在MIMIC-III数据集上的评估结果显示,生成的临床笔记摘要更具领域适应性且幻觉现象显著减少。
链接: https://arxiv.org/abs/2411.15666
作者: Gaya Mehenni,Amal Zouaq
关键词-EN: Large Language Models, Large Language, Language Models, offer promising solutions, offer promising
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands
点击查看摘要
Abstract:Large Language Models (LLMs) offer promising solutions for text summarization. However, some domains require specific information to be available in the summaries. Generating these domain-adapted summaries is still an open challenge. Similarly, hallucinations in generated content is a major drawback of current approaches, preventing their deployment. This study proposes a novel approach that leverages ontologies to create domain-adapted summaries both structured and unstructured. We employ an ontology-guided constrained decoding process to reduce hallucinations while improving relevance. When applied to the medical domain, our method shows potential in summarizing Electronic Health Records (EHRs) across different specialties, allowing doctors to focus on the most relevant information to their domain. Evaluation on the MIMIC-III dataset demonstrates improvements in generating domain-adapted summaries of clinical notes and hallucination reduction.
zh
[NLP-58] Improving Next Tokens via Second-Last Predictions with Generate and Refine
【速读】: 该论文试图解决在自然语言处理中,生成式模型(如GPT)在预测下一个词时可能存在的准确性问题。解决方案的关键在于训练一个仅解码器架构的模型,用于预测序列中倒数第二个词(second last token),并通过一种结构化的确定性方法进行掩码(masking),从而提高训练效率。该方法通过“生成-然后-精炼”(generate-then-refine)策略,将倒数第二个词的预测与标准GPT的下一个词预测相结合,显著提升了下一个词预测的准确性,尤其是在不同版本的GPT-2模型和不同数据集上,倒数第二个词的预测准确性比普通下一个词预测高出超过15%。
链接: https://arxiv.org/abs/2411.15661
作者: Johannes Schneider
关键词-EN: Autoregressive language models, Autoregressive language, BERT are trained, predicting masked tokens, trained on tasks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Autoregressive language models like GPT aim at predicting next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder only architecture for predicting the second last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach towards masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a generate-then-refine'' approach. We show on different variants of GPT-2 and different datasets that (not unexpectedly) second last token predictions are much more accurate, i.e., more than 15\% higher accuracy than ordinary next token predictors. The
generate-then-refine’’ approach also demonstrates notable improvements in next-token predictions, yielding smaller yet consistent and significant gains.
zh
[NLP-59] AfriMed-QA: A Pan-African Multi-Specialty Medical Question-Answering Benchmark Dataset
【速读】: 该论文试图解决在低收入和中等收入国家(LMICs)中,由于医生短缺和专家缺乏,如何利用大型语言模型(LLM)来提高医疗保健的可及性和降低成本的问题。解决方案的关键在于引入了AfriMed-QA,这是首个大规模的泛非洲英语多专科医学问答(QA)数据集,包含15,000个问题(开放和封闭式),来源于16个国家的60多所医学院,涵盖32个医学专科。通过评估30个LLM在正确性和人口统计偏差等多个维度的表现,研究发现不同专科和地理区域的表现存在显著差异,且MCQ表现明显落后于USMLE(MedQA)。此外,生物医学LLM的表现不如通用模型,而较小的边缘友好型LLM难以达到及格分数。有趣的是,人类评估显示,与临床医生答案相比,消费者对LLM答案和解释有持续的偏好。
链接: https://arxiv.org/abs/2411.15640
作者: Tobi Olatunji,Charles Nimo,Abraham Owodunni,Tassallah Abdullahi,Emmanuel Ayodele,Mardhiyah Sanni,Chinemelu Aka,Folafunmi Omofoye,Foutse Yuehgoh,Timothy Faniran,Bonaventure F. P. Dossou,Moshood Yekini,Jonas Kemp,Katherine Heller,Jude Chidubem Omeke,Chidi Asuzu MD,Naome A. Etori,Aimérou Ndiaye,Ifeoma Okoh,Evans Doe Ocansey,Wendy Kinara,Michael Best,Irfan Essa,Stephen Edward Moore,Chris Fourie,Mercy Nyamewaa Asiedu
关键词-EN: Recent advancements, benchmarks have stimulated, patients globally, stimulated interest, providers and patients
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advancements in large language model(LLM) performance on medical multiple choice question (MCQ) benchmarks have stimulated interest from healthcare providers and patients globally. Particularly in low-and middle-income countries (LMICs) facing acute physician shortages and lack of specialists, LLMs offer a potentially scalable pathway to enhance healthcare access and reduce costs. However, their effectiveness in the Global South, especially across the African continent, remains to be established. In this work, we introduce AfriMed-QA, the first large scale Pan-African English multi-specialty medical Question-Answering (QA) dataset, 15,000 questions (open and closed-ended) sourced from over 60 medical schools across 16 countries, covering 32 medical specialties. We further evaluate 30 LLMs across multiple axes including correctness and demographic bias. Our findings show significant performance variation across specialties and geographies, MCQ performance clearly lags USMLE (MedQA). We find that biomedical LLMs underperform general models and smaller edge-friendly LLMs struggle to achieve a passing score. Interestingly, human evaluations show a consistent consumer preference for LLM answers and explanations when compared with clinician answers.
zh
[NLP-60] “All that Glitters”: Approaches to Evaluations with Unreliable Model and Human Annotations
【速读】: 该论文试图解决在模型评估过程中,由于“黄金”和“真实”人类标签存在误差,导致评估指标无法准确反映标签质量和模型性能的问题。解决方案的关键在于采用新颖的评估方法,通过六个维度(一致性 (Concordance)、置信度 (Confidence)、有效性 (Validity)、偏差 (Bias)、公平性 (Fairness) 和有用性 (Helpfulness))来全面评估标签质量和模型表现。研究首先揭示了在标签质量较低的情况下,标准评估指标可能掩盖标签和模型的真实质量,进而发现大型语言模型(LLM)在某些任务上表现“超人类”,但在更严格的评估下暴露出虚假相关性和非随机种族偏差。最后,研究扩展了这些方法,以估计在人机协作情境下,模型使用对人类标签质量的影响,并指出某些LLM在当前数据可泛化性的范围内,可能有助于提高昂贵的人类课堂评估质量。
链接: https://arxiv.org/abs/2411.15634
作者: Michael Hardy
关键词-EN: ground truth, Gold, quality, model, label quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 20 pages, 15 figures, 58 pages with references and appendices
点击查看摘要
Abstract:“Gold” and “ground truth” human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families–encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even “super-human”, results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.
zh
[NLP-61] Multi-label Sequential Sentence Classification via Large Language Model EMNLP2024
【速读】: 该论文试图解决科学出版物中序列句子分类 (Sequential Sentence Classification, SSC) 面临的模型大小、序列长度和单标签设置的限制问题。解决方案的关键在于提出了基于大语言模型 (Large Language Model, LLM) 的框架 LLM-SSC,该框架通过设计提示 (prompts) 来生成 SSC 标签,结合演示 (demonstrations) 和查询 (query) 描述预测目标,从而增强任务理解。此外,论文还引入了多标签对比学习损失 (multi-label contrastive learning loss) 和自动加权方案 (auto-weighting scheme),以支持多标签分类任务。为了验证多标签 SSC 分析的有效性,论文还发布了一个新的生物医学领域数据集 biorc800。
链接: https://arxiv.org/abs/2411.15623
作者: Mengfei Lan,Lecheng Zheng,Shufan Ming,Halil Kilicoglu
关键词-EN: Sequential sentence classification, fine-grained information retrieval, Sequential sentence, supporting downstream tasks, extractive summarization
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024
点击查看摘要
Abstract:Sequential sentence classification (SSC) in scientific publications is crucial for supporting downstream tasks such as fine-grained information retrieval and extractive summarization. However, current SSC methods are constrained by model size, sequence length, and single-label setting. To address these limitations, this paper proposes LLM-SSC, a large language model (LLM)-based framework for both single- and multi-label SSC tasks. Unlike previous approaches that employ small- or medium-sized language models, the proposed framework utilizes LLMs to generate SSC labels through designed prompts, which enhance task understanding by incorporating demonstrations and a query to describe the prediction target. We also present a multi-label contrastive learning loss with auto-weighting scheme, enabling the multi-label classification task. To support our multi-label SSC analysis, we introduce and release a new dataset, biorc800, which mainly contains unstructured abstracts in the biomedical domain with manual annotations. Experiments demonstrate LLM-SSC’s strong performance in SSC under both in-context learning and task-specific tuning settings. We release biorc800 and our code at: this https URL.
zh
[NLP-62] A Survey on LLM -as-a-Judge
【速读】: 该论文试图解决如何构建可靠的大型语言模型(LLM)作为评估系统(LLM-as-a-Judge)的问题。解决方案的关键在于提高评估的一致性、减轻偏见,并适应多样化的评估场景。论文提出了一系列增强可靠性的策略,并设计了新的基准来评估LLM-as-a-Judge系统的可靠性,为研究人员和实践者提供了基础参考。
链接: https://arxiv.org/abs/2411.15594
作者: Jiawei Gu,Xuhui Jiang,Zhichao Shi,Hexiang Tan,Xuehao Zhai,Chengjin Xu,Wei Li,Yinghan Shen,Shengjie Ma,Honghao Liu,Yuanzhuo Wang,Jian Guo
关键词-EN: challenging task due, Large Language Models, Accurate and consistent, inherent subjectivity, crucial for decision-making
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 33 pages, 9 figures. arXiv admin note: text overlap with arXiv:2310.05470 by other authors
点击查看摘要
Abstract:Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of “LLM-as-a-Judge,” where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.
zh
[NLP-63] ransparent but Powerful: Explainability Accuracy and Generalizability in ADHD Detection from Social Media Data
【速读】: 该论文试图解决注意力缺陷多动障碍(Attention-deficit/hyperactivity disorder, ADHD)的诊断不足问题,特别是通过利用社交媒体数据进行大规模、非侵入性的筛查。解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)和机器学习(Machine Learning, ML)技术,分析社交媒体文本中的语言模式。论文通过比较浅层机器学习模型和深度学习模型(如BiLSTM和基于transformer的模型),评估了不同模型在ADHD检测中的性能和可解释性。研究发现,BiLSTM模型在透明性和准确性之间提供了良好的平衡,并揭示了跨平台数据(如Reddit和Twitter)中与ADHD相关的关键语言特征,这些特征有助于开发更有效的数字筛查工具。
链接: https://arxiv.org/abs/2411.15586
作者: D. Wiechmann,E. Kempa,E. Kerz,Y. Qiao
关键词-EN: remains severely underdiagnosed, prevalent mental health, mental health condition, health condition affecting, Natural Language Processing
类目: Computation and Language (cs.CL)
备注: 12 pages (including references and appendix)
点击查看摘要
Abstract:Attention-deficit/hyperactivity disorder (ADHD) is a prevalent mental health condition affecting both children and adults, yet it remains severely underdiagnosed. Recent advances in artificial intelligence, particularly in Natural Language Processing (NLP) and Machine Learning (ML), offer promising solutions for scalable and non-invasive ADHD screening methods using social media data. This paper presents a comprehensive study on ADHD detection, leveraging both shallow machine learning models and deep learning approaches, including BiLSTM and transformer-based models, to analyze linguistic patterns in ADHD-related social media text. Our results highlight the trade-offs between interpretability and performance across different models, with BiLSTM offering a balance of transparency and accuracy. Additionally, we assess the generalizability of these models using cross-platform data from Reddit and Twitter, uncovering key linguistic features associated with ADHD that could contribute to more effective digital screening tools.
zh
[NLP-64] From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars COLING2025
【速读】: 该论文试图解决的问题是如何评估和提升语言模型在处理低资源语言(low-resource languages)时的能力,特别是从复杂的语言学语法描述中提取和分类信息的能力。解决方案的关键在于引入了一套基准测试(benchmarks),涵盖了248种语言和142个语系,重点关注WALS和Grambank中的类型学特征(typological features)。论文提出了一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法,利用这些语言学描述来支持下游任务,如机器翻译。这些基准测试为首次全面评估语言模型在上下文中的能力,准确解释和提取语言学特征,为扩展自然语言处理(NLP)到低资源语言提供了关键资源。
链接: https://arxiv.org/abs/2411.15577
作者: Albert Kornilov,Tatiana Shavrina
关键词-EN: demonstrated significant improvements, Recent advances, including in-context learning, extremely under-resourced languages, zero-shot capabilities
类目: Computation and Language (cs.CL)
备注: submitted to COLING 2025
点击查看摘要
Abstract:Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank. This set of benchmarks offers the first comprehensive evaluation of language models’ in-context ability to accurately interpret and extract linguistic features, providing a critical resource for scaling NLP to low-resource languages. The code and data are publicly available at \urlthis https URL. Comments: submitted to COLING 2025 Subjects: Computation and Language (cs.CL) MSC classes: 68-06, 68T50, 68T01 ACMclasses: G.3; I.2.7 Cite as: arXiv:2411.15577 [cs.CL] (or arXiv:2411.15577v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.15577 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-65] Do LLM s Agree on the Creativity Evaluation of Alternative Uses?
【速读】: 该论文试图解决的问题是:大型语言模型 (LLMs) 在评估替代用途测试 (Alternative Uses Test, AUT) 中的创造性响应时是否能够保持一致性和公正性。解决方案的关键在于使用一个由专家分类的基准数据集(包含常见、创造性和高度创造性的响应),并利用四种最先进的 LLMs 对这些响应进行评分和排序。通过两种评估设置(综合和分段),研究结果显示,LLMs 在评估创造性方面表现出高度的模型间一致性(Spearman 相关系数平均超过 0.7,与基准数据集的相关系数超过 0.77),并且不偏袒自己生成的响应,而是对其他模型生成的响应给予相似的创造性评分或排名。这些发现验证了 LLMs 在创造性评估中的可靠性和公正性,为自动化创造性评估提供了有前景的应用前景。
链接: https://arxiv.org/abs/2411.15560
作者: Abdullah Al Rabeyah,Fabrício Góes,Marco Volpe,Talles Medeiros
关键词-EN: large language models, investigates whether large, large language, creativity, LLMs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 7 figures, 15 tables
点击查看摘要
Abstract:This paper investigates whether large language models (LLMs) show agreement in assessing creativity in responses to the Alternative Uses Test (AUT). While LLMs are increasingly used to evaluate creative content, previous studies have primarily focused on a single model assessing responses generated by the same model or humans. This paper explores whether LLMs can impartially and accurately evaluate creativity in outputs generated by both themselves and other models. Using an oracle benchmark set of AUT responses, categorized by creativity level (common, creative, and highly creative), we experiment with four state-of-the-art LLMs evaluating these outputs. We test both scoring and ranking methods and employ two evaluation settings (comprehensive and segmented) to examine if LLMs agree on the creativity evaluation of alternative uses. Results reveal high inter-model agreement, with Spearman correlations averaging above 0.7 across models and reaching over 0.77 with respect to the oracle, indicating a high level of agreement and validating the reliability of LLMs in creativity assessment of alternative uses. Notably, models do not favour their own responses, instead they provide similar creativity assessment scores or rankings for alternative uses generated by other models. These findings suggest that LLMs exhibit impartiality and high alignment in creativity evaluation, offering promising implications for their use in automated creativity assessment.
zh
[NLP-66] QEQR: An Exploration of Query Expansion Methods for Question Retrieval in CQA Services
【速读】: 该论文试图解决CQA(Community Question Answering)服务中由于词汇差异(lexical gap)导致的相似问题检索困难的问题。解决方案的关键在于使用查询扩展(query expansion)方法,包括基于词相似度的方法、提出基于问题相似度的方法以及选择性扩展这些方法,以扩展用户提交的问题,从而缓解词汇差异问题。最佳方法相较于未使用查询扩展的最佳基线方法,实现了1.8%的显著相对改进。
链接: https://arxiv.org/abs/2411.15530
作者: Yasin Ghafourian,Sajad Movahedi,Azadeh Shakery
关键词-EN: CQA services, valuable sources, sources of knowledge, find answers, answers to users’
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:CQA services are valuable sources of knowledge that can be used to find answers to users’ information needs. In these services, question retrieval aims to help users with their information needs by finding similar questions to theirs. However, finding similar questions is obstructed by the lexical gap that exists between relevant questions. In this work, we target this problem by using query expansion methods. We use word-similarity-based methods, propose a question-similarity-based method and selective expansion of these methods to expand a question that’s been submitted and mitigate the lexical gap problem. Our best method achieves a significant relative improvement of 1.8% compared to the best-performing baseline without query expansion.
zh
[NLP-67] Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8 Dataset
【速读】: 该论文试图解决语法错误检测 (Grammatical Error Detection, GED) 这一具有挑战性和重要性的问题。解决方案的关键在于精细化的数据清洗和使用基于Transformer的模型进行微调。具体来说,论文通过严格清洗Lang8数据集,并使用BERT-base-uncased模型进行实验,取得了显著的性能提升,F1得分达到0.91,训练集准确率达到98.49%,测试集准确率达到90.53%。此外,研究还发现,尽管使用了更大规模的BERT-large-uncased和RoBERTa-large模型,性能并未显著提升,这表明在GED任务中,数据质量和模型选择比模型规模更为关键。
链接: https://arxiv.org/abs/2411.15523
作者: Rahul Nihalani,Kushal Shah
关键词-EN: Grammatical Error Detection, improved LLM based, Error Detection, Grammatical Error, equally important problem
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 tables, 20 references
点击查看摘要
Abstract:This paper presents an improved LLM based model for Grammatical Error Detection (GED), which is a very challenging and equally important problem for many applications. The traditional approach to GED involved hand-designed features, but recently, Neural Networks (NN) have automated the discovery of these features, improving performance in GED. Traditional rule-based systems have an F1 score of 0.50-0.60 and earlier machine learning models give an F1 score of 0.65-0.75, including decision trees and simple neural networks. Previous deep learning models, for example, Bi-LSTM, have reported F1 scores within the range from 0.80 to 0.90. In our study, we have fine-tuned various transformer models using the Lang8 dataset rigorously cleaned by us. In our experiments, the BERT-base-uncased model gave an impressive performance with an F1 score of 0.91 and accuracy of 98.49% on training data and 90.53% on testing data, also showcasing the importance of data cleaning. Increasing model size using BERT-large-uncased or RoBERTa-large did not give any noticeable improvements in performance or advantage for this task, underscoring that larger models are not always better. Our results clearly show how far rigorous data cleaning and simple transformer-based models can go toward significantly improving the quality of GED.
zh
[NLP-68] MolMetaLM: a Physicochemical Knowledge-Guided Molecular Meta Language Model
【速读】: 该论文试图解决现有分子语言模型在处理分子时仅依赖于原子/键符号,而忽视了分子所包含的重要物理/化学性质的问题。解决方案的关键在于提出了一个新颖的物理化学知识引导的分子元语言框架MolMetaLM。该框架设计了一种分子专用的元语言范式,格式化为多个共享相同主体(即分子)的S,P,O(主语、谓语、宾语)知识三元组,以增强学习物理化学知识与分子之间的语义关系。通过引入不同的分子知识和噪声,元语言范式生成了数以万计的预训练任务,从而在属性预测、分子生成、构象推断和分子优化等大规模基准评估中表现出色。MolMetaLM为设计语言模型提供了新的视角。
链接: https://arxiv.org/abs/2411.15500
作者: Yifan Wu,Min Zeng,Yang Li,Yang Zhang,Min Li
关键词-EN: natural language processing, transfer the masked, language, language models transfer, masked language model
类目: Emerging Technologies (cs.ET); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Most current molecular language models transfer the masked language model or image-text generation model from natural language processing to molecular field. However, molecules are not solely characterized by atom/bond symbols; they encapsulate important physical/chemical properties. Moreover, normal language models bring grammar rules that are irrelevant for understanding molecules. In this study, we propose a novel physicochemical knowledge-guided molecular meta language framework MolMetaLM. We design a molecule-specialized meta language paradigm, formatted as multiple S,P,O (subject, predicate, object) knowledge triples sharing the same S (i.e., molecule) to enhance learning the semantic relationships between physicochemical knowledge and molecules. By introducing different molecular knowledge and noises, the meta language paradigm generates tens of thousands of pretraining tasks. By recovering the token/sequence/order-level noises, MolMetaLM exhibits proficiency in large-scale benchmark evaluations involving property prediction, molecule generation, conformation inference, and molecular optimization. Through MolMetaLM, we offer a new insight for designing language models.
zh
[NLP-69] raditional Chinese Medicine Case Analysis System for High-Level Semantic Abstraction: Optimized with Prompt and RAG
【速读】: 该论文旨在构建一个用于传统中医(TCM)的临床案例数据库,通过网络爬虫技术从多个平台(如360doc)收集了超过5000个TCM临床案例。解决方案的关键在于数据清洗和结构化处理,包括患者信息、病因、证候和注释等关键字段的提取。利用Baidu_ERNIE_Speed_128K API去除冗余信息,并通过DeepSeekv2 API生成最终答案,输出标准JSON格式。此外,通过RAG和rerank技术优化数据召回,结合两阶段检索方法和Jieba关键词匹配,显著提高了模型输出的准确性。
链接: https://arxiv.org/abs/2411.15491
作者: Peng Xu,Hongjin Wu,Jinle Wang,Rongjia Lin,Liwei Tan
关键词-EN: Traditional Chinese Medicine, Chinese Medicine, Traditional Chinese, TCM clinical cases, clinical case database
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper details a technical plan for building a clinical case database for Traditional Chinese Medicine (TCM) using web scraping. Leveraging multiple platforms, including 360doc, we gathered over 5,000 TCM clinical cases, performed data cleaning, and structured the dataset with crucial fields such as patient details, pathogenesis, syndromes, and annotations. Using the Baidu_ERNIE_Speed_128K API, we removed redundant information and generated the final answers through the DeepSeekv2 API, outputting results in standard JSON format. We optimized data recall with RAG and rerank techniques during retrieval and developed a hybrid matching scheme. By combining two-stage retrieval method with keyword matching via Jieba, we significantly enhanced the accuracy of model outputs.
zh
[NLP-70] Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework Distilled Training and Meta-evaluation Benchmark
【速读】: 该论文试图解决文本到图像生成质量自动评估中的成本和性能问题。解决方案的关键在于提出了一种基于GPT-4o的任务分解评估框架,通过将复杂的评估任务分解为更简单的子任务,从而降低学习复杂性,并利用这一框架自动构建新的训练数据集。基于此数据集,论文设计了创新的训练策略,成功地将GPT-4o的评估能力提炼到一个7B参数的开源多模态大语言模型(MLLM)MiniCPM-V-2.6中。此外,论文还手动标注了一个包含链式思维解释和质量评分的元评估基准,以全面评估现有方法和所提出模型的性能。实验结果表明,提炼后的开源MLLM在Spearman和Kendall相关性上显著优于当前最先进的GPT-4o-base基线模型VIEScore,分别提高了4.6%。
链接: https://arxiv.org/abs/2411.15488
作者: Rong-Cheng Tu,Zi-Ao Ma,Tian Lan,Yuehao Zhao,Heyan Huang,Xian-Ling Mao
关键词-EN: made significant strides, Multi-modal Large Language, Large Language Models, generation has made, creating a pressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Driven by the remarkable progress in diffusion models, text-to-image generation has made significant strides, creating a pressing demand for automatic quality evaluation of generated images. Current state-of-the-art automatic evaluation methods heavily rely on Multi-modal Large Language Models (MLLMs), particularly powerful commercial models like GPT-4o. While these models are highly effective, their substantial costs limit scalability in large-scale evaluations. Adopting open-source MLLMs is an alternative; however, their performance falls short due to significant limitations in processing multi-modal data compared to commercial MLLMs. To tackle these problems, we first propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset, where the complex evaluation task is decoupled into simpler sub-tasks, effectively reducing the learning complexity. Based on this dataset, we design innovative training strategies to effectively distill GPT-4o’s evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Furthermore, to reliably and comprehensively assess prior works and our proposed model, we manually annotate a meta-evaluation benchmark that includes chain-of-thought explanations alongside quality scores for generated images. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline, VIEScore, with over 4.6% improvement in Spearman and Kendall correlations with human judgments.
zh
[NLP-71] ransition Network Analysis: A Novel Framework for Modeling Visualizing and Identifying the Temporal Patterns of Learners and Learning Processes
【速读】: 该论文试图解决学习过程数据中过渡模式建模、可视化和识别的问题。解决方案的关键在于提出了一个名为过渡网络分析 (Transition Network Analysis, TNA) 的新型分析框架,该框架整合了随机过程挖掘 (Stochastic Process Mining) 和概率图表示 (probabilistic graph representation),将关系和时间维度结合在一个统一的视角下。TNA 不仅能够捕捉重要的学习事件(centralities)、识别行为模式(community finding),还能揭示时间模式(clustering)。通过案例研究,TNA 展示了其在揭示监管过程、识别重要事件和时间模式方面的有效性,并通过 Bootstrap 验证确保了过渡的显著性。
链接: https://arxiv.org/abs/2411.15486
作者: Mohammed Saqr,Sonsoles López-Pernas,Tiina Törmänen,Rogers Kaliisa,Kamila Misiejuk,Santtu Tikka
关键词-EN: Stochastic Process Mining, integrates Stochastic Process, Stochastic Process, Process Mining, learning process data
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted at Learning Analytics Knowledge (LAK '25)
点击查看摘要
Abstract:This paper proposes a novel analytical framework: Transition Network Analysis (TNA), an approach that integrates Stochastic Process Mining and probabilistic graph representation to model, visualize, and identify transition patterns in the learning process data. Combining the relational and temporal aspects into a single lens offers capabilities beyond either framework, including centralities to capture important learning events, community finding to identify patterns of behavior, and clustering to reveal temporal patterns. This paper introduces the theoretical and mathematical foundations of TNA. To demonstrate the functionalities of TNA, we present a case study with students (n=191) engaged in small-group collaboration to map patterns of group dynamics using the theories of co-regulation and socially-shared regulated learning. The analysis revealed that TNA could reveal the regulatory processes and identify important events, temporal patterns and clusters. Bootstrap validation established the significant transitions and eliminated spurious transitions. In doing so, we showcase TNA’s utility to capture learning dynamics and provide a robust framework for investigating the temporal evolution of learning processes. Future directions include advancing estimation methods, expanding reliability assessment, exploring longitudinal TNA, and comparing TNA networks using permutation tests.
zh
[NLP-72] Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLM s: A Case Study in Thai ACL
【速读】: 该论文试图解决在数据稀缺的情况下,如何高效地对大型语言模型 (LLMs) 进行指令微调以适应低资源语言(特别是泰语)的问题。解决方案的关键在于提出了一种无需种子数据 (seed-data-free) 的合成数据生成框架,该框架通过生成多样化的主题、从维基百科中检索相关上下文,并创建适用于多种任务(如问答、摘要和对话)的指令,来构建具有流畅性、多样性和文化背景的指令微调数据集。实验结果表明,该框架生成的合成数据集在仅使用5,000条指令的情况下,就能达到与使用数十万条指令训练的先进泰语LLMs相媲美的性能。
链接: https://arxiv.org/abs/2411.15484
作者: Parinthapat Pengpun,Can Udomcharoenchaikit,Weerayut Buaphet,Peerat Limkonchotiwat
关键词-EN: large language models, instruction-tuning large language, data-efficient manner, specifically focusing, language models
类目: Computation and Language (cs.CL)
备注: ACL-SRW 2024. Our code and dataset are publicly available at this https URL
点击查看摘要
Abstract:We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at this https URL.
zh
[NLP-73] owards Robust Evaluation of Unlearning in LLM s via Data Transformations EMNLP2024
【速读】: 该论文试图解决的问题是如何在大型语言模型 (LLMs) 中实现可靠的机器遗忘 (Machine Unlearning, MUL),以确保模型能够彻底遗忘特定信息(如个人身份信息 PII),同时不影响其在常规任务中的性能。解决方案的关键在于评估现有 MUL 技术的鲁棒性,特别是研究数据格式转换对遗忘效果的影响。论文通过在 TOFU 数据集上的实验,强调了使用多样化的数据格式来量化 LLMs 中遗忘效果的必要性,以确保模型在不同输入格式下均无法召回被遗忘的信息。
链接: https://arxiv.org/abs/2411.15477
作者: Abhinav Joshi,Shaswati Saha,Divyaksh Shukla,Sriram Vema,Harsh Jhamtani,Manas Gaur,Ashutosh Modi
关键词-EN: Large Language Models, Large Language, Language Models, great success, wide range
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted at EMNLP 2024 Findings; 21 pages (5 page main content + references + appendix)
点击查看摘要
Abstract:Large Language Models (LLMs) have shown to be a great success in a wide range of applications ranging from regular NLP-based use cases to AI agents. LLMs have been trained on a vast corpus of texts from various sources; despite the best efforts during the data pre-processing stage while training the LLMs, they may pick some undesirable information such as personally identifiable information (PII). Consequently, in recent times research in the area of Machine Unlearning (MUL) has become active, the main idea is to force LLMs to forget (unlearn) certain information (e.g., PII) without suffering from performance loss on regular tasks. In this work, we examine the robustness of the existing MUL techniques for their ability to enable leakage-proof forgetting in LLMs. In particular, we examine the effect of data transformation on forgetting, i.e., is an unlearned LLM able to recall forgotten information if there is a change in the format of the input? Our findings on the TOFU dataset highlight the necessity of using diverse data formats to quantify unlearning in LLMs more reliably.
zh
[NLP-74] HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter
【速读】: 该论文试图解决在线仇恨言论检测模型在实际应用中的性能评估问题,特别是由于评估数据集的系统性偏差导致模型在不同语言和地理区域中的表现不明确。解决方案的关键在于引入了HateDay,这是首个代表社交媒体环境的全球性仇恨言论数据集,涵盖了2022年9月21日发布的八种语言和四个英语国家的推文。通过HateDay,研究揭示了仇恨言论在不同语言和国家中的流行程度和构成差异,并发现学术数据集上的评估结果高估了实际检测性能,尤其是在非欧洲语言中。论文还指出了模型在区分仇恨言论与攻击性言论方面的不足,以及学术研究目标与现实世界中目标流行度之间的不匹配。最终,研究强调了未来检测模型需要在实际应用环境中进行评估,以应对这一全球性挑战。
链接: https://arxiv.org/abs/2411.15462
作者: Manuel Tonneau,Diyi Liu,Niyati Malhotra,Scott A. Hale,Samuel P. Fraiberger,Victor Orozco-Olvera,Paul Röttger
关键词-EN: online content, hate speech, sea of online, online hate speech, large body
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:To tackle the global challenge of online hate speech, a large body of research has developed detection models to flag hate speech in the sea of online content. Yet, due to systematic biases in evaluation datasets, detection performance in real-world settings remains unclear, let alone across geographies. To address this issue, we introduce HateDay, the first global hate speech dataset representative of social media settings, randomly sampled from all tweets posted on September 21, 2022 for eight languages and four English-speaking countries. Using HateDay, we show how the prevalence and composition of hate speech varies across languages and countries. We also find that evaluation on academic hate speech datasets overestimates real-world detection performance, which we find is very low, especially for non-European languages. We identify several factors explaining poor performance, including models’ inability to distinguish between hate and offensive speech, and the misalignment between academic target focus and real-world target prevalence. We finally argue that such low performance renders hate speech moderation with public detection models unfeasible, even in a human-in-the-loop setting which we find is prohibitively costly. Overall, we emphasize the need to evaluate future detection models from academia and platforms in real-world settings to address this global challenge.
zh
[NLP-75] Efficient Ternary Weight Embedding Model: Bridging Scalability and Performance
【速读】: 该论文试图解决嵌入模型在资源受限环境中部署时面临的高内存和计算需求问题。解决方案的关键在于提出了一种新颖的微调框架,用于三值权重嵌入模型(ternary-weight embedding models),通过引入自教知识蒸馏(self-taught knowledge distillation)来确定线性层的三值权重,从而在保持高性能的同时显著降低内存和计算开销。实验结果表明,三值化模型在推理阶段具有低内存占用和低延迟,且在与近似最近邻搜索(Approximate Nearest Neighbor, ANN)结合时,在精度和计算效率上均取得了显著提升。
链接: https://arxiv.org/abs/2411.15438
作者: Jiayi Chen,Chen Wu,Shaoqun Zhang,Nan Li,Liangjie Zhang,Qi Zhang
关键词-EN: enabling efficient semantic, natural language processing, efficient semantic search, enabling efficient, essential tools
类目: Computation and Language (cs.CL)
备注: Technical Report
点击查看摘要
Abstract:Embedding models have become essential tools in both natural language processing and computer vision, enabling efficient semantic search, recommendation, clustering, and more. However, the high memory and computational demands of full-precision embeddings pose challenges for deployment in resource-constrained environments, such as real-time recommendation systems. In this work, we propose a novel finetuning framework to ternary-weight embedding models, which reduces memory and computational overhead while maintaining high performance. To apply ternarization to pre-trained embedding models, we introduce self-taught knowledge distillation to finalize the ternary-weights of the linear layers. With extensive experiments on public text and vision datasets, we demonstrated that without sacrificing effectiveness, the ternarized model consumes low memory usage and has low latency in the inference stage with great efficiency. In practical implementations, embedding models are typically integrated with Approximate Nearest Neighbor (ANN) search. Our experiments combining ternary embedding with ANN search yielded impressive improvement in both accuracy and computational efficiency. The repository is available at here.
zh
[NLP-76] Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
【速读】: 该论文试图解决在终身学习场景下,视觉语言大模型(Vision LLMs, VLLMs)中知识编辑的问题,即在不重新训练模型的情况下,如何持续地修正不准确的知识、更新过时的信息以及整合新数据。解决方案的关键在于提出了LiveEdit框架,该框架包括三个主要模块:1) 训练一个编辑专家生成器(editing expert generator),用于为每个编辑实例独立生成低秩专家,以修正VLLM的相关响应;2) 开发一种硬过滤机制(hard filtering mechanism),利用视觉语义知识在推理阶段粗略地排除与输入查询视觉无关的专家;3) 引入一种基于文本语义相关性的软路由机制(soft routing mechanism),以实现多专家融合,从而整合视觉相关的专家。这些设计使得LiveEdit在终身VLLM编辑场景中表现出显著优势。
链接: https://arxiv.org/abs/2411.15432
作者: Qizhou Chen,Chengyu Wang,Dakan Wang,Taolin Zhang,Wangyue Li,Xiaofeng He
关键词-EN: update outdated information, Large Language Models, data into Large, Large Language, correct inaccurate knowledge
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Model editing aims to correct inaccurate knowledge, update outdated information, and incorporate new data into Large Language Models (LLMs) without the need for retraining. This task poses challenges in lifelong scenarios where edits must be continuously applied for real-world applications. While some editors demonstrate strong robustness for lifelong editing in pure LLMs, Vision LLMs (VLLMs), which incorporate an additional vision modality, are not directly adaptable to existing LLM editors. In this paper, we propose LiveEdit, a LIfelong Vision language modEl Edit to bridge the gap between lifelong LLM editing and VLLMs. We begin by training an editing expert generator to independently produce low-rank experts for each editing instance, with the goal of correcting the relevant responses of the VLLM. A hard filtering mechanism is developed to utilize visual semantic knowledge, thereby coarsely eliminating visually irrelevant experts for input queries during the inference stage of the post-edited model. Finally, to integrate visually relevant experts, we introduce a soft routing mechanism based on textual semantic relevance to achieve multi-expert fusion. For evaluation, we establish a benchmark for lifelong VLLM editing. Extensive experiments demonstrate that LiveEdit offers significant advantages in lifelong VLLM editing scenarios. Further experiments validate the rationality and effectiveness of each module design in LiveEdit.
zh
[NLP-77] Exploring Large Language Models for Multimodal Sentiment Analysis: Challenges Benchmarks and Future Directions
【速读】: 该论文试图解决多模态基于方面的情感分析 (Multimodal Aspect-Based Sentiment Analysis, MABSA) 中,大型语言模型 (Large Language Models, LLMs) 的适应性和性能问题。解决方案的关键在于构建一个基准测试,以评估LLMs在MABSA任务中的表现,并与传统的监督学习方法进行比较。研究结果表明,尽管LLMs在多模态理解方面展现出潜力,但在MABSA任务中,特别是在准确性和推理时间方面,仍面临显著挑战。基于这些发现,论文讨论了当前LLMs的局限性,并提出了未来研究的方向,以增强其在多模态情感分析中的能力。
链接: https://arxiv.org/abs/2411.15408
作者: Shezheng Song
关键词-EN: extract aspect terms, Aspect-Based Sentiment Analysis, aims to extract, including text, text and images
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect terms and their corresponding sentiment polarities from multimodal information, including text and images. While traditional supervised learning methods have shown effectiveness in this task, the adaptability of large language models (LLMs) to MABSA remains uncertain. Recent advances in LLMs, such as Llama2, LLaVA, and ChatGPT, demonstrate strong capabilities in general tasks, yet their performance in complex and fine-grained scenarios like MABSA is underexplored. In this study, we conduct a comprehensive investigation into the suitability of LLMs for MABSA. To this end, we construct a benchmark to evaluate the performance of LLMs on MABSA tasks and compare them with state-of-the-art supervised learning methods. Our experiments reveal that, while LLMs demonstrate potential in multimodal understanding, they face significant challenges in achieving satisfactory results for MABSA, particularly in terms of accuracy and inference time. Based on these findings, we discuss the limitations of current LLMs and outline directions for future research to enhance their capabilities in multimodal sentiment analysis.
zh
[NLP-78] ML-SPEAK: A Theory-Guided Machine Learning Method for Studying and Predicting Conversational Turn-taking Patterns
【速读】: 该论文试图解决从团队成员的个性特征预测团队动态的问题,解决方案的关键在于开发了一种基于对话轮换模式的计算模型。该模型通过分析团队成员在自组织团队中的对话轮换模式(turn-taking patterns),独立于对话内容,来揭示个性特征与团队沟通动态之间的关系。模型通过训练对话数据,学习个体特征与发言行为之间的关联,并能基于团队特征组合预测整体的沟通模式。这种方法不仅提高了预测对话轮换序列的准确性,还能揭示新的个性特征与沟通模式之间的关系,从而为团队过程理论提供数据驱动的动态理解,并为团队人员配置和培训提供实用指导。
链接: https://arxiv.org/abs/2411.15405
作者: Lisa R. O’Bryan,Madeline Navarro,Juan Segundo Hevia,Santiago Segarra
关键词-EN: team, personality traits remains, remains a fundamental, fundamental challenge, psychological sciences
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 64 pages, 9 figures
点击查看摘要
Abstract:Predicting team dynamics from personality traits remains a fundamental challenge for the psychological sciences and team-based organizations. Understanding how team composition generates team processes can significantly advance team-based research along with providing practical guidelines for team staffing and training. Although the Input-Process-Output (IPO) model has been useful for studying these connections, the complex nature of team member interactions demands a more dynamic approach. We develop a computational model of conversational turn-taking within self-organized teams that can provide insight into the relationships between team member personality traits and team communication dynamics. We focus on turn-taking patterns between team members, independent of content, which can significantly influence team emergent states and outcomes while being objectively measurable and quantifiable. As our model is trained on conversational data from teams of given trait compositions, it can learn the relationships between individual traits and speaking behaviors and predict group-wide patterns of communication based on team trait composition alone. We first evaluate the performance of our model using simulated data and then apply it to real-world data collected from self-organized student teams. In comparison to baselines, our model is more accurate at predicting speaking turn sequences and can reveal new relationships between team member traits and their communication patterns. Our approach offers a more data-driven and dynamic understanding of team processes. By bridging the gap between individual personality traits and team communication patterns, our model has the potential to inform theories of team processes and provide powerful insights into optimizing team staffing and training.
zh
[NLP-79] A Comparative Analysis of Transformer and LSTM Models for Detecting Suicidal Ideation on Reddit ICML
【速读】: 该论文试图解决从社交媒体平台(如Reddit)上检测用户自杀倾向的问题。解决方案的关键在于评估和比较基于深度学习的Transformer模型(如BERT、RoBERTa、DistilBERT、ALBERT和ELECTRA)以及各种基于长短期记忆网络(LSTM)的模型在识别自杀倾向方面的有效性。研究结果表明,RoBERTa模型在准确率和F1分数上表现最佳,分别为93.22%和93.14%,而结合了注意力机制和BERT嵌入的LSTM模型紧随其后,准确率和F1分数分别为92.65%和92.69%。这些发现强调了基于Transformer的模型在提升自杀倾向检测方面的潜力,为开发强大的社交媒体心理健康监测工具提供了路径,从而有助于改进自杀预防工作。
链接: https://arxiv.org/abs/2411.15404
作者: Khalid Hasan,Jamil Saquer
关键词-EN: critical global health, global health problem, health problem involving, deaths yearly, young adults
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 23rd IEEE International Conference on Machine Learning and Applications, ICMLA 2024 (camera-ready)
点击查看摘要
Abstract:Suicide is a critical global health problem involving more than 700,000 deaths yearly, particularly among young adults. Many people express their suicidal thoughts on social media platforms such as Reddit. This paper evaluates the effectiveness of the deep learning transformer-based models BERT, RoBERTa, DistilBERT, ALBERT, and ELECTRA and various Long Short-Term Memory (LSTM) based models in detecting suicidal ideation from user posts on Reddit. Toward this objective, we curated an extensive dataset from diverse subreddits and conducted linguistic, topic modeling, and statistical analyses to ensure data quality. Our results indicate that each model could reach high accuracy and F1 scores, but among them, RoBERTa emerged as the most effective model with an accuracy of 93.22% and F1 score of 93.14%. An LSTM model that uses attention and BERT embeddings performed as the second best, with an accuracy of 92.65% and an F1 score of 92.69%. Our findings show that transformer-based models have the potential to improve suicide ideation detection, thereby providing a path to develop robust mental health monitoring tools from social media. This research, therefore, underlines the undeniable prospect of advanced techniques in Natural Language Processing (NLP) while improving suicide prevention efforts.
zh
[NLP-80] ChatBCI: A P300 Speller BCI Leveraging Large Language Models for Improved Sentence Composition in Realistic Scenarios
【速读】: 该论文试图解决P300拼写器脑机接口(BCI)在句子构建过程中高按键需求、时间消耗、认知负荷和疲劳的问题。解决方案的关键在于引入ChatBCI,这是一种利用大型语言模型(LLM)的零样本学习能力(zero-shot learning capabilities)来减少按键次数并加速句子构建的P300拼写器BCI。ChatBCI通过远程查询GPT-3.5 API获取单词建议,并设计了一个新的图形用户界面(GUI)来显示这些建议,从而显著减少了按键次数和时间消耗,提高了信息传输率(information transfer rate),特别是在用户自编句子和即兴创作句子的情况下表现尤为突出。
链接: https://arxiv.org/abs/2411.15395
作者: Jiazhen Hong,Weinan Wang,Laleh Najafizadeh
关键词-EN: EEG signals, selecting target keys, visual stimuli, speller BCIs, selecting target
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注:
点击查看摘要
Abstract:P300 speller BCIs allow users to compose sentences by selecting target keys on a GUI through the detection of P300 component in their EEG signals following visual stimuli. Most P300 speller BCIs require users to spell words letter by letter, or the first few initial letters, resulting in high keystroke demands that increase time, cognitive load, and fatigue. This highlights the need for more efficient, user-friendly methods for faster sentence composition. In this work, we introduce ChatBCI, a P300 speller BCI that leverages the zero-shot learning capabilities of large language models (LLMs) to suggest words from user-spelled initial letters or predict the subsequent word(s), reducing keystrokes and accelerating sentence composition. ChatBCI retrieves word suggestions through remote queries to the GPT-3.5 API. A new GUI, displaying GPT-3.5 word suggestions as extra keys is designed. SWLDA is used for the P300 classification. Seven subjects completed two online spelling tasks: 1) copy-spelling a self-composed sentence using ChatBCI, and 2) improvising a sentence using ChatBCI’s word suggestions. Results demonstrate that in Task 1, on average, ChatBCI outperforms letter-by-letter BCI spellers, reducing time and keystrokes by 62.14% and 53.22%, respectively, and increasing information transfer rate by 198.96%. In Task 2, ChatBCI achieves 80.68% keystroke savings and a record 8.53 characters/min for typing speed. Overall, ChatBCI, by employing remote LLM queries, enhances sentence composition in realistic scenarios, significantly outperforming traditional spellers without requiring local model training or storage. ChatBCI’s (multi-) word predictions, combined with its new GUI, pave the way for developing next-generation speller BCIs that are efficient and effective for real-time communication, especially for users with communication and motor disabilities.
zh
[NLP-81] From Jack of All Trades to Master of One: Specializing LLM -based Autoraters to a Test Set
【速读】: 该论文试图解决在大规模语言模型(LLMs)评估中,依赖于固定测试集的传统自动评估方法的局限性问题。解决方案的关键在于设计了一种名为“Specialist”的方法,通过利用测试集上的历史评分来构建上下文学习(In-Context Learning, ICL)示例,从而使提示的自动评估模型(Autorater)专门化于特定的测试集。这种方法在细粒度机器翻译评估任务中显著优于现有的最先进评估指标XCOMET,分别在WMT’23和WMT’24测试集上提升了54%和119%的性能。
链接: https://arxiv.org/abs/2411.15387
作者: Mara Finkelstein,Dan Deutsch,Parker Riley,Juraj Juraska,Geza Kovacs,Markus Freitag
关键词-EN: powerful and versatile, quickly become intractable, intractable at scale, scale and reliance, LLMs continue
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT’23 and WMT’24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method for designing automatic metrics across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.
zh
[NLP-82] On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
【速读】: 该论文试图解决大语言模型(LLMs)在特定任务微调(fine-tuning)过程中对其推理能力的影响问题。解决方案的关键在于系统地研究微调对LLMs推理能力的影响,特别是对链式思维(Chain-of-Thought, CoT)推理性能和推理的忠实性(faithfulness)的影响。通过分析微调对不同数据集上CoT推理忠实性的平均影响,研究发现微调过程可能导致LLMs内部机制的变化,从而影响其推理能力。
链接: https://arxiv.org/abs/2411.15382
作者: Elita Lobo,Chirag Agarwal,Himabindu Lakkaraju
关键词-EN: Large language models, advanced natural language, natural language processing, showcasing advanced natural, Large language
类目: Computation and Language (cs.CL)
备注: This paper is a work in progress with findings based on limited evidence. Please exercise discretion when interpreting the findings
点击查看摘要
Abstract:Large language models have emerged as powerful tools for general intelligence, showcasing advanced natural language processing capabilities that find applications across diverse domains. Despite their impressive performance, recent studies have highlighted the potential for significant enhancements in LLMs’ task-specific performance through fine-tuning strategies like Reinforcement Learning with Human Feedback (RLHF), supervised fine-tuning (SFT), and Quantized Low-Rank Adapters (Q-LoRA) method. However, previous works have shown that while fine-tuning offers significant performance gains, it also leads to challenges such as catastrophic forgetting and privacy and safety risks. To this end, there has been little to no work in \textitunderstanding the impact of fine-tuning on the reasoning capabilities of LLMs. Our research investigates the effect of fine-tuning on the reasoning abilities of LLMs, addressing critical questions regarding the impact of task-specific fine-tuning on overall reasoning capabilities, the influence of fine-tuning on Chain-of-Thought (CoT) reasoning performance, and the implications for the faithfulness of CoT reasonings. By exploring these dimensions, our study shows the impact of fine-tuning on LLM reasoning capabilities, where the faithfulness of CoT reasoning, on average across four datasets, decreases, highlighting potential shifts in internal mechanisms of the LLMs resulting from fine-tuning processes.
zh
[NLP-83] ransforming NLU with Babylon: A Case Study in Development of Real-time Edge-Efficient Multi-Intent Translation System for Automated Drive-Thru Ordering
【速读】: 该论文试图解决在动态户外环境中,如自动得来速系统中,实时对话AI代理进行自然语言理解(Natural Language Understanding, NLU)时面临的挑战。这些挑战包括处理背景噪音、多样口音、多意图查询,以及在边缘设备上严格的时间延迟和内存限制。解决方案的关键在于引入了一种名为Babylon的基于transformer的架构,将NLU任务视为意图翻译任务,将自然语言输入转换为编码意图和槽位信息的序列化常规语言单元(‘transcodes’)。这种设计使得Babylon能够在一个对话轮次中处理多意图场景。此外,Babylon还集成了基于LSTM的音素序列预处理机制,通过减少输入长度来优化低延迟和低内存的边缘部署,同时增强对上游自动语音识别(Automatic Speech Recognition, ASR)错误输出的鲁棒性。
链接: https://arxiv.org/abs/2411.15372
作者: Mostafa Varzaneh,Pooja Voladoddi,Tanmay Bakshi,Uma Gunturi
关键词-EN: Natural Language Understanding, agents face challenges, Language Understanding, performing Natural Language, Automatic Speech Recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, 2 tables
点击查看摘要
Abstract:Real-time conversational AI agents face challenges in performing Natural Language Understanding (NLU) in dynamic, outdoor environments like automated drive-thru systems. These settings require NLU models to handle background noise, diverse accents, and multi-intent queries while operating under strict latency and memory constraints on edge devices. Additionally, robustness to errors from upstream Automatic Speech Recognition (ASR) is crucial, as ASR outputs in these environments are often noisy. We introduce Babylon, a transformer-based architecture that tackles NLU as an intent translation task, converting natural language inputs into sequences of regular language units (‘transcodes’) that encode both intents and slot information. This formulation allows Babylon to manage multi-intent scenarios in a single dialogue turn. Furthermore, Babylon incorporates an LSTM-based token pooling mechanism to preprocess phoneme sequences, reducing input length and optimizing for low-latency, low-memory edge deployment. This also helps mitigate inaccuracies in ASR outputs, enhancing system robustness. While this work focuses on drive-thru ordering, Babylon’s design extends to similar noise-prone scenarios, for e.g. ticketing kiosks. Our experiments show that Babylon achieves significantly better accuracy-latency-memory footprint trade-offs over typically employed NMT models like Flan-T5 and BART, demonstrating its effectiveness for real-time NLU in edge deployment settings.
zh
[NLP-84] Exploring Facets of Language Generation in the Limit
【速读】: 该论文试图解决在给定未知目标语言的序列示例的情况下,如何生成新示例的问题,确保在某个点之后不再生成错误的示例。解决方案的关键在于区分两种生成模式:均匀生成(uniform generation)和非均匀生成(non-uniform generation)。论文展示了每个可数语言集合都存在一个具有非均匀生成特性的生成器,但同时指出,仅使用成员查询(membership queries)的算法无法实现非均匀生成,即使在仅包含两种语言的集合中也是如此。此外,论文通过引入穷尽生成(exhaustive generation)的概念,揭示了生成过程中有效性和广度之间的内在权衡。最后,论文探讨了在反馈模型下均匀生成的可能性,并完全刻画了在复杂度度量下可能实现均匀生成反馈的语言集合。
链接: https://arxiv.org/abs/2411.15364
作者: Moses Charikar,Chirag Pabbaraju
关键词-EN: Kleinberg and Mullainathan, unknown target language, generation, target language, language
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages
点击查看摘要
Abstract:The recent work of Kleinberg and Mullainathan [KM24] provides a concrete model for language generation in the limit: given a sequence of examples from an unknown target language, the goal is to generate new examples from the target language such that no incorrect examples are generated beyond some point. In sharp contrast to strong negative results for the closely related problem of language identification, they establish positive results for language generation in the limit for all countable collections of languages. Follow-up work by Raman and Tewari [RT24] studies bounds on the number of distinct inputs required by an algorithm before correct language generation is achieved – namely, whether this is a constant for all languages in the collection (uniform generation) or a language-dependent constant (non-uniform generation). We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit. However, while the generation algorithm of [KM24] can be implemented using membership queries, we show that any algorithm cannot non-uniformly generate even for collections of just two languages, using only membership queries. We also formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation, and show a strong negative result for exhaustive generation. Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit. Finally, inspired by algorithms that can choose to obtain feedback, we consider a model of uniform generation with feedback, completely characterizing language collections for which such uniform generation with feedback is possible in terms of a complexity measure of the collection. Comments: 24 pages Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2411.15364 [cs.DS] (or arXiv:2411.15364v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.15364 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-85] PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models
【速读】: 该论文试图解决生成式大型语言模型(LLMs)在无监督情况下评估其响应质量的问题。解决方案的关键是提出了一种名为PPLqa的信息论度量方法,该方法易于计算且语言无关,能够在无需真实标注或人工监督的情况下,评估生成式LLMs的响应质量。PPLqa不仅涵盖了连贯性、流畅性(写作质量)以及相关性和一致性(响应的适当性),还能有效地对生成式语言模型进行排序,从而选择最适合特定任务的模型。该方法在长篇问答任务中表现尤为出色,能够替代传统的基于真实标注的评估过程,并与人类和LLM的排序结果高度相关。
链接: https://arxiv.org/abs/2411.15320
作者: Gerald Friedland,Xin Huang,Yueying Cui,Vishaal Kapoor,Ashish Khetan,Sanjiv Das
关键词-EN: Large Language Models, generative Large Language, generative language models, Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We propose PPLqa, an easy to compute, language independent, information-theoretic metric to measure the quality of responses of generative Large Language Models (LLMs) in an unsupervised way, without requiring ground truth annotations or human supervision. The method and metric enables users to rank generative language models for quality of responses, so as to make a selection of the best model for a given task. Our single metric assesses LLMs with an approach that subsumes, but is not explicitly based on, coherence and fluency (quality of writing) and relevance and consistency (appropriateness of response) to the query. PPLqa performs as well as other related metrics, and works better with long-form Q\A. Thus, PPLqa enables bypassing the lengthy annotation process required for ground truth evaluations, and it also correlates well with human and LLM rankings.
zh
[NLP-86] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLM s
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)的评估问题。解决方案的关键在于系统性地总结和分类现有的评估基准类型,包括基础能力、模型自我分析和扩展应用的评估;详细描述基准构建的典型过程,如数据收集、标注和注意事项;以及提出系统化的评估方式,包括评判标准、度量方法和工具包。通过这些关键步骤,论文旨在为研究人员提供一个全面的框架,以便根据不同需求有效地评估MLLMs,并激发更优的评估方法,从而推动MLLM研究的进步。
链接: https://arxiv.org/abs/2411.15296
作者: Chaoyou Fu,Yi-Fan Zhang,Shukang Yin,Bo Li,Xinyu Fang,Sirui Zhao,Haodong Duan,Xing Sun,Ziwei Liu,Liang Wang,Caifeng Shan,Ran He
关键词-EN: Artificial General Intelligence, Multimodal Large Language, Large Language Models, General Intelligence, Artificial General
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Produced by MME+MMBench+LLaVA Teams. Project Page: this https URL
点击查看摘要
Abstract:As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.
zh
[NLP-87] Sycophancy in Large Language Models : Causes and Mitigations
【速读】: 该论文试图解决大型语言模型(LLMs)中存在的“谄媚行为”(sycophancy)问题,即模型过度同意或奉承用户,从而影响其可靠性和伦理部署。解决方案的关键在于识别和量化谄媚行为的倾向,分析其与幻觉(hallucination)和偏见(bias)等其他挑战的关系,并探索有效的缓解策略。关键方法包括改进训练数据、采用新颖的微调方法、部署后的控制机制以及解码策略。此外,论文还讨论了谄媚行为对AI对齐的广泛影响,并提出了未来研究的方向。
链接: https://arxiv.org/abs/2411.15287
作者: Lars Malmqvist
关键词-EN: demonstrated remarkable capabilities, language processing tasks, Large language models, natural language processing, processing tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. However, their tendency to exhibit sycophantic behavior - excessively agreeing with or flattering users - poses significant risks to their reliability and ethical deployment. This paper provides a technical survey of sycophancy in LLMs, analyzing its causes, impacts, and potential mitigation strategies. We review recent work on measuring and quantifying sycophantic tendencies, examine the relationship between sycophancy and other challenges like hallucination and bias, and evaluate promising techniques for reducing sycophancy while maintaining model performance. Key approaches explored include improved training data, novel fine-tuning methods, post-deployment control mechanisms, and decoding strategies. We also discuss the broader implications of sycophancy for AI alignment and propose directions for future research. Our analysis suggests that mitigating sycophancy is crucial for developing more robust, reliable, and ethically-aligned language models.
zh
[NLP-88] BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques
【速读】: 该论文试图解决低资源语言(如孟加拉语)在自然语言理解任务中缺乏高效句子嵌入模型的问题。解决方案的关键在于引入了一种轻量级的跨语言知识蒸馏方法,通过从预训练的高性能英语句子嵌入模型中提取知识,构建适用于孟加拉语的轻量级句子转换器。这种方法不仅在多个下游任务(如释义检测、语义文本相似性(STS)和孟加拉语仇恨言论检测)中表现优异,而且其轻量级架构和较短的推理时间使其非常适合在资源受限的环境中部署,从而为低资源语言的实际NLP应用提供了有价值的解决方案。
链接: https://arxiv.org/abs/2411.15270
作者: Muhammad Rafsan Kabir,Md. Mohibur Rahman Nabil,Mohammad Ashrafuzzaman Khan
关键词-EN: require understanding natural, Sentence-level embedding, understanding natural language, require understanding, understanding natural
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in ACAI 2024
点击查看摘要
Abstract:Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.
zh
[NLP-89] ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models
【速读】: 该论文试图解决大型视觉语言模型 (Large Vision Language Models, LVLMs) 在理解和响应复杂视觉文本上下文时存在的幻觉倾向问题,尤其是在需要高精度应用的实际场景中。解决方案的关键在于提出了一种轻量级、无需训练的方法,称为 ICT (Intervention-based Calibration Technique)。ICT 通过计算干预方向,调整模型对不同层次视觉信息的注意力,特别是在前向传递阶段对编码整体图像信息和细粒度对象细节的注意力头进行干预,从而有效减少语言先验的过度影响,缓解幻觉现象。该方法在少量数据上表现出色,并能跨不同数据集和模型泛化。
链接: https://arxiv.org/abs/2411.15268
作者: Junzhe Chen,Tianshu Zhang,Shiyu Huang,Yuwei Niu,Linfeng Zhang,Lijie Wen,Xuming Hu
关键词-EN: Large Vision Language, complex visual-textual contexts, recent breakthroughs achieved, Vision Language Models, Large Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in understanding and responding to complex visual-textual contexts, their inherent hallucination tendencies limit their practical application in real-world scenarios that demand high levels of precision. Existing methods typically either fine-tune the LVLMs using additional data, which incurs extra costs in manual annotation and computational resources or perform comparisons at the decoding stage, which may eliminate useful language priors for reasoning while introducing inference time overhead. Therefore, we propose ICT, a lightweight, training-free method that calculates an intervention direction to shift the model’s focus towards different levels of visual information, enhancing its attention to high-level and fine-grained visual details. During the forward pass stage, the intervention is applied to the attention heads that encode the overall image information and the fine-grained object details, effectively mitigating the phenomenon of overly language priors, and thereby alleviating hallucinations. Extensive experiments demonstrate that ICT achieves strong performance with a small amount of data and generalizes well across different datasets and models. Our code will be public.
zh
[NLP-90] PLogAD: Unsupervised Log Anomaly Detection Based on Event Templates and Key Parameters
【速读】: 该论文试图解决日志系统中异常检测的问题,特别是现有方法在捕捉日志条目中的特征和语义信息方面的不足,导致漏报和误报的问题。解决方案的关键在于提出了TPLogAD,一种基于事件模板和关键参数的通用无监督日志分析方法。TPLogAD通过itemplate2vec和para2vec两种高效的语义表示方法,分别对事件模板和参数进行异常检测,这在以往的工作中未曾实现。此外,TPLogAD能够避免日志多样性和动态性对异常检测的干扰,从而提高了检测的准确性。实验结果表明,TPLogAD在四个公开日志数据集上的表现优于现有的日志异常检测方法。
链接: https://arxiv.org/abs/2411.15250
作者: Jiawei Lu,Chengrong Wu
关键词-EN: Web service systems, Web service, anomaly detection, log anomaly detection, anomaly detection methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Log-system is an important mechanism for recording the runtime status and events of Web service systems, and anomaly detection in logs is an effective method of detecting problems. However, manual anomaly detection in logs is inefficient, error-prone, and unrealistic. Existing log anomaly detection methods either use the indexes of event templates, or form vectors by embedding the fixed string part of the template as a sentence, or use time parameters for sequence analysis. However, log entries often contain features and semantic information that cannot be fully represented by these methods, resulting in missed and false alarms. In this paper, we propose TPLogAD, a universal unsupervised method for analyzing unstructured logs, which performs anomaly detection based on event templates and key parameters. The itemplate2vec and para2vec included in TPLogAD are two efficient and easy-to-implement semantic representation methods for logs, detecting anomalies in event templates and parameters respectively, which has not been achieved in previous work. Additionally, TPLogAD can avoid the interference of log diversity and dynamics on anomaly detection. Our experiments on four public log datasets show that TPLogAD outperforms existing log anomaly detection methods.
zh
[NLP-91] he Zamba2 Suite: Technical Report
【速读】: 该论文旨在解决现有开源模型在推理延迟、吞吐量和内存效率方面的性能瓶颈问题。解决方案的关键在于提出了Zamba2系列模型,这是一组包含1.2B、2.7B和7.4B参数的混合Mamba2-transformer模型,通过优化架构、训练数据集和训练过程(最高达三万亿个token),实现了在保持与同类领先模型相当性能的同时,显著提升了推理效率。此外,论文还公开了Zamba2系列模型的权重、指令调优变体以及用于预训练的Zyda-2数据集,进一步推动了模型的开放性和可访问性。
链接: https://arxiv.org/abs/2411.15242
作者: Paolo Glorioso,Quentin Anthony,Yury Tokpanov,Anna Golubeva,Vasudev Shyam,James Whittington,Jonathan Pilault,Beren Millidge
关键词-EN: achieving substantial gains, leading open-weights models, parameter hybrid, technical report, inference latency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21/11/24 initial upload
点击查看摘要
Abstract:In this technical report, we present the Zamba2 series – a suite of 1.2B, 2.7B, and 7.4B parameter hybrid Mamba2-transformer models that achieve state of the art performance against the leading open-weights models of their class, while achieving substantial gains in inference latency, throughput, and memory efficiency. The Zamba2 series builds upon our initial work with Zamba1-7B, optimizing its architecture, training and annealing datasets, and training for up to three trillion tokens. We provide open-source weights for all models of the Zamba2 series as well as instruction-tuned variants that are strongly competitive against comparable instruct-tuned models of their class. We additionally open-source the pretraining dataset, which we call Zyda-2, used to train the Zamba2 series of models. The models and datasets used in this work are openly available at this https URL
zh
[NLP-92] BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models
【速读】: 该论文试图解决生物医学图像分类中,由于标注数据有限、图像对比度不直观以及视觉特征复杂,导致现有的视觉-语言模型(Vision-Language Models, VLMs)如CLIP在下游应用中适应性不足的问题。解决方案的关键在于提出了一个名为BiomedCoOp的新型提示学习框架,该框架通过利用大型语言模型(Large Language Models, LLMs)的语义一致性和基于统计的提示选择策略的知识蒸馏,实现了对BiomedCLIP模型的高效适应和少样本生物医学图像分类的准确性与泛化能力的显著提升。
链接: https://arxiv.org/abs/2411.15232
作者: Taha Koleilat,Hojat Asgariandehkordi,Hassan Rivaz,Yiming Xiao
关键词-EN: demonstrated substantial success, self-supervised representation learning, vision tasks, advancements in vision-language, demonstrated substantial
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 18 pages, 5 figures, 10 tables
点击查看摘要
Abstract:Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code will be publicly available at this https URL.
zh
[NLP-93] Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training BMVC’24
【速读】: 该论文试图解决在医疗领域中,由于隐私、敏感性和标注复杂性导致的跨模态数据获取困难和数据稀缺问题。解决方案的关键在于引入了一种名为 Uni-Mlip 的统一自监督框架,该框架在数据层和特征层上整合了跨模态、单模态和融合模态的自监督技术,并针对医疗图像的独特特性定制了单模态图像自监督方法。通过这种方法,Uni-Mlip 在图像-文本检索、图像分类和视觉问答 (VQA) 等下游任务中显著超越了当前最先进的方法。
链接: https://arxiv.org/abs/2411.15207
作者: Ameera Bawazir,Kebin Wu,Wenbin Li
关键词-EN: Recent advancements, computer vision tasks, contrastive learning, computer vision, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 2 figures, accepted by BMVC’24
点击查看摘要
Abstract:Recent advancements in vision-language pre-training via contrastive learning have significantly improved performance across computer vision tasks. However, in the medical domain, obtaining multimodal data is often costly and challenging due to privacy, sensitivity, and annotation complexity. To mitigate data scarcity while boosting model performance, we introduce \textbfUni-Mlip, a unified self-supervision framework specifically designed to enhance medical vision-language pre-training. Uni-Mlip seamlessly integrates cross-modality, uni-modality, and fused-modality self-supervision techniques at the data-level and the feature-level. Additionally, Uni-Mlip tailors uni-modal image self-supervision to accommodate the unique characteristics of medical images. Our experiments across datasets of varying scales demonstrate that Uni-Mlip significantly surpasses current state-of-the-art methods in three key downstream tasks: image-text retrieval, image classification, and visual question answering (VQA).
zh
[NLP-94] Multimodal large language model for wheat breeding: a new exploration of smart breeding
【速读】: 该论文试图解决作物育种中跨领域多模态数据的知识挖掘难题,特别是如何高效、准确地利用无人机遥感技术收集的作物表型数据。解决方案的关键在于开发智能育种目标工具,通过监督微调(SFT)、检索增强生成(RAG)和基于人类反馈的强化学习(RLHF)技术,将跨领域知识注入多模态大语言模型(MLLMs),构建适用于小麦育种的多模态大语言模型(WBLMs)。论文中评估了基于不同预训练MLLMs(如Qwen-VL, InternVL, Deepseek-VL)构建的WBLMs,结果表明,结合SFT、RAG和RLHF技术的InternVL2-8B模型表现最佳,尤其在小麦产量预测和多任务决策支持生成方面表现突出。
链接: https://arxiv.org/abs/2411.15203
作者: Guofeng Yang,Yu Li,Yong He,Zhenjiang Zhou,Lingzhen Ye,Hui Fang,Yiqi Luo,Xuping Feng
关键词-EN: UAV remote sensing, key technology, crop phenotyping data, UAV remote, achieve high-throughput
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:UAV remote sensing technology has become a key technology in crop breeding, which can achieve high-throughput and non-destructive collection of crop phenotyping data. However, the multidisciplinary nature of breeding has brought technical barriers and efficiency challenges to knowledge mining. Therefore, it is important to develop a smart breeding goal tool to mine cross-domain multimodal data. Based on different pre-trained open-source multimodal large language models (MLLMs) (e.g., Qwen-VL, InternVL, Deepseek-VL), this study used supervised fine-tuning (SFT), retrieval-augmented generation (RAG), and reinforcement learning from human feedback (RLHF) technologies to inject cross-domain knowledge into MLLMs, thereby constructing multiple multimodal large language models for wheat breeding (WBLMs). The above WBLMs were evaluated using the newly created evaluation benchmark in this study. The results showed that the WBLM constructed using SFT, RAG and RLHF technologies and InternVL2-8B has leading performance. Then, subsequent experiments were conducted using the WBLM. Ablation experiments indicated that the combination of SFT, RAG, and RLHF technologies can improve the overall generation performance, enhance the generated quality, balance the timeliness and adaptability of the generated answer, and reduce hallucinations and biases. The WBLM performed best in wheat yield prediction using cross-domain data (remote sensing, phenotyping, weather, germplasm) simultaneously, with R2 and RMSE of 0.821 and 489.254 kg/ha, respectively. Furthermore, the WBLM can generate professional decision support answers for phenotyping estimation, environmental stress assessment, target germplasm screening, cultivation technique recommendation, and seed price query tasks.
zh
[NLP-95] Graph Neural Network-Based Entity Extraction and Relationship Reasoning in Complex Knowledge Graphs
【速读】: 该论文试图解决知识图谱中实体提取和关系推理的问题。解决方案的关键在于利用图神经网络(Graph Neural Network),特别是图卷积网络(Graph Convolutional Network)和图注意力网络(Graph Attention Network),来建模知识图谱的复杂结构。通过构建一个端到端的联合模型,实现了实体和关系的高效识别与推理。实验结果表明,该模型在复杂知识图谱中表现出更强的泛化能力和稳定性,为知识图谱的进一步研究提供了有力支持,并展示了图神经网络在实体提取和关系推理中的应用潜力。
链接: https://arxiv.org/abs/2411.15195
作者: Junliang Du,Guiran Liu,Jia Gao,Xiaoxuan Liao,Jiacheng Hu,Linxiao Wu
关键词-EN: reasoning algorithm based, relationship reasoning algorithm, relationship reasoning, extraction and relationship, graph convolutional network
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This study proposed a knowledge graph entity extraction and relationship reasoning algorithm based on a graph neural network, using a graph convolutional network and graph attention network to model the complex structure in the knowledge graph. By building an end-to-end joint model, this paper achieves efficient recognition and reasoning of entities and relationships. In the experiment, this paper compared the model with a variety of deep learning algorithms and verified its superiority through indicators such as AUC, recall rate, precision rate, and F1 value. The experimental results show that the model proposed in this paper performs well in all indicators, especially in complex knowledge graphs, it has stronger generalization ability and stability. This provides strong support for further research on knowledge graphs and also demonstrates the application potential of graph neural networks in entity extraction and relationship reasoning.
zh
[NLP-96] Guiding Word Equation Solving using Graph Neural Networks (Extended Technical Report)
【速读】: 该论文试图解决的是基于Nielsen变换的词方程求解问题,关键在于提出了一种由图神经网络(Graph Neural Networks, GNNs)引导的算法。该算法通过迭代重写方程的每一侧的首项,生成树状搜索空间,并在每次分裂点处利用GNNs进行高效的分裂决策。分裂决策被编码为多分类任务,论文还引入了五种图表示方法来编码词方程的结构信息,以供GNNs使用。实验结果表明,该算法在可满足性问题上表现尤为出色,对于单个词方程,DragonLi解算器能够解决比现有字符串解算器更多的问题;对于多个词方程的合取,DragonLi也与最先进的字符串解算器相媲美。
链接: https://arxiv.org/abs/2411.15194
作者: Parosh Aziz Abdulla,Mohamed Faouzi Atig,Julie Cailler,Chencheng Liang,Philipp Rümmer
关键词-EN: well-known Nielsen transformation, Graph Neural Network-guided, Neural Network-guided algorithm, Neural Network-guided, Graph Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
点击查看摘要
Abstract:This paper proposes a Graph Neural Network-guided algorithm for solving word equations, based on the well-known Nielsen transformation for splitting equations. The algorithm iteratively rewrites the first terms of each side of an equation, giving rise to a tree-like search space. The choice of path at each split point of the tree significantly impacts solving time, motivating the use of Graph Neural Networks (GNNs) for efficient split decision-making. Split decisions are encoded as multi-classification tasks, and five graph representations of word equations are introduced to encode their structural information for GNNs. The algorithm is implemented as a solver named DragonLi. Experiments are conducted on artificial and real-world benchmarks. The algorithm performs particularly well on satisfiable problems. For single word \mboxequations, DragonLi can solve significantly more problems than well-established string solvers. For the conjunction of multiple word equations, DragonLi is competitive with state-of-the-art string solvers.
zh
[NLP-97] Can Open-source LLM s Enhance Data Augmentation for Toxic Detection?: An Experimental Study
【速读】: 该论文试图解决在内容审核中高质量、多样化有害数据生成的问题,特别是在毒性内容检测方面。解决方案的关键在于利用提示工程(prompt engineering)和微调(fine-tuning)技术对开源大型语言模型(LLMs)进行优化,以增强有害数据的生成能力。研究通过两阶段实验,第一阶段评估了六个开源LLMs在多个数据集上的表现,仅使用提示工程;第二阶段则专注于微调。研究发现,Mistral模型在生成有害数据时表现出较低的幻觉(hallucination)率。尽管微调提高了数据质量和多样性,但仍面临数据重复和过拟合的挑战。实验结果表明,这种方法在提升毒性内容检测系统方面具有可扩展性和成本效益,证明了开源LLMs在创建强大内容审核工具方面的潜力。
链接: https://arxiv.org/abs/2411.15175
作者: Zheng Hui,Zhaoxiao Guo,Hang Zhao,Juanyong Duan,Lin Ai,Yinheng Li,Julia Hirschberg,Congrui Huang
关键词-EN: toxic content detection, addressing real-time applications, toxic content, content detection, essential to addressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:High-quality, diverse harmful data is essential to addressing real-time applications in content moderation. Current state-of-the-art approaches to toxic content detection using GPT series models are costly and lack explainability. This paper investigates the use of prompt engineering and fine-tuning techniques on open-source LLMs to enhance harmful data augmentation specifically for toxic content detection. We conduct a two-stage empirical study, with stage 1 evaluating six open-source LLMs across multiple datasets using only prompt engineering and stage 2 focusing on fine-tuning. Our findings indicate that Mistral can excel in generating harmful data with minimal hallucination. While fine-tuning these models improves data quality and diversity, challenges such as data duplication and overfitting persist. Our experimental results highlight scalable, cost-effective strategies for enhancing toxic content detection systems. These findings not only demonstrate the potential of open-source LLMs in creating robust content moderation tools. The application of this method in real industrial scenarios further proves the feasibility and efficiency of the fine-tuned open-source LLMs for data augmentation. We hope our study will aid in understanding the capabilities and limitations of current models in toxic content detection and drive further advancements in this field.
zh
[NLP-98] Kleene algebra with commutativity conditions is undecidable
【速读】: 该论文试图解决Kleene代数(Kleene algebra)中关于原语(atomic terms)交换性条件的等式理论的可判定性问题。解决方案的关键在于证明了即使在较弱的理论中,不支持Kleene代数的归纳公理,该等式理论仍然是不可判定的。这一结果解决了长期以来在Kleene代数理论中的一个开放问题,并且与Kuznetsov独立解决该问题的结果一致。
链接: https://arxiv.org/abs/2411.15979
作者: Arthur Azevedo de Amorim,Cheng Zhang,Marco Gaboardi
关键词-EN: Kleene algebra, theory of Kleene, Toggle, longstanding open question, Kleene
类目: Logic (math.LO); Computational Complexity (cs.CC); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: Published at CSL 2025
点击查看摘要
Abstract:We prove that the equational theory of Kleene algebra with commutativity conditions on primitives (or atomic terms) is undecidable, thereby settling a longstanding open question in the theory of Kleene algebra. While this question has also been recently solved independently by Kuznetsov, our results hold even for weaker theories that do not support the induction axioms of Kleene algebra. Comments: Published at CSL 2025 Subjects: Logic (math.LO); Computational Complexity (cs.CC); Computation and Language (cs.CL); Programming Languages (cs.PL) Cite as: arXiv:2411.15979 [math.LO] (or arXiv:2411.15979v1 [math.LO] for this version) https://doi.org/10.48550/arXiv.2411.15979 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Cheng Zhang [view email] [v1] Sun, 24 Nov 2024 20:44:27 UTC (252 KB) Full-text links: Access Paper: View a PDF of the paper titled Kleene algebra with commutativity conditions is undecidable, by Arthur Azevedo de Amorim and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: math.LO prev | next new | recent | 2024-11 Change to browse by: cs cs.CC cs.CL cs.PL math References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-99] Bio-inspired AI: Integrating Biological Complexity into Artificial Intelligence
【速读】: 该论文试图解决的问题是如何设计出更加适应性强且鲁棒的人工智能系统。解决方案的关键在于借鉴生物计算的基本原则,特别是上下文依赖的层次信息处理、试错启发式方法以及多尺度组织结构。通过深入研究生物智能的微妙机制,如自上而下的因果关系和与环境的适应性交互,论文旨在揭示现有人工智能系统中的潜在局限性,并提供一个受生物系统启发的框架,以设计更为智能和灵活的人工智能系统。
链接: https://arxiv.org/abs/2411.15243
作者: Nima Dehghani,Michael Levin
关键词-EN: mirrors our longstanding, creating artificial intelligence, pursuit of creating, longstanding fascination, fascination with understanding
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
备注:
点击查看摘要
Abstract:The pursuit of creating artificial intelligence (AI) mirrors our longstanding fascination with understanding our own intelligence. From the myths of Talos to Aristotelian logic and Heron’s inventions, we have sought to replicate the marvels of the mind. While recent advances in AI hold promise, singular approaches often fall short in capturing the essence of intelligence. This paper explores how fundamental principles from biological computation–particularly context-dependent, hierarchical information processing, trial-and-error heuristics, and multi-scale organization–can guide the design of truly intelligent systems. By examining the nuanced mechanisms of biological intelligence, such as top-down causality and adaptive interaction with the environment, we aim to illuminate potential limitations in artificial constructs. Our goal is to provide a framework inspired by biological systems for designing more adaptable and robust artificial intelligent systems.
zh
计算机视觉
[CV-0] Generative Omnimatte: Learning to Decompose Video into Layers
【速读】: 该论文试图解决现有视频分解方法在面对动态背景或不准确的姿态和深度估计时表现不佳的问题,特别是在处理被遮挡的动态区域时缺乏生成先验。解决方案的关键在于提出了一种新的生成式分层视频分解框架,该框架不依赖于静态场景假设或相机姿态和深度信息,而是通过训练视频扩散模型来识别和去除特定对象引起的场景效果。核心思想是利用视频扩散模型从现有的视频修复模型中微调,以生成高质量的分解层,包括对被遮挡动态区域的合理补全。
链接: https://arxiv.org/abs/2411.16683
作者: Yao-Chih Lee,Erika Lu,Sarah Rumbley,Michal Geyer,Jia-Bin Huang,Tali Dekel,Forrester Cole
关键词-EN: input object masks, semantically meaningful layers, omnimatte method aims, set of input, aims to decompose
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.
zh
[CV-1] Factorized Visual Tokenization and Generation
【速读】: 该论文试图解决基于向量量化(VQ)的视觉分词器在处理大规模词汇时面临的训练不稳定性和性能提升有限的问题。解决方案的关键在于引入因子分解量化(Factorized Quantization, FQ),通过将大型码本分解为多个独立的子码本,从而降低查找复杂度,提高视觉分词的效率和可扩展性。此外,论文还提出了一种解耦正则化方法,以减少子码本之间的冗余,促进多样性,并通过集成表示学习,利用预训练的视觉模型(如CLIP和DINO)来丰富学习到的表示,确保分词器能够捕捉多层次的语义信息,从而生成更具表现力和解耦的表示。实验结果表明,FQGAN模型显著提升了视觉分词器的重建质量,达到了最先进的性能水平,并展示了其在自回归图像生成中的有效性。
链接: https://arxiv.org/abs/2411.16681
作者: Zechen Bai,Jianxiong Gao,Ziteng Gao,Pichao Wang,Zheng Zhang,Tong He,Mike Zheng Shou
关键词-EN: image generation, generation, auto-regressive image generation, image, Visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation. this https URL
zh
[CV-2] Quark: Real-time High-resolution and General Neural View Synthesis SIGGRAPH
【速读】: 该论文试图解决高分辨率、实时的新视角合成问题。解决方案的关键在于结合了多个创新概念,包括使用分层深度图 (Layered Depth Maps, LDMs) 来高效表示复杂深度和遮挡场景,采用迭代学习渲染与优化方法,以及在多尺度 UNet 架构中嵌入更新步骤。此外,论文引入了基于 Transformer 的网络组件,以在输入图像空间中处理多视图信息,从而提高效率。最终,通过动态生成和丢弃每帧的内部3D几何结构,实现了实时重建和渲染。这些创新点共同构成了一个高效且高质量的新视角合成算法。
链接: https://arxiv.org/abs/2411.16680
作者: John Flynn,Michael Broxton,Lukas Murmann,Lucy Chai,Matthew DuVall,Clément Godard,Kathryn Heal,Srinivas Kaza,Stephen Lombardi,Xuan Luo,Supreeth Achar,Kira Prabhu,Tiancheng Sun,Lynn Tsai,Ryan Overbeck
关键词-EN: performing high-quality, neural algorithm, input RGB images, quality, RGB images
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: SIGGRAPH Asia 2024 camera ready version; project page this https URL
点击查看摘要
Abstract:We present a novel neural algorithm for performing high-quality, high-resolution, real-time novel view synthesis. From a sparse set of input RGB images or videos streams, our network both reconstructs the 3D scene and renders novel views at 1080p resolution at 30fps on an NVIDIA A100. Our feed-forward network generalizes across a wide variety of datasets and scenes and produces state-of-the-art quality for a real-time method. Our quality approaches, and in some cases surpasses, the quality of some of the top offline methods. In order to achieve these results we use a novel combination of several key concepts, and tie them together into a cohesive and effective algorithm. We build on previous works that represent the scene using semi-transparent layers and use an iterative learned render-and-refine approach to improve those layers. Instead of flat layers, our method reconstructs layered depth maps (LDMs) that efficiently represent scenes with complex depth and occlusions. The iterative update steps are embedded in a multi-scale, UNet-style architecture to perform as much compute as possible at reduced resolution. Within each update step, to better aggregate the information from multiple input views, we use a specialized Transformer-based network component. This allows the majority of the per-input image processing to be performed in the input image space, as opposed to layer space, further increasing efficiency. Finally, due to the real-time nature of our reconstruction and rendering, we dynamically create and discard the internal 3D geometry for each frame, generating the LDM for each view. Taken together, this produces a novel and effective algorithm for view synthesis. Through extensive evaluation, we demonstrate that we achieve state-of-the-art quality at real-time rates. Project page: this https URL
zh
[CV-3] Diffusion Features for Zero-Shot 6DoF Object Pose Estimation
【速读】: 该论文试图解决零样本物体姿态估计问题,即在不依赖特定物体训练数据的情况下,从图像中提取物体姿态。解决方案的关键在于采用基于Latent Diffusion Model (LDM) 的骨干网络进行特征提取,并提出了一种基于模板的多阶段方法来实现零样本姿态估计。通过在三个标准数据集上的实验,论文展示了该方法相较于基于Vision Transformers (ViT) 的基线模型,平均召回率提高了27%。
链接: https://arxiv.org/abs/2411.16668
作者: Bernd Von Gimborn,Philipp Ausserlechner,Markus Vincze,Stefan Thalhammer
关键词-EN: Zero-shot object pose, object pose estimation, enables the retrieval, images without necessitating, necessitating object-specific training
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Zero-shot object pose estimation enables the retrieval of object poses from images without necessitating object-specific training. In recent approaches this is facilitated by vision foundation models (VFM), which are pre-trained models that are effectively general-purpose feature extractors. The characteristics exhibited by these VFMs vary depending on the training data, network architecture, and training paradigm. The prevailing choice in this field are self-supervised Vision Transformers (ViT). This study assesses the influence of Latent Diffusion Model (LDM) backbones on zero-shot pose estimation. In order to facilitate a comparison between the two families of models on a common ground we adopt and modify a recent approach. Therefore, a template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented. The efficacy of the proposed approach is empirically evaluated on three standard datasets for object-specific 6DoF pose estimation. The experiments demonstrate an Average Recall improvement of up to 27% over the ViT baseline. The source code is available at: this https URL.
zh
[CV-4] Edge Weight Prediction For Category-Agnostic Pose Estimation
【速读】: 该论文试图解决在多类别物体姿态估计中,现有方法在处理遮挡和对称性问题时表现不佳的问题。解决方案的关键在于引入了一种名为EdgeCape的新框架,该框架通过预测姿态图中边的权重来优化关键点的定位。此外,论文还提出了结合马尔可夫结构偏置(Markovian Structural Bias)的方法,该方法根据节点间的跳数调节自注意力机制中的交互,从而增强模型捕捉全局空间依赖性的能力。这些创新使得EdgeCape在MP-100基准测试中,在1-shot和5-shot设置下均取得了最先进的性能,显著提升了关键点定位的准确性。
链接: https://arxiv.org/abs/2411.16665
作者: Or Hirschorn,Shai Avidan
关键词-EN: Category-Agnostic Pose Estimation, Pose Estimation, annotated support images, diverse object categories, diverse object
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Category-Agnostic Pose Estimation (CAPE) localizes keypoints across diverse object categories with a single model, using one or a few annotated support images. Recent works have shown that using a pose graph (i.e., treating keypoints as nodes in a graph rather than isolated points) helps handle occlusions and break symmetry. However, these methods assume a static pose graph with equal-weight edges, leading to suboptimal results. We introduce EdgeCape, a novel framework that overcomes these limitations by predicting the graph’s edge weights which optimizes localization. To further leverage structural priors, we propose integrating Markovian Structural Bias, which modulates the self-attention interaction between nodes based on the number of hops between them. We show that this improves the model’s ability to capture global spatial dependencies. Evaluated on the MP-100 benchmark, which includes 100 categories and over 20K images, EdgeCape achieves state-of-the-art results in the 1-shot setting and leads among similar-sized methods in the 5-shot setting, significantly improving keypoint localization accuracy. Our code is publicly available.
zh
[CV-5] Imperceptible Adversarial Examples in the Physical World
【速读】: 该论文试图解决在物理世界中生成不可察觉的对抗样本(adversarial examples)的问题,特别是在深度学习计算机视觉模型中。现有的方法在生成物理可实现的对抗样本时,通常放宽了对对抗样本的定义,允许无界的扰动,导致明显的或甚至奇怪的视觉模式。论文的关键解决方案是使用直通估计器(Straight-Through Estimator, STE)来克服视觉传感系统中非可微图像失真函数的挑战。通过在反向传播的前向过程中应用精确的非可微失真,并在反向过程中使用恒等函数,STE使得在物理世界中生成不可察觉的对抗样本成为可能。论文还扩展了STE以实现可微渲染,从而在物理世界中生成不可察觉的对抗补丁(adversarial patches)。实验结果表明,尽管存在非可微失真,STE仍能快速生成具有小ℓ∞范数的对抗样本,并在全局扰动威胁模型中迫使分类准确率为零,在补丁扰动威胁模型中导致接近零的AP50。
链接: https://arxiv.org/abs/2411.16622
作者: Weilin Xu,Sebastian Szyller,Cory Cornelius,Luis Murillo Rojas,Marius Arvinte,Alvaro Velasquez,Jason Martin,Nageen Himayat
关键词-EN: deep learning-based computer, learning-based computer vision, computer vision models, physical world, Adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Adversarial examples in the digital domain against deep learning-based computer vision models allow for perturbations that are imperceptible to human eyes. However, producing similar adversarial examples in the physical world has been difficult due to the non-differentiable image distortion functions in visual sensing systems. The existing algorithms for generating physically realizable adversarial examples often loosen their definition of adversarial examples by allowing unbounded perturbations, resulting in obvious or even strange visual patterns. In this work, we make adversarial examples imperceptible in the physical world using a straight-through estimator (STE, a.k.a. BPDA). We employ STE to overcome the non-differentiability – applying exact, non-differentiable distortions in the forward pass of the backpropagation step, and using the identity function in the backward pass. Our differentiable rendering extension to STE also enables imperceptible adversarial patches in the physical world. Using printout photos, and experiments in the CARLA simulator, we show that STE enables fast generation of \ell_\infty bounded adversarial examples despite the non-differentiable distortions. To the best of our knowledge, this is the first work demonstrating imperceptible adversarial examples bounded by small \ell_\infty norms in the physical world that force zero classification accuracy in the global perturbation threat model and cause near-zero ( 4.22% ) AP50 in object detection in the patch perturbation threat model. We urge the community to re-evaluate the threat of adversarial examples in the physical world.
zh
[CV-6] Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric
【速读】: 该论文试图解决AI生成的视频(AGVs)中涉及人类活动时经常出现的视觉和语义失真问题,这些问题阻碍了视频生成技术在实际场景中的应用。解决方案的关键在于构建了一个名为AI-Generated Human activity Video Quality Assessment (Human-AGVQA)的数据集,并开发了一种客观评估指标——AI-Generated Human activity Video Quality metric (GHVQ)。Human-AGVQA数据集包含3200个由8种流行的文本到视频(T2V)模型生成的AGVs,通过400个描述多样人类活动的文本提示构建。GHVQ指标系统地提取了以人为中心的质量特征、AI生成内容感知质量特征和时间连续性特征,使其成为评估人类活动AGVs质量的综合且可解释的工具。实验结果表明,GHVQ在Human-AGVQA数据集上的表现显著优于现有质量指标,证明了其在评估人类活动AGVs质量方面的有效性。
链接: https://arxiv.org/abs/2411.16619
作者: Zhichao Zhang,Wei Sun,Xinyue Li,Yunhao Li,Qihang Ge,Jun Jia,Zicheng Zhang,Zhongpeng Ji,Fengyu Sun,Shangling Jui,Xiongkuo Min,Guangtao Zhai
关键词-EN: made significant progress, human activity AGVs, human activity, Human activity Video, AI-driven video generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:AI-driven video generation techniques have made significant progress in recent years. However, AI-generated videos (AGVs) involving human activities often exhibit substantial visual and semantic distortions, hindering the practical application of video generation technologies in real-world scenarios. To address this challenge, we conduct a pioneering study on human activity AGV quality assessment, focusing on visual quality evaluation and the identification of semantic distortions. First, we construct the AI-Generated Human activity Video Quality Assessment (Human-AGVQA) dataset, consisting of 3,200 AGVs derived from 8 popular text-to-video (T2V) models using 400 text prompts that describe diverse human activities. We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. Based on Human-AGVQA, we benchmark the performance of T2V models and analyze their strengths and weaknesses in generating different categories of human activities. Second, we develop an objective evaluation metric, named AI-Generated Human activity Video Quality metric (GHVQ), to automatically analyze the quality of human activity AGVs. GHVQ systematically extracts human-focused quality features, AI-generated content-aware quality features, and temporal continuity features, making it a comprehensive and explainable quality metric for human activity AGVs. The extensive experimental results show that GHVQ outperforms existing quality metrics on the Human-AGVQA dataset by a large margin, demonstrating its efficacy in assessing the quality of human activity AGVs. The Human-AGVQA dataset and GHVQ metric will be released in public at this https URL
zh
[CV-7] GeoFormer: A Multi-Polygon Segmentation Transformer
【速读】: 该论文试图解决遥感领域中建筑物等目标物体的尺度不变形状学习问题,传统方法依赖于调整多个损失函数将分割图转换为最终的尺度不变表示,这需要繁琐的设计和优化。论文提出的解决方案是引入GeoFormer,一种新颖的架构,通过端到端的方式学习生成多边形。关键在于将关键点建模为空间依赖的token,并以自回归方式进行处理,从而优化单一似然函数,显著提升了从卫星图像中描绘建筑物对象的性能。这是首次成功应用自回归transformer模型进行遥感中的多边形预测,为建筑物矢量化提供了一种有前景的方法论替代方案。
链接: https://arxiv.org/abs/2411.16616
作者: Maxim Khomiakov,Michael Riis Andersen,Jes Frellsen
关键词-EN: scale invariant shapes, learning scale invariant, scale invariant representation, final scale invariant, scale invariant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 5 figures, in proceedings of British Machine Vision Conference 2024
点击查看摘要
Abstract:In remote sensing there exists a common need for learning scale invariant shapes of objects like buildings. Prior works relies on tweaking multiple loss functions to convert segmentation maps into the final scale invariant representation, necessitating arduous design and optimization. For this purpose we introduce the GeoFormer, a novel architecture which presents a remedy to the said challenges, learning to generate multipolygons end-to-end. By modeling keypoints as spatially dependent tokens in an auto-regressive manner, the GeoFormer outperforms existing works in delineating building objects from satellite imagery. We evaluate the robustness of the GeoFormer against former methods through a variety of parameter ablations and highlight the advantages of optimizing a single likelihood function. Our study presents the first successful application of auto-regressive transformer models for multi-polygon predictions in remote sensing, suggesting a promising methodological alternative for building vectorization.
zh
[CV-8] Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models
【速读】: 该论文试图解决现有文本到SVG生成方法在形状规则性、泛化能力和表达性方面的局限性。解决方案的关键在于引入Chat2SVG,这是一个结合了大型语言模型(Large Language Models, LLMs)和图像扩散模型的混合框架。该框架首先利用LLM生成基于基本几何图元的语义上有意义的SVG模板,然后通过图像扩散模型引导的双阶段优化流程,在潜在空间中精炼路径并调整点坐标,以增强几何复杂性。这种方法不仅提高了视觉保真度、路径规则性和语义对齐,还通过自然语言指令实现了直观的编辑功能,使专业矢量图形创作对所有用户更加便捷。
链接: https://arxiv.org/abs/2411.16602
作者: Ronghuan Wu,Wanchao Su,Jing Liao
关键词-EN: Scalable Vector Graphics, offering resolution independence, Scalable Vector, Vector Graphics, digital design
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL
点击查看摘要
Abstract:Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.
zh
[CV-9] Unlocking The Potential of Adaptive Attacks on Diffusion-Based Purification
【速读】: 该论文试图解决扩散式净化 (Diffusion-based purification, DBP) 在对抗样本 (Adversarial examples, AEs) 防御中的有效性问题。论文指出,尽管DBP因其对攻击的不可知性和对强敌手的抵抗能力而受到欢迎,但其核心基础在面对基于梯度的自适应攻击 (adaptive attacks) 时被破坏。解决方案的关键在于重新审视和修正用于DBP的梯度反向传播技术中的实现缺陷,并提出了一种新的优化方法,该方法结合了自适应攻击,能够完全击败DBP,即使在多数投票设置下也是如此。论文通过提供首个可靠的DBP梯度库,展示了自适应攻击如何显著降低DBP的鲁棒性,从而证明DBP在当前状态下并非对抗样本的有效防御手段。
链接: https://arxiv.org/abs/2411.16598
作者: Andre Kassis,Urs Hengartner,Yaoliang Yu
关键词-EN: Diffusion-based purification, amassing popularity, ability to protect, attack-oblivious manner, manner and resistance
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Diffusion-based purification (DBP) is a defense against adversarial examples (AEs), amassing popularity for its ability to protect classifiers in an attack-oblivious manner and resistance to strong adversaries with access to the defense. Its robustness has been claimed to ensue from the reliance on diffusion models (DMs) that project the AEs onto the natural distribution. We revisit this claim, focusing on gradient-based strategies that back-propagate the loss gradients through the defense, commonly referred to as ``adaptive attacks". Analytically, we show that such an optimization method invalidates DBP’s core foundations, effectively targeting the DM rather than the classifier and restricting the purified outputs to a distribution over malicious samples instead. Thus, we reassess the reported empirical robustness, uncovering implementation flaws in the gradient back-propagation techniques used thus far for DBP. We fix these issues, providing the first reliable gradient library for DBP and demonstrating how adaptive attacks drastically degrade its robustness. We then study a less efficient yet stricter majority-vote setting where the classifier evaluates multiple purified copies of the input to make its decision. Here, DBP’s stochasticity enables it to remain partially robust against traditional norm-bounded AEs. We propose a novel adaptation of a recent optimization method against deepfake watermarking that crafts systemic malicious perturbations while ensuring imperceptibility. When integrated with the adaptive attack, it completely defeats DBP, even in the majority-vote setup. Our findings prove that DBP, in its current state, is not a viable defense against AEs.
zh
[CV-10] Rethinking Diffusion for Text-Driven Human Motion Generation
【速读】: 该论文试图解决基于向量量化(Vector Quantization, VQ)的离散生成方法在人体运动生成中存在的信息损失、多样性降低和作为运动先验或生成指导的局限性问题。解决方案的关键在于结合扩散模型(diffusion-based methods)的连续空间生成特性,通过引入双向掩码自回归机制,优化数据表示和分布,从而提升模型的生成能力和多样性。此外,论文还提出了更稳健的评估方法,以公平地比较不同生成方法的性能。
链接: https://arxiv.org/abs/2411.16575
作者: Zichong Meng,Yiming Xie,Xiaogang Peng,Zeyu Han,Huaizu Jiang
关键词-EN: Vector Quantization, primarily surpassing diffusion-based, rapidly dominated human, standard performance metrics, primarily surpassing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
点击查看摘要
Abstract:Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.
zh
[CV-11] J-CaPA : Joint Channel and Pyramid Attention Improves Medical Image Segmentation
【速读】: 该论文试图解决传统基于卷积神经网络 (CNN) 的医学图像分割模型(如 U-Net)在捕捉长距离依赖和全局上下文方面的局限性。解决方案的关键在于提出了一种基于Transformer的架构,该架构联合应用了通道注意力 (Channel Attention) 和金字塔注意力 (Pyramid Attention) 机制,以增强多尺度特征提取和分割性能。此外,通过CutMix数据增强技术提高了模型的泛化能力,从而在Synapse多器官分割数据集上实现了显著的性能提升,包括6.9%的平均Dice系数提升和39.9%的Hausdorff距离 (HD95) 提升。
链接: https://arxiv.org/abs/2411.16568
作者: Marzia Binta Nizam,Marian Zlateva,James Davis
关键词-EN: treatment planning, crucial for diagnosis, diagnosis and treatment, Medical image segmentation, applies Channel Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical image segmentation is crucial for diagnosis and treatment planning. Traditional CNN-based models, like U-Net, have shown promising results but struggle to capture long-range dependencies and global context. To address these limitations, we propose a transformer-based architecture that jointly applies Channel Attention and Pyramid Attention mechanisms to improve multi-scale feature extraction and enhance segmentation performance for medical images. Increasing model complexity requires more training data, and we further improve model generalization with CutMix data augmentation. Our approach is evaluated on the Synapse multi-organ segmentation dataset, achieving a 6.9% improvement in Mean Dice score and a 39.9% improvement in Hausdorff Distance (HD95) over an implementation without our enhancements. Our proposed model demonstrates improved segmentation accuracy for complex anatomical structures, outperforming existing state-of-the-art methods.
zh
[CV-12] Generating Out-Of-Distribution Scenarios Using Language Models
【速读】: 该论文试图解决自动驾驶车辆在面对分布外(Out-Of-Distribution, OOD)驾驶场景时的安全性和可靠性问题。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的零样本泛化能力和常识推理能力,构建一个生成多样化OOD驾驶场景的框架。具体来说,论文提出了一种基于LLM的分支树结构,每个分支代表一个独特的OOD场景,并通过CARLA模拟器进行自动化模拟。此外,论文还引入了新的“OOD-ness”指标,用于量化生成场景与典型城市驾驶条件的偏离程度,并探讨了视觉语言模型(Vision-Language Models, VLMs)在解释和安全导航这些模拟OOD场景中的潜力。
链接: https://arxiv.org/abs/2411.16554
作者: Erfan Aasi,Phat Nguyen,Shiva Sreeram,Guy Rosman,Sertac Karaman,Daniela Rus
关键词-EN: machine learning techniques, learning techniques requires, comprehensive safety validation, diverse real-world environments, OOD scenarios
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The deployment of autonomous vehicles controlled by machine learning techniques requires extensive testing in diverse real-world environments, robust handling of edge cases and out-of-distribution scenarios, and comprehensive safety validation to ensure that these systems can navigate safely and effectively under unpredictable conditions. Addressing Out-Of-Distribution (OOD) driving scenarios is essential for enhancing safety, as OOD scenarios help validate the reliability of the models within the vehicle’s autonomy stack. However, generating OOD scenarios is challenging due to their long-tailed distribution and rarity in urban driving dataset. Recently, Large Language Models (LLMs) have shown promise in autonomous driving, particularly for their zero-shot generalization and common-sense reasoning capabilities. In this paper, we leverage these LLM strengths to introduce a framework for generating diverse OOD driving scenarios. Our approach uses LLMs to construct a branching tree, where each branch represents a unique OOD scenario. These scenarios are then simulated in the CARLA simulator using an automated framework that aligns scene augmentation with the corresponding textual descriptions. We evaluate our framework through extensive simulations, and assess its performance via a diversity metric that measures the richness of the scenarios. Additionally, we introduce a new “OOD-ness” metric, which quantifies how much the generated scenarios deviate from typical urban driving conditions. Furthermore, we explore the capacity of modern Vision-Language Models (VLMs) to interpret and safely navigate through the simulated OOD scenarios. Our findings offer valuable insights into the reliability of language models in addressing OOD scenarios within the context of urban driving.
zh
[CV-13] Guarding the Gate: ConceptGuard Battles Concept-Level Backdoors in Concept Bottleneck Models
【速读】: 该论文试图解决概念瓶颈模型 (Concept Bottleneck Models, CBMs) 在面对概念级后门攻击 (concept-level backdoor attacks) 时的安全问题。解决方案的关键是引入了一种名为 ConceptGuard 的新型防御框架,该框架通过多阶段方法来保护 CBMs 免受此类攻击。具体来说,ConceptGuard 利用基于文本距离测量的概念聚类和在不同概念子组上训练的分类器之间的投票机制,来隔离和缓解潜在的触发器。这一解决方案不仅提供了理论上的防御保证,还确保了 CBMs 的高性能和可解释性,从而增强了其在关键应用中的安全性和可信度。
链接: https://arxiv.org/abs/2411.16512
作者: Songning Lai,Yu Huang,Jiayu Yang,Gaoxiang Huang,Wenshuo Chen,Yutao Yue
关键词-EN: Explainable Artificial Intelligence, deep learning, medical diagnostics, undermine trust, Artificial Intelligence
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 17pages, 4 figures
点击查看摘要
Abstract:The increasing complexity of AI models, especially in deep learning, has raised concerns about transparency and accountability, particularly in high-stakes applications like medical diagnostics, where opaque models can undermine trust. Explainable Artificial Intelligence (XAI) aims to address these issues by providing clear, interpretable models. Among XAI techniques, Concept Bottleneck Models (CBMs) enhance transparency by using high-level semantic concepts. However, CBMs are vulnerable to concept-level backdoor attacks, which inject hidden triggers into these concepts, leading to undetectable anomalous behavior. To address this critical security gap, we introduce ConceptGuard, a novel defense framework specifically designed to protect CBMs from concept-level backdoor attacks. ConceptGuard employs a multi-stage approach, including concept clustering based on text distance measurements and a voting mechanism among classifiers trained on different concept subgroups, to isolate and mitigate potential triggers. Our contributions are threefold: (i) we present ConceptGuard as the first defense mechanism tailored for concept-level backdoor attacks in CBMs; (ii) we provide theoretical guarantees that ConceptGuard can effectively defend against such attacks within a certain trigger size threshold, ensuring robustness; and (iii) we demonstrate that ConceptGuard maintains the high performance and interpretability of CBMs, crucial for trustworthiness. Through comprehensive experiments and theoretical proofs, we show that ConceptGuard significantly enhances the security and trustworthiness of CBMs, paving the way for their secure deployment in critical applications.
zh
[CV-14] Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
【速读】: 该论文试图解决扩散模型在生成图像时与输入提示的语义对齐问题。解决方案的关键在于利用大型视觉语言模型(LVLMs)的语言理解能力来指导初始噪声潜在变量的优化。具体来说,论文提出了Noise Diffusion过程,通过更新噪声潜在变量来生成语义上忠实的图像,同时保持分布一致性。这一方法不仅在理论上分析了更新过程如何提高语义忠实度,还在实验中证明了其有效性和适应性,能够显著提升各种扩散模型的语义对齐效果。
链接: https://arxiv.org/abs/2411.16503
作者: Boming Miao,Chunxiao Li,Xiaoxiao Wang,Andi Zhang,Rui Sun,Zizhe Wang,Yao Zhu
关键词-EN: achieved impressive success, initial noisy latent, ensuring precise semantic, generating photorealistic images, precise semantic alignment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models. The code is available at this https URL.
zh
[CV-15] Multi-Resolution Generative Modeling of Human Motion from Limited Data
【速读】: 该论文试图解决从有限训练序列中合成人类运动的问题。解决方案的关键在于提出了一个生成式模型,该模型通过结合骨架卷积层和多尺度架构来捕捉人类运动模式。模型包含生成对抗网络和嵌入模块,能够在特定帧率下生成运动,并控制其内容和细节。此外,该模型还能扩展到合成与语音同步的手势,即使数据对有限。通过直接合成SMPL姿态参数,该方法避免了测试时对人类身体网格的调整。实验结果表明,该模型能够广泛覆盖训练样本,并生成多样化的运动。
链接: https://arxiv.org/abs/2411.16498
作者: David Eduardo Moreno-Villamarín,Anna Hilsmann,Peter Eisert
关键词-EN: learns to synthesize, limited training sequences, synthesize human motion, training sequences, synthesize human
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 1O pages, 7 figures, published in European Conference on Visual Media Production CVMP 24
点击查看摘要
Abstract:We present a generative model that learns to synthesize human motion from limited training sequences. Our framework provides conditional generation and blending across multiple temporal resolutions. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture. Our model contains a set of generative and adversarial networks, along with embedding modules, each tailored for generating motions at specific frame rates while exerting control over their content and details. Notably, our approach also extends to the synthesis of co-speech gestures, demonstrating its ability to generate synchronized gestures from speech inputs, even with limited paired data. Through direct synthesis of SMPL pose parameters, our approach avoids test-time adjustments to fit human body meshes. Experimental results showcase our model’s ability to achieve extensive coverage of training examples, while generating diverse motions, as indicated by local and global diversity metrics.
zh
[CV-16] Deformable Mamba for Wide Field of View Segmentation
【速读】: 该论文试图解决广角相机(如鱼眼和全景相机)在180°和360°图像中引入的显著畸变问题,这些畸变使得密集预测任务(如全景语义分割)变得复杂。解决方案的关键在于提出了一个名为Deformable Mamba的统一框架,该框架专门设计用于处理全景和鱼眼图像中的畸变。其核心是一个由一系列Deformable Mamba Fusion (DMF)模块构建的解码器,使得整个框架在处理极端畸变时更具变形性、高效性和准确性。通过在五个数据集上的广泛评估,该方法相较于之前针对特定视场角(FoV)的最先进方法,在分割精度上实现了持续提升,特别是在360° Stanford2D3D数据集上取得了+2.5%的性能提升,并且在60°到360°的视场角范围内均表现出色。
链接: https://arxiv.org/abs/2411.16481
作者: Jie Hu,Junwei Zheng,Jiale Wei,Jiaming Zhang,Rainer Stiefelhagen
关键词-EN: dense prediction tasks, complicating dense prediction, Wide-FoV cameras, introduce significant distortions, complicating dense
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Models and code will be made publicly available at: this https URL
点击查看摘要
Abstract:Wide-FoV cameras, like fisheye and panoramic setups, are essential for broader perception but introduce significant distortions in 180° and 360° images, complicating dense prediction tasks. For instance, existing MAMBA models lacking distortion-aware capacity cannot perform well in panoramic semantic segmentation. To address this problem, this work presents Deformable Mamba, a unified framework specifically designed to address imaging distortions within the context of panoramic and fisheye semantic segmentation. At the core is a decoder constructed with a series of Deformable Mamba Fusion (DMF) blocks, making the whole framework more deformable, efficient, and accurate, when handling extreme distortions. Extensive evaluations across five datasets demonstrate that our method consistently improves segmentation accuracy compared to the previous state-of-the-art methods tailored for specific FoVs. Notably, Deformable Mamba achieves a +2.5% performance improvement on the 360° Stanford2D3D dataset, and shows better results across FoVs from 60° to 360°.
zh
[CV-17] Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency
【速读】: 该论文试图解决在面对视频中常见的人脸视频时,由于高压缩比导致的模糊和量化噪声等降质问题,特别是这些降质对人脸视频的严重影响。解决方案的关键在于提出了一种新颖且高效的盲视频人脸增强方法,该方法基于3D-VQGAN(3D Vector Quantized Generative Adversarial Network)骨干网络,结合了记录高质量肖像特征和基于残差的时间信息的空间-时间码本。论文通过两阶段学习框架来训练模型,第一阶段通过正则化器缓解码本崩溃问题,第二阶段则利用两个Transformer从码本中查找代码并进一步更新低质量视频的编码器。实验结果表明,该方法在效率和效果上均优于当前最先进的盲人脸视频恢复和去闪烁方法。
链接: https://arxiv.org/abs/2411.16468
作者: Yutong Wang,Jiajie Teng,Jiajiong Cao,Yuming Li,Chenguang Ma,Hongteng Xu,Dixin Luo
关键词-EN: talk shows, live broadcasts, common type, video face enhancement, face
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details. Despite the significant advancement in video face enhancement, current methods still suffer from i) long processing time and ii) inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage \Rmnum1, we learn the model with a regularizer mitigating the codebook collapse problem. In Stage \Rmnum2, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos. Experiments conducted on the VFHQ-Test dataset demonstrate that our method surpasses the current state-of-the-art blind face video restoration and de-flickering methods on both efficiency and effectiveness. Code is available at \urlthis https URL.
zh
[CV-18] No Identity no problem: Motion through detection for people tracking
【速读】: 该论文试图解决在行人追踪中,依赖于检测和重识别的传统方法需要大量身份标注的问题。解决方案的关键在于利用运动线索,并通过仅对检测结果进行监督来提供所需的监督信号,而不需要任何运动标注。具体来说,算法预测两个不同时间点的检测热图,并估计这两幅图像之间的2D运动偏移。然后,使用运动估计对其中一个热图进行变形,并强制其与另一个热图保持一致。这种方法在训练过程中耦合了不同图像的信息,从而提高了追踪精度,特别是在拥挤场景和低帧率序列中。
链接: https://arxiv.org/abs/2411.16466
作者: Martin Engilberge,F. Wilke Grosche,Pascal Fua
关键词-EN: facto standard approach, facto standard, motion, regressing motion offset, standard approach
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in TMLR November 2024
点击查看摘要
Abstract:Tracking-by-detection has become the de facto standard approach to people tracking. To increase robustness, some approaches incorporate re-identification using appearance models and regressing motion offset, which requires costly identity annotations. In this paper, we propose exploiting motion clues while providing supervision only for the detections, which is much easier to do. Our algorithm predicts detection heatmaps at two different times, along with a 2D motion estimate between the two images. It then warps one heatmap using the motion estimate and enforces consistency with the other one. This provides the required supervisory signal on the motion without the need for any motion annotations. In this manner, we couple the information obtained from different images during training and increase accuracy, especially in crowded scenes and when using low frame-rate sequences. We show that our approach delivers state-of-the-art results for single- and multi-view multi-target tracking on the MOT17 and WILDTRACK datasets.
zh
[CV-19] VQ-SGen: A Vector Quantized Stroke Representation for Sketch Generation
【速读】: 该论文试图解决现有草图生成方法在处理单个笔画之间的内在和上下文关系时存在的不足,特别是忽视了笔画的形状和空间位置关系。解决方案的关键在于提出了一种新的算法VQ-SGen,该算法通过将每个笔画视为一个实体,并引入向量量化(VQ)笔画表示,以实现细粒度的草图生成。具体来说,VQ-SGen采用两阶段框架:第一阶段将每个笔画的形状和位置信息解耦,确保VQ表示优先学习笔画形状;第二阶段将精确且紧凑的表示输入到自解码Transformer中,以整合笔画的语义、位置和形状信息。这种方法不仅提高了生成笔画的保真度,还促进了条件生成和语义感知笔画编辑等新应用。
链接: https://arxiv.org/abs/2411.16446
作者: Jiawei Wang,Zhiming Cui,Changjian Li
关键词-EN: paper presents VQ-SGen, high-quality sketch generation, presents VQ-SGen, paper presents, algorithm for high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:This paper presents VQ-SGen, a novel algorithm for high-quality sketch generation. Recent approaches have often framed the task as pixel-based generation either as a whole or part-by-part, neglecting the intrinsic and contextual relationships among individual strokes, such as the shape and spatial positioning of both proximal and distant strokes. To overcome these limitations, we propose treating each stroke within a sketch as an entity and introducing a vector-quantized (VQ) stroke representation for fine-grained sketch generation. Our method follows a two-stage framework - in the first stage, we decouple each stroke’s shape and location information to ensure the VQ representation prioritizes stroke shape learning. In the second stage, we feed the precise and compact representation into an auto-decoding Transformer to incorporate stroke semantics, positions, and shapes into the generation process. By utilizing tokenized stroke representation, our approach generates strokes with high fidelity and facilitates novel applications, such as conditional generation and semantic-aware stroke editing. Comprehensive experiments demonstrate our method surpasses existing state-of-the-art techniques, underscoring its effectiveness. The code and model will be made publicly available upon publication.
zh
[CV-20] SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis
【速读】: 该论文试图解决现有方法在3D场景生成和编辑中缺乏统一框架的问题,特别是针对3D高斯溅射(3D Gaussian Splatting, 3DGS)的高保真和实时渲染需求。解决方案的关键在于引入了一个名为SplatFlow的综合框架,该框架包含两个主要组件:多视图矫正流(Multi-view Rectified Flow, RF)模型和高斯溅射解码器(Gaussian Splatting Decoder, GSDecoder)。多视图RF模型在潜在空间中操作,能够根据文本提示同时生成多视图图像、深度和相机姿态,解决了现实场景中多样的场景尺度和复杂的相机轨迹问题。随后,GSDecoder通过前馈3DGS方法将这些潜在输出高效地转换为3DGS表示。此外,SplatFlow利用无训练的反转和修复技术,实现了无缝的3DGS编辑,并在一个统一的框架内支持多种3D任务,如对象编辑、新视图合成和相机姿态估计,无需额外的复杂流程。
链接: https://arxiv.org/abs/2411.16443
作者: Hyojun Go,Byeongjun Park,Jiho Jang,Jin-Young Kim,Soonwoo Kwon,Changick Kim
关键词-EN: intuitive user interactions, hold significant potential, streamlining content creation, scenes hold significant, Gaussian Splatting Decoder
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow’s capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.
zh
[CV-21] AnonyNoise: Anonymizing Event Data with Smart Noise to Outsmart Re-Identification and Preserve Privacy WACV25
【速读】: 该论文试图解决深度神经网络在重识别(re-identification)方面的日益增强的能力与近年来公共监控增加对个人隐私构成的威胁之间的矛盾。解决方案的关键在于提出了一种事件相机数据匿名化流程,该流程不仅能够防止人类对事件相机输出数据的解读,还能有效阻止神经网络的重识别。具体来说,论文的方法通过引入可学习的数据依赖性噪声(learnable data-dependent noise)来掩盖原始事件数据中的个人识别信息,从而将攻击者的重识别能力降低高达60%,同时仍保留了执行下游任务所需的大量信息。此外,该匿名化方法在未见数据上具有良好的泛化能力,并且对图像重建和反演攻击具有鲁棒性。
链接: https://arxiv.org/abs/2411.16440
作者: Katharina Bendig,René Schuster,Nicole Thiemer,Karen Joisten,Didier Stricker
关键词-EN: rise in public, public surveillance, neural networks, recent years, deep neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV25
点击查看摘要
Abstract:The increasing capabilities of deep neural networks for re-identification, combined with the rise in public surveillance in recent years, pose a substantial threat to individual privacy. Event cameras were initially considered as a promising solution since their output is sparse and therefore difficult for humans to interpret. However, recent advances in deep learning proof that neural networks are able to reconstruct high-quality grayscale images and re-identify individuals using data from event cameras. In our paper, we contribute a crucial ethical discussion on data privacy and present the first event anonymization pipeline to prevent re-identification not only by humans but also by neural networks. Our method effectively introduces learnable data-dependent noise to cover personally identifiable information in raw event data, reducing attackers’ re-identification capabilities by up to 60%, while maintaining substantial information for the performing of downstream tasks. Moreover, our anonymization generalizes well on unseen data and is robust against image reconstruction and inversion attacks. Code: this https URL
zh
[CV-22] Harnessing Superclasses for Learning from Hierarchical Databases
【速读】: 该论文试图解决在大规模分类问题中,类别之间存在已知层次结构(hierarchy)时,如何有效进行监督层次分类的问题。解决方案的关键在于引入了一种新的损失函数,该损失函数利用层次结构的知识,不仅将每个样本分配到一个具体的类别,还分配到所有包含该类别的超类(superclasses)。这种损失函数适用于任何带有softmax输出层的神经网络架构,并且是一个适当的评分规则(proper scoring rule),其期望值由真实的后验类别概率最小化。这一特性使得我们能够在超类和细粒度类别之间同时追求一致的分类目标,消除了不同粒度之间性能权衡的需要。实验结果表明,该方法在不显著增加计算成本的情况下,提高了分类准确性并减少了粗粒度错误,特别是在预测标签与真实标签在层次树中距离较远的情况下。
链接: https://arxiv.org/abs/2411.16438
作者: Nicolas Urbani(Heudiasyc),Sylvain Rousseau(Heudiasyc),Yves Grandvalet(Heudiasyc),Leonardo Tanzi(Polito)
关键词-EN: large-scale classification problems, typically represented, expressing the inclusion, classification problems, large-scale classification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:In many large-scale classification problems, classes are organized in a known hierarchy, typically represented as a tree expressing the inclusion of classes in superclasses. We introduce a loss for this type of supervised hierarchical classification. It utilizes the knowledge of the hierarchy to assign each example not only to a class but also to all encompassing superclasses. Applicable to any feedforward architecture with a softmax output layer, this loss is a proper scoring rule, in that its expectation is minimized by the true posterior class probabilities. This property allows us to simultaneously pursue consistent classification objectives between superclasses and fine-grained classes, and eliminates the need for a performance trade-off between different granularities. We conduct an experimental study on three reference benchmarks, in which we vary the size of the training sets to cover a diverse set of learning scenarios. Our approach does not entail any significant additional computational cost compared with the loss of cross-entropy. It improves accuracy and reduces the number of coarse errors, with predicted labels that are distant from ground-truth labels in the tree.
zh
[CV-23] Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack NEURIPS2024
【速读】: 该论文试图解决个性化文本到图像(T2I)扩散模型在隐私保护方面的挑战,特别是在防止模型被恶意使用时可能导致的隐私泄露问题。解决方案的关键在于提出了一种新颖且高效的对抗攻击方法,称为概念保护通过选择性注意力操纵(Concept Protection by Selective Attention Manipulation, CoPSAM)。该方法通过仅针对T2I扩散模型的交叉注意力层,精心构建一种不可察觉的噪声,将其添加到干净样本中以生成对抗样本。这一过程在微调阶段通过最大化用户特定令牌和类别特定令牌对应的交叉注意力图之间的差异来实现。实验验证表明,该方法在保护个体身份免受潜在滥用方面优于现有方法,并且在较低噪声水平下提供了更好的保护效果。
链接: https://arxiv.org/abs/2411.16437
作者: Xide Xu,Muhammad Atif Butt,Sandesh Kamath,Bogdan Raducanu
关键词-EN: customized visual content, Selective Attention Manipulation, rise of personalized, growing demand, demand for customized
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at Safe Generative AI Workshop (NeurIPS 2024)
点击查看摘要
Abstract:The growing demand for customized visual content has led to the rise of personalized text-to-image (T2I) diffusion models. Despite their remarkable potential, they pose significant privacy risk when misused for malicious purposes. In this paper, we propose a novel and efficient adversarial attack method, Concept Protection by Selective Attention Manipulation (CoPSAM) which targets only the cross-attention layers of a T2I diffusion model. For this purpose, we carefully construct an imperceptible noise to be added to clean samples to get their adversarial counterparts. This is obtained during the fine-tuning process by maximizing the discrepancy between the corresponding cross-attention maps of the user-specific token and the class-specific token, respectively. Experimental validation on a subset of CelebA-HQ face images dataset demonstrates that our approach outperforms existing methods. Besides this, our method presents two important advantages derived from the qualitative evaluation: (i) we obtain better protection results for lower noise levels than our competitors; and (ii) we protect the content from unauthorized use thereby protecting the individual’s identity from potential misuse.
zh
[CV-24] opV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation
【速读】: 该论文试图解决零样本目标导航 (Zero-Shot Object Navigation, ZSON) 任务中,现有基于大型语言模型 (LLM) 的方法在将视觉观察转换为语言描述时丢失空间信息的问题。解决方案的关键在于引入了一种基于多模态大型语言模型 (MLLM) 的方法,称为 TopV-Nav,该方法直接在具有完整空间信息的顶视图地图上进行推理。具体来说,论文提出了自适应视觉提示生成 (Adaptive Visual Prompt Generation, AVPG) 方法,用于自适应构建语义丰富的顶视图地图,使代理能够直接利用顶视图地图中的空间信息进行深入推理。此外,设计了动态地图缩放 (Dynamic Map Scaling, DMS) 机制,以动态调整顶视图地图的缩放比例,增强局部细粒度推理能力。同时,提出了目标引导导航 (Target-Guided Navigation, TGN) 机制,用于预测和利用目标位置,促进全局和类人探索。实验结果表明,TopV-Nav 在 MP3D 和 HM3D 基准测试中显著优于现有方法。
链接: https://arxiv.org/abs/2411.16425
作者: Linqing Zhong,Chen Gao,Zihan Ding,Yue Liao,Si Liu
关键词-EN: previously unseen object, Zero-Shot Object Navigation, task requires embodied, Zero-Shot Object, unseen object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages
点击查看摘要
Abstract:The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, a MLLM-based method that directly reasons on the top-view map with complete spatial information. To fully unlock the MLLM’s spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Target-Guided Navigation (TGN) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D benchmarks demonstrate the superiority of our TopV-Nav, e.g., +3.9% SR and +2.0% SPL absolute improvements on HM3D.
zh
[CV-25] Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks
【速读】: 该论文试图解决的问题是如何利用长时间序列的空间-时间数据来提升机器学习模型在台风预测任务中的性能。解决方案的关键在于引入数字台风数据集V2 (Digital Typhoon Dataset V2),该数据集不仅包含北半球的台风数据,还新增了南半球的台风数据,从而能够研究跨区域和跨半球的差异。论文提出了新的任务,如台风中心估计任务,并探讨了自监督学习框架与长短期记忆网络 (LSTM) 结合在强度预测和热带气旋向温带气旋转变预测任务中的表现。此外,论文还研究了模型在不同半球数据上的泛化能力,通过在北半球数据上训练模型并在南半球数据上测试,评估模型的跨区域适应性。
链接: https://arxiv.org/abs/2411.16421
作者: Asanobu Kitamoto,Erwan Dzik,Gaspar Faure
关键词-EN: Digital Typhoon Dataset, satellite image dataset, presents the Digital, long-term spatio-temporal data, longest typhoon satellite
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper presents the Digital Typhoon Dataset V2, a new version of the longest typhoon satellite image dataset for 40+ years aimed at benchmarking machine learning models for long-term spatio-temporal data. The new addition in Dataset V2 is tropical cyclone data from the southern hemisphere, in addition to the northern hemisphere data in Dataset V1. Having data from two hemispheres allows us to ask new research questions about regional differences across basins and hemispheres. We also discuss new developments in representations and tasks of the dataset. We first introduce a self-supervised learning framework for representation learning. Combined with the LSTM model, we discuss performance on intensity forecasting and extra-tropical transition forecasting tasks. We then propose new tasks, such as the typhoon center estimation task. We show that an object detection-based model performs better for stronger typhoons. Finally, we study how machine learning models can generalize across basins and hemispheres, by training the model on the northern hemisphere data and testing it on the southern hemisphere data. The dataset is publicly available at \urlthis http URL and \urlthis https URL.
zh
[CV-26] Low-Data Classification of Historical Music Manuscripts: A Few-Shot Learning Approach
【速读】: 该论文试图解决历史手稿中音乐符号分类的问题,特别是在缺乏标注数据的情况下。解决方案的关键在于开发了一个自监督学习框架,通过在未标注数据上训练神经网络特征提取器,从而实现有效的分类。具体方法包括优化裁剪预处理步骤以适应自监督卷积神经网络,并评估了多种分类方法,如支持向量机(SVM)、多层感知器和原型网络。实验结果显示,该方法在音乐符号分类任务中达到了87.66%的准确率,展示了AI驱动的技术在历史音乐数字化存档中的潜力。
链接: https://arxiv.org/abs/2411.16408
作者: Elona Shatri,Daniel Raymond,George Fazekas
关键词-EN: self-supervised learning framework, explore the intersection, intersection of technology, technology and cultural, cultural preservation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, The Sixth IEEE international conference on Image Processing Applications and Systems
点击查看摘要
Abstract:In this paper, we explore the intersection of technology and cultural preservation by developing a self-supervised learning framework for the classification of musical symbols in historical manuscripts. Optical Music Recognition (OMR) plays a vital role in digitising and preserving musical heritage, but historical documents often lack the labelled data required by traditional methods. We overcome this challenge by training a neural-based feature extractor on unlabelled data, enabling effective classification with minimal samples. Key contributions include optimising crop preprocessing for a self-supervised Convolutional Neural Network and evaluating classification methods, including SVM, multilayer perceptrons, and prototypical networks. Our experiments yield an accuracy of 87.66%, showcasing the potential of AI-driven methods to ensure the survival of historical music for future generations through advanced digital archiving techniques.
zh
[CV-27] A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models BMVC
【速读】: 该论文试图解决基于深度学习的计算机视觉中存在的领域偏移问题,特别是在自动驾驶场景中的语义分割任务。解决方案的关键在于利用预训练的视觉-语言模型(vision-language pre-trained models)替换传统的基于ImageNet预训练的编码器(encoder),从而显著提升无监督领域自适应(Unsupervised Domain Adaptation, UDA)方法在目标域上的性能。具体来说,通过将现有UDA方法如DACS的编码器替换为视觉-语言预训练编码器,可以在GTA5到Cityscapes的领域偏移上实现高达10.0%的mIoU提升,并且在未见过的领域上也能获得高达13.7%的mIoU增益。然而,论文也指出并非所有UDA方法都能轻易与新编码器结合,且UDA性能的提升并不总是能转化为泛化性能的提升。
链接: https://arxiv.org/abs/2411.16407
作者: Manuel Schwonberg,Claus Werner,Hanno Gottschalk,Carsten Meyer
关键词-EN: based computer vision, deep learning based, learning based computer, UDA methods, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to British Machine Vision Conference (BMVC) 2024: Workshop on Robust Recognition in the Open World (RROW)
点击查看摘要
Abstract:Despite the recent progress in deep learning based computer vision, domain shifts are still one of the major challenges. Semantic segmentation for autonomous driving faces a wide range of domain shifts, e.g. caused by changing weather conditions, new geolocations and the frequent use of synthetic data in model training. Unsupervised domain adaptation (UDA) methods have emerged which adapt a model to a new target domain by only using unlabeled data of that domain. The variety of UDA methods is large but all of them use ImageNet pre-trained models. Recently, vision-language models have demonstrated strong generalization capabilities which may facilitate domain adaptation. We show that simply replacing the encoder of existing UDA methods like DACS by a vision-language pre-trained encoder can result in significant performance improvements of up to 10.0% mIoU on the GTA5-to-Cityscapes domain shift. For the generalization performance to unseen domains, the newly employed vision-language pre-trained encoder provides a gain of up to 13.7% mIoU across three unseen datasets. However, we find that not all UDA methods can be easily paired with the new encoder and that the UDA performance does not always likewise transfer into generalization performance. Finally, we perform our experiments on an adverse weather condition domain shift to further verify our findings on a pure real-to-real domain shift.
zh
[CV-28] Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN ProGAN and DCGAN
【速读】: 该论文试图解决手写乐谱生成中的数据稀缺问题,以提升光学音乐识别系统 (Optical Music Recognition, OMR) 的性能。解决方案的关键在于应用生成对抗网络 (Generative Adversarial Networks, GANs) 来合成逼真的手写乐谱图像。论文通过对比三种GAN模型——DCGAN、ProGAN和CycleWGAN,发现CycleWGAN在风格迁移和训练稳定性方面表现优异,显著优于其他模型,其FID得分为41.87,IS得分为2.29,KID得分为0.05,显示出其在提升OMR系统中的潜力。
链接: https://arxiv.org/abs/2411.16405
作者: Elona Shatri,Kalikidhar Palavala,George Fazekas
关键词-EN: Optical Music Recognition, enhancing Optical Music, handwritten music sheets, enhancing Optical, Music Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 10 pages, one page references, to appear on the IEEE Big Data 2024 2nd Workshop on AI Music Generation (AIMG 2024)
点击查看摘要
Abstract:The generation of handwritten music sheets is a crucial step toward enhancing Optical Music Recognition (OMR) systems, which rely on large and diverse datasets for optimal performance. However, handwritten music sheets, often found in archives, present challenges for digitisation due to their fragility, varied handwriting styles, and image quality. This paper addresses the data scarcity problem by applying Generative Adversarial Networks (GANs) to synthesise realistic handwritten music sheets. We provide a comprehensive evaluation of three GAN models - DCGAN, ProGAN, and CycleWGAN - comparing their ability to generate diverse and high-quality handwritten music images. The proposed CycleWGAN model, which enhances style transfer and training stability, significantly outperforms DCGAN and ProGAN in both qualitative and quantitative evaluations. CycleWGAN achieves superior performance, with an FID score of 41.87, an IS of 2.29, and a KID of 0.05, making it a promising solution for improving OMR systems.
zh
[CV-29] Quadratic Gaussian Splatting for Efficient and Detailed Surface Reconstruction
【速读】: 该论文试图解决3D高斯喷射(3D Gaussian Splatting, 3DGS)在表面表示上的局限性,特别是2D高斯喷射(2D Gaussian Splatting, 2DGS)中使用圆盘作为场景基元导致的几何过度平滑问题。解决方案的关键在于提出了一种新的二次高斯喷射(Quadratic Gaussian Splatting, QGS)方法,通过用二次曲面替代圆盘,增强了几何拟合能力。QGS在非欧几里得空间中定义高斯分布,使基元能够捕捉更复杂的纹理,并通过二次曲面近似来渲染空间曲率,从而引导法线一致性项,有效减少过度平滑。实验结果表明,QGS在几何重建方面超越了当前最先进的方法。
链接: https://arxiv.org/abs/2411.16392
作者: Ziyu Zhang,Binbin Huang,Hanqing Jiang,Liyang Zhou,Xiaojun Xiang,Shunhan Shen
关键词-EN: Neural Radiance Fields, Radiance Fields, Neural Radiance, Gaussian Splatting, superior rendering quality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, 3D Gaussian Splatting (3DGS) has attracted attention for its superior rendering quality and speed over Neural Radiance Fields (NeRF). To address 3DGS’s limitations in surface representation, 2D Gaussian Splatting (2DGS) introduced disks as scene primitives to model and reconstruct geometries from multi-view images, offering view-consistent geometry. However, the disk’s first-order linear approximation often leads to over-smoothed results. We propose Quadratic Gaussian Splatting (QGS), a novel method that replaces disks with quadric surfaces, enhancing geometric fitting, whose code will be open-sourced. QGS defines Gaussian distributions in non-Euclidean space, allowing primitives to capture more complex textures. As a second-order surface approximation, QGS also renders spatial curvature to guide the normal consistency term, to effectively reduce over-smoothing. Moreover, QGS is a generalized version of 2DGS that achieves more accurate and detailed reconstructions, as verified by experiments on DTU and TNT, demonstrating its effectiveness in surpassing current state-of-the-art methods in geometry reconstruction. Our code willbe released as open source.
zh
[CV-30] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing
【速读】: 该论文试图解决现有视频扩散模型(VDMs)在生成长视频时存在的计算效率低下和冗余问题。现有自回归VDMs在生成后续片段时,需要重新计算与前一片段重叠的条件帧,导致计算量随自回归步数的增加呈二次方增长。论文提出的解决方案是Ca2-VDM,其关键在于引入因果生成(Causal generation)和缓存共享(Cache sharing)机制。因果生成通过单向特征计算,确保在前序自回归步骤中预计算的条件帧缓存可以在后续步骤中重复使用,从而消除冗余计算。缓存共享则通过在所有去噪步骤中共享缓存,避免了巨大的缓存存储成本。实验结果表明,Ca2-VDM在视频生成质量和速度上均达到了最先进水平。
链接: https://arxiv.org/abs/2411.16375
作者: Kaifeng Gao,Jiaxin Shi,Hanwang Zhang,Chunping Wang,Jun Xiao,Long Chen
关键词-EN: achieved impressive quality, today video generation, video diffusion models, impressive quality, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Code is available at this https URL
点击查看摘要
Abstract:With the advance of diffusion models, today’s video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available at this https URL
zh
[CV-31] A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation
【速读】: 该论文试图解决图像分割领域中不确定性量化的问题,特别是在高风险应用中确保算法可靠性的挑战。解决方案的关键在于区分和量化两种不确定性:认知不确定性(epistemic uncertainty)和偶然不确定性(aleatoric uncertainty)。认知不确定性涉及模型参数的不确定性,而偶然不确定性涉及数据本身的不确定性。通过近似贝叶斯推断,分别对潜在变量或模型参数进行不确定性量化,可以有效提升模型的鲁棒性和决策的可靠性。论文还探讨了这些不确定性在四个关键应用中的作用,包括量化标注过程中的统计不一致性、关联预测误差与不确定性、扩展模型假设空间以提高泛化能力,以及在主动学习中的应用。
链接: https://arxiv.org/abs/2411.16370
作者: M.M.A. Valiuddin,R.J.G. van Sloun,C.G.A. Viviers,P.H.N. de With,F. van der Sommen
关键词-EN: Deep Learning-based computer, Learning-based computer vision, Deep Learning-based, scope of Deep, Learning-based computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注: 20 pages
点击查看摘要
Abstract:Advancements in image segmentation play an integral role within the greater scope of Deep Learning-based computer vision. Furthermore, their widespread applicability in critical real-world tasks has given rise to challenges related to the reliability of such algorithms. Hence, uncertainty quantification has been extensively studied within this context, enabling expression of model ignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to prevent uninformed decision making. Due to the rapid adoption of Convolutional Neural Network (CNN)-based segmentation models in high-stake applications, a substantial body of research has been published on this very topic, causing its swift expansion into a distinct field. This work provides a comprehensive overview of probabilistic segmentation by discussing fundamental concepts in uncertainty that govern advancements in the field as well as the application to various tasks. We identify that quantifying aleatoric and epistemic uncertainty approximates Bayesian inference w.r.t. to either latent variables or model parameters, respectively. Moreover, literature on both uncertainties trace back to four key applications; (1) to quantify statistical inconsistencies in the annotation process due ambiguous images, (2) correlating prediction error with uncertainty, (3) expanding the model hypothesis space for better generalization, and (4) active learning. Then, a discussion follows that includes an overview of utilized datasets for each of the applications and comparison of the available methods. We also highlight challenges related to architectures, uncertainty-based active learning, standardization and benchmarking, and recommendations for future work such as methods based on single forward passes and models that appropriately leverage volumetric data.
zh
[CV-32] Cluster-based human-in-the-loop strategy for improving machine learning-based circulating tumor cell detection in liquid biopsy
【速读】: 该论文试图解决循环肿瘤细胞 (CTCs) 和非CTCs在癌症患者血液样本中的检测与区分问题。解决方案的关键在于引入了一种人机协作 (Human-in-the-Loop, HiL) 策略,通过结合自监督深度学习和传统机器学习分类器,迭代地由专家对新样本进行有针对性的采样和标注。具体来说,该方法基于局部潜在空间簇的分类性能,选择性地采样未标注的训练样本,从而提高机器学习系统的准确性和可靠性。与简单的随机采样相比,这种有针对性的采样策略显著提升了液体活检数据在转移性乳腺癌患者中的应用效果。
链接: https://arxiv.org/abs/2411.16332
作者: Hümeyra Husseini-Wüsthoff,Sabine Riethdorf,Andreas Schneeweiss,Andreas Trumpp,Klaus Pantel,Harriet Wikman,Maximilian Nielsen,René Werner
关键词-EN: circulating tumor cells, pose multiple challenges, patients pose multiple, tumor cells, multiple challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Detection and differentiation of circulating tumor cells (CTCs) and non-CTCs in blood draws of cancer patients pose multiple challenges. While the gold standard relies on tedious manual evaluation of an automatically generated selection of images, machine learning (ML) techniques offer the potential to automate these processes. However, human assessment remains indispensable when the ML system arrives at uncertain or wrong decisions due to an insufficient set of labeled training data. This study introduces a human-in-the-loop (HiL) strategy for improving ML-based CTC detection. We combine self-supervised deep learning and a conventional ML-based classifier and propose iterative targeted sampling and labeling of new unlabeled training samples by human experts. The sampling strategy is based on the classification performance of local latent space clusters. The advantages of the proposed approach compared to naive random sampling are demonstrated for liquid biopsy data from patients with metastatic breast cancer.
zh
[CV-33] CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain
【速读】: 该论文试图解决在极端光照条件下,利用可见光合成红外(IR)图像时出现的细节损失和伪热交叉伪影问题。解决方案的关键在于提出了CapHDR2IR框架,该框架利用高动态范围(HDR)图像作为输入,结合视觉-语言模型生成IR图像。HDR图像能够捕捉更广泛的亮度变化,确保在不同光照条件下生成可靠的IR图像。此外,通过密集标注分支引入语义理解,使得生成的IR图像更具意义和可辨识性。实验结果表明,CapHDR2IR在HDRT数据集上达到了最先进的性能,优于现有的通用域转换方法和专门用于可见光到红外图像转换的方法。
链接: https://arxiv.org/abs/2411.16327
作者: Jingchao Peng,Thomas Bashford-Rogers,Zhuang Shao,Haitao Zhao,Aru Ranjan Singh,Abhishek Goswami,Kurt Debattista
关键词-EN: imaging offers advantages, imaging offers, offers advantages, unique ability, ability of capturing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Infrared (IR) imaging offers advantages in several fields due to its unique ability of capturing content in extreme light conditions. However, the demanding hardware requirements of high-resolution IR sensors limit its widespread application. As an alternative, visible light can be used to synthesize IR images but this causes a loss of fidelity in image details and introduces inconsistencies due to lack of contextual awareness of the scene. This stems from a combination of using visible light with a standard dynamic range, especially under extreme lighting, and a lack of contextual awareness can result in pseudo-thermal-crossover artifacts. This occurs when multiple objects with similar temperatures appear indistinguishable in the training data, further exacerbating the loss of fidelity. To solve this challenge, this paper proposes CapHDR2IR, a novel framework incorporating vision-language models using high dynamic range (HDR) images as inputs to generate IR images. HDR images capture a wider range of luminance variations, ensuring reliable IR image generation in different light conditions. Additionally, a dense caption branch integrates semantic understanding, resulting in more meaningful and discernible IR outputs. Extensive experiments on the HDRT dataset show that the proposed CapHDR2IR achieves state-of-the-art performance compared with existing general domain transfer methods and those tailored for visible-to-infrared image translation.
zh
[CV-34] Brain-like emergent properties in deep networks: impact of network architecture datasets and training
【速读】: 该论文试图解决深度网络在标准化视觉基准测试中表现优异,但在现实世界视觉任务中仍不如人类的问题。解决方案的关键在于使深度网络更具类脑特性。论文通过系统评估30多种最先进的深度网络,发现网络架构对类脑特性的影响最大,而数据集和训练机制的影响相对较小。此外,不同网络在类脑特性上的表现差异显著,没有单一网络在所有类脑特性上均优于其他网络。这些发现补充了现有基准测试,揭示了当前最先进深度网络中存在的类脑特性的涌现或缺失。
链接: https://arxiv.org/abs/2411.16326
作者: Niranjan Rajesh,Georgin Jacob,SP Arun
关键词-EN: real-world vision tasks, standardized vision benchmarks, deep networks, vision tasks, standardized vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Despite the rapid pace at which deep networks are improving on standardized vision benchmarks, they are still outperformed by humans on real-world vision tasks. This paradoxical lack of generalization could be addressed by making deep networks more brain-like. Although several benchmarks have compared the ability of deep networks to predict brain responses to natural images, they do not capture subtle but important brain-like emergent properties. To resolve this issue, we report several well-known perceptual and neural emergent properties that can be tested on deep networks. To evaluate how various design factors impact brain-like properties, we systematically evaluated over 30 state-of-the-art networks with varying network architectures, training datasets and training regimes. Our main findings are as follows. First, network architecture had the strongest impact on brain-like properties compared to dataset and training regime variations. Second, networks varied widely in their alignment to the brain with no single network outperforming all others. Taken together, our results complement existing benchmarks by revealing brain-like properties that are either emergent or lacking in state-of-the-art deep networks.
zh
[CV-35] Luminance Component Analysis for Exposure Correction
【速读】: 该论文试图解决现有曝光校正方法在分离亮度相关和亮度无关成分时存在的困难,导致颜色失真、细节丢失以及需要额外修复步骤的问题。解决方案的关键在于提出了一种基于亮度成分分析 (Luminance Component Analysis, LCA) 的方法,该方法通过在U-Net结构中应用正交约束,成功解耦了亮度相关和亮度无关特征。LCA仅调整亮度相关成分,同时保持亮度无关成分不变,并通过几何优化算法将欧几里得空间中的约束问题转化为正交Stiefel流形中的无约束问题,从而优化正交约束。实验结果表明,LCA能够有效分离RGB色彩空间中的亮度特征,并在曝光校正数据集上实现了最佳的PSNR (21.33) 和SSIM (0.88),处理速度达到28.72 FPS。
链接: https://arxiv.org/abs/2411.16325
作者: Jingchao Peng,Thomas Bashford-Rogers,Jingkun Chen,Haitao Zhao,Zhengwei Hu,Kurt Debattista
关键词-EN: Exposure correction methods, Exposure correction, correction methods aim, current exposure correction, correction methods
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Exposure correction methods aim to adjust the luminance while maintaining other luminance-unrelated information. However, current exposure correction methods have difficulty in fully separating luminance-related and luminance-unrelated components, leading to distortions in color, loss of detail, and requiring extra restoration procedures. Inspired by principal component analysis (PCA), this paper proposes an exposure correction method called luminance component analysis (LCA). LCA applies the orthogonal constraint to a U-Net structure to decouple luminance-related and luminance-unrelated features. With decoupled luminance-related features, LCA adjusts only the luminance-related components while keeping the luminance-unrelated components unchanged. To optimize the orthogonal constraint problem, LCA employs a geometric optimization algorithm, which converts the constrained problem in Euclidean space to an unconstrained problem in orthogonal Stiefel manifolds. Extensive experiments show that LCA can decouple the luminance feature from the RGB color space. Moreover, LCA achieves the best PSNR (21.33) and SSIM (0.88) in the exposure correction dataset with 28.72 FPS.
zh
[CV-36] CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation
【速读】: 该论文试图解决传统2D图像实例分割算法依赖大量人工标注数据的问题,并针对现有无监督方法在处理重叠实例时的不足,提出了一种新的解决方案。其关键在于利用场景的点云表示,在3D空间中对语义掩码进行切割,从而获得最终的2D实例分割结果。此外,论文还引入了一个空间重要性函数(Spatial Importance function),用于沿着实例的3D边界重新锐化语义信息,并通过三个空间置信度组件(Spatial Confidence components)增强类无关检测器的训练,以减少掩码模糊性。这些创新使得该方法在多个无监督实例分割和目标检测的标准基准测试中超越了现有方法。
链接: https://arxiv.org/abs/2411.16319
作者: Leon Sick,Dominik Engel,Sebastian Hartwig,Pedro Hermosilla,Timo Ropinski
关键词-EN: algorithms that learn, human-annotated data, learn to segment, heavily relied, relied on large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Traditionally, algorithms that learn to segment object instances in 2D images have heavily relied on large amounts of human-annotated data. Only recently, novel approaches have emerged tackling this problem in an unsupervised fashion. Generally, these approaches first generate pseudo-masks and then train a class-agnostic detector. While such methods deliver the current state of the art, they often fail to correctly separate instances overlapping in 2D image space since only semantics are considered. To tackle this issue, we instead propose to cut the semantic masks in 3D to obtain the final 2D instances by utilizing a point cloud representation of the scene. Furthermore, we derive a Spatial Importance function, which we use to resharpen the semantics along the 3D borders of instances. Nevertheless, these pseudo-masks are still subject to mask ambiguity. To address this issue, we further propose to augment the training of a class-agnostic detector with three Spatial Confidence components aiming to isolate a clean learning signal. With these contributions, our approach outperforms competing methods across multiple standard benchmarks for unsupervised instance segmentation and object detection.
zh
[CV-37] One Diffusion to Generate Them All
【速读】: 该论文试图解决多任务图像合成与理解的问题,特别是如何在一个统一的框架下支持多种条件生成和逆向任务,如文本到图像生成、图像去模糊、超分辨率、深度估计和分割等。解决方案的关键在于提出了OneDiffusion模型,该模型通过将所有任务视为带有不同噪声尺度的帧序列进行训练,从而在推理时允许任何帧作为条件图像。这种统一的方法不仅简化了架构设计,还支持可扩展的多任务训练,并能平滑适应任意分辨率,从而增强了模型的泛化能力和可扩展性。
链接: https://arxiv.org/abs/2411.16318
作者: Duong H. Le,Tuan Pham,Sangho Lee,Christopher Clark,Aniruddha Kembhavi,Stephan Mandt,Ranjay Krishna,Jiasen Lu
关键词-EN: large-scale diffusion model, bidirectional image synthesis, seamlessly supports bidirectional, large-scale diffusion, camera pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: two first authors contribute equally
点击查看摘要
Abstract:We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at this https URL
zh
[CV-38] Monocular Lane Detection Based on Deep Learning: A Survey
【速读】: 该论文旨在全面综述基于深度学习的单目车道检测方法,并探讨其在自动驾驶感知系统中的应用。解决方案的关键在于四个核心设计要素:(1) 任务范式,专注于车道实例级别的区分;(2) 车道建模,将车道表示为神经网络中的可学习参数;(3) 全局上下文补充,增强对遮挡车道的检测;(4) 透视效应消除,提供可用于下游应用的3D车道信息。论文不仅涵盖了日益成熟的2D车道检测方法,还涉及正在发展的3D车道检测工作,并通过统一的设置比较了主流方法在不同基准上的性能和推理速度。此外,论文还介绍了车道检测的扩展工作,如多任务感知、视频车道检测、在线高清地图构建和车道拓扑推理,为读者提供了车道检测技术演变的全面路线图。
链接: https://arxiv.org/abs/2411.16316
作者: Xin He,Haiyun Guo,Kuan Zhu,Bingke Zhu,Xu Zhao,Jianwu Fang,Jinqiao Wang
关键词-EN: autonomous driving perception, Lane detection, driving perception system, Lane detection plays, Lane
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Lane detection plays an important role in autonomous driving perception system. As deep learning algorithms gain popularity, monocular lane detection methods based on deep learning have demonstrated superior performance and emerged as a key research direction in autonomous driving perception. The core design of these algorithmic frameworks can be summarized as follows: (1) Task paradigm, focusing on lane instance-level discrimination; (2) Lane modeling, representing lanes as a set of learnable parameters in the neural network; (3) Global context supplementation, enhancing the detection of obscured lanes; (4) Perspective effect elimination, providing 3D lanes usable for downstream applications. From these perspectives, this paper presents a comprehensive overview of existing methods, encompassing both the increasingly mature 2D lane detection approaches and the developing 3D lane detection works. For a relatively fair comparison, in addition to comparing the performance of mainstream methods on different benchmarks, their inference speed is also investigated under a unified setting. Moreover, we present some extended works on lane detection, including multi-task perception, video lane detection, online high-definition (HD) map construction, and lane topology reasoning, to offer readers a comprehensive roadmap for the evolution of lane detection. Finally, we point out some potential future research directions in this field. We exhaustively collect the papers and codes of existing works at this https URL and will keep tracing the research.
zh
[CV-39] EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training
【速读】: 该论文试图解决在视频传输系统中,利用深度神经网络(DNNs)的过拟合特性进行超分辨率(SR)重建时,训练大量视频帧所带来的巨大计算成本问题。解决方案的关键在于提出了一种高效的补丁采样方法,称为EPS(Efficient Patch Sampling),用于视频SR网络的过拟合训练。EPS方法通过引入基于离散余弦变换(DCT)的空间-时间特征,直接评估每个补丁的复杂度得分,并根据这些特征的直方图分布将所有可能的补丁分类到不同的簇中,从包含最高空间-时间信息的簇中选择训练补丁。该方法自适应地调整采样补丁的数量,以平衡训练复杂度和效率,从而将训练补丁数量减少到4%至25%,同时保持高视频质量和显著提高训练效率。与最先进的补丁采样方法EMT相比,EPS方法的整体运行时间减少了83%。
链接: https://arxiv.org/abs/2411.16312
作者: Yiying Wei,Hadi Amirpour,Jong Hwan Ko,Christian Timmerer
关键词-EN: video delivery systems, deep neural networks, property of deep, deep neural, delivery systems
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for low-resolution (LR) bitstreams, which are used to reconstruct high-resolution (HR) videos at the decoder. Although these approaches show promising results, the huge computational costs of training a large number of video frames limit their practical applications. To overcome this challenge, we propose an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames. To this end, we first present two low-complexity Discrete Cosine Transform (DCT)-based spatial-temporal features to measure the complexity score of each patch directly. By analyzing the histogram distribution of these features, we then categorize all possible patches into different clusters and select training patches from the cluster with the highest spatial-temporal information. The number of sampled patches is adaptive based on the video content, addressing the trade-off between training complexity and efficiency. Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters, while maintaining high video quality and significantly enhancing training efficiency. Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.
zh
[CV-40] Functionality understanding and segmentation in 3D scenes
【速读】: 该论文试图解决在三维场景中理解功能性对象的问题,即通过自然语言描述定位三维环境中的功能性交互对象(如把手和按钮)。解决方案的关键在于引入了一种名为Fun3DU的新方法,该方法利用语言模型通过Chain-of-Thought推理解析任务描述,以识别感兴趣的对象。随后,通过视觉和语言模型在捕获场景的多视图中进行对象分割,并将各视图的分割结果提升到三维空间并聚合到点云中,利用几何信息进行处理。Fun3DU方法无需额外训练,完全依赖预训练模型,并在SceneFun3D数据集上显著优于现有的开放词汇三维分割方法。
链接: https://arxiv.org/abs/2411.16310
作者: Jaime Corsetti,Francesco Giuliari,Alice Fasoli,Davide Boscaini,Fabio Poiesi
关键词-EN: involves interpreting natural, locate functional interactive, interpreting natural language, functional interactive objects, scenes involves interpreting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report. 20 pages, 12 figures, 7 tables
点击查看摘要
Abstract:Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. For example, given a task like ‘turn on the ceiling light’, an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state-of-the-art open-vocabulary 3D segmentation approaches. Code will be released publicly.
zh
[CV-41] An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models
【速读】: 该论文试图解决现有条件去噪扩散概率模型(DDPMs)在处理3D场景理解任务时面临的挑战,特别是在复杂几何细节场景中,由于数据分布梯度(scores)拟合困难导致的训练和推理时间较长的问题。解决方案的关键在于提出了一种基于条件-噪声框架(Conditional-Noise Framework, CNF)的端到端鲁棒语义分割网络,名为CDSegNet。CDSegNet通过将噪声网络(Noise Network, NN)建模为可学习的噪声特征生成器,使得条件网络(Conditional Network, CN)能够在多层次特征扰动下理解3D场景语义,从而增强了对未见场景的泛化能力。此外,CDSegNet利用DDPMs的噪声系统,在实验中表现出强大的噪声和稀疏性鲁棒性。由于避免了直接从语义标签中拟合梯度,CDSegNet能够在单步推理中生成语义标签,显著缩短了推理时间,并在公开的室内外基准测试中取得了最先进的性能。
链接: https://arxiv.org/abs/2411.16308
作者: Wentao Qu,Jing Wang,YongShun Gong,Xiaoshui Huang,Liang Xiao
关键词-EN: Denoising Diffusion Probabilistic, conditional Denoising Diffusion, Diffusion Probabilistic Models, Denoising Diffusion, Diffusion Probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end-to-end robust semantic \textbfSegmentation \textbfNetwork based on a \textbfConditional-Noise Framework (CNF) of D\textbfDPMs, named \textbfCDSegNet. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise-feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi-level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single-step inference like non-DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state-of-the-art performance.
zh
[CV-42] DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation
【速读】: 该论文试图解决室内设计过程中效率低下和生成式设计与实际需求之间存在显著差异的问题。解决方案的关键在于提出了DiffDesign,一种可控的扩散模型,结合了元先验信息,以提高室内设计生成的效率和质量。具体来说,DiffDesign利用预训练的2D扩散模型的生成先验作为渲染基础,并通过解耦交叉注意力控制设计属性(如外观、姿态和尺寸)来指导去噪过程。此外,引入了一个基于最优传输的对齐模块来确保视图一致性。论文还构建了一个专门的室内设计数据集DesignHelper,用于微调模型,从而提高其在不同空间类型和设计风格上的适应性和鲁棒性。
链接: https://arxiv.org/abs/2411.16301
作者: Yuxuan Yang,Jingyao Wang,Tao Geng,Wenwen Qiang,Changwen Zheng,Fuchun Sun
关键词-EN: discipline involving aesthetics, creative discipline involving, involving aesthetics, materials science, complex and creative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 32 pages
点击查看摘要
Abstract:Interior design is a complex and creative discipline involving aesthetics, functionality, ergonomics, and materials science. Effective solutions must meet diverse requirements, typically producing multiple deliverables such as renderings and design drawings from various perspectives. Consequently, interior design processes are often inefficient and demand significant creativity. With advances in machine learning, generative models have emerged as a promising means of improving efficiency by creating designs from text descriptions or sketches. However, few generative works focus on interior design, leading to substantial discrepancies between outputs and practical needs, such as differences in size, spatial scope, and the lack of controllable generation quality. To address these challenges, we propose DiffDesign, a controllable diffusion model with meta priors for efficient interior design generation. Specifically, we utilize the generative priors of a 2D diffusion model pre-trained on a large image dataset as our rendering backbone. We further guide the denoising process by disentangling cross-attention control over design attributes, such as appearance, pose, and size, and introduce an optimal transfer-based alignment module to enforce view consistency. Simultaneously, we construct an interior design-specific dataset, DesignHelper, consisting of over 400 solutions across more than 15 spatial types and 15 design styles. This dataset helps fine-tune DiffDesign. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness and robustness of DiffDesign.
zh
[CV-43] A Performance Increment Strategy for Semantic Segmentation of Low-Resolution Images from Damaged Roads
【速读】: 该论文试图解决新兴国家道路状况复杂、数据集质量低下的问题,特别是在自动驾驶领域中,现有语义分割数据集主要基于高分辨率、维护良好的城市道路图像,而忽视了低分辨率、维护不良的道路图像。解决方案的关键在于提出了性能提升策略(Performance Increment Strategy for Semantic Segmentation, PISSS),通过14个训练实验来提升模型性能,特别是在处理像素少、形状不确定和类别高度不平衡的对象时。该策略在Road Traversing Knowledge (RTK)和Technik Autonomer Systeme 500 (TAS500)测试集上分别达到了79.8和68.8的mIoU,达到了当前最先进的结果,并分析了DeepLabV3+在小对象分割中的不足。
链接: https://arxiv.org/abs/2411.16295
作者: Rafael S. Toledo,Cristiano S. Oliveira,Vitor H. T. Oliveira,Eric A. Antonelo,Aldo von Wangenheim
关键词-EN: Autonomous driving, deep learning models, well-maintained urban roads, semantic segmentation datasets, Brazilian roads
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Autonomous driving needs good roads, but 85% of Brazilian roads have damages that deep learning models may not regard as most semantic segmentation datasets for autonomous driving are high-resolution images of well-maintained urban roads. A representative dataset for emerging countries consists of low-resolution images of poorly maintained roads and includes labels of damage classes; in this scenario, three challenges arise: objects with few pixels, objects with undefined shapes, and highly underrepresented classes. To tackle these challenges, this work proposes the Performance Increment Strategy for Semantic Segmentation (PISSS) as a methodology of 14 training experiments to boost performance. With PISSS, we reached state-of-the-art results of 79.8 and 68.8 mIoU on the Road Traversing Knowledge (RTK) and Technik Autonomer Systeme 500 (TAS500) test sets, respectively. Furthermore, we also offer an analysis of DeepLabV3+ pitfalls for small object segmentation.
zh
[CV-44] Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery WACV2025
【速读】: 该论文试图解决单目3D人体姿态和形状估计中的深度模糊、遮挡和截断问题。解决方案的关键在于提出了一种新的监督学习方法,通过最小化学习到的3D人体网格分布与2D姿态检测器生成的热图分布之间的距离,来增强模型对真实分布的捕捉能力。此外,论文还揭示了现有方法在不可见关节上生成错误假设的问题,并提出利用人体分割掩码在训练过程中减少无效样本的数量,同时引入两个新的评估指标来衡量这一改进。最终,基于归一化流的方法能够生成与图像证据一致且对模糊身体部位保持高多样性的合理3D人体网格假设。
链接: https://arxiv.org/abs/2411.16289
作者: Tom Wehrbein,Marco Rudolph,Bodo Rosenhahn,Bastian Wandt
关键词-EN: inherently ill-posed problem, ill-posed problem due, depth ambiguities, shape estimation, inherently ill-posed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025
点击查看摘要
Abstract:Monocular 3D human pose and shape estimation is an inherently ill-posed problem due to depth ambiguities, occlusions, and truncations. Recent probabilistic approaches learn a distribution over plausible 3D human meshes by maximizing the likelihood of the ground-truth pose given an image. We show that this objective function alone is not sufficient to best capture the full distributions. Instead, we propose to additionally supervise the learned distributions by minimizing the distance to distributions encoded in heatmaps of a 2D pose detector. Moreover, we reveal that current methods often generate incorrect hypotheses for invisible joints which is not detected by the evaluation protocols. We demonstrate that person segmentation masks can be utilized during training to significantly decrease the number of invalid samples and introduce two metrics to evaluate it. Our normalizing flow-based approach predicts plausible 3D human mesh hypotheses that are consistent with the image evidence while maintaining high diversity for ambiguous body parts. Experiments on 3DPW and EMDB show that we outperform other state-of-the-art probabilistic methods. Code is available for research purposes at this https URL.
zh
[CV-45] Open-Vocabulary Octree-Graph for 3D Scene Understanding
【速读】: 该论文试图解决开放词汇3D场景理解中的存储效率和空间关系表达问题。现有方法依赖于点云数据,虽然能够进行物体分割,但点云数据的无序性和高存储需求限制了其在下游任务(如路径规划和复杂文本对象检索)中的效率。论文提出的解决方案之关键是Octree-Graph,它通过以下步骤实现:首先,设计了时间顺序分组分割合并策略(Chronological Group-wise Segment Merging, CGSM)和实例特征聚合算法(Instance Feature Aggregation, IFA)来获取3D实例及其语义特征;接着,开发了一种自适应八叉树结构,根据物体形状动态调整存储语义信息和占用状态;最后,构建了Octree-Graph,其中每个自适应八叉树作为图节点,节点间的边描述了空间关系。这种方法在多个广泛使用的数据集上进行了广泛实验,展示了其多功能性和有效性。
链接: https://arxiv.org/abs/2411.16253
作者: Zhigang Wang,Yifei Su,Chenhui Li,Dong Wang,Yan Huang,Bin Zhao,Xuelong Li
关键词-EN: embodied agents, indispensable for embodied, Group-wise Segment Merging, Chronological Group-wise Segment, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages,7figures
点击查看摘要
Abstract:Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and complex text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely-used datasets, demonstrating the versatility and effectiveness of our method.
zh
[CV-46] Diagnosis of diabetic retinopathy using machine learning deep learning technique
【速读】: 该论文试图解决眼底图像(fundus images)在诊断多种眼病(如糖尿病视网膜病变、青光眼和年龄相关性黄斑变性)时,手动分析耗时且易出错的问题。解决方案的关键在于采用目标检测(object detection)和机器学习分类技术。具体来说,论文使用YOLO_V8进行眼底图像的目标检测,定位视盘(optic disc)、视杯(optic cup)和病灶(lesions)等感兴趣区域(ROIs),然后利用支持向量机(SVM)分类算法根据病理特征(如渗出物、微动脉瘤和出血等)将ROIs分类为不同的糖尿病视网膜病变(DR)阶段。该方法在眼底检测中达到了84%的准确率和效率,特别适用于全球偏远地区的眼底疾病筛查。
链接: https://arxiv.org/abs/2411.16250
作者: Eric Shah,Jay Patel,Mr.Vishal Katheriya,Parth Pataliya
关键词-EN: age-related macular degeneration, Fundus images, diabetic retinopathy, macular degeneration, diagnosing various eye
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, 11 figures, Journal Paper
点击查看摘要
Abstract:Fundus images are widely used for diagnosing various eye diseases, such as diabetic retinopathy, glaucoma, and age-related macular degeneration. However, manual analysis of fundus images is time-consuming and prone to errors. In this report, we propose a novel method for fundus detection using object detection and machine learning classification techniques. We use a YOLO_V8 to perform object detection on fundus images and locate the regions of interest (ROIs) such as optic disc, optic cup and lesions. We then use machine learning SVM classification algorithms to classify the ROIs into different DR stages based on the presence or absence of pathological signs such as exudates, microaneurysms, and haemorrhages etc. Our method achieves 84% accuracy and efficiency for fundus detection and can be applied for retinal fundus disease triage, especially in remote areas around the world.
zh
[CV-47] Weakly supervised image segmentation for defect-based grading of fresh produce
【速读】: 该论文试图解决农业中基于图像的机器学习应用在数据稀缺和标注不足的情况下,难以实现高质量模型预测的问题。具体而言,研究聚焦于在分散供应链中对香蕉的采后质量评估,特别是表面缺陷的检测与分割。解决方案的关键在于采用弱监督学习方法,利用粗略标签而非耗时的像素级标注,结合Segment Anything Model (SAM) 生成密集标注,从而显著减少人工标注工作量,同时实现了77.6%的panoptic quality评分。这一方法展示了在数据有限的农业环境中,通过低成本、高精度的分割技术进行缺陷量化评估的潜力。
链接: https://arxiv.org/abs/2411.16219
作者: Manuel Knott,Divinefavour Odion,Sameer Sontakke,Anup Karwa,Thijs Defraeye
关键词-EN: Implementing image-based machine, image-based machine learning, Implementing image-based, high-quality model predictions, achieve high-quality model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Implementing image-based machine learning in agriculture is often limited by scarce data and annotations, making it hard to achieve high-quality model predictions. This study tackles the issue of postharvest quality assessment of bananas in decentralized supply chains. We propose a method to detect and segment surface defects in banana images using panoptic segmentation to quantify defect size and number. Instead of time-consuming pixel-level annotations, we use weak supervision with coarse labels. A dataset of 476 smartphone images of bananas was collected under real-world field conditions and annotated for bruises and scars. Using the Segment Anything Model (SAM), a recently published foundation model for image segmentation, we generated dense annotations from coarse bounding boxes to train a segmentation model, significantly reducing manual effort while achieving a panoptic quality score of 77.6%. This demonstrates SAM’s potential for low-effort, accurate segmentation in agricultural settings with limited data.
zh
[CV-48] Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding
【速读】: 该论文试图解决多重退化图像恢复(Multiple-in-one Image Restoration)中存在的退化多样性和提示单一性问题。解决方案的关键在于设计了一个局部动态优化模块(Local Dynamic Optimization, LDO)和一个条件特征嵌入模块(Conditional Feature Embedding, CFE)。LDO模块能够动态处理不同类型和粒度的退化区域,而CFE模块则通过引导解码器利用与退化类型相关的特征,显著提升了模型在混合退化恢复场景中的性能。
链接: https://arxiv.org/abs/2411.16217
作者: Yubin Gu,Yuan Meng,Xiaoshuai Sun,Jiayi Ji,Weijian Ruan,Rongrong Ji
关键词-EN: made significant progress, significant progress, aiming to handle, made significant, Local Dynamic Optimization
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 8 tables
点击查看摘要
Abstract:Multiple-in-one image restoration (IR) has made significant progress, aiming to handle all types of single degraded image restoration with a single model. However, in real-world scenarios, images often suffer from combinations of multiple degradation factors. Existing multiple-in-one IR models encounter challenges related to degradation diversity and prompt singularity when addressing this issue. In this paper, we propose a novel multiple-in-one IR model that can effectively restore images with both single and mixed degradations. To address degradation diversity, we design a Local Dynamic Optimization (LDO) module which dynamically processes degraded areas of varying types and granularities. To tackle the prompt singularity issue, we develop an efficient Conditional Feature Embedding (CFE) module that guides the decoder in leveraging degradation-type-related features, significantly improving the model’s performance in mixed degradation restoration scenarios. To validate the effectiveness of our model, we introduce a new dataset containing both single and mixed degradation elements. Experimental results demonstrate that our proposed model achieves state-of-the-art (SOTA) performance not only on mixed degradation tasks but also on classic single-task restoration benchmarks.
zh
[CV-49] SMGDiff: Soccer Motion Generation using diffusion probabilistic models
【速读】: 该论文试图解决生成逼真足球运动的问题,特别是在视频游戏和VR/AR应用中,由于球员与球之间复杂的交互关系,生成实时且用户可控的足球动作具有挑战性。解决方案的关键在于引入SMGDiff,这是一个两阶段框架,结合了实时角色控制与基于扩散的生成模型。第一阶段将粗略的用户控制即时转换为多样化的角色全局轨迹,第二阶段利用基于Transformer的自回归扩散模型,根据轨迹条件生成足球动作,并在推理过程中通过接触引导模块优化球与脚的接触细节,以确保动作的高质量和多样性。此外,论文还贡献了一个包含超过108万帧多样化足球动作的大规模数据集。
链接: https://arxiv.org/abs/2411.16216
作者: Hongdi Yang,Chengyang Li,Zhenxuan Wu,Gaozheng Li,Jingya Wang,Jingyi Yu,Zhuo Su,Lan Xu
关键词-EN: globally renowned sport, globally renowned, renowned sport, sport with significant, significant applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Soccer is a globally renowned sport with significant applications in video games and VR/AR. However, generating realistic soccer motions remains challenging due to the intricate interactions between the human player and the ball. In this paper, we introduce SMGDiff, a novel two-stage framework for generating real-time and user-controllable soccer motions. Our key idea is to integrate real-time character control with a powerful diffusion-based generative model, ensuring high-quality and diverse output motion. In the first stage, we instantly transform coarse user controls into diverse global trajectories of the character. In the second stage, we employ a transformer-based autoregressive diffusion model to generate soccer motions based on trajectory conditioning. We further incorporate a contact guidance module during inference to optimize the contact details for realistic ball-foot interactions. Moreover, we contribute a large-scale soccer motion dataset consisting of over 1.08 million frames of diverse soccer motions. Extensive experiments demonstrate that our SMGDiff significantly outperforms existing methods in terms of motion quality and condition alignment.
zh
[CV-50] SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context
【速读】: 该论文试图解决现有大型语言模型(Video-LLMs)在理解和解释长视频时,难以有效整合视频中丰富的视听信息的问题。解决方案的关键在于:(i) 引入首个长音频-视觉视频数据集SAVEn-Vid,包含超过58k的音频-视觉指令;(ii) 提出时间感知的音频-视觉大型语言模型(AV-LLM)SAVEnVideo,并在SAVEn-Vid上进行微调;(iii) 创建AVBench基准,包含2,500个问答对,用于评估模型在长视频中增强的音频-视觉理解任务中的表现。实验结果表明,SAVEnVideo在零样本长视频任务(Video-MME)和零样本音频-视觉任务(Music-AVQA)中分别超越了现有最佳模型3.61%和1.29%,在7B参数规模下达到最先进水平。
链接: https://arxiv.org/abs/2411.16213
作者: Jungang Li,Sicheng Tao,Yibo Yan,Xiaojie Gu,Haodong Xu,Xu Zheng,Yuanhuiyi Lyu,Linfeng Zhang,Xuming Hu
关键词-EN: explore Large Language, Large Language Models, Large Language, Audio-Visual Large Language, interpreting long videos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Endeavors have been made to explore Large Language Models for video analysis (Video-LLMs), particularly in understanding and interpreting long videos. However, existing Video-LLMs still face challenges in effectively integrating the rich and diverse audio-visual information inherent in long videos, which is crucial for comprehensive understanding. This raises the question: how can we leverage embedded audio-visual information to enhance long video understanding? Therefore, (i) we introduce SAVEn-Vid, the first-ever long audio-visual video dataset comprising over 58k audio-visual instructions. (ii) From the model perspective, we propose a time-aware Audio-Visual Large Language Model (AV-LLM), SAVEnVideo, fine-tuned on SAVEn-Vid. (iii) Besides, we present AVBench, a benchmark containing 2,500 QAs designed to evaluate models on enhanced audio-visual comprehension tasks within long video, challenging their ability to handle intricate audio-visual interactions. Experiments on AVBench reveal the limitations of current AV-LLMs. Experiments also demonstrate that SAVEnVideo outperforms the best Video-LLM by 3.61% on the zero-shot long video task (Video-MME) and surpasses the leading audio-visual LLM by 1.29% on the zero-shot audio-visual task (Music-AVQA). Consequently, at the 7B parameter scale, SAVEnVideo can achieve state-of-the-art performance. Our dataset and code will be released at this https URL upon acceptance.
zh
[CV-51] VIRES: Video Instance Repainting with Sketch and Text Guidance
【速读】: 该论文试图解决视频实例重绘、替换、生成和移除中的时序一致性和与提供草图序列的精确对齐问题。解决方案的关键在于引入VIRES方法,该方法利用文本到视频生成模型的生成先验来维持时序一致性,并生成视觉上令人满意的结果。具体技术包括:1) 提出顺序控制网络(Sequential ControlNet),通过标准化自缩放有效提取结构布局并自适应捕捉高对比度草图细节;2) 增强扩散变换器骨干网络(diffusion transformer backbone),加入草图注意力(sketch attention)以解释和注入细粒度草图语义;3) 设计草图感知编码器(sketch-aware encoder),确保重绘结果与提供的草图序列对齐。此外,论文还贡献了VireSet数据集,用于训练和评估视频实例编辑方法。实验结果表明,VIRES在视觉质量、时序一致性、条件对齐和人类评分方面优于现有最先进方法。
链接: https://arxiv.org/abs/2411.16199
作者: Shuchen Weng,Haojie Zheng,Peixuan Zhan,Yuchen Hong,Han Jiang,Si Li,Boxin Shi
关键词-EN: video instance repainting, instance repainting, text guidance, enabling video instance, instance repainting method
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings.
zh
[CV-52] Interpreting Object-level Foundation Models via Visual Precision Search
【速读】: 该论文试图解决多模态预训练模型(如 Grounding DINO 和 Florence-2)在视觉定位和物体检测任务中决策解释的难题。现有解释方法(如基于梯度的方法和基于扰动的方法)存在显著局限性:(1) 基于梯度的方法由于模型内部视觉-文本融合导致定位不精确;(2) 基于扰动的方法生成的显著性图噪声较大,限制了细粒度解释能力。论文提出的解决方案是视觉精确搜索方法(Visual Precision Search),该方法通过将输入划分为稀疏子区域,并利用一致性和协作评分来准确识别关键决策区域,从而生成更精确的归因图。此方法绕过了模型内部参数,克服了多模态融合带来的归因问题,显著提升了对象级任务的解释性,实验结果表明在多个评估指标上超越了现有最先进方法。
链接: https://arxiv.org/abs/2411.16198
作者: Ruoyu Chen,Siyuan Liang,Jingzhi Li,Shiming Liu,Maosen Li,Zheng Huang,Hua Zhang,Xiaochun Cao
关键词-EN: Grounding DINO, propelled object-level foundation, pre-training have propelled, Visual Precision Search, object-level foundation models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Advances in multimodal pre-training have propelled object-level foundation models, such as Grounding DINO and Florence-2, in tasks like visual grounding and object detection. However, interpreting these models’ decisions has grown increasingly challenging. Existing interpretable attribution methods for object-level task interpretation have notable limitations: (1) gradient-based methods lack precise localization due to visual-textual fusion in foundation models, and (2) perturbation-based methods produce noisy saliency maps, limiting fine-grained interpretability. To address these, we propose a Visual Precision Search method that generates accurate attribution maps with fewer regions. Our method bypasses internal model parameters to overcome attribution issues from multimodal fusion, dividing inputs into sparse sub-regions and using consistency and collaboration scores to accurately identify critical decision-making regions. We also conducted a theoretical analysis of the boundary guarantees and scope of applicability of our method. Experiments on RefCOCO, MS COCO, and LVIS show our approach enhances object-level task interpretability over SOTA for Grounding DINO and Florence-2 across various evaluation metrics, with faithfulness gains of 23.7%, 31.6%, and 20.1% on MS COCO, LVIS, and RefCOCO for Grounding DINO, and 102.9% and 66.9% on MS COCO and RefCOCO for Florence-2. Additionally, our method can interpret failures in visual grounding and object detection tasks, surpassing existing methods across multiple evaluation metrics. The code will be released at \urlthis https URL.
zh
[CV-53] Learn from Foundation Model: Fruit Detection Model without Manual Annotation
【速读】: 该论文试图解决农业领域数据稀缺的问题,特别是在水果检测任务中缺乏足够的标注数据。解决方案的关键在于提出了一种名为SDM-D(Segmentation-Description-Matching-Distilling)的框架,该框架利用基础模型(如SAM2和OpenCLIP)进行分割和零样本开放词汇分类,并通过知识蒸馏机制从这些基础模型中提取出高效、可部署于边缘设备的小型模型。SDM-D方法在无需手动标注的情况下,在水果检测任务(包括目标检测、语义分割和实例分割)中表现出色,几乎达到了使用大量标注数据训练的模型的性能,并且在开放集检测方法(如Grounding SAM和YOLO-World)中表现更优。
链接: https://arxiv.org/abs/2411.16196
作者: Yanan Wang,Zhenghao Fei,Ruichen Li,Yibin Ying
关键词-EN: limited data availability, Recent breakthroughs, transferring knowledge pre-trained, breakthroughs in large, enabled the possibility
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 12 figures, conference or other essential info
点击查看摘要
Abstract:Recent breakthroughs in large foundation models have enabled the possibility of transferring knowledge pre-trained on vast datasets to domains with limited data availability. Agriculture is one of the domains that lacks sufficient data. This study proposes a framework to train effective, domain-specific, small models from foundation models without manual annotation. Our approach begins with SDM (Segmentation-Description-Matching), a stage that leverages two foundation models: SAM2 (Segment Anything in Images and Videos) for segmentation and OpenCLIP (Open Contrastive Language-Image Pretraining) for zero-shot open-vocabulary classification. In the second stage, a novel knowledge distillation mechanism is utilized to distill compact, edge-deployable models from SDM, enhancing both inference speed and perception accuracy. The complete method, termed SDM-D (Segmentation-Description-Matching-Distilling), demonstrates strong performance across various fruit detection tasks object detection, semantic segmentation, and instance segmentation) without manual annotation. It nearly matches the performance of models trained with abundant labels. Notably, SDM-D outperforms open-set detection methods such as Grounding SAM and YOLO-World on all tested fruit detection datasets. Additionally, we introduce MegaFruits, a comprehensive fruit segmentation dataset encompassing over 25,000 images, and all code and datasets are made publicly available at this https URL.
zh
[CV-54] Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation
【速读】: 该论文试图解决从单张图像生成3D网格(3D meshes)时存在的多视图不一致性、网格保真度不足以及生成的网格模糊等问题。解决方案的关键在于提出了Fancy123方法,该方法包含两个增强模块和一个反投影操作:外观增强模块用于调整2D多视图图像以纠正像素对齐问题,从而提高多视图一致性;保真度增强模块用于调整3D网格以更好地匹配输入图像;反投影操作则将输入图像和调整后的多视图图像投影到LRM生成的网格上,以确保高清晰度并去除LRM预测的模糊颜色。这些模块在推理时可即插即用,能够无缝集成到现有的单图像到3D方法中,并通过广泛的定性和定量实验验证了其显著优于现有技术的性能。
链接: https://arxiv.org/abs/2411.16185
作者: Qiao Yu,Xianzhi Li,Yuan Tang,Xu Han,Long Hu,Yixue Hao,Min Chen
关键词-EN: Large Reconstruction Model, multiview images, ill-posed task, important but ill-posed, Large Reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM’s generated mesh ensures high clarity, discarding LRM’s predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123’s SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods.
zh
[CV-55] Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking
【速读】: 该论文试图解决现有3D实例分割方法中常见的过度分割问题,即由于无监督合并方法导致的冗余和不准确的3D提案,这些问题增加了下游任务的复杂性。解决方案的关键在于提出了两个模块:3D-Aware 2D Mask Tracking模块和3D Mask Optimization模块。前者利用2D掩码分割和跟踪基础模型(SAM-2)的鲁棒3D先验,确保视频帧间对象掩码的一致性;后者通过动态规划算法选择最佳视图集,优化超点以生成每个对象的最终3D提案,从而在减少不必要提案的同时实现场景内对象的全面覆盖。
链接: https://arxiv.org/abs/2411.16183
作者: Phuc Nguyen,Minh Luu,Anh Tran,Cuong Pham,Khoi Nguyen
关键词-EN: frequently encounter issues, methods frequently encounter, issues with over-segmentation, instance segmentation, Instance Segmentation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class-Agnostic, Open-Vocabulary, and Open-Ended 3D Instance Segmentation tasks.
zh
[CV-56] Event-boosted Deformable 3D Gaussians for Fast Dynamic Scene Reconstruction
【速读】: 该论文试图解决3D高斯喷射 (3D Gaussian Splatting, 3D-GS) 在实时渲染中因RGB相机时间分辨率低而难以处理快速运动的问题。解决方案的关键在于结合事件相机 (event cameras) 的高时间分辨率、连续运动数据与可变形3D-GS,以实现快速动态场景重建。具体策略包括:1) 提出高斯-阈值联合建模 (GS-Threshold Joint Modeling, GTJM) 策略,通过相互增强的过程显著提升3D重建和阈值建模的质量;2) 引入动态-静态分解 (Dynamic-Static Decomposition, DSD) 策略,通过识别动态区域并应用基于缓冲区的软分解,加速渲染并提高动态区域的保真度。这些方法使得在RTX 3090 GPU上以400×400分辨率实现156 FPS的高保真动态重建成为可能。
链接: https://arxiv.org/abs/2411.16180
作者: Wenhao Xu,Wenming Weng,Yueyi Zhang,Ruikang Xu,Zhiwei Xiong
关键词-EN: Gaussian Splatting, RGB cameras, enables real-time rendering, low temporal resolution, enables real-time
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3D-GS) enables real-time rendering but struggles with fast motion due to low temporal resolution of RGB cameras. To address this, we introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for fast dynamic scene reconstruction. We observe that threshold modeling for events plays a crucial role in achieving high-quality reconstruction. Therefore, we propose a GS-Threshold Joint Modeling (GTJM) strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling. Moreover, we introduce a Dynamic-Static Decomposition (DSD) strategy that first identifies dynamic areas by exploiting the inability of static Gaussians to represent motions, then applies a buffer-based soft decomposition to separate dynamic and static areas. This strategy accelerates rendering by avoiding unnecessary deformation in static areas, and focuses on dynamic areas to enhance fidelity. Our approach achieves high-fidelity dynamic reconstruction at 156 FPS with a 400 \times 400 resolution on an RTX 3090 GPU.
zh
[CV-57] SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
【速读】: 该论文试图解决长视频内容处理中的两个主要问题:一是现有大型多模态模型(Large Multi-modal Models, LMMs)在处理长且未剪辑的视频时,由于上下文长度和内存开销的限制,导致信息丢失和模型响应的相关性降低;二是随着网络平台上视频数据的指数增长,理解长视频内容对于推进通用智能至关重要。解决方案的关键在于引入了一种名为SALOVA(Segment-Augmented LOng Video Assistant)的新型视频-LLM框架,通过以下两个关键技术来增强长视频内容的理解:(i) 提出了SceneWalk数据集,这是一个包含87.8K个长视频的高质量集合,每个视频在片段级别上进行了密集标注,以帮助模型捕捉场景连续性和保持丰富的描述性上下文;(ii) 开发了结合动态路由机制和时空投影器的稳健架构设计,以根据用户查询高效地检索和处理相关视频片段。SALOVA通过精确识别和检索相关视频片段来响应查询,从而提高了生成响应的上下文相关性。
链接: https://arxiv.org/abs/2411.16173
作者: Junho Kim,Hyunjun Kim,Hosu Lee,Yong Man Ro
关键词-EN: Large Multi-modal Models, Large Multi-modal, substantial memory overhead, remains challenging due, advances in Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
点击查看摘要
Abstract:Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.
zh
[CV-58] U2NeRF: Unsupervised Underwater Image Restoration and Neural Radiance Fields ICLR
【速读】: 该论文试图解决水下图像因光线吸收、折射和散射导致的色彩偏移、低对比度和模糊问题。解决方案的关键在于提出了一种无监督的水下神经辐射场 (Unsupervised Underwater Neural Radiance Field, U2NeRF),这是一种基于transformer的架构,能够在多视角几何条件下同时学习渲染和恢复新视角。通过将恢复能力隐式地融入NeRF流程,并将其预测的颜色分解为场景辐射、直接透射图、后向散射透射图和全局背景光等多个组件,U2NeRF能够在自监督的方式下重建水下图像。此外,论文还发布了一个包含12个水下场景的UVS数据集,用于验证其方法的有效性。实验结果表明,U2NeRF在单一场景优化时,相比多个基线方法在LPIPS、UIQM和UCIQE指标上分别提升了11%、5%和4%(平均值),展示了其优越的渲染和恢复能力。
链接: https://arxiv.org/abs/2411.16172
作者: Vinayak Gupta,Manoj S,Mukund Varma T,Kaushik Mitra
关键词-EN: Unsupervised Underwater Neural, Neural Radiance Field, Underwater images suffer, Underwater Neural Radiance, low contrast
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR Tiny Papers 2024. arXiv admin note: text overlap with arXiv:2207.13298
点击查看摘要
Abstract:Underwater images suffer from colour shifts, low contrast, and haziness due to light absorption, refraction, scattering and restoring these images has warranted much attention. In this work, we present Unsupervised Underwater Neural Radiance Field U2NeRF, a transformer-based architecture that learns to render and restore novel views conditioned on multi-view geometry simultaneously. Due to the absence of supervision, we attempt to implicitly bake restoring capabilities onto the NeRF pipeline and disentangle the predicted color into several components - scene radiance, direct transmission map, backscatter transmission map, and global background light, and when combined reconstruct the underwater image in a self-supervised manner. In addition, we release an Underwater View Synthesis UVS dataset consisting of 12 underwater scenes, containing both synthetically-generated and real-world data. Our experiments demonstrate that when optimized on a single scene, U2NeRF outperforms several baselines by as much LPIPS 11%, UIQM 5%, UCIQE 4% (on average) and showcases improved rendering and restoration capabilities. Code will be made available upon acceptance.
zh
[CV-59] Image Generation Diversity Issues and How to Tame Them
【速读】: 该论文试图解决生成式模型(Generative Models)在多样性(diversity)方面的不足问题,特别是现有模型在生成数据时未能充分捕捉真实数据分布的多样性,且缺乏有效的评估指标。解决方案的关键在于提出了一个新的评估指标——图像检索分数(Image Retrieval Score, IRS),通过将多样性问题框架化为图像检索问题,利用合成数据作为查询来检索真实图像,从而量化生成模型输出的多样性。此外,论文还引入了多样性感知扩散模型(Diversity-Aware Diffusion Models, DiADM),通过解耦多样性与图像质量,使用多样性感知模块(diversity aware module)输入伪无条件特征(pseudo-unconditional features),在不损失图像质量的前提下提升生成模型的多样性。
链接: https://arxiv.org/abs/2411.16171
作者: Mischa Dombrowski,Weitong Zhang,Sarah Cechnicka,Hadrien Reynaud,Bernhard Kainz
关键词-EN: Generative, generative models, diversity, models, methods now produce
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 tables, 12 figures
点击查看摘要
Abstract:Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. Unlike quality issues, diversity limitations in generative models are hard to detect visually, requiring specific metrics for assessment. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model’s output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively. Consequently, we perform an extensive search for the best feature extractors to assess diversity. Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. We do this by disentangling diversity from image quality by using a diversity aware module that uses pseudo-unconditional features as input. We provide a Python package offering unified feature extraction and metric computation to further facilitate the evaluation of generative models this https URL.
zh
[CV-60] CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction
【速读】: 该论文试图解决现有线性复杂度视觉Transformer在资源受限的移动设备上部署时,面临效率提升有限或精度显著下降的问题。解决方案的关键在于提出了一种新的解耦双交互线性注意力机制 (deCoupled duAl-interactive lineaR attEntion, CARE),通过不对称特征解耦策略和动态记忆单元,有效分离局部归纳偏置和长程依赖的学习过程,同时设计双交互模块促进不同层特征间的有效交互,从而在保持高精度的同时显著提升模型效率。实验结果表明,该方法在ImageNet-1K、COCO和ADE20K数据集上均表现出色,例如在ImageNet-1K上以仅0.7/1.9 GMACs的计算成本达到78.4/82.1%的top-1准确率。
链接: https://arxiv.org/abs/2411.16170
作者: Yuan Zhou,Qingshan Xu,Jiequan Cui,Junbao Zhou,Jing Zhang,Richang Hong,Hanwang Zhang
关键词-EN: linear-complexity visual Transformers, efficient linear-complexity visual, visual Transformers, design efficient linear-complexity, large efforts
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, large efforts have been made to design efficient linear-complexity visual Transformers. However, current linear attention models are generally unsuitable to be deployed in resource-constrained mobile devices, due to suffering from either few efficiency gains or significant accuracy drops. In this paper, we propose a new de\textbfCoupled du\textbfAl-interactive linea\textbfR att\textbfEntion (CARE) mechanism, revealing that features’ decoupling and interaction can fully unleash the power of linear attention. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies, thereby preserving sufficient local and global information while effectively enhancing the efficiency of models. Then, a dynamic memory unit is employed to maintain critical information along the network pipeline. Moreover, we design a dual interaction module to effectively facilitate interaction between local inductive bias and long-range information as well as among features at different layers. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy. Extensive experiments on ImageNet-1K, COCO, and ADE20K datasets demonstrate the effectiveness of our approach, e.g., achieving 78.4/82.1% top-1 accuracy on ImagegNet-1K at the cost of only 0.7/1.9 GMACs. Codes will be released on \href…github.
zh
[CV-61] Local and Global Feature Attention Fusion Network for Face Recognition
【速读】: 该论文试图解决低质量人脸图像识别中的问题,特别是由于部分面部区域缺失或变形导致的识别困难。解决方案的关键在于提出了一个基于特征质量的局部和全局特征注意力融合网络(Local and Global Feature Attention Fusion, LGAF)。该网络能够根据特征质量自适应地分配局部和全局特征的注意力,通过局部和全局信息的互补,提取更具判别力和高质量的人脸特征。此外,论文还引入了一个多头多尺度局部特征提取模块(Multi-Head Multi-Scale Local Feature Extraction, MHMS),以增强在高维空间中人脸特征的可分性,并有效获取多尺度的细粒度信息。实验结果表明,LGAF在多个验证集上均取得了最佳的平均性能,并在TinyFace和SCFace数据集上超越了当前最先进的方法(SoTA)。
链接: https://arxiv.org/abs/2411.16169
作者: Wang Yu,Wei Wei
关键词-EN: partial facial regions, face images remains, partial facial, remains a challenge, challenge due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recognition of low-quality face images remains a challenge due to invisible or deformation in partial facial regions. For low-quality images dominated by missing partial facial regions, local region similarity contributes more to face recognition (FR). Conversely, in cases dominated by local face deformation, excessive attention to local regions may lead to misjudgments, while global features exhibit better robustness. However, most of the existing FR methods neglect the bias in feature quality of low-quality images introduced by different factors. To address this issue, we propose a Local and Global Feature Attention Fusion (LGAF) network based on feature quality. The network adaptively allocates attention between local and global features according to feature quality and obtains more discriminative and high-quality face features through local and global information complementarity. In addition, to effectively obtain fine-grained information at various scales and increase the separability of facial features in high-dimensional space, we introduce a Multi-Head Multi-Scale Local Feature Extraction (MHMS) module. Experimental results demonstrate that the LGAF achieves the best average performance on 4 validation sets (CFP-FP, CPLFW, AgeDB, and CALFW), and the performance on TinyFace and SCFace outperforms the state-of-the-art methods (SoTA).
zh
[CV-62] xt-to-Image Synthesis: A Decade Survey
【速读】: 该论文试图解决文本到图像合成 (Text-to-Image Synthesis, T2I) 的问题,即如何从文本描述生成高质量的图像。解决方案的关键在于利用基础模型 (Foundation Models) 在生成式 AI (Generative AI) 中的重要作用。论文回顾了超过440篇相关研究,探讨了生成对抗网络 (GANs)、自回归模型 (Autoregressive Models) 和扩散模型 (Diffusion Models) 在图像生成中的应用,并重点讨论了这些模型在文本条件下的生成能力和多样性。此外,论文还探讨了T2I在性能、可控性、个性化生成、安全性和内容及空间关系一致性等方面的前沿研究,并总结了常用的数据集和评估指标。最终,论文讨论了T2I在人工智能生成内容 (AIGC) 中的潜在应用及其面临的挑战和未来研究方向。
链接: https://arxiv.org/abs/2411.16164
作者: Nonghai Zhang,Hao Tang
关键词-EN: Artificial Intelligence Generated, Intelligence Generated Content, Artificial Intelligence, humans read, read a specific
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In this survey, we review over 440 recent works on T2I
点击查看摘要
Abstract:When humans read a specific text, they often visualize the corresponding images, and we hope that computers can do the same. Text-to-image synthesis (T2I), which focuses on generating high-quality images from textual descriptions, has become a significant aspect of Artificial Intelligence Generated Content (AIGC) and a transformative direction in artificial intelligence research. Foundation models play a crucial role in T2I. In this survey, we review over 440 recent works on T2I. We start by briefly introducing how GANs, autoregressive models, and diffusion models have been used for image generation. Building on this foundation, we discuss the development of these models for T2I, focusing on their generative capabilities and diversity when conditioned on text. We also explore cutting-edge research on various aspects of T2I, including performance, controllability, personalized generation, safety concerns, and consistency in content and spatial relationships. Furthermore, we summarize the datasets and evaluation metrics commonly used in T2I research. Finally, we discuss the potential applications of T2I within AIGC, along with the challenges and future research opportunities in this field.
zh
[CV-63] Sparse patches adversarial attacks via extrapolating point-wise information NEURIPS24
【速读】: 该论文试图解决稀疏对抗攻击(Sparse Adversarial Attacks)和补丁对抗攻击(Patch Adversarial Attacks)中无法同时优化多个补丁位置和扰动的问题。解决方案的关键在于提出了一种通过逐点修剪密集对抗扰动(Dense Adversarial Perturbations)来生成稀疏补丁对抗攻击的新方法。该方法能够同时优化任意数量和形状的稀疏补丁的位置和扰动,并且在标准稀疏对抗攻击中也显著提升了现有技术的性能。
链接: https://arxiv.org/abs/2411.16162
作者: Yaniv Nemcovsky,Avi Mendelson,Chaim Baskin
关键词-EN: adversarial attacks, patch adversarial attacks, adversarial, autonomous systems, patch adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AdvML-Frontiers 24: The 3nd Workshop on New Frontiers in Adversarial Machine Learning, NeurIPS 24
点击查看摘要
Abstract:Sparse and patch adversarial attacks were previously shown to be applicable in realistic settings and are considered a security risk to autonomous systems. Sparse adversarial perturbations constitute a setting in which the adversarial perturbations are limited to affecting a relatively small number of points in the input. Patch adversarial attacks denote the setting where the sparse attacks are limited to a given structure, i.e., sparse patches with a given shape and number. However, previous patch adversarial attacks do not simultaneously optimize multiple patches’ locations and perturbations. This work suggests a novel approach for sparse patches adversarial attacks via point-wise trimming dense adversarial perturbations. Our approach enables simultaneous optimization of multiple sparse patches’ locations and perturbations for any given number and shape. Moreover, our approach is also applicable for standard sparse adversarial attacks, where we show that it significantly improves the state-of-the-art over multiple extensive settings. A reference implementation of the proposed method and the reported experiments is provided at \urlthis https URL
zh
[CV-64] MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model
【速读】: 该论文试图解决多视角新视图合成 (Novel View Synthesis, NVS) 任务中的泛化性和3D一致性问题。解决方案的关键在于引入了一个多视角扩散模型 (MVGenMaster),并通过结合3D先验信息(使用度量深度和相机姿态进行扭曲)来显著增强模型的泛化能力和3D一致性。该模型能够在一个前向过程中生成多达100个新视图,且支持任意参考视图和相机姿态。此外,论文还开发了一个包含多达120万场景的大规模多视角图像数据集,并提出了针对大规模数据集的训练和模型改进方法,以进一步提升模型的性能。
链接: https://arxiv.org/abs/2411.16157
作者: Chenjie Cao,Chaohui Yu,Shang Liu,Fan Wang,Xiangyang Xue,Yanwei Fu
关键词-EN: View Synthesis, diffusion model enhanced, address versatile, introduce MVGenMaster, NVS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Models and codes will be released at this https URL
点击查看摘要
Abstract:We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on arbitrary reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset comprising up to 1.2 million scenes, equipped with well-aligned metric depth. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at this https URL.
zh
[CV-65] VideoOrion: Tokenizing Object Dynamics in Videos
【速读】: 该论文试图解决视频大语言模型 (Video-LLM) 中高效压缩高维视频数据并提取关键语义信息的问题。解决方案的关键在于引入 VideoOrion,这是一个专门设计的视频大语言模型,通过检测-分割-跟踪 (detect-segment-track) 流水线,利用专家视觉模型提取视频中的对象动态,并将这些动态编码为一组对象标记 (object tokens)。这种方法不仅提供了一种更自然和高效的方式来生成紧凑且解耦的语义表示,还能够在最小计算成本下显式地建模视频内容中的对象。此外,对象标记的引入使得 VideoOrion 能够自然地处理基于视频的指代任务。实验结果表明,VideoOrion 能够有效利用这些对象标记,并在一般视频问答和基于视频的指代基准测试中取得竞争性结果。
链接: https://arxiv.org/abs/2411.16156
作者: Yicheng Feng,Yijiang Li,Wanpeng Zhang,Sipeng Zheng,Zongqing Lu
关键词-EN: Large Language Model, Video Large Language, Large Language, Language Model, Video Large
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos–the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.
zh
[CV-66] Revisiting Marr in Face: The Building of 2D–2.5D–3D Representations in Deep Neural Networks
【速读】: 该论文试图解决的问题是深度神经网络(DNN)在视觉感知任务中是否遵循David Marr的视觉理论,即从2D草图到2.5D草图再到3D模型的逐步构建过程。解决方案的关键在于引入了一个图形探针(graphics probe),这是一个专门设计的子网络,用于从DNN的中间层重建原始图像。图形探针的关键特性是其灵活的架构,能够支持2D和3D格式的图像重建,以及介于两者之间的过渡状态。通过在神经网络中注入图形探针并分析其在图像重建中的行为,研究发现DNN在低层首先编码为2D表示,在高层最终构建3D表示,而在中层则表现出一种混合状态,类似于2.5D表示,即在狭窄深度范围内构建几何表示,类似于低浮雕雕塑的外观。这一发现为Marr的理论提供了实证支持,并揭示了DNN在视觉感知过程中从2D到3D的演变机制。
链接: https://arxiv.org/abs/2411.16148
作者: Xiangyu Zhu,Chang Yu,Jiankuo Zhao,Zhaoxiang Zhang,Stan Z. Li,Zhen Lei
关键词-EN: David Marr seminal, visual system operates, human visual system, David Marr, Marr seminal theory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:David Marr’s seminal theory of vision proposes that the human visual system operates through a sequence of three stages, known as the 2D sketch, the 2.5D sketch, and the 3D model. In recent years, Deep Neural Networks (DNN) have been widely thought to have reached a level comparable to human vision. However, the mechanisms by which DNNs accomplish this and whether they adhere to Marr’s 2D–2.5D–3D construction theory remain unexplored. In this paper, we delve into the perception task to explore these questions and find evidence supporting Marr’s theory. We introduce a graphics probe, a sub-network crafted to reconstruct the original image from the network’s intermediate layers. The key to the graphics probe is its flexible architecture that supports image in both 2D and 3D formats, as well as in a transitional state between them. By injecting graphics probes into neural networks, and analyzing their behavior in reconstructing images, we find that DNNs initially encode images as 2D representations in low-level layers, and finally construct 3D representations in high-level layers. Intriguingly, in mid-level layers, DNNs exhibit a hybrid state, building a geometric representation that s sur normals within a narrow depth range, akin to the appearance of a low-relief sculpture. This stage resembles the 2.5D representations, providing a view of how DNNs evolve from 2D to 3D in the perception process. The graphics probe therefore serves as a tool for peering into the mechanisms of DNN, providing empirical support for Marr’s theory.
zh
[CV-67] reeFormer: Single-view Plant Skeleton Estimation via Tree-constrained Graph Generation WACV2025
【速读】: 该论文试图解决从图像中准确估计植物骨架结构(如分支结构)的问题,这在智能农业和植物科学中至关重要。与人类骨骼具有固定拓扑结构不同,植物骨架估计的挑战在于从图像中推断出任意树形图。尽管最近的图生成方法能够成功地从图像中推断出细小结构,但严格地将输出图约束为树形结构仍然具有挑战性。为此,论文提出了TreeFormer,一种通过树形约束图生成来估计植物骨架的方法。其关键在于结合基于学习的图生成与传统图算法,在训练过程中施加约束。具体而言,该方法在训练过程中将无约束图投影到最小生成树(Minimum Spanning Tree, MST)上,并通过抑制不需要的特征值将这种先验知识融入梯度下降优化中。实验表明,该方法能够准确估计多个领域的目标植物骨架结构,包括合成树模式、真实植物根系和葡萄藤分支。
链接: https://arxiv.org/abs/2411.16132
作者: Xinpeng Liu,Hiroaki Santo,Yosuke Toda,Fumio Okura
关键词-EN: Accurate estimation, essential for smart, smart agriculture, Accurate, graph
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)
点击查看摘要
Abstract:Accurate estimation of plant skeletal structure (e.g., branching structure) from images is essential for smart agriculture and plant science. Unlike human skeletons with fixed topology, plant skeleton estimation presents a unique challenge, i.e., estimating arbitrary tree graphs from images. While recent graph generation methods successfully infer thin structures from images, it is challenging to constrain the output graph strictly to a tree structure. To this problem, we present TreeFormer, a plant skeleton estimator via tree-constrained graph generation. Our approach combines learning-based graph generation with traditional graph algorithms to impose the constraints during the training loop. Specifically, our method projects an unconstrained graph onto a minimum spanning tree (MST) during the training loop and incorporates this prior knowledge into the gradient descent optimization by suppressing unwanted feature values. Experiments show that our method accurately estimates target plant skeletal structures for multiple domains: Synthetic tree patterns, real botanical roots, and grapevine branches. Our implementations are available at this https URL.
zh
[CV-68] hree Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion
【速读】: 该论文试图解决基于相机的语义场景补全 (Semantic Scene Completion, SSC) 在3D感知领域中,由于透视和遮挡导致的远距离区域几何信息低估的问题。解决方案的关键在于提出了ScanSSC模型,该模型包含Scan模块和Scan损失函数,旨在通过利用近视角场景的上下文信息来增强远距离场景的感知。Scan模块采用轴向掩码注意力机制,通过近到远的级联掩码使远距离体素能够捕捉与先前体素的关系。Scan损失函数则沿每个轴计算累积对数与相应类别分布之间的交叉熵,从而将近视角的丰富上下文信号传播到远距离体素。这种协同作用使得ScanSSC在SemanticKITTI和SSCBench-KITTI-360基准测试中达到了最先进的性能,IoU分别为44.54和48.29,mIoU分别为17.40和20.14。
链接: https://arxiv.org/abs/2411.16129
作者: Jongseong Bae,Junwoo Ha,Ha Young Kim
关键词-EN: Semantic Scene Completion, Camera-based Semantic Scene, Scene Completion, Camera-based Semantic, Semantic Scene
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both designed to enhance distant scenes by leveraging context from near-viewpoint scenes. The Scan Module uses axis-wise masked attention, where each axis employing a near-to-far cascade masking that enables distant voxels to capture relationships with preceding voxels. In addition, the Scan Loss computes the cross-entropy along each axis between cumulative logits and corresponding class distributions in a near-to-far direction, thereby propagating rich context-aware signals to distant voxels. Leveraging the synergy between these components, ScanSSC achieves state-of-the-art performance, with IoUs of 44.54 and 48.29, and mIoUs of 17.40 and 20.14 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.
zh
[CV-69] CIA: Controllable Image Augmentation Framework Based on Stable Diffusion
【速读】: 该论文试图解决在计算机视觉任务中,如目标检测和分割,由于数据集标注不足或质量不高而导致的性能瓶颈问题。解决方案的关键在于提出了一个名为CIA的模块化流水线,该流水线包括三个主要步骤:(1) 使用Stable Diffusion生成合成图像以增强数据集;(2) 通过定义的质量指标过滤掉低质量样本;(3) 通过精确的提示和ControlNet确保生成图像中存在特定模式。通过在COCO和Flickr30k数据集上使用YOLOv8n进行实验,研究结果表明,CIA生成的图像显著提升了目标检测性能,接近于将真实图像数量翻倍的效果。这一发现表明,CIA框架能够显著增强目标检测系统,并为未来在数据受限场景下的研究提供了可能性。
链接: https://arxiv.org/abs/2411.16128
作者: Mohamed Benkedadra,Dany Rimez,Tiffanie Godelaine,Natarajan Chidambaram,Hamed Razavi Khosroshahi,Horacio Tellez,Matei Mancas,Benoit Macq,Sidi Ahmed Mahmoudi
关键词-EN: Computer vision tasks, Computer vision, accurately annotated datasets, availability of extensive, accurately annotated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Computer vision tasks such as object detection and segmentation rely on the availability of extensive, accurately annotated datasets. In this work, We present CIA, a modular pipeline, for (1) generating synthetic images for dataset augmentation using Stable Diffusion, (2) filtering out low quality samples using defined quality metrics, (3) forcing the existence of specific patterns in generated images using accurate prompting and ControlNet. In order to show how CIA can be used to search for an optimal augmentation pipeline of training data, we study human object detection in a data constrained scenario, using YOLOv8n on COCO and Flickr30k datasets. We have recorded significant improvement using CIA-generated images, approaching the performances obtained when doubling the amount of real images in the dataset. Our findings suggest that our modular framework can significantly enhance object detection systems, and make it possible for future research to be done on data-constrained scenarios. The framework is available at: this http URL.
zh
[CV-70] Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain
【速读】: 该论文试图解决在医学领域中使用Segment Anything Model (SAM)进行“一次性”学习时,由于视觉提示生成依赖于像素相似性而导致的提示生成不准确和点提示聚类问题。解决方案的关键在于引入了一种名为Med-PerSAM的新型一次性框架,该框架通过视觉提示工程和轻量级基于扭曲的提示调优模型,实现了自动化的提示生成和迭代优化,从而在不依赖额外训练或人工干预的情况下,提升了预训练SAM在医学影像数据集上的性能。
链接: https://arxiv.org/abs/2411.16123
作者: Hangyul Yoon,Doohyuk Jang,Jungeun Kim,Eunho Yang
关键词-EN: proven highly effective, NLP tasks, effective in NLP, Leveraging pre-trained models, Leveraging pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Leveraging pre-trained models with tailored prompts for in-context learning has proven highly effective in NLP tasks. Building on this success, recent studies have applied a similar approach to the Segment Anything Model (SAM) within a ``one-shot" framework, where only a single reference image and its label are employed. However, these methods face limitations in the medical domain, primarily due to SAM’s essential requirement for visual prompts and the over-reliance on pixel similarity for generating them. This dependency may lead to (1) inaccurate prompt generation and (2) clustering of point prompts, resulting in suboptimal outcomes. To address these challenges, we introduce \textbfMed-PerSAM, a novel and straightforward one-shot framework designed for the medical domain. Med-PerSAM uses only visual prompt engineering and eliminates the need for additional training of the pretrained SAM or human intervention, owing to our novel automated prompt generation process. By integrating our lightweight warping-based prompt tuning model with SAM, we enable the extraction and iterative refinement of visual prompts, enhancing the performance of the pre-trained SAM. This advancement is particularly meaningful in the medical domain, where creating visual prompts poses notable challenges for individuals lacking medical expertise. Our model outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.
zh
[CV-71] FUN-AD: Fully Unsupervised Learning for Anomaly Detection with Noisy Training Data WACV2025
【速读】: 该论文试图解决在实际工业环境中,由于标注错误或新/翻新产品缺乏标签导致的训练数据噪声问题,特别是在无监督异常检测场景下。解决方案的关键在于提出了一种基于学习的方法,利用未标记且可能受污染的训练数据进行全无监督异常检测。具体来说,该方法基于两个观察:1) 正常样本之间的成对特征距离平均上可能小于异常样本或异质样本之间的距离;2) 相互最接近的特征对很可能是同质对,前提是正常数据的方差小于异常数据。基于第一个观察,论文提出了使用迭代重建的记忆库(IRMB)进行伪标签策略;基于第二个观察,引入了一种新的损失函数,以促进相互最接近的特征对之间的类同质性,从而减轻任务的病态性。实验结果表明,该方法在不同场景和异常与正常样本比例下均有效。
链接: https://arxiv.org/abs/2411.16110
作者: Jiin Im,Yongho Son,Je Hyeong Hong
关键词-EN: incur noisy training, noisy training data, training data due, practical industrial environments, one-class classification
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025. Supplementary material included after references. 17 pages, 7 figures, 14 tables
点击查看摘要
Abstract:While the mainstream research in anomaly detection has mainly followed the one-class classification, practical industrial environments often incur noisy training data due to annotation errors or lack of labels for new or refurbished products. To address these issues, we propose a novel learning-based approach for fully unsupervised anomaly detection with unlabeled and potentially contaminated training data. Our method is motivated by two observations, that i) the pairwise feature distances between the normal samples are on average likely to be smaller than those between the anomaly samples or heterogeneous samples and ii) pairs of features mutually closest to each other are likely to be homogeneous pairs, which hold if the normal data has smaller variance than the anomaly data. Building on the first observation that nearest-neighbor distances can distinguish between confident normal samples and anomalies, we propose a pseudo-labeling strategy using an iteratively reconstructed memory bank (IRMB). The second observation is utilized as a new loss function to promote class-homogeneity between mutually closest pairs thereby reducing the ill-posedness of the task. Experimental results on two public industrial anomaly benchmarks and semantic anomaly examples validate the effectiveness of FUN-AD across different scenarios and anomaly-to-normal ratios. Our code is available at this https URL.
zh
[CV-72] UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image
【速读】: 该论文试图解决在仅有一张未标注的RGB-D参考图像的情况下,对未见过的物体进行姿态估计的问题。解决方案的关键在于提出了一种名为UNOPose的新方法,该方法通过构建一个SE(3)不变的参考框架来标准化物体表示,从而克服了姿态和尺寸变化带来的挑战。此外,UNOPose通过重新校准每个对应点的权重,基于其预测的重叠区域内的可能性,来缓解不同视角之间重叠区域较小的问题。这种方法在仅有一张参考图像的设置下,显著优于传统的和基于学习的方法,并且在性能上与基于CAD模型的方法相当。
链接: https://arxiv.org/abs/2411.16106
作者: Xingyu Liu,Gu Wang,Ruida Zhang,Chenyangguang Zhang,Federico Tombari,Xiangyang Ji
关键词-EN: onboarding stage costly, multiple reference views, rely on CAD, CAD models, Unseen object pose
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures
点击查看摘要
Abstract:Unseen object pose estimation methods often rely on CAD models or multiple reference views, making the onboarding stage costly. To simplify reference acquisition, we aim to estimate the unseen object’s pose through a single unposed RGB-D reference image. While previous works leverage reference images as pose anchors to limit the range of relative pose, our scenario presents significant challenges since the relative transformation could vary across the entire SE(3) space. Moreover, factors like occlusion, sensor noise, and extreme geometry could result in low viewpoint overlap. To address these challenges, we present a novel approach and benchmark, termed UNOPose, for unseen one-reference-based object pose estimation. Building upon a coarse-to-fine paradigm, UNOPose constructs an SE(3)-invariant reference frame to standardize object representation despite pose and size variations. To alleviate small overlap across viewpoints, we recalibrate the weight of each correspondence based on its predicted likelihood of being within the overlapping region. Evaluated on our proposed benchmark based on the BOP Challenge, UNOPose demonstrates superior performance, significantly outperforming traditional and learning-based methods in the one-reference setting and remaining competitive with CAD-model-based methods. The code and dataset will be available.
zh
[CV-73] ENCLIP: Ensembling and Clustering-Based Contrastive Language-Image Pretraining for Fashion Multimodal Search with Limited Data and Low-Quality Images
【速读】: 该论文试图解决在时尚智能领域中,由于数据稀缺和图像质量低下的问题,导致对比语言-图像预训练模型(Contrastive Language-Image Pretraining, CLIP)在多模态搜索中的性能受限的问题。解决方案的关键在于提出了一种名为ENCLIP的创新方法,该方法通过训练和集成多个CLIP模型的实例,并利用聚类技术将相似图像分组,从而增强CLIP模型在时尚智能领域的应用效果。这种方法有效应对了数据不足和图像质量差的问题,显著提升了CLIP模型在多模态搜索中的表现。
链接: https://arxiv.org/abs/2411.16096
作者: Prithviraj Purushottam Naik,Rohit Agarwal
关键词-EN: Multimodal search, explore fashion items, Multimodal Search targeted, providing a seamless, seamless and intuitive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Multimodal search has revolutionized the fashion industry, providing a seamless and intuitive way for users to discover and explore fashion items. Based on their preferences, style, or specific attributes, users can search for products by combining text and image information. Text-to-image searches enable users to find visually similar items or describe products using natural language. This paper presents an innovative approach called ENCLIP, for enhancing the performance of the Contrastive Language-Image Pretraining (CLIP) model, specifically in Multimodal Search targeted towards the domain of fashion intelligence. This method focuses on addressing the challenges posed by limited data availability and low-quality images. This paper proposes an algorithm that involves training and ensembling multiple instances of the CLIP model, and leveraging clustering techniques to group similar images together. The experimental findings presented in this study provide evidence of the effectiveness of the methodology. This approach unlocks the potential of CLIP in the domain of fashion intelligence, where data scarcity and image quality issues are prevalent. Overall, the ENCLIP method represents a valuable contribution to the field of fashion intelligence and provides a practical solution for optimizing the CLIP model in scenarios with limited data and low-quality images.
zh
[CV-74] Very Basics of Tensors with Graphical Notations: Unfolding Calculations and Decompositions
【速读】: 该论文旨在解决读者在阅读使用张量(tensor)的文献时,由于缺乏对张量及其操作的详细定义和解释而感到困惑的问题。解决方案的关键在于通过张量网络图(Tensor network diagram)这一图形表示法,直观地展示张量之间的复杂乘法操作,包括内积(inner product)、外积(outer product)、哈达玛积(Hadamard product)、克罗内克积(Kronecker product)和Khatri-Rao积(Khatri-Rao product)等。通过这种图形表示法,读者可以更清晰地理解张量乘法的本质,从而更好地掌握张量在信号处理和机器学习中的应用。
链接: https://arxiv.org/abs/2411.16094
作者: Tatsuya Yokota
关键词-EN: graphical notation, Tensor network diagram, network diagram, nodes and edges, graphically represents multiplications
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Tensor network diagram (graphical notation) is a useful tool that graphically represents multiplications between multiple tensors using nodes and edges. Using the graphical notation, complex multiplications between tensors can be described simply and intuitively, and it also helps to understand the essence of tensor products. In fact, most of matrix/tensor products including inner product, outer product, Hadamard product, Kronecker product, and Khatri-Rao product can be written in graphical notation. These matrix/tensor operations are essential building blocks for the use of matrix/tensor decompositions in signal processing and machine learning. The purpose of this lecture note is to learn the very basics of tensors and how to represent them in mathematical symbols and graphical notation. Many papers using tensors omit these detailed definitions and explanations, which can be difficult for the reader. I hope this note will be of help to such readers.
zh
[CV-75] AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity
【速读】: 该论文试图解决AI生成图像(AIGIs)在感知质量和文本-图像对齐质量评估中存在的问题。现有评估方法过于依赖初始提示(initial prompts),并使用相同的提示来指导感知和对齐质量的评估,忽略了这两项任务之间的区别。论文提出的解决方案之关键是TSP-MGS方法,该方法设计了任务特定的提示(task-specific prompts),并测量AIGIs与提示之间的多粒度相似性(multi-granularity similarity)。具体来说,TSP-MGS首先构建描述感知和对齐质量程度的任务特定提示,并引入初始提示以进行详细的质量感知。然后,计算AIGIs与任务特定提示之间的粗粒度相似性,以促进整体质量意识;同时,测量图像与初始提示之间的细粒度相似性,以增强对AIGI细节的理解。最终,通过整合多粒度相似性来实现精确的质量预测。
链接: https://arxiv.org/abs/2411.16087
作者: Jili Xia,Lihuo He,Fei Gao,Kaifan Zhang,Leida Li,Xinbo Gao
关键词-EN: garnered widespread attention, quality, prompts, widespread attention, garnered widespread
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, AI-generated images (AIGIs) created by given prompts (initial prompts) have garnered widespread attention. Nevertheless, due to technical nonproficiency, they often suffer from poor perception quality and Text-to-Image misalignment. Therefore, assessing the perception quality and alignment quality of AIGIs is crucial to improving the generative model’s performance. Existing assessment methods overly rely on the initial prompts in the task prompt design and use the same prompts to guide both perceptual and alignment quality evaluation, overlooking the distinctions between the two tasks. To address this limitation, we propose a novel quality assessment method for AIGIs named TSP-MGS, which designs task-specific prompts and measures multi-granularity similarity between AIGIs and the prompts. Specifically, task-specific prompts are first constructed to describe perception and alignment quality degrees separately, and the initial prompt is introduced for detailed quality perception. Then, the coarse-grained similarity between AIGIs and task-specific prompts is calculated, which facilitates holistic quality awareness. In addition, to improve the understanding of AIGI details, the fine-grained similarity between the image and the initial prompt is measured. Finally, precise quality prediction is acquired by integrating the multi-granularity similarities. Experiments on the commonly used AGIQA-1K and AGIQA-3K benchmarks demonstrate the superiority of the proposed TSP-MGS.
zh
[CV-76] Leverage Task Context for Object Affordance Ranking
【速读】: 该论文试图解决智能代理在复杂环境中根据任务上下文选择合适对象的问题。当前研究将同一功能类别的对象视为等价,忽略了不同任务上下文中对象功能优先级的差异,导致决策不准确。解决方案的关键在于提出了一种基于任务上下文的对象功能排序方法,即通过任务关系挖掘模块和图组更新模块,深入整合任务上下文并进行全局相对关系传递,从而揭示任务与对象之间的关系并明确检测对象的优先级。该方法的核心是利用任务上下文进行对象功能排序,并通过构建大规模任务导向的功能排序数据集来验证其可行性和优越性。
链接: https://arxiv.org/abs/2411.16082
作者: Haojie Huang,Hongchen Luo,Wei Zhai,Yang Cao,Zheng-Jun Zha
关键词-EN: Intelligent agents accomplish, task context, Intelligent agents, task, affordance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Intelligent agents accomplish different tasks by utilizing various objects based on their affordance, but how to select appropriate objects according to task context is not well-explored. Current studies treat objects within the affordance category as equivalent, ignoring that object affordances vary in priority with different task contexts, hindering accurate decision-making in complex environments. To enable agents to develop a deeper understanding of the objects required to perform tasks, we propose to leverage task context for object affordance ranking, i.e., given image of a complex scene and the textual description of the affordance and task context, revealing task-object relationships and clarifying the priority rank of detected objects. To this end, we propose a novel Context-embed Group Ranking Framework with task relation mining module and graph group update module to deeply integrate task context and perform global relative relationship transmission. Due to the lack of such data, we construct the first large-scale task-oriented affordance ranking dataset with 25 common tasks, over 50k images and more than 661k objects. Experimental results demonstrate the feasibility of the task context based affordance learning paradigm and the superiority of our model over state-of-the-art models in the fields of saliency ranking and multimodal object detection. The source code and dataset will be made available to the public.
zh
[CV-77] Boosting 3D Object Generation through PBR Materials SIGGRAPH
【速读】: 该论文试图解决现有生成式 3D 内容创建方法在生成高质量、逼真 3D 物体时面临的挑战,特别是材质(materials)与纹理(textures)之间的不一致性问题,以及几何(geometry)与高频纹理细节(high-frequency texture details)之间的严重错位。解决方案的关键在于引入基于物理的渲染(Physics-Based Rendering, PBR)材质分析,并结合扩散模型(diffusion models)和多模态模型(multimodal models),通过精细调整的 Stable Diffusion 模型提取 3D 一致的反照率(albedo)和凹凸贴图(bump maps),同时采用半自动流程生成粗糙度(roughness)和金属度(metalness)贴图,以实现更自然的光照效果和显著提升的几何精度。
链接: https://arxiv.org/abs/2411.16080
作者: Yitong Wang,Xudong Xu,Li Ma,Haoran Wang,Bo Dai
关键词-EN: increasing attention recently, gained increasing attention, film industry, content creation, attention recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Accepted to SIGGRAPH Asia 2024 Conference Papers
点击查看摘要
Abstract:Automatic 3D content creation has gained increasing attention recently, due to its potential in various applications such as video games, film industry, and AR/VR. Recent advancements in diffusion models and multimodal models have notably improved the quality and efficiency of 3D object generation given a single RGB image. However, 3D objects generated even by state-of-the-art methods are still unsatisfactory compared to human-created assets. Considering only textures instead of materials makes these methods encounter challenges in photo-realistic rendering, relighting, and flexible appearance editing. And they also suffer from severe misalignment between geometry and high-frequency texture details. In this work, we propose a novel approach to boost the quality of generated 3D objects from the perspective of Physics-Based Rendering (PBR) materials. By analyzing the components of PBR materials, we choose to consider albedo, roughness, metalness, and bump maps. For albedo and bump maps, we leverage Stable Diffusion fine-tuned on synthetic data to extract these values, with novel usages of these fine-tuned models to obtain 3D consistent albedo UV and bump UV for generated objects. In terms of roughness and metalness maps, we adopt a semi-automatic process to provide room for interactive adjustment, which we believe is more practical. Extensive experiments demonstrate that our model is generally beneficial for various state-of-the-art generation methods, significantly boosting the quality and realism of their generated 3D objects, with natural relighting effects and substantially improved geometry.
zh
[CV-78] Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models
【速读】: 该论文试图解决神经网络在图像分类任务中因学习到偏差而影响其泛化能力和性能的问题。解决方案的关键在于引入了一种名为 DiffuBias 的新型文本到图像生成管道,该管道通过生成偏差冲突样本(bias-conflict samples)来增强分类器的鲁棒性,而不需要在生成阶段进行训练。DiffuBias 利用预训练的扩散模型和图像字幕生成模型,通过偏差分类器(f_B)的 top-K 损失来生成更具代表性的数据样本,从而有效地去偏并提升分类器的泛化能力。据我们所知,DiffuBias 是首个利用稳定扩散模型在去偏任务中生成偏差冲突样本的方法。
链接: https://arxiv.org/abs/2411.16079
作者: Donggeun Ko,Dongjun Lee,Namjun Park,Wonkyeong Shim,Jaekwang Kim
关键词-EN: Neural networks struggle, Neural networks, Generative Adversarial Networks, misleads correlations, learned and misleads
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages + Appendix
点击查看摘要
Abstract:Neural networks struggle with image classification when biases are learned and misleads correlations, affecting their generalization and performance. Previous methods require attribute labels (e.g. background, color) or utilizes Generative Adversarial Networks (GANs) to mitigate biases. We introduce DiffuBias, a novel pipeline for text-to-image generation that enhances classifier robustness by generating bias-conflict samples, without requiring training during the generation phase. Utilizing pretrained diffusion and image captioning models, DiffuBias generates images that challenge the biases of classifiers, using the top- K losses from a biased classifier ( f_B ) to create more representative data samples. This method not only debiases effectively but also boosts classifier generalization capabilities. To the best of our knowledge, DiffuBias is the first approach leveraging a stable diffusion model to generate bias-conflict samples in debiasing tasks. Our comprehensive experimental evaluations demonstrate that DiffuBias achieves state-of-the-art performance on benchmark datasets. We also conduct a comparative analysis of various generative models in terms of carbon emissions and energy consumption to highlight the significance of computational efficiency.
zh
[CV-79] Geometry Distributions
【速读】: 该论文试图解决传统坐标基网络在处理3D数据时面临的挑战,如薄结构和非水密几何体的处理问题,这些问题限制了其灵活性和准确性。解决方案的关键在于提出了一种新的几何数据表示方法,即将几何体建模为分布(distributions),这种表示方法不依赖于表面拓扑、连通性或边界条件。具体实现上,论文采用了扩散模型(diffusion models)结合一种新颖的网络架构来学习表面点分布,从而捕捉精细的几何细节。这种方法在多种物体类型上进行了定性和定量评估,展示了其在实现高几何保真度方面的有效性,并探索了其在纹理网格表示、神经表面压缩、动态物体建模和渲染等应用中的潜力。
链接: https://arxiv.org/abs/2411.16076
作者: Biao Zhang,Jing Ren,Peter Wonka
关键词-EN: recent work leveraging, work leveraging coordinate-based, leveraging coordinate-based networks, vector fields, widely adopted
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: For the project site, see this https URL
点击查看摘要
Abstract:Neural representations of 3D data have been widely adopted across various applications, particularly in recent work leveraging coordinate-based networks to model scalar or vector fields. However, these approaches face inherent challenges, such as handling thin structures and non-watertight geometries, which limit their flexibility and accuracy. In contrast, we propose a novel geometric data representation that models geometry as distributions-a powerful representation that makes no assumptions about surface genus, connectivity, or boundary conditions. Our approach uses diffusion models with a novel network architecture to learn surface point distributions, capturing fine-grained geometric details. We evaluate our representation qualitatively and quantitatively across various object types, demonstrating its effectiveness in achieving high geometric fidelity. Additionally, we explore applications using our representation, such as textured mesh representation, neural surface compression, dynamic object modeling, and rendering, highlighting its potential to advance 3D geometric learning.
zh
[CV-80] Soft-TransFormers for Continual Learning
【速读】: 该论文试图解决持续学习(Continual Learning, CL)中的灾难性遗忘(Catastrophic Forgetting, CF)问题,特别是在类增量学习(Class-Incremental Learning, CIL)和任务增量学习(Task-Incremental Learning, TIL)场景下。解决方案的关键在于提出了一种名为Soft-TransFormers(Soft-TF)的全新全微调持续学习方法。Soft-TF通过顺序学习和选择每个任务的最优软网络或子网络,在训练过程中联合优化稀疏层的权重,以获得任务自适应的软(实值)网络或子网络(二进制掩码),同时保持预训练层参数冻结。在推理阶段,Soft-TF通过识别的任务自适应网络掩码预训练网络的参数,映射到每个任务的最优解,从而最小化灾难性遗忘,并保留预训练网络的知识。实验结果表明,Soft-TF在Vision Transformer (ViT)和CLIP模型上表现出色,达到了各种持续学习场景下的最先进性能。
链接: https://arxiv.org/abs/2411.16073
作者: Haeyong Kang,Chang D. Yoo
关键词-EN: Lottery Ticket Hypothesis, Well-initialized Lottery Ticket, Inspired by Well-initialized, Ticket Hypothesis, Well-initialized Lottery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Inspired by Well-initialized Lottery Ticket Hypothesis (WLTH), which provides suboptimal fine-tuning solutions, we propose a novel fully fine-tuned continual learning (CL) method referred to as Soft-TransFormers (Soft-TF). Soft-TF sequentially learns and selects an optimal soft-network or subnetwork for each task. During sequential training in CL, Soft-TF jointly optimizes the weights of sparse layers to obtain task-adaptive soft (real-valued) networks or subnetworks (binary masks), while keeping the well-pre-trained layer parameters frozen. In inference, the identified task-adaptive network of Soft-TF masks the parameters of the pre-trained network, mapping to an optimal solution for each task and minimizing Catastrophic Forgetting (CF) - the soft-masking preserves the knowledge of the pre-trained network. Extensive experiments on Vision Transformer (ViT) and CLIP demonstrate the effectiveness of Soft-TF, achieving state-of-the-art performance across various CL scenarios, including Class-Incremental Learning (CIL) and Task-Incremental Learning (TIL), supported by convergence theory.
zh
[CV-81] Language Driven Occupancy Prediction
【速读】: 该论文试图解决开放词汇占用预测 (Open-Vocabulary Occupancy, OVO) 中由于监督信号不准确导致模型泛化能力不足的问题。解决方案的关键在于提出了一种语义传递标注流程 (semantic transitive labeling pipeline),通过将图像中的文本标签传递到LiDAR点云,最终映射到体素 (voxel) 上,生成密集且细粒度的3D语言占用真值。这一流程有效缓解了传统方法中基于图像特征或体素模型视图投影产生的噪声和稀疏对应关系。此外,论文通过替换监督占用模型中的预测头,引入几何头 (geometry head) 和语言头 (language head),利用生成的语言真值指导3D语言体积的学习,从而显著提升了模型的预测精度和泛化能力。
链接: https://arxiv.org/abs/2411.16072
作者: Zhu Yu,Bowen Pang,Lizhe Liu,Runmin Zhang,Qihao Peng,Maochun Luo,Sheng Yang,Mingxia Chen,Si-Yuan Cao,Hui-Liang Shen
关键词-EN: effective and generalizable, generalizable framework, framework for open-vocabulary, OVO, semantic transitive labeling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and finegrained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and utimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our semantic transitive labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-ofthe-art zero-shot occupancy prediction approaches on the Occ3D-nuScenes dataset. Notably, even based on the simpler BEVDet model, with an input resolution of 256 * 704,Occ-BEVDet achieves an mIoU of 20.29, surpassing previous approaches that rely on temporal images, higher-resolution inputs, or larger backbone networks. The code for the proposed method is available at this https URL.
zh
[CV-82] Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation
【速读】: 该论文试图解决的是类增量无源无监督领域自适应 (Class-Incremental Source-Free Unsupervised Domain Adaptation, CI-SFUDA) 问题,即在无法访问带标签的源数据的情况下,如何有效地将源域知识迁移到增量到达的无标签目标域。解决方案的关键在于提出了多粒度类原型拓扑蒸馏 (Multi-Granularity Class Prototype Topology Distillation, GROTO) 算法。该算法通过设计两个核心模块来应对问题中的两个挑战:1) 相似源类知识对目标类表示学习的干扰;2) 新目标知识对旧目标知识的干扰。具体来说,算法首先通过建模两种累积分布来挖掘正类,并引入多粒度类原型生成可靠的伪标签,促进正类目标特征的自组织。接着,利用正类原型构建源域和目标域特征空间的拓扑结构,并通过拓扑蒸馏持续减轻新目标知识对旧目标知识的干扰。实验结果表明,该方法在多个公开数据集上达到了最先进的性能。
链接: https://arxiv.org/abs/2411.16064
作者: Peihua Deng,Jiehua Zhang,Xichun Sheng,Chenggang Yan,Yaoqi Sun,Ying Fu,Liang Li
关键词-EN: Unsupervised Domain Adaptation, Source-Free Unsupervised Domain, Class-Incremental Source-Free Unsupervised, Source-Free Unsupervised, labeled source instances
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
点击查看摘要
Abstract:This paper explores the Class-Incremental Source-Free Unsupervised Domain Adaptation (CI-SFUDA) problem, where the unlabeled target data come incrementally without access to labeled source instances. This problem poses two challenges, the disturbances of similar source-class knowledge to target-class representation learning and the new target knowledge to old ones. To address them, we propose the Multi-Granularity Class Prototype Topology Distillation (GROTO) algorithm, which effectively transfers the source knowledge to the unlabeled class-incremental target domain. Concretely, we design the multi-granularity class prototype self-organization module and prototype topology distillation module. Firstly, the positive classes are mined by modeling two accumulation distributions. Then, we generate reliable pseudo-labels by introducing multi-granularity class prototypes, and use them to promote the positive-class target feature self-organization. Secondly, the positive-class prototypes are leveraged to construct the topological structures of source and target feature spaces. Then, we perform the topology distillation to continually mitigate the interferences of new target knowledge to old ones. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performances on three public datasets.
zh
[CV-83] Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training
【速读】: 该论文试图解决脉冲神经网络 (Spiking Neural Networks, SNNs) 在性能和训练成本方面与传统人工神经网络 (Artificial Neural Networks, ANNs) 之间的差距问题。解决方案的关键在于提出了一种基于整数训练和脉冲驱动推理的脉冲发放近似方法 (Spike Firing Approximation, SFA),该方法优化了脉冲神经元的脉冲发放模式,从而提高了训练效率、降低了功耗、提升了性能,并使得SNNs更易于扩展和更好地利用神经形态芯片。此外,论文还开发了一种高效的脉冲驱动Transformer架构和脉冲掩码自编码器,以防止SNN在扩展过程中性能下降。实验结果表明,该方法在ImageNet-1k数据集上取得了最先进的性能,并且在训练时间和推理能效方面均有显著提升。
链接: https://arxiv.org/abs/2411.16061
作者: Man Yao,Xuerui Qiu,Tianxiang Hu,Jiakui Hu,Yuhong Chou,Keyu Tian,Jianxing Liao,Luziwei Leng,Bo Xu,Guoqi Li
关键词-EN: Artificial Neural Networks, traditional Artificial Neural, Spiking Neural Networks, Neural Networks, brain-inspired Spiking Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The ambition of brain-inspired Spiking Neural Networks (SNNs) is to become a low-power alternative to traditional Artificial Neural Networks (ANNs). This work addresses two major challenges in realizing this vision: the performance gap between SNNs and ANNs, and the high training costs of SNNs. We identify intrinsic flaws in spiking neurons caused by binary firing mechanisms and propose a Spike Firing Approximation (SFA) method using integer training and spike-driven inference. This optimizes the spike firing pattern of spiking neurons, enhancing efficient training, reducing power consumption, improving performance, enabling easier scaling, and better utilizing neuromorphic chips. We also develop an efficient spike-driven Transformer architecture and a spike-masked autoencoder to prevent performance degradation during SNN scaling. On ImageNet-1k, we achieve state-of-the-art top-1 accuracy of 78.5%, 79.8%, 84.0%, and 86.2% with models containing 10M, 19M, 83M, and 173M parameters, respectively. For instance, the 10M model outperforms the best existing SNN by 7.2% on ImageNet, with training time acceleration and inference energy efficiency improved by 4.5 \times and 3.9 \times , respectively. We validate the effectiveness and efficiency of the proposed method across various tasks, including object detection, semantic segmentation, and neuromorphic vision tasks. This work enables SNNs to match ANN performance while maintaining the low-power advantage, marking a significant step towards SNNs as a general visual backbone. Code is available at this https URL.
zh
[CV-84] UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation
【速读】: 该论文试图解决在连续环境中视觉与语言导航 (VLN-CE) 中由于视觉遮挡或盲点导致的导航困难问题。解决方案的关键在于引入了一种名为 UnitedVLN 的新型 3DGS 预训练范式,通过联合渲染高保真 360 度视觉图像和语义特征,使代理能够更好地探索未来环境。UnitedVLN 采用两种关键策略:搜索-然后-查询采样和分离-然后-联合渲染,这些策略有助于有效利用神经原语,整合外观和语义信息,从而实现更稳健的导航。实验结果表明,UnitedVLN 在现有的 VLN-CE 基准测试中优于最先进的方法。
链接: https://arxiv.org/abs/2411.16053
作者: Guangzhao Dai,Jian Zhao,Yuantao Chen,Yusen Qin,Hao Zhao,Guosen Xie,Yazhou Yao,Xiangbo Shu,Xuelong Li
关键词-EN: target destination, significant advancements, instructions to reach, reach a target, recently seen significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods lack intuitive appearance-level information or high-level semantic complexity crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
zh
[CV-85] ROADS: Robust Prompt-driven Multi-Class Anomaly Detection under Domain Shift WACV2025
【速读】: 该论文试图解决多类别统一异常检测 (Multi-class Unified Anomaly Detection, MUAD) 方法中存在的类间干扰和域偏移问题。解决方案的关键在于提出了一种名为 ROADS 的新型鲁棒提示驱动 MUAD 框架。ROADS 通过层次化的类感知提示集成机制,动态地将类特定信息编码到异常检测器中,以减轻类间干扰;同时,引入域适配器来学习域不变表示,增强对域偏移的鲁棒性。实验结果表明,ROADS 在 MVTec-AD 和 VISA 数据集上的异常检测和定位性能均优于现有最先进方法,特别是在分布外设置下表现显著提升。
链接: https://arxiv.org/abs/2411.16049
作者: Hossein Kashiani,Niloufar Alipour Talemi,Fatemeh Afghah
关键词-EN: Multi-class Unified Anomaly, Multi-class Unified, practical alternatives compared, Recent advancements, Unified Anomaly Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)
点击查看摘要
Abstract:Recent advancements in anomaly detection have shifted focus towards Multi-class Unified Anomaly Detection (MUAD), offering more scalable and practical alternatives compared to traditional one-class-one-model approaches. However, existing MUAD methods often suffer from inter-class interference and are highly susceptible to domain shifts, leading to substantial performance degradation in real-world applications. In this paper, we propose a novel robust prompt-driven MUAD framework, called ROADS, to address these challenges. ROADS employs a hierarchical class-aware prompt integration mechanism that dynamically encodes class-specific information into our anomaly detector to mitigate interference among anomaly classes. Additionally, ROADS incorporates a domain adapter to enhance robustness against domain shifts by learning domain-invariant representations. Extensive experiments on MVTec-AD and VISA datasets demonstrate that ROADS surpasses state-of-the-art methods in both anomaly detection and localization, with notable improvements in out-of-distribution settings.
zh
[CV-86] ZoomEye: Enhancing Multimodal LLM s with Human-Like Zooming Capabilities through Tree-Based Image Exploration
【速读】: 该论文试图解决多模态大语言模型 (MLLMs) 在处理高分辨率图像时,由于预训练视觉编码器的输入分辨率限制和图像密集上下文导致的对细节对象的忽视问题。解决方案的关键是提出了Zoom Eye算法,这是一种树搜索算法,通过将图像概念化为树结构,每个子节点代表父节点的放大子块,根节点代表整体图像。Zoom Eye不仅模型无关且无需训练,允许任何MLLMs模拟人类的缩放动作,通过从根节点到叶节点的搜索,捕捉相关信息,并准确响应相关查询。实验结果表明,Zoom Eye显著提升了基础MLLMs的性能,并使小型7B MLLMs能够超越强大的大型模型如GPT-4。
链接: https://arxiv.org/abs/2411.16044
作者: Haozhan Shen,Kangjia Zhao,Tiancheng Zhao,Ruochen Xu,Zilun Zhang,Mingwei Zhu,Jianwei Yin
关键词-EN: numerous visual elements, Zoom Eye, fine-grained detailed objects, dominant large objects, typically consists
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects. When perceiving such images, multimodal large language models~(MLLMs) face limitations due to the restricted input resolution of the pretrained vision encoder and the cluttered, dense context of the image, resulting in a focus on primary objects while easily overlooking detailed ones. In this paper, we propose Zoom Eye, a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information. Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs to simulate human zooming actions by searching along the image tree from root to leaf nodes, seeking out pertinent information, and accurately responding to related queries. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series base MLLMs with large margin~(e.g., LLaVA-v1.5-7B increases by 34.57% on V^* Bench and 17.88% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o. Our code is available at \hrefthis https URLthis https URL.
zh
[CV-87] VisualLens: Personalization through Visual History
【速读】: 该论文试图解决在个性化推荐系统中,如何有效利用用户的视觉历史(visual history)来提取有价值的兴趣和偏好信息的问题。解决方案的关键在于提出了一种名为VisualLens的新方法,该方法通过提取、过滤和优化图像表示(image representations),从而在任务无关的视觉历史数据中提取出与用户兴趣相关的信号,并用于个性化推荐。这一方法在两个新创建的基准测试中展示了其优越性,相较于现有最先进的推荐系统,在Hit@3指标上提升了5-10%,并优于GPT-4o 2-5%。
链接: https://arxiv.org/abs/2411.16034
作者: Wang Bill Zhu,Deqing Fu,Kai Sun,Yi Lu,Zhaojiang Lin,Seungwhan Moon,Kanika Narang,Mustafa Canim,Yue Liu,Anuj Kumar,Xin Luna Dong
关键词-EN: offers valuable insights, daily life, offers valuable, valuable insights, user visual history
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We hypothesize that a user’s visual history with images reflecting their daily life, offers valuable insights into their interests and preferences, and can be leveraged for personalization. Among the many challenges to achieve this goal, the foremost is the diversity and noises in the visual history, containing images not necessarily related to a recommendation task, not necessarily reflecting the user’s interest, or even not necessarily preference-relevant. Existing recommendation systems either rely on task-specific user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. We propose a novel approach, VisualLens, that extracts, filters, and refines image representations, and leverages these signals for personalization. We created two new benchmarks with task-agnostic visual histories, and show that our method improves over state-of-the-art recommendations by 5-10% on Hit@3, and improves over GPT-4o by 2-5%. Our approach paves the way for personalized recommendations in scenarios where traditional methods fail.
zh
[CV-88] From Dashcam Videos to Driving Simulations: Stress Testing Automated Vehicles against Rare Events
【速读】: 该论文试图解决将真实世界驾驶视频自动转换为用于自动驾驶系统(ADS)测试的仿真场景的挑战。解决方案的关键在于提出了一种新颖的框架,利用提示工程的视频语言模型(VLM)将行车记录仪视频转换为SCENIC脚本,这些脚本定义了CARLA模拟器中的环境和驾驶行为,从而生成逼真的仿真场景。该框架不仅关注一对一的场景重建,还强调捕捉原始视频中的关键驾驶行为,并提供天气或道路条件等参数的灵活性,以支持基于搜索的测试。此外,引入了一种相似度度量,通过比较真实和模拟视频中的驾驶行为关键特征,迭代地优化生成的场景。初步结果显示,该方法在几分钟内完成从真实到仿真的转换,完全自动化且无需人工干预,同时保持对原始驾驶事件的高保真度。
链接: https://arxiv.org/abs/2411.16027
作者: Yan Miao,Georgios Fainekos,Bardh Hoxha,Hideki Okamoto,Danil Prokhorov,Sayan Mitra
关键词-EN: Automated Driving Systems, Testing Automated Driving, Testing Automated, Driving Systems, Automated Driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Testing Automated Driving Systems (ADS) in simulation with realistic driving scenarios is important for verifying their performance. However, converting real-world driving videos into simulation scenarios is a significant challenge due to the complexity of interpreting high-dimensional video data and the time-consuming nature of precise manual scenario reconstruction. In this work, we propose a novel framework that automates the conversion of real-world car crash videos into detailed simulation scenarios for ADS testing. Our approach leverages prompt-engineered Video Language Models(VLM) to transform dashcam footage into SCENIC scripts, which define the environment and driving behaviors in the CARLA simulator, enabling the generation of realistic simulation scenarios. Importantly, rather than solely aiming for one-to-one scenario reconstruction, our framework focuses on capturing the essential driving behaviors from the original video while offering flexibility in parameters such as weather or road conditions to facilitate search-based testing. Additionally, we introduce a similarity metric that helps iteratively refine the generated scenario through feedback by comparing key features of driving behaviors between the real and simulated videos. Our preliminary results demonstrate substantial time efficiency, finishing the real-to-sim conversion in minutes with full automation and no human intervention, while maintaining high fidelity to the original driving events.
zh
[CV-89] Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models WACV2025
【速读】: 该论文试图解决预训练视觉-语言模型(Vision-language models)在下游任务中过度拟合特定数据分布的问题,这限制了模型在新领域或未见类别上的泛化能力。解决方案的关键是提出了一种名为Style-Pro的新型风格引导提示学习框架(style-guided prompt learning framework)。Style-Pro通过使用可学习的风格基(learnable style bases)来合成多样化的分布偏移,并由两个专门的损失函数确保风格多样性和内容完整性。此外,Style-Pro将未见风格映射到已知风格表示空间,并通过加权组合风格基来最小化未见领域与源领域之间的差异。为了保持风格偏移提示模型与原始冻结的CLIP模型之间的嵌入一致性,Style-Pro引入了一致性约束,从而在适应下游任务时最小化偏差。实验结果表明,Style-Pro在多个基准数据集上显著优于现有最先进的方法。
链接: https://arxiv.org/abs/2411.16018
作者: Niloufar Alipour Talemi,Hossein Kashiani,Fatemeh Afghah
关键词-EN: Pre-trained Vision-language, downstream tasks, shown significant generalization, significant generalization ability, minimal fine-tuning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)
点击查看摘要
Abstract:Pre-trained Vision-language (VL) models, such as CLIP, have shown significant generalization ability to downstream tasks, even with minimal fine-tuning. While prompt learning has emerged as an effective strategy to adapt pre-trained VL models for downstream tasks, current approaches frequently encounter severe overfitting to specific downstream data distributions. This overfitting constrains the original behavior of the VL models to generalize to new domains or unseen classes, posing a critical challenge in enhancing the adaptability and generalization of VL models. To address this limitation, we propose Style-Pro, a novel style-guided prompt learning framework that mitigates overfitting and preserves the zero-shot generalization capabilities of CLIP. Style-Pro employs learnable style bases to synthesize diverse distribution shifts, guided by two specialized loss functions that ensure style diversity and content integrity. Then, to minimize discrepancies between unseen domains and the source domain, Style-Pro maps the unseen styles into the known style representation space as a weighted combination of style bases. Moreover, to maintain consistency between the style-shifted prompted model and the original frozen CLIP, Style-Pro introduces consistency constraints to preserve alignment in the learned embeddings, minimizing deviation during adaptation to downstream tasks. Extensive experiments across 11 benchmark datasets demonstrate the effectiveness of Style-Pro, consistently surpassing state-of-the-art methods in various settings, including base-to-new generalization, cross-dataset transfer, and domain generalization.
zh
[CV-90] DRIVE: Dual-Robustness via Information Variability and Entropic Consistency in Source-Free Unsupervised Domain Adaptation
【速读】: 该论文试图解决源数据不可访问情况下的无监督领域自适应问题(Source-Free Unsupervised Domain Adaptation, SFUDA),特别是在目标域数据无标签的情况下,如何有效适应预训练模型。解决方案的关键在于提出了一种名为DRIVE(Dual-Robustness through Information Variability and Entropy)的新型SFUDA框架,该框架采用双模型架构。两个初始权重相同的模型并行工作,以捕捉目标域的多样性特征。其中一个模型通过投影梯度下降(PGD)引入扰动,并由互信息引导,专注于高不确定性区域。此外,论文还引入了一种基于熵的伪标签策略,根据预测不确定性调整标签权重,确保模型关注可靠数据并避免噪声区域。适应过程分为两个阶段:第一阶段通过互信息一致性损失对齐模型,第二阶段根据第一阶段的损失动态调整扰动水平,鼓励模型探索更广泛的目标域特征,同时保持现有性能,从而增强模型的泛化能力和抗干扰能力。
链接: https://arxiv.org/abs/2411.15976
作者: Ruiqiang Xiao,Songning Lai,Yijun Yang,Jiemin Wu,Yutao Yue,Lei Zhu
关键词-EN: Adapting machine learning, machine learning models, autonomous driving, target domain, medical imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Adapting machine learning models to new domains without labeled data, especially when source data is inaccessible, is a critical challenge in applications like medical imaging, autonomous driving, and remote sensing. This task, known as Source-Free Unsupervised Domain Adaptation (SFUDA), involves adapting a pre-trained model to a target domain using only unlabeled target data, which can lead to issues such as overfitting, underfitting, and poor generalization due to domain discrepancies and noise. Existing SFUDA methods often rely on single-model architectures, struggling with uncertainty and variability in the target domain. To address these challenges, we propose DRIVE (Dual-Robustness through Information Variability and Entropy), a novel SFUDA framework leveraging a dual-model architecture. The two models, initialized with identical weights, work in parallel to capture diverse target domain characteristics. One model is exposed to perturbations via projection gradient descent (PGD) guided by mutual information, focusing on high-uncertainty regions. We also introduce an entropy-aware pseudo-labeling strategy that adjusts label weights based on prediction uncertainty, ensuring the model focuses on reliable data while avoiding noisy regions. The adaptation process has two stages: the first aligns the models on stable features using a mutual information consistency loss, and the second dynamically adjusts the perturbation level based on the loss from the first stage, encouraging the model to explore a broader range of the target domain while preserving existing performance. This enhances generalization capabilities and robustness against interference. Evaluations on standard SFUDA benchmarks show that DRIVE consistently outperforms previous methods, delivering improved adaptation accuracy and stability across complex target domains.
zh
[CV-91] CNNs for Style Transfer of Digital to Film Photography
【速读】: 该论文试图解决使用深度学习生成Cinestill800T胶片风格效果的问题。解决方案的关键在于采用简单的卷积神经网络(Convolutional Neural Networks)来模拟数字输入的胶片效果,并通过实验测试不同损失函数(Loss Functions)、输入噪声通道(Input Noise Channel)以及训练过程中随机缩放的图像块(Random Scales of Patches)对效果的影响。研究结果表明,结合均方误差(MSE)和VGG损失函数能够产生最佳的色彩效果,尽管能够生成一些颗粒感,但质量不高,且未能生成光晕(Halation)效果。此外,论文还贡献了一个对齐的胶片和数字相机拍摄的图像数据集,以供进一步研究使用。
链接: https://arxiv.org/abs/2411.15967
作者: Pierre Mackenzie,Mika Senghaas,Raphael Achddou
关键词-EN: stylistic effect generation, recent years, deep learning, learning in stylistic, stylistic effect
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The use of deep learning in stylistic effect generation has seen increasing use over recent years. In this work, we use simple convolutional neural networks to model Cinestill800T film given a digital input. We test the effect of different loss functions, the addition of an input noise channel and the use of random scales of patches during training. We find that a combination of MSE/VGG loss gives the best colour production and that some grain can be produced, but it is not of a high quality, and no halation is produced. We contribute our dataset of aligned paired images taken with a film and digital camera for further work.
zh
[CV-92] Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors
【速读】: 该论文试图解决从有限数量的未校准2D图像中进行无姿态(pose-free)的360°场景重建问题。解决方案的关键在于提出了一种指令跟随的RGBD扩散模型,该模型能够填补缺失细节并去除新视角渲染和深度图中的伪影。此外,论文还引入了一种新的高斯表示置信度度量,以更好地检测这些伪影。通过逐步整合这些新视角,论文采用了一种类似于高斯SLAM(Gaussian-SLAM)的过程,实现了多视角一致的高斯表示。实验结果表明,该方法在复杂360°场景中超越了现有的无姿态重建技术,并与最先进的已知姿态重建方法表现相当。
链接: https://arxiv.org/abs/2411.15966
作者: Soumava Paul,Prakhar Kaushik,Alan Yuille
关键词-EN: number of uncalibrated, introduce a generative, generative approach, limited number, reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures, 3 tables
点击查看摘要
Abstract:In this work, we introduce a generative approach for pose-free reconstruction of 360^\circ scenes from a limited number of uncalibrated 2D images. Pose-free scene reconstruction from incomplete, unposed observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of unbounded scenes with known camera poses using diffusion priors, these methods rely on explicit camera embeddings for extrapolating unobserved regions. This reliance limits their application in pose-free settings, where view-specific data is only implicitly available. To address this, we propose an instruction-following RGBD diffusion model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We also propose a novel confidence measure for Gaussian representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent Gaussian representation. Evaluations on the MipNeRF360 dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed reconstruction methods in complex 360^\circ scenes.
zh
[CV-93] MobileMamba: Lightweight Multi-Receptive Visual Mamba Network
【速读】: 该论文试图解决轻量级模型在处理高分辨率图像时面临的效率与性能平衡问题。解决方案的关键在于提出了MobileMamba框架,通过设计三阶段网络结构显著提升推理速度,并引入Multi-Receptive Field Feature Interaction (MRFFI)模块,该模块包括Long-Range Wavelet Transform-Enhanced Mamba (WTE-Mamba)、Efficient Multi-Kernel Depthwise Convolution (MK-DeConv)和Eliminate Redundant Identity组件,以集成多感受野信息并增强高频细节提取。此外,通过采用特定的训练和测试策略,进一步提升了模型的性能和效率。实验结果表明,MobileMamba在速度和准确性上均优于现有的最先进方法。
链接: https://arxiv.org/abs/2411.15941
作者: Haoyang He,Jiangning Zhang,Yuxuan Cai,Hongxu Chen,Xiaobin Hu,Zhenye Gan,Yabiao Wang,Chengjie Wang,Yunsheng Wu,Lei Xie
关键词-EN: Previous research, primarily focused, Transformer-based designs, CNNs and Transformer-based, Previous
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
点击查看摘要
Abstract:Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction(MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba(WTE-Mamba), Efficient Multi-Kernel Depthwise Convolution(MK-DeConv), and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum x21 faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.
zh
[CV-94] Segment to Recognize Robustly – Enhancing Recognition by Image Decomposition
【速读】: 该论文试图解决图像识别中背景信息过度依赖的问题,特别是在实际部署环境中模型鲁棒性受限的情况。解决方案的关键在于提出了一种名为“Segment to Recognize Robustly” (S2R^2) 的新型识别方法,该方法通过解耦前景 (FG) 和背景 (BG) 的建模,并在识别过程中结合这两部分信息,从而实现简单、鲁棒且可解释的识别。S2R^2 利用零样本分割技术在识别前或识别过程中隔离前景和背景,通过结合前景、背景以及标准的全图像分类器,不仅在域内数据上达到了最先进的结果,还保持了对背景变化的鲁棒性。
链接: https://arxiv.org/abs/2411.15933
作者: Klara Janouskova,Cristian Gavrus,Jiri Matas
关键词-EN: real-world deployment settings, deep image recognition, standard deep image, limiting model robustness, deep image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In image recognition, both foreground (FG) and background (BG) play an important role; however, standard deep image recognition often leads to unintended over-reliance on the BG, limiting model robustness in real-world deployment settings. Current solutions mainly suppress the BG, sacrificing BG information for improved generalization. We propose “Segment to Recognize Robustly” (S2R^2), a novel recognition approach which decouples the FG and BG modelling and combines them in a simple, robust, and interpretable manner. S2R^2 leverages recent advances in zero-shot segmentation to isolate the FG and the BG before or during recognition. By combining FG and BG, potentially also with a standard full-image classifier, S2R^2 achieves state-of-the-art results on in-domain data while maintaining robustness to BG shifts. The results confirm that segmentation before recognition is now possible.
zh
[CV-95] Improving Pre-Trained Self-Supervised Embeddings Through Effective Entropy Maximization
【速读】: 该论文试图解决自监督学习 (Self-Supervised Learning, SSL) 中嵌入向量在高维空间中熵估计不准确的问题。解决方案的关键在于提出了一种有效的熵最大化准则 (Effective Entropy Maximization Criterion, E2MC),该准则基于易于估计的低维约束。通过在已经训练好的SSL模型上继续训练几个周期,使用E2MC能够显著提升下游任务的性能,而其他替代准则则未能带来显著改进,甚至在某些情况下会降低性能。
链接: https://arxiv.org/abs/2411.15931
作者: Deep Chakraborty,Yann LeCun,Tim G. J. Rudner,Erik Learned-Miller
关键词-EN: supervised downstream tasks, lightly supervised downstream, self-supervised learning, lightly supervised, architectures and loss
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Applications (stat.AP); Machine Learning (stat.ML)
备注: 19 pages including appendix, 5 figures
点击查看摘要
Abstract:A number of different architectures and loss functions have been applied to the problem of self-supervised learning (SSL), with the goal of developing embeddings that provide the best possible pre-training for as-yet-unknown, lightly supervised downstream tasks. One of these SSL criteria is to maximize the entropy of a set of embeddings in some compact space. But the goal of maximizing the embedding entropy often depends–whether explicitly or implicitly–upon high dimensional entropy estimates, which typically perform poorly in more than a few dimensions. In this paper, we motivate an effective entropy maximization criterion (E2MC), defined in terms of easy-to-estimate, low-dimensional constraints. We demonstrate that using it to continue training an already-trained SSL model for only a handful of epochs leads to a consistent and, in some cases, significant improvement in downstream performance. We perform careful ablation studies to show that the improved performance is due to the proposed add-on criterion. We also show that continued pre-training with alternative criteria does not lead to notable improvements, and in some cases, even degrades performance.
zh
[CV-96] Making Images from Images: Interleaving Denoising and Transformation
【速读】: 该论文试图解决通过重新排列图像区域来生成新图像的问题,特别是如何将现有图像(如《蒙娜丽莎》)转换为全新的主题。解决方案的关键在于提出了一种同时学习图像内容和参数化变换的方法,通过将图像扩散与能量最小化步骤交替进行,来解决这一约束优化问题。与以往方法不同,增加区域数量不仅不会增加问题复杂性,反而能提升结果质量。该方法在像素空间和潜在空间中均得到了验证,并展示了使用无限复制源图像和多源图像的创意扩展。
链接: https://arxiv.org/abs/2411.15925
作者: Shumeet Baluja,David Marwood,Ashwin Baluja
关键词-EN: Simply by rearranging, image, Simply, subject matter, regions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
点击查看摘要
Abstract:Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.
zh
[CV-97] Deep Learning for automated multi-scale functional field boundaries extraction using multi-date Sentinel-2 and PlanetScope imagery: Case Study of Netherlands and Pakistan
【速读】: 该论文试图解决在不同地理和多尺度农业系统中,利用多时相卫星影像进行功能性田地边界划分的有效性问题。解决方案的关键在于采用深度学习语义分割架构,结合多时相影像和归一化植被指数(NDVI)堆栈,以捕捉作物生长季节的不同时间点信息。研究通过在荷兰和巴基斯坦两个不同地区的实验,评估了基于UNET架构的四种深度学习模型,并比较了不同组合的多时相影像和NDVI堆栈的效果。结果表明,多时相NDVI堆栈提供了额外的季节性上下文信息,显著提高了田地边界划分的准确性。此外,研究还强调了多尺度地面信息在不同地理区域中的重要性,以及高空间分辨率在小型农田区域边界提取中的关键作用。通过迁移学习和结合多源数据,研究展示了在异质农业环境中实现自动田地边界划分的潜力。
链接: https://arxiv.org/abs/2411.15923
作者: Saba Zahid,Sajid Ghuffar,Obaid-ur-Rehman,Syed Roshaan Ali Shah
关键词-EN: multi-temporal satellite imagery, Pakistan, semantic segmentation architecture, learning semantic segmentation, Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 09 pages, To be published
点击查看摘要
Abstract:This study explores the effectiveness of multi-temporal satellite imagery for better functional field boundary delineation using deep learning semantic segmentation architecture on two distinct geographical and multi-scale farming systems of Netherlands and Pakistan. Multidate images of April, August and October 2022 were acquired for PlanetScope and Sentinel-2 in sub regions of Netherlands and November 2022, February and March 2023 for selected area of Dunyapur in Pakistan. For Netherlands, Basic registration crop parcels (BRP) vector layer was used as labeled training data. while self-crafted field boundary vector data were utilized for Pakistan. Four deep learning models with UNET architecture were evaluated using different combinations of multi-date images and NDVI stacks in the Netherlands subregions. A comparative analysis of IoU scores assessed the effectiveness of the proposed multi-date NDVI stack approach. These findings were then applied for transfer learning, using pre-trained models from the Netherlands on the selected area in Pakistan. Additionally, separate models were trained using self-crafted field boundary data for Pakistan, and combined models were developed using data from both the Netherlands and Pakistan. Results indicate that multi-date NDVI stacks provide additional temporal context, reflecting crop growth over different times of the season. The study underscores the critical role of multi-scale ground information from diverse geographical areas in developing robust and universally applicable models for field boundary delineation. The results also highlight the importance of fine spatial resolution for extraction of field boundaries in regions with small scale framing. The findings can be extended to multi-scale implementations for improved automatic field boundary delineation in heterogeneous agricultural environments.
zh
[CV-98] A Tunable Despeckling Neural Network Stabilized via Diffusion Equation
【速读】: 该论文试图解决合成孔径雷达(SAR)成像中乘性Gamma噪声去除的问题,特别是在实际数据与理论模型不符时,神经网络易受干扰和对抗攻击的情况。解决方案的关键在于利用扩散方程的耗散性质,设计了一种可调节的正则化神经网络,该网络将去噪单元和正则化单元整合为一个端到端的训练网络。去噪单元由去噪网络构成,而正则化单元则基于最简单的线性扩散方程,增强了网络的稳定性,允许在训练后调整时间步长以有效缓解对抗攻击的不利影响。该模型在理论上的稳定性和收敛性得到了证明,并在实验中与几种最先进的去噪方法进行了比较,结果显示在定量和视觉评估方面均表现优异。
链接: https://arxiv.org/abs/2411.15921
作者: Yi Ran,Zhichang Guo,Jia Li,Yao Li,Martin Burger,Boying Wu
关键词-EN: Multiplicative Gamma noise, Multiplicative Gamma, Gamma noise remove, synthetic aperture radar, critical research area
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Multiplicative Gamma noise remove is a critical research area in the application of synthetic aperture radar (SAR) imaging, where neural networks serve as a potent tool. However, real-world data often diverges from theoretical models, exhibiting various disturbances, which makes the neural network less effective. Adversarial attacks work by finding perturbations that significantly disrupt functionality of neural networks, as the inherent instability of neural networks makes them highly susceptible. A network designed to withstand such extreme cases can more effectively mitigate general disturbances in real SAR data. In this work, the dissipative nature of diffusion equations is employed to underpin a novel approach for countering adversarial attacks and improve the resistance of real noise disturbance. We propose a tunable, regularized neural network that unrolls a denoising unit and a regularization unit into a single network for end-to-end training. In the network, the denoising unit and the regularization unit are composed of the denoising network and the simplest linear diffusion equation respectively. The regularization unit enhances network stability, allowing post-training time step adjustments to effectively mitigate the adverse impacts of adversarial attacks. The stability and convergence of our model are theoretically proven, and in the experiments, we compare our model with several state-of-the-art denoising methods on simulated images, adversarial samples, and real SAR images, yielding superior results in both quantitative and visual evaluations.
zh
[CV-99] Bimanual Grasp Synthesis for Dexterous Robot Hands
【速读】: 该论文试图解决机器人双臂操作中双臂抓取姿态合成的问题,特别是在灵巧手操作器上的应用。解决方案的关键在于提出了BimanGrasp算法,该算法通过优化能量函数来生成考虑抓取稳定性和可行性的抓取姿态,并使用Isaac Gym物理模拟引擎进行验证。此外,论文还创建了BimanGrasp-Dataset,这是首个大规模合成的双臂灵巧手抓取姿态数据集,包含超过150k个验证过的抓取姿态。最后,论文提出了基于扩散模型(BimanGrasp-DDPM)的数据驱动方法,显著提高了抓取合成的成功率和计算速度。
链接: https://arxiv.org/abs/2411.15903
作者: Yanming Shao,Chenxi Xiao
关键词-EN: Humans naturally perform, Humans naturally, naturally perform bimanual, perform bimanual skills, naturally perform
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in RA-L 24’, 8 pages, 9 figures, 3 tables
点击查看摘要
Abstract:Humans naturally perform bimanual skills to handle large and heavy objects. To enhance robots’ object manipulation capabilities, generating effective bimanual grasp poses is essential. Nevertheless, bimanual grasp synthesis for dexterous hand manipulators remains underexplored. To bridge this gap, we propose the BimanGrasp algorithm for synthesizing bimanual grasps on 3D objects. The BimanGrasp algorithm generates grasp poses by optimizing an energy function that considers grasp stability and feasibility. Furthermore, the synthesized grasps are verified using the Isaac Gym physics simulation engine. These verified grasp poses form the BimanGrasp-Dataset, the first large-scale synthesized bimanual dexterous hand grasp pose dataset to our knowledge. The dataset comprises over 150k verified grasps on 900 objects, facilitating the synthesis of bimanual grasps through a data-driven approach. Last, we propose BimanGrasp-DDPM, a diffusion model trained on the BimanGrasp-Dataset. This model achieved a grasp synthesis success rate of 69.87% and significant acceleration in computational speed compared to BimanGrasp algorithm.
zh
[CV-100] Highly Efficient and Unsupervised Framework for Moving Object Detection in Satellite Videos
【速读】: 该论文试图解决卫星视频中移动目标检测 (SVMOD) 的问题,特别是由于目标极小且亮度极低所带来的挑战。当前基于学习的方法通过从多帧密集表示中提取时空信息来应对这一问题,但需要大量人工标注,且由于前景与背景区域的不平衡导致计算冗余。论文提出的解决方案关键在于:1) 引入一个通用的无监督框架,其中伪标签由传统方法生成并在训练过程中进化,以提升检测性能;2) 通过将密集的多帧图像形式采样为稀疏的时空点云表示,并跳过背景区域的冗余计算,设计了一种高效且有效的稀疏卷积无锚点检测网络。这些设计使得该方法在保持高效性的同时,实现了最先进的检测性能。
链接: https://arxiv.org/abs/2411.15895
作者: C. Xiao,W. An,Y. Zhang,Z. Su,M. Li,W. Sheng,M. Pietikäinen,L. Liu
关键词-EN: small target characteristics, Moving object detection, challenging task due, Moving object, satellite videos
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures
点击查看摘要
Abstract:Moving object detection in satellite videos (SVMOD) is a challenging task due to the extremely dim and small target characteristics. Current learning-based methods extract spatio-temporal information from multi-frame dense representation with labor-intensive manual labels to tackle SVMOD, which needs high annotation costs and contains tremendous computational redundancy due to the severe imbalance between foreground and background regions. In this paper, we propose a highly efficient unsupervised framework for SVMOD. Specifically, we propose a generic unsupervised framework for SVMOD, in which pseudo labels generated by a traditional method can evolve with the training process to promote detection performance. Furthermore, we propose a highly efficient and effective sparse convolutional anchor-free detection network by sampling the dense multi-frame image form into a sparse spatio-temporal point cloud representation and skipping the redundant computation on background regions. Coping these two designs, we can achieve both high efficiency (label and computation efficiency) and effectiveness. Extensive experiments demonstrate that our method can not only process 98.8 frames per second on 1024x1024 images but also achieve state-of-the-art performance. The relabeled dataset and code are available at this https URL.
zh
[CV-101] Optimization-Driven Statistical Models of Anatomies using Radial Basis Function Shape Representation
【速读】: 该论文试图解决粒子基形状建模 (Particle-based shape modeling, PSM) 中自动量化解剖结构形状变异性的问题。解决方案的关键在于结合传统的优化方法与深度学习技术,通过利用特征形状 (eigenshape) 和对应损失 (correspondence loss) 来实现对模型特性的更精确控制。这种方法不仅避免了黑箱模型的使用,还允许粒子在表面上有更大的自由度,从而生成更具信息量的统计模型。通过在两个真实数据集上与最先进方法的比较,证明了该方法的有效性,并通过实验验证了所选损失函数的合理性。
链接: https://arxiv.org/abs/2411.15882
作者: Hong Xu,Shireen Y. Elhabian
关键词-EN: Particle-based shape modeling, quantify shape variability, automatically quantify shape, Particle-based shape, variability in populations
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Particle-based shape modeling (PSM) is a popular approach to automatically quantify shape variability in populations of anatomies. The PSM family of methods employs optimization to automatically populate a dense set of corresponding particles (as pseudo landmarks) on 3D surfaces to allow subsequent shape analysis. A recent deep learning approach leverages implicit radial basis function representations of shapes to better adapt to the underlying complex geometry of anatomies. Here, we propose an adaptation of this method using a traditional optimization approach that allows more precise control over the desired characteristics of models by leveraging both an eigenshape and a correspondence loss. Furthermore, the proposed approach avoids using a black-box model and allows more freedom for particles to navigate the underlying surfaces, yielding more informative statistical models. We demonstrate the efficacy of the proposed approach to state-of-the-art methods on two real datasets and justify our choice of losses empirically.
zh
[CV-102] Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation
【速读】: 该论文试图解决预训练视觉-语言模型(如CLIP)在开放词汇分割任务中因图像级预训练而难以捕捉局部细节的问题。解决方案的关键在于提出了一种无需训练的方法——自校准CLIP (Self-Calibrated CLIP, SC-CLIP),通过识别并解决前向传播过程中的异常标记(anomaly tokens),减少其对正常标记的干扰,从而增强空间感知能力。此外,通过利用CLIP中间特征的语义一致性来提升特征的区分度和注意力相关性,并采用多层次特征融合来丰富细节,最终在不引入新参数或依赖额外骨干网络的情况下,显著提升了CLIP的特征表示粒度和一致性,实验结果表明SC-CLIP在多个语义分割数据集上达到了最先进的性能。
链接: https://arxiv.org/abs/2411.15869
作者: Sule Bai,Yong Liu,Yifei Han,Haoji Zhang,Yansong Tang
关键词-EN: pre-trained vision-language models, Recent advancements, CLIP, advancements in pre-trained, pre-trained vision-language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advancements in pre-trained vision-language models like CLIP, have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to its image-level pre-training, CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis reveals that anomaly tokens emerge during the forward pass, drawing excessive attention from normal patch tokens, thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to produce finer-grained representations while preserving its original generalization ability, without introducing new parameters or relying on additional backbones. Specifically, we first identify and resolve the anomaly tokens to mitigate their negative impact. Next, we enhance feature discriminability and attention correlation by leveraging the semantic consistency found in CLIP’s intermediate features. Furthermore, we employ multi-level feature fusion to enrich details. Collectively, these strategies enhance CLIP’s feature representation with greater granularity and coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across eight semantic segmentation datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at this https URL.
zh
[CV-103] PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLM s
【速读】: 该论文试图解决全景图像生成中的多层次一致性挑战和扩散模型实现复杂性问题。解决方案的关键在于引入PanoLlama框架,将全景图像生成重新定义为下一个token预测任务,基于预训练的LlamaGen架构,采用自回归方式生成图像,并通过扩展策略处理尺寸限制。该方法以裁剪方式和无需训练的方式与图像token结构对齐,生成高质量全景图像,具有最小接缝和最大可扩展性,从而克服了扩散模型无法解决的问题。
链接: https://arxiv.org/abs/2411.15867
作者: Teng Zhou,Xiaoyu Zhang,Yongchuan Tang
关键词-EN: Panoramic Image Generation, driven by growing, technical applications, Panoramic Image, Image Generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Panoramic Image Generation has emerged as an important task in image generation, driven by growing demands for large-scale visuals in creative and technical applications. While diffusion models have dominated this field, they face inherent limitations, including the multilevel-coherence challenge and implementation complexity, leading to suboptimal outcomes. In this paper, we introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task. Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations. This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability. PanoLlama demonstrates its effectiveness and versatility in our experiments, achieving the best overall performance while offering flexibility for multi-scale, multi-layout, and multi-guidance generation. It overcomes the challenges that diffusion-based methods fail to address, setting a new paradigm for panoramic image generation tasks. Code is available at this https URL.
zh
[CV-104] Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching WACV2025
【速读】: 该论文试图解决的是在仅使用单张RGB图像的情况下,对未见过的物体进行姿态估计的问题。解决方案的关键在于利用扩散模型(diffusion model)生成新视角的图像,并通过在这些生成的图像上进行双边匹配(two-sided matching)来确定物体的姿态。这种方法无需依赖实例级别的物体姿态估计和大量的训练数据,也不需要3D物体模型或多视角图像,从而实现了对未见过物体的泛化能力。实验结果表明,该方法在合成数据集和真实世界数据集上都优于现有的姿态估计技术,特别是在视角变化较大的情况下表现出色,显示出其鲁棒性和广泛适用性。
链接: https://arxiv.org/abs/2411.15860
作者: Yujing Sun,Caiyi Sun,Yuan Liu,Yuexin Ma,Siu Ming Yiu
关键词-EN: object pose estimation, generalizable object pose, object pose, pose estimation, RGB image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2025, not published yet
点击查看摘要
Abstract:In this paper, we present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Unlike traditional approaches that rely on instance-level object pose estimation and necessitate extensive training data, our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object. These characteristics are achieved by utilizing a diffusion model to generate novel-view images and conducting a two-sided matching on these generated images. Quantitative experiments demonstrate the superiority of our method over existing pose estimation techniques across both synthetic and real-world datasets. Remarkably, our approach maintains strong performance even in scenarios with significant viewpoint changes, highlighting its robustness and versatility in challenging conditions. The code will be re leased at this https URL.
zh
[CV-105] SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition
【速读】: 该论文试图解决基于连接主义时间分类(CTC)的场景文本识别(STR)方法在面对复杂和多样化文本实例时准确性较低的问题。解决方案的关键在于提出SVTRv2模型,该模型通过引入多尺寸重调整(Multi-size Resizing, MSR)策略和特征重排模块(Feature Rearrangement Module, FRM)来处理文本的不规则性,并通过语义指导模块(Semantic Guidance Module, SGM)整合语言上下文信息,从而提高识别精度和速度。SGM在推理阶段可被省略,不会增加推理成本,使得SVTRv2在保持高效推理的同时,显著提升了在各种复杂场景下的识别性能。
链接: https://arxiv.org/abs/2411.15858
作者: Yongkun Du,Zhineng Chen,Hongtao Xie,Caiyan Jia,Yu-Gang Jiang
关键词-EN: Connectionist temporal classification, CTC-aligned linear classifier, OCR applications, Connectionist temporal, employed in OCR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed. Code is available at this https URL.
zh
[CV-106] ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
【速读】: 该论文试图解决视觉-语言模型(如CLIP)在密集预测任务中的不足,特别是由于自注意力机制在最终块中的局限性,导致无法有效捕捉空间对应关系的问题。解决方案的关键在于提出了残差交叉相关自注意力(Residual Cross-correlation Self-attention, RCS)模块和语义反馈细化(Semantic Feedback Refinement, SFR)模块。RCS模块利用中间层的交叉相关自注意力来重塑最终块的注意力,从而有效重组空间信息,释放CLIP在密集视觉-语言推理中的定位潜力。SFR模块则通过语义分割图进一步调整注意力分数,增强对同一类别区域的关注和局部一致性。通过集成这两种策略,ResCLIP方法能够显著提升现有方法在密集视觉-语言推理任务中的性能,并在多个标准基准测试中超越了最先进的无训练方法。
链接: https://arxiv.org/abs/2411.15851
作者: Yuhang Yang,Jinhong Deng,Wen Li,Lixin Duan
关键词-EN: shown remarkable success, open-vocabulary tasks, image-level tasks, Residual Cross-correlation Self-attention, shown remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP’s non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at this https URL.
zh
[CV-107] Unveiling the Superior Paradigm: A Comparative Study of Source-Free Domain Adaptation and Unsupervised Domain Adaptation
【速读】: 该论文试图解决在实际应用中,无监督领域自适应(Unsupervised Domain Adaptation, UDA)与无源领域自适应(Source-Free Domain Adaptation, SFDA)之间的性能比较问题。解决方案的关键在于通过预测编码理论和多基准数据集的广泛实验,证明了SFDA在实际场景中通常优于UDA,特别是在时间效率、存储需求、学习目标的针对性、减少负迁移风险和提高抗过拟合能力方面。此外,论文提出了一种新的数据-模型融合场景,并引入了一种新颖的权重估计方法,以有效整合可用的源数据到多SFDA(Multi-Source-Free Domain Adaptation, MSFDA)方法中,从而在该场景下提升模型性能。
链接: https://arxiv.org/abs/2411.15844
作者: Fan Wang,Zhongyi Han,Xingbo Liu,Xin Gao,Yilong Yin
关键词-EN: Unsupervised Domain Adaptation, UDA versus SFDA, Unsupervised Domain, Source-Free Domain Adaptation, leverages pre-trained source
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
点击查看摘要
Abstract:In domain adaptation, there are two popular paradigms: Unsupervised Domain Adaptation (UDA), which aligns distributions using source data, and Source-Free Domain Adaptation (SFDA), which leverages pre-trained source models without accessing source data. Evaluating the superiority of UDA versus SFDA is an open and timely question with significant implications for deploying adaptive algorithms in practical applications. In this study, we demonstrate through predictive coding theory and extensive experiments on multiple benchmark datasets that SFDA generally outperforms UDA in real-world scenarios. Specifically, SFDA offers advantages in time efficiency, storage requirements, targeted learning objectives, reduced risk of negative transfer, and increased robustness against overfitting. Notably, SFDA is particularly effective in mitigating negative transfer when there are substantial distribution discrepancies between source and target domains. Additionally, we introduce a novel data-model fusion scenario, where data sharing among stakeholders varies (e.g., some provide raw data while others provide only models), and reveal that traditional UDA and SFDA methods do not fully exploit their potential in this context. To address this limitation and capitalize on the strengths of SFDA, we propose a novel weight estimation method that effectively integrates available source data into multi-SFDA (MSFDA) approaches, thereby enhancing model performance within this scenario. This work provides a thorough analysis of UDA versus SFDA and advances a practical approach to model adaptation across diverse real-world environments.
zh
[CV-108] Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing
【速读】: 该论文试图解决在利用流变换器(flow transformer)进行无调优图像编辑时,现有的扩散反演(diffusion inversion)方法在流模型中表现不佳,以及不变性控制机制无法协调多种刚性和非刚性编辑任务的问题。解决方案的关键在于:1) 提出了一种两阶段反演方法,首先优化速度估计,然后补偿剩余误差,以更接近模型先验并有利于编辑;2) 提出了通过在自适应层归一化(adaptive layer normalization)中操纵文本特征的不变性控制机制,将文本提示的变化与图像语义连接起来,从而在保留非目标内容的同时,实现刚性和非刚性编辑,支持多种编辑类型,如视觉文本、数量、面部表情等。
链接: https://arxiv.org/abs/2411.15843
作者: Pengcheng Xu,Boyuan Jiang,Xiaobin Hu,Donghao Luo,Qingdong He,Jiangning Zhang,Chengjie Wang,Yunsheng Wu,Charles Ling,Boyu Wang
关键词-EN: requires authentic inversion, Leveraging the large, editing requires authentic, large generative prior, invariance control
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model’s domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbfinversion and invariance control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.
zh
[CV-109] VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding
【速读】: 该论文试图解决大视觉-语言模型 (Large Vision-Language Models, LVLMs) 在多模态任务推理中产生的幻觉 (hallucination) 问题。幻觉表现为模型生成的响应看似合理,但实际上并未准确反映视觉内容。论文分析指出,视觉编码过程中的畸变是导致幻觉的重要原因,具体表现为视觉信息在从底层向输出层传播过程中逐渐失真。解决方案的关键在于提出了一种新的幻觉缓解方法——视觉层融合对比解码 (Visual Layer Fusion Contrastive Decoding, VaLiD)。该方法通过利用不确定性来指导选择视觉隐藏层,从而纠正视觉编码过程中的畸变,提高生成文本的可靠性。实验结果表明,VaLiD在多个基准测试中有效减少了幻觉现象,达到了最先进的性能。
链接: https://arxiv.org/abs/2411.15839
作者: Jiaqi Wang,Yifei Gao,Jitao Sang
关键词-EN: Large Vision-Language Models, Large Vision-Language, multimodal task reasoning, demonstrated outstanding performance, demonstrated outstanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in multimodal task reasoning. However, they often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods that mitigate hallucinations by adjusting the decoding strategy during inference stage, typically attributing hallucination to the language model itself. Our analysis, however, reveals that distortions in the visual encoding process significantly affect the model’s reasoning accuracy. Specifically, earlier visual layers may retain key features but gradually distort as the information propagates toward the output layer. Building on these findings, we propose a novel hallucination-mitigation method from the visual encoding perspective: \textbfVisu\textbfal \textbfLayer Fus\textbfion Contrastive \textbfDecoding (VaLiD). This method utilizes uncertainty to guide the selection of visual hidden layers, correcting distortions in the visual encoding process and thereby improving the reliability of generated text. Experimental results show that VaLiD effectively reduces hallucinations across various benchmarks, achieving state-of-the-art performance compared to multiple baseline methods.
zh
[CV-110] Modality Alignment Meets Federated Broadcasting
【速读】: 该论文试图解决在联邦学习(Federated Learning, FL)中,由于数据异质性导致的模型收敛性下降和计算成本增加的问题。解决方案的关键在于引入了一种新的联邦学习框架,通过模态对齐(modality alignment)来处理跨客户端的学习。具体来说,该框架在服务器端部署文本编码器,而在本地设备上运行图像编码器,借鉴了多模态学习(multi-modal learning)如CLIP的范式,将服务器与客户端之间的通信类比为多模态广播。此外,通过使用预训练模型和低秩适应(Low-Rank Adaptation, LoRA)更新部分参数,既减少了过拟合风险,又满足了计算需求和性能效率。本地模型独立训练并将其更新传递给服务器,服务器通过基于查询的方法聚合参数,从而促进跨客户端的知识共享和在极端异质性下的性能提升。
链接: https://arxiv.org/abs/2411.15837
作者: Yuting Ma,Shengeng Tang,Xiaohua Xu,Lechao Cheng
关键词-EN: safeguard data privacy, distributed edge devices, Federated learning, centralizing local data, powerful approach
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Federated learning (FL) has emerged as a powerful approach to safeguard data privacy by training models across distributed edge devices without centralizing local data. Despite advancements in homogeneous data scenarios, maintaining performance between the global and local clients in FL over heterogeneous data remains challenging due to data distribution variations that degrade model convergence and increase computational costs. This paper introduces a novel FL framework leveraging modality alignment, where a text encoder resides on the server, and image encoders operate on local devices. Inspired by multi-modal learning paradigms like CLIP, this design aligns cross-client learning by treating server-client communications akin to multi-modal broadcasting. We initialize with a pre-trained model to mitigate overfitting, updating select parameters through low-rank adaptation (LoRA) to meet computational demand and performance efficiency. Local models train independently and communicate updates to the server, which aggregates parameters via a query-based method, facilitating cross-client knowledge sharing and performance improvement under extreme heterogeneity. Extensive experiments on benchmark datasets demonstrate the efficacy in maintaining generalization and robustness, even in highly heterogeneous settings.
zh
[CV-111] FastTrackTr:Towards Fast Multi-Object Tracking with Transformers
【速读】: 该论文试图解决基于Transformer的多目标跟踪(MOT)方法在推理速度上的瓶颈问题。解决方案的关键在于重新审视并改进传统的联合检测与跟踪(JDT, Joint Detection and Tracking)方法,通过在DETR框架中引入高效的信息传递机制,构建了一个名为FastTrackTr的新型JDT-type MOT框架。这一信息传递机制不仅减少了跟踪过程中所需的查询数量,还避免了过度引入网络结构,从而在保证模型简洁性的同时,提升了推理速度,使其具备实现实时跟踪的潜力,并在多个数据集上展示了竞争性的跟踪精度。
链接: https://arxiv.org/abs/2411.15811
作者: Pan Liao,Feng Yang,Di Wu,Jinwen Yu,Wenhui Zhao,Bo Liu
关键词-EN: Transformer-based multi-object tracking, Transformer-based multi-object, recent years, captured the attention, researchers in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformer-based multi-object tracking (MOT) methods have captured the attention of many researchers in recent years. However, these models often suffer from slow inference speeds due to their structure or other issues. To address this problem, we revisited the Joint Detection and Tracking (JDT) method by looking back at past approaches. By integrating the original JDT approach with some advanced theories, this paper employs an efficient method of information transfer between frames on the DETR, constructing a fast and novel JDT-type MOT framework: FastTrackTr. Thanks to the superiority of this information transfer method, our approach not only reduces the number of queries required during tracking but also avoids the excessive introduction of network structures, ensuring model simplicity. Experimental results indicate that our method has the potential to achieve real-time tracking and exhibits competitive tracking accuracy across multiple datasets.
zh
[CV-112] LRSAA: Large-scale Remote Sensing Image Target Recognition and Automatic Annotation
【速读】: 该论文试图解决在大面积遥感图像中进行物体识别和自动标注的问题,提出了名为LRSAA的方法。解决方案的关键在于通过集成学习将YOLOv11和MobileNetV3-SSD物体检测算法相结合,以提升模型性能。此外,采用泊松盘采样分割技术和EIOU度量来优化分割图像的训练和推理过程,并通过结果集成进一步提高效率。这种方法不仅减少了计算资源的消耗,还在准确性和速度之间实现了良好的平衡。
链接: https://arxiv.org/abs/2411.15808
作者: Yujuan Zhu,Wuzheng Dong
关键词-EN: images called LRSAA, large-area remote sensing, remote sensing images, sensing images called, called LRSAA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2411.07802
点击查看摘要
Abstract:This paper presents a method for object recognition and automatic labeling in large-area remote sensing images called LRSAA. The method integrates YOLOv11 and MobileNetV3-SSD object detection algorithms through ensemble learning to enhance model performance. Furthermore, it employs Poisson disk sampling segmentation techniques and the EIOU metric to optimize the training and inference processes of segmented images, followed by the integration of results. This approach not only reduces the demand for computational resources but also achieves a good balance between accuracy and speed. The source code for this project has been made publicly available on this https URL.
zh
[CV-113] PG-SLAM: Photo-realistic and Geometry-aware RGB-D SLAM in Dynamic Environments
【速读】: 该论文试图解决动态环境中同时定位与地图构建(SLAM)的问题,特别是在处理动态物体时如何实现高质量的场景重建和相机定位。解决方案的关键在于提出了一种基于高斯光栅化的RGB-D SLAM方法,该方法通过三个主要模块来实现:1) 动态前景(包括非刚性人体和刚性物体)的建图;2) 静态背景的重建;3) 相机定位。关键技术包括利用几何和外观约束对动态物体进行建模,通过优化策略整合相邻局部地图的外观约束,以及利用静态背景和动态前景来增加噪声补偿的观测数据。通过结合3D高斯与2D光流和像素块的关联,该方法在几何和外观约束上进行了深入探索,从而在相机定位和场景表示方面优于现有最先进的方法。
链接: https://arxiv.org/abs/2411.15800
作者: Haoang Li,Xiangqi Meng,Xingxing Zuo,Zhe Liu,Hesheng Wang,Daniel Cremers
关键词-EN: achieved impressive performance, Simultaneous localization, achieved impressive, impressive performance, SLAM
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Simultaneous localization and mapping (SLAM) has achieved impressive performance in static environments. However, SLAM in dynamic environments remains an open question. Many methods directly filter out dynamic objects, resulting in incomplete scene reconstruction and limited accuracy of camera localization. The other works express dynamic objects by point clouds, sparse joints, or coarse meshes, which fails to provide a photo-realistic representation. To overcome the above limitations, we propose a photo-realistic and geometry-aware RGB-D SLAM method by extending Gaussian splatting. Our method is composed of three main modules to 1) map the dynamic foreground including non-rigid humans and rigid items, 2) reconstruct the static background, and 3) localize the camera. To map the foreground, we focus on modeling the deformations and/or motions. We consider the shape priors of humans and exploit geometric and appearance constraints of humans and items. For background mapping, we design an optimization strategy between neighboring local maps by integrating appearance constraint into geometric alignment. As to camera localization, we leverage both static background and dynamic foreground to increase the observations for noise compensation. We explore the geometric and appearance constraints by associating 3D Gaussians with 2D optical flows and pixel patches. Experiments on various real-world datasets demonstrate that our method outperforms state-of-the-art approaches in terms of camera localization and scene representation. Source codes will be publicly available upon paper acceptance.
zh
[CV-114] Symmetric Perception and Ordinal Regression for Detecting Scoliosis Natural Image
【速读】: 该论文试图解决青少年脊柱侧弯(scoliosis)的广泛筛查问题,传统方法依赖于放射性检查,需要专业医疗设备和专家,且存在辐射风险。论文提出的解决方案关键在于利用人体背部的自然图像,通过双路径脊柱侧弯检测网络(dual-path scoliosis detection network)来实现。该网络包含两个主要模块:对称特征匹配模块(Symmetric Feature Matching Module, SFMM)和序数回归头(Ordinal Regression Head, ORH)。SFMM用于捕捉输入图像与其水平翻转图像之间的对称关系,而ORH则将序数回归问题转化为一系列二分类子问题。实验结果表明,该方法在脊柱侧弯严重程度的粗略和细粒度估计上均优于现有方法和人类表现,分别为95.11%和81.46%的准确率,为广泛筛查提供了有前景且经济的解决方案。
链接: https://arxiv.org/abs/2411.15799
作者: Xiaojia Zhu,Rui Chen,Xiaoqi Guo,Zhiwen Shao,Yuhu Dai,Ming Zhang,Chuandong Lang
关键词-EN: diseases in adolescents, common diseases, Scoliosis, wide-range scoliosis screening, scoliosis screening
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by Applied Intelligence
点击查看摘要
Abstract:Scoliosis is one of the most common diseases in adolescents. Traditional screening methods for the scoliosis usually use radiographic examination, which requires certified experts with medical instruments and brings the radiation risk. Considering such requirement and inconvenience, we propose to use natural images of the human back for wide-range scoliosis screening, which is a challenging problem. In this paper, we notice that the human back has a certain degree of symmetry, and asymmetrical human backs are usually caused by spinal lesions. Besides, scoliosis severity levels have ordinal relationships. Taking inspiration from this, we propose a dual-path scoliosis detection network with two main modules: symmetric feature matching module (SFMM) and ordinal regression head (ORH). Specifically, we first adopt a backbone to extract features from both the input image and its horizontally flipped image. Then, we feed the two extracted features into the SFMM to capture symmetric relationships. Finally, we use the ORH to transform the ordinal regression problem into a series of binary classification sub-problems. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods as well as human performance, which provides a promising and economic solution to wide-range scoliosis screening. In particular, our method achieves accuracies of 95.11% and 81.46% in estimation of general severity level and fine-grained severity level of the scoliosis, respectively.
zh
[CV-115] Multi-Token Enhancing for Vision Representation Learning
【速读】: 该论文试图解决传统集成学习策略在视觉表示学习(尤其是自监督学习)中不切实际的问题,因为这些策略需要k倍的训练和推理计算成本。论文提出的解决方案是引入多标记增强(Multi-Token Enhancing, MTE),通过从单个模型中同时提取多个辅助标记(包括辅助CLS标记和自适应池化标记)来增强表示学习,同时仅增加极少的额外训练成本且不增加推理成本。这些辅助标记由于其差异性能够捕捉互补信息。此外,为了应对推理成本的增加,论文提出在预训练期间将辅助标记获得的知识蒸馏到全局标记中,从而在推理时可以丢弃辅助标记而不增加额外成本。MTE方法兼容各种自监督损失函数和架构,并在不同下游任务中持续提升性能。
链接: https://arxiv.org/abs/2411.15787
作者: Zhong-Yu Li,Yu-Song Hu,Bo-Wen Yin,Ming-Ming Cheng
关键词-EN: vision applications, representation learning, learning, Vision representation learning, Vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision representation learning, especially self-supervised learning, is pivotal for various vision applications. Ensemble learning has also succeeded in enhancing the performance and robustness of the vision models. However, traditional ensemble strategies are impractical for representation learning, especially self-supervised representation learning that requires large-scale datasets and long schedules. This is because they require k times more training and inference computation costs for an ensemble of k models. Differently, we introduce Multi-Token Enhancing (MTE) that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs. These auxiliary tokens, including auxiliary CLS tokens and adaptively pooled tokens, capture complementary information due to their differences. Meanwhile, to address the increase in inference costs, we distill the knowledge acquired by the auxiliary tokens into a global token during pre-training. Consequently, we can discard the auxiliary tokens during inference without incurring additional costs. Our MTE is compatible with various self-supervised loss functions and architectures, consistently improving performances across different downstream tasks. Our source code will be made publicly available.
zh
[CV-116] ZeroGS: Training 3D Gaussian Splatting from Unposed Images
【速读】: 该论文试图解决从大量无序和未标定的图像中训练3D高斯喷射(3D Gaussian Splatting, 3DGS)模型的问题。解决方案的关键在于利用预训练的基础模型作为神经场景表示,并通过初始化种子图像和逐步注册新图像来微调模型。此外,通过最小化多视图点对相机射线一致性损失来优化相机姿态和点图,从而提高图像注册的准确性和图像渲染的质量。实验结果表明,该方法在恢复相机姿态和渲染图像质量方面优于现有的无姿态NeRF/3DGS方法。
链接: https://arxiv.org/abs/2411.15779
作者: Yu Chen,Rolandos Alexandros Potamias,Evangelos Ververas,Jifei Song,Jiankang Deng,Gim Hee Lee
关键词-EN: Gaussian Splatting, Neural radiance fields, radiance fields, popular techniques, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 12 figures
点击查看摘要
Abstract:Neural radiance fields (NeRF) and 3D Gaussian Splatting (3DGS) are popular techniques to reconstruct and render photo-realistic images. However, the pre-requisite of running Structure-from-Motion (SfM) to get camera poses limits their completeness. While previous methods can reconstruct from a few unposed images, they are not applicable when images are unordered or densely captured. In this work, we propose ZeroGS to train 3DGS from hundreds of unposed and unordered images. Our method leverages a pretrained foundation model as the neural scene representation. Since the accuracy of the predicted pointmaps does not suffice for accurate image registration and high-fidelity image rendering, we propose to mitigate the issue by initializing and finetuning the pretrained model from a seed image. Images are then progressively registered and added to the training buffer, which is further used to train the model. We also propose to refine the camera poses and pointmaps by minimizing a point-to-camera ray consistency loss across multiple views. Experiments on the LLFF dataset, the MipNeRF360 dataset, and the Tanks-and-Temples dataset show that our method recovers more accurate camera poses than state-of-the-art pose-free NeRF/3DGS methods, and even renders higher quality images than 3DGS with COLMAP poses. Our project page is available at this https URL.
zh
[CV-117] Context-Aware Detection of Mixed Critical Events using Video Classification
【速读】: 该论文试图解决通过计算机视觉检测混合关键事件(mixed-critical events)的挑战,特别是需要上下文理解来准确评估事件的关键性。解决方案的关键在于提出了一种适用于智能城市应用的多功能检测系统,该系统能够在交通和火灾检测场景中进行测试,并具备适应不同应用需求的灵活性。论文的主要贡献包括对检测需求的分析以及开发出能够适应多样化应用的系统,从而推动智能城市的自动化监控技术。
链接: https://arxiv.org/abs/2411.15773
作者: Filza Akhlaq,Alina Arshad,Muhammad Yehya Hayati,Jawwad A. Shamsi,Muhammad Burhan Khan
关键词-EN: Detecting mixed-critical events, Detecting mixed-critical, event criticality accurately, assess event criticality, criticality accurately
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Detecting mixed-critical events through computer vision is challenging due to the need for contextual understanding to assess event criticality accurately. Mixed critical events, such as fires of varying severity or traffic incidents, demand adaptable systems that can interpret context to trigger appropriate responses. This paper addresses these challenges by proposing a versatile detection system for smart city applications, offering a solution tested across traffic and fire detection scenarios. Our contributions include an analysis of detection requirements and the development of a system adaptable to diverse applications, advancing automated surveillance for smart cities.
zh
[CV-118] Corner2Net: Detecting Objects as Cascade Corners ECAI2024
【速读】: 该论文试图解决基于角点检测范式中的三个主要问题:1) 角点匹配困难,启发式角点匹配算法容易导致错误框,特别是在相似物体共存时;2) 实例上下文信息不足,两个独立的角点保留的实例语义信息较少,难以保证在同一热图通道上获取两个类别特定的角点;3) 不友好的骨干网络,沙漏网络的训练成本高。解决方案的关键在于构建了一个名为Corner2Net的新型角点检测框架,通过设计级联角点管道(cascade corner pipeline),逐步预测关联的角点对,而不是通过并行头同步搜索两个独立的角点。Corner2Net将角点定位与目标分类解耦,两个角点均为类别无关,实例特定的右下角点进一步简化了搜索空间。同时,提取具有丰富语义的RoI特征进行分类,并可轻松连接流行的骨干网络(如ResNeXt)。实验结果表明,Corner2Net在COCO数据集上在准确性和速度方面均显著超越了现有的基于角点的检测器。
链接: https://arxiv.org/abs/2411.15772
作者: Chenglong Liu,Jintao Liu,Haorao Wei,Jinze Yang,Liangyu Xu,Yuchen Guo,Lu Fang
关键词-EN: detection paradigm enjoys, produce high-quality boxes, corner-based detection paradigm, detection paradigm, paradigm enjoys
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by 27th EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2024)
点击查看摘要
Abstract:The corner-based detection paradigm enjoys the potential to produce high-quality boxes. But the development is constrained by three factors: 1) Hard to match corners. Heuristic corner matching algorithms can lead to incorrect boxes, especially when similar-looking objects co-occur. 2) Poor instance context. Two separate corners preserve few instance semantics, so it is difficult to guarantee getting both two class-specific corners on the same heatmap channel. 3) Unfriendly backbone. The training cost of the hourglass network is high. Accordingly, we build a novel corner-based framework, named Corner2Net. To achieve the corner-matching-free manner, we devise the cascade corner pipeline which progressively predicts the associated corner pair in two steps instead of synchronously searching two independent corners via parallel heads. Corner2Net decouples corner localization and object classification. Both two corners are class-agnostic and the instance-specific bottom-right corner further simplifies its search space. Meanwhile, RoI features with rich semantics are extracted for classification. Popular backbones (e.g., ResNeXt) can be easily connected to Corner2Net. Experimental results on COCO show Corner2Net surpasses all existing corner-based detectors by a large margin in accuracy and speed.
zh
[CV-119] xt-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering
【速读】: 该论文试图解决在光学传感器成像受限的挑战条件下(如云覆盖和低光场景),遥感视觉问答(RSVQA)性能下降的问题。解决方案的关键在于提出了一种文本引导的粗到细融合网络(Text-guided Coarse-to-Fine Fusion Network, TGFNet),通过利用问题文本与多源图像之间的语义关系,在特征层面进行互补融合。具体来说,论文开发了文本引导的粗到细注意力优化模块(Text-guided Coarse-to-Fine Attention Refinement, CFAR),通过关键区域路由逐步从广泛区域聚焦到细节,增强模型对相关区域的注意力。此外,提出了自适应多专家融合模块(Adaptive Multi-Expert Fusion, AMEF),动态集成不同专家,实现光学与SAR特征的自适应融合。论文还创建了首个大规模光学-SAR RSVQA评估基准数据集,包含6,008对对齐的光学-SAR图像和1,036,694个标注的问答对,涵盖16种多样的问题类型。实验结果表明,TGFNet在挑战场景下显著提升了模型的性能。
链接: https://arxiv.org/abs/2411.15770
作者: Zhicheng Zhao,Changfu Zhou,Yu Zhang,Chenglong Li,Xiaoliang Ma,Jin Tang
关键词-EN: significant research interest, gained significant research, Remote Sensing Visual, Sensing Visual Question, Remote Sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Remote Sensing Visual Question Answering (RSVQA) has gained significant research interest. However, current RSVQA methods are limited by the imaging mechanisms of optical sensors, particularly under challenging conditions such as cloud-covered and low-light scenarios. Given the all-time and all-weather imaging capabilities of Synthetic Aperture Radar (SAR), it is crucial to investigate the integration of optical-SAR images to improve RSVQA performance. In this work, we propose a Text-guided Coarse-to-Fine Fusion Network (TGFNet), which leverages the semantic relationships between question text and multi-source images to guide the network toward complementary fusion at the feature level. Specifically, we develop a Text-guided Coarse-to-Fine Attention Refinement (CFAR) module to focus on key areas related to the question in complex remote sensing images. This module progressively directs attention from broad areas to finer details through key region routing, enhancing the model’s ability to focus on relevant regions. Furthermore, we propose an Adaptive Multi-Expert Fusion (AMEF) module that dynamically integrates different experts, enabling the adaptive fusion of optical and SAR features. In addition, we create the first large-scale benchmark dataset for evaluating optical-SAR RSVQA methods, comprising 6,008 well-aligned optical-SAR image pairs and 1,036,694 well-labeled question-answer pairs across 16 diverse question types, including complex relational reasoning questions. Extensive experiments on the proposed dataset demonstrate that our TGFNet effectively integrates complementary information between optical and SAR images, significantly improving the model’s performance in challenging scenarios. The dataset is available at: this https URL. Index Terms: Remote Sensing Visual Question Answering, Multi-source Data Fusion, Multimodal, Remote Sensing, OPT-SAR. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.15770 [cs.CV] (or arXiv:2411.15770v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.15770 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-120] Integrating Deep Metric Learning with Coreset for Active Learning in 3D Segmentation NEURIPS2024
【速读】: 该论文试图解决3D医学图像分割中标注数据需求量大、成本高的问题。解决方案的关键在于引入了一种新的度量学习方法,用于在3D医学分割中进行基于切片的主动学习 (Active Learning, AL)。通过将对比学习 (Contrastive Learning) 与医学影像中的固有数据分组相结合,该方法学习了一种度量,强调了样本间在训练3D医学分割模型时相关差异的重要性。这种方法不仅在弱标注和全标注的情况下均优于现有的主动学习技术,而且在低标注预算下也能获得优越的性能,这对于医学影像领域尤为重要。
链接: https://arxiv.org/abs/2411.15763
作者: Arvind Murari Vepa,Zukang Yang,Andrew Choi,Jungseock Joo,Fabien Scalzo,Yizhou Sun
关键词-EN: demands extensive annotated, Deep learning, extensive annotated data, remarkable advancements, advancements in machine
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To be published in NeurIPS 2024
点击查看摘要
Abstract:Deep learning has seen remarkable advancements in machine learning, yet it often demands extensive annotated data. Tasks like 3D semantic segmentation impose a substantial annotation burden, especially in domains like medicine, where expert annotations drive up the cost. Active learning (AL) holds great potential to alleviate this annotation burden in 3D medical segmentation. The majority of existing AL methods, however, are not tailored to the medical domain. While weakly-supervised methods have been explored to reduce annotation burden, the fusion of AL with weak supervision remains unexplored, despite its potential to significantly reduce annotation costs. Additionally, there is little focus on slice-based AL for 3D segmentation, which can also significantly reduce costs in comparison to conventional volume-based AL. This paper introduces a novel metric learning method for Coreset to perform slice-based active learning in 3D medical segmentation. By merging contrastive learning with inherent data groupings in medical imaging, we learn a metric that emphasizes the relevant differences in samples for training 3D medical segmentation models. We perform comprehensive evaluations using both weak and full annotations across four datasets (medical and non-medical). Our findings demonstrate that our approach surpasses existing active learning techniques on both weak and full annotations and obtains superior performance with low-annotation budgets which is crucial in medical imaging. Source code for this project is available in the supplementary materials and on GitHub: this https URL.
zh
[CV-121] MambaTrack: Exploiting Dual-Enhancement for Night UAV Tracking
【速读】: 该论文试图解决夜间无人飞行器(UAV)跟踪中由于光照不足导致的性能下降问题。解决方案的关键在于提出了一种基于mamba的高效跟踪器,利用双增强技术提升夜间UAV跟踪性能。具体来说,该方法包括一个mamba低光增强器,通过光照估计器和损伤修复器实现全局图像增强,同时保留低光图像的细节和结构。此外,还引入了一种跨模态mamba网络,实现视觉和语言模态之间的高效交互学习。实验结果表明,该方法在性能上显著优于现有方法,并且在计算和内存效率上也有显著提升。
链接: https://arxiv.org/abs/2411.15761
作者: Chunhui Zhang,Li Liu,Hao Wen,Xi Zhou,Yanfeng Wang
关键词-EN: unmanned aerial vehicle, Night unmanned aerial, night UAV tracking, boost night UAV, demonstrating suboptimal performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Night unmanned aerial vehicle (UAV) tracking is impeded by the challenges of poor illumination, with previous daylight-optimized methods demonstrating suboptimal performance in low-light conditions, limiting the utility of UAV applications. To this end, we propose an efficient mamba-based tracker, leveraging dual enhancement techniques to boost night UAV tracking. The mamba-based low-light enhancer, equipped with an illumination estimator and a damage restorer, achieves global image enhancement while preserving the details and structure of low-light images. Additionally, we advance a cross-modal mamba network to achieve efficient interactive learning between vision and language modalities. Extensive experiments showcase that our method achieves advanced performance and exhibits significantly improved computation and memory efficiency. For instance, our method is 2.8 \times faster than CiteTracker and reduces 50.2 % GPU memory. Codes will be made publicly available.
zh
[CV-122] Advanced Learning-Based Inter Prediction for Future Video Coding
【速读】: 该论文试图解决在第四代音视频编码标准 (Audio Video coding Standard, AVS4) 中,传统帧间预测滤波器 (Inter Prediction Filter, INTERPF) 在处理预测与相邻重建像素间的不连续性时存在的复杂度和效率问题。解决方案的关键在于提出了一种基于学习的低复杂度帧间预测方法 (Low Complexity Learning-based Inter Prediction, LLIP),通过利用轻量级神经网络模型来替代传统的 INTERPF。具体来说,LLIP 通过提取传统 INTERPF 使用的像素和坐标来构建训练数据集,训练后导出神经网络的权重和偏置,实现无需第三方依赖的高效推理过程,从而在不依赖 Libtorch 的情况下无缝集成到视频编解码器中,显著提升了推理速度。最终,LLIP 用学习到的最优滤波参数替代了传统的手工设计的滤波参数,使得深度学习编码工具与传统视频编码方案的结合更加高效。实验结果表明,该方法在随机接入 (Random Access, RA) 配置下,分别在 Y、U 和 V 分量上平均获得了 0.01%、0.31% 和 0.25% 的编码增益。
链接: https://arxiv.org/abs/2411.15759
作者: Yanchen Zhao,Wenhong Duan,Chuanmin Jia,Shanshe Wang,Siwei Ma
关键词-EN: Inter Prediction Filter, fourth generation Audio, generation Audio Video, Inter Prediction, Prediction Filter
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In the fourth generation Audio Video coding Standard (AVS4), the Inter Prediction Filter (INTERPF) reduces discontinuities between prediction and adjacent reconstructed pixels in inter prediction. The paper proposes a low complexity learning-based inter prediction (LLIP) method to replace the traditional INTERPF. LLIP enhances the filtering process by leveraging a lightweight neural network model, where parameters can be exported for efficient inference. Specifically, we extract pixels and coordinates utilized by the traditional INTERPF to form the training dataset. Subsequently, we export the weights and biases of the trained neural network model and implement the inference process without any third-party dependency, enabling seamless integration into video codec without relying on Libtorch, thus achieving faster inference speed. Ultimately, we replace the traditional handcraft filtering parameters in INTERPF with the learned optimal filtering parameters. This practical solution makes the combination of deep learning encoding tools with traditional video encoding schemes more efficient. Experimental results show that our approach achieves 0.01%, 0.31%, and 0.25% coding gain for the Y, U, and V components under the random access (RA) configuration on average.
zh
[CV-123] PR-MIM: Delving Deeper into Partial Reconstruction in Masked Image Modeling
【速读】: 该论文试图解决在掩码图像建模(Masked Image Modeling)中由于部分重建(Partial Reconstruction)策略导致的计算成本降低与表示质量下降之间的矛盾。解决方案的关键在于提出了一种渐进重建策略(Progressive Reconstruction Strategy)和最远采样策略(Furthest Sampling Strategy),以极其轻量级的方式逐步重建被丢弃的掩码标记,而不是完全放弃它们。这种方法确保了所有掩码标记在预训练过程中得到充分的监督,同时保持了部分重建策略在降低计算成本方面的优势。通过这种方法,论文在丢弃50%的图像块时,能够在不损失性能的情况下,将ViT-B/16模型的计算量(FLOPs)减少28%,内存使用量减少36%。
链接: https://arxiv.org/abs/2411.15746
作者: Zhong-Yu Li,Yunheng Li,Deng-Ping Fan,Ming-Ming Cheng
关键词-EN: achieved great success, huge computational costs, Masked image modeling, modeling has achieved, achieved great
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Masked image modeling has achieved great success in learning representations but is limited by the huge computational costs. One cost-saving strategy makes the decoder reconstruct only a subset of masked tokens and throw the others, and we refer to this method as partial reconstruction. However, it also degrades the representation quality. Previous methods mitigate this issue by throwing tokens with minimal information using temporal redundancy inaccessible for static images or attention maps that incur extra costs and complexity. To address these limitations, we propose a progressive reconstruction strategy and a furthest sampling strategy to reconstruct those thrown tokens in an extremely lightweight way instead of completely abandoning them. This approach involves all masked tokens in supervision to ensure adequate pre-training, while maintaining the cost-reduction benefits of partial reconstruction. We validate the effectiveness of the proposed method across various existing frameworks. For example, when throwing 50% patches, we can achieve lossless performance of the ViT-B/16 while saving 28% FLOPs and 36% memory usage compared to standard MAE. Our source code will be made publicly available
zh
[CV-124] PEnG: Pose-Enhanced Geo-Localisation
【速读】: 该论文试图解决跨视角地理定位(Cross-view Geo-localisation)中由于密集采样导致的重叠问题,从而限制了定位精度的提升。解决方案的关键在于结合跨视角地理定位和相对姿态估计(relative pose estimation),通过开发PEnG系统,该系统首先预测查询图像在城市尺度图表示中最可能的边缘,然后在这些边缘内进行相对姿态估计以确定精确位置。PEnG首次利用跨视角地理定位数据集中的双重视角来将精度提升至亚米级,甚至达到厘米级。该方法显著提高了定位精度,相对于之前的工作,Top-5m检索的改进达到了213%,并将中位欧几里得距离误差从734米降低到22.77米。
链接: https://arxiv.org/abs/2411.15742
作者: Tavis Shore,Oscar Mendez,Simon Hadfield
关键词-EN: densely sampled satellite, patches overlap heavily, coarse granularity, typically performed, overlap heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 6 figures
点击查看摘要
Abstract:Cross-view Geo-localisation is typically performed at a coarse granularity, because densely sampled satellite image patches overlap heavily. This heavy overlap would make disambiguating patches very challenging. However, by opting for sparsely sampled patches, prior work has placed an artificial upper bound on the localisation accuracy that is possible. Even a perfect oracle system cannot achieve accuracy greater than the average separation of the tiles. To solve this limitation, we propose combining cross-view geo-localisation and relative pose estimation to increase precision to a level practical for real-world application. We develop PEnG, a 2-stage system which first predicts the most likely edges from a city-scale graph representation upon which a query image lies. It then performs relative pose estimation within these edges to determine a precise position. PEnG presents the first technique to utilise both viewpoints available within cross-view geo-localisation datasets to enhance precision to a sub-metre level, with some examples achieving centimetre level accuracy. Our proposed ensemble achieves state-of-the-art precision - with relative Top-5m retrieval improvements on previous works of 213%. Decreasing the median euclidean distance error by 96.90% from the previous best of 734m down to 22.77m, when evaluating with 90 degree horizontal FOV images. Code will be made available: this http URL
zh
[CV-125] Proceedings of the 6th International Workshop on Reading Music Systems
【速读】: 该论文是第六届国际阅读音乐系统研讨会(WoRMS)的会议记录,旨在连接开发音乐阅读系统(如光学音乐识别 (Optical Music Recognition))的研究人员与其他可能从这些系统中受益的研究人员和实践者(如图书馆员或音乐学家)。研讨会关注的主题包括但不限于:音乐阅读系统、光学音乐识别、数据集和性能评估、音乐乐谱的图像处理、作者识别、音乐乐谱的创作、编辑、存储和展示系统、多模态系统、生成书面音乐的新输入方法、基于网络的音乐信息检索服务、应用和项目,以及与书面音乐相关的用例。解决方案的关键在于促进跨学科合作,推动音乐阅读系统的发展和应用,以满足不同领域的需求。
链接: https://arxiv.org/abs/2411.15741
作者: Jorge Calvo-Zaragoza,Alexander Pacha,Elona Shatri(Eds.)
关键词-EN: Optical Music Recognition, Reading Music Systems, Music reading systems, Web-based Music Information, Reading Music
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Proceedings edited by Jorge Calvo-Zaragoza, Alexander Pacha and Elona Shatri
点击查看摘要
Abstract:The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 6th International Workshop on Reading Music Systems, held Online on November 22nd 2024. Comments: Proceedings edited by Jorge Calvo-Zaragoza, Alexander Pacha and Elona Shatri Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2411.15741 [cs.CV] (or arXiv:2411.15741v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.15741 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-126] LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration
【速读】: 该论文试图解决低光图像增强问题,解决方案的关键在于引入了一种新颖的网络架构LTCF-Net。该架构通过利用LAB和YUV两种颜色空间,有效地分离和处理图像的亮度和色度信息,同时结合Transformer架构以全面理解图像内容并保持计算效率。此外,论文还引入了一个傅里叶变换模块,用于在频域动态调整亮度通道,从而在不同区域均匀平衡亮度并消除背景噪声,显著提升图像的视觉质量。通过这些创新组件的结合,LTCF-Net在保持模型轻量化的同时,有效提高了低光图像的质量,并在多个评估指标和数据集上超越了当前最先进的方法。
链接: https://arxiv.org/abs/2411.15740
作者: Gaojing Zhang,Jinglun Feng
关键词-EN: network architecture designed, LAB and YUV, Unlike Retinex-based methods, network architecture, architecture designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce LTCF-Net, a novel network architecture designed for enhancing low-light images. Unlike Retinex-based methods, our approach utilizes two color spaces - LAB and YUV - to efficiently separate and process color information, by leveraging the separation of luminance from chromatic components in color images. In addition, our model incorporates the Transformer architecture to comprehensively understand image content while maintaining computational efficiency. To dynamically balance the brightness in output images, we also introduce a Fourier transform module that adjusts the luminance channel in the frequency domain. This mechanism could uniformly balance brightness across different regions while eliminating background noises, and thereby enhancing visual quality. By combining these innovative components, LTCF-Net effectively improves low-light image quality while keeping the model lightweight. Experimental results demonstrate that our method outperforms current state-of-the-art approaches across multiple evaluation metrics and datasets, achieving more natural color restoration and a balanced brightness distribution.
zh
[CV-127] AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
【速读】: 该论文试图解决基于自然语言指令的图像编辑中,现有模型在执行复杂用户指令时表现不佳的问题。主要原因是这些模型通常在低质量、编辑类型有限的数据上进行训练。论文提出的解决方案是引入一个名为AnyEdit的综合多模态指令编辑数据集,该数据集包含250万对高质量编辑样本,涵盖超过20种编辑类型和五个领域。关键在于通过初始数据多样性、自适应编辑过程和自动化编辑结果选择来确保数据集的多样性和质量。基于此数据集,论文进一步训练了一种新型的AnyEdit Stable Diffusion模型,该模型采用任务感知路由和可学习的任务嵌入,以实现统一的图像编辑。实验结果表明,AnyEdit数据集显著提升了基于扩散的编辑模型的性能,为开发支持人类创造力的指令驱动图像编辑模型提供了前景。
链接: https://arxiv.org/abs/2411.15738
作者: Qifan Yu,Wei Chow,Zhongqi Yue,Kaihang Pan,Yang Wu,Xiaoyang Wan,Juncheng Li,Siliang Tang,Hanwang Zhang,Yueting Zhuang
关键词-EN: natural language instructions, Instruction-based image editing, specific image elements, modify specific image, Instruction-based image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 24 figures
点击查看摘要
Abstract:Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models. This presents prospects for developing instruction-driven image editing models that support human creativity.
zh
[CV-128] Enhancing Few-Shot Out-of-Distribution Detection with Gradient Aligned Context Optimization
【速读】: 该论文试图解决少样本分布外检测(Few-shot out-of-distribution (OOD) detection)中存在的梯度冲突问题,即在分布内样本分类优化(ID classification optimization)与分布外正则化(OOD regularization)之间由于识别偏差导致的梯度冲突。解决方案的关键是提出了一种名为梯度对齐上下文优化(Gradient Aligned Context Optimization, GaCoOp)的方法,通过分解优化梯度来识别冲突发生的场景,并通过梯度投影技术缓解分布内样本中的冲突,同时优化提示(prompts),从而有效减轻了梯度冲突并显著提升了性能。
链接: https://arxiv.org/abs/2411.15736
作者: Baoshun Tong,Kaiyu Song,Hanjiang Lai
关键词-EN: detect OOD images, detect OOD, OOD images, OOD, OOD regularization
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Few-shot out-of-distribution (OOD) detection aims to detect OOD images from unseen classes with only a few labeled in-distribution (ID) images. To detect OOD images and classify ID samples, prior methods have been proposed by regarding the background regions of ID samples as the OOD knowledge and performing OOD regularization and ID classification optimization. However, the gradient conflict still exists between ID classification optimization and OOD regularization caused by biased recognition. To address this issue, we present Gradient Aligned Context Optimization (GaCoOp) to mitigate this gradient conflict. Specifically, we decompose the optimization gradient to identify the scenario when the conflict occurs. Then we alleviate the conflict in inner ID samples and optimize the prompts via leveraging gradient projection. Extensive experiments over the large-scale ImageNet OOD detection benchmark demonstrate that our GaCoOp can effectively mitigate the conflict and achieve great performance. Code will be available at this https URL.
zh
[CV-129] st-time Alignment-Enhanced Adapter for Vision-Language Models
【速读】: 该论文试图解决预训练视觉-语言模型(Vision-Language Models, VLMs)在测试阶段面临的分布偏移问题。解决方案的关键在于提出了一种新的测试时对齐增强适配器(Test-time Alignment-Enhanced Adapter, TAEA),通过在测试阶段利用测试样本训练适配器来调整文本特征,从而增强文本与图像的对齐预测。此外,论文还采用了来自测试时数据增强(Test-time Data Augmentation, TDA)的负缓存(negative cache)作为增强模块,进一步提升了TAEA的性能。该方法在分布外基准和跨域基准上分别比现有的最先进测试时适应方法提升了0.75%和2.5%,同时保持了可接受的训练时间。
链接: https://arxiv.org/abs/2411.15735
作者: Baoshun Tong,Kaiyu Song,Hanjiang Lai
关键词-EN: attracted increasing attention, pre-trained vision-language models, vision-language models, addressing distribution shift, distribution shift
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Test-time adaptation with pre-trained vision-language models (VLMs) has attracted increasing attention for tackling the issue of distribution shift during the test phase. While prior methods have shown effectiveness in addressing distribution shift by adjusting classification logits, they are not optimal due to keeping text features unchanged. To address this issue, we introduce a new approach called Test-time Alignment-Enhanced Adapter (TAEA), which trains an adapter with test samples to adjust text features during the test phase. We can enhance the text-to-image alignment prediction by utilizing an adapter to adapt text features. Furthermore, we also propose to adopt the negative cache from TDA as enhancement module, which further improves the performance of TAEA. Our approach outperforms the state-of-the-art TTA method of pre-trained VLMs by an average of 0.75% on the out-of-distribution benchmark and 2.5% on the cross-domain benchmark, with an acceptable training time. Code will be available at this https URL.
zh
[CV-130] DynamicAvatars: Accurate Dynamic Facial Avatars Reconstruction and Precise Editing with Diffusion Models
【速读】: 该论文试图解决在虚拟现实和电影制作中生成和编辑动态3D头部化身时面临的面部扭曲、头部运动不准确以及精细编辑能力有限的问题。解决方案的关键在于提出了DynamicAvatars模型,该模型通过视频片段和与面部位置及表情相关的参数生成逼真的动态3D头部化身。其核心创新包括:1) 基于提示的编辑模型,结合用户提供的提示和大语言模型(LLMs)导出的指导参数,实现精确编辑;2) 双跟踪框架,基于高斯平滑技术,确保编辑稳定性;3) 提示预处理模块,增强编辑稳定性;4) 专用GAN算法与控制模块的结合,生成精确的指导参数;5) 动态编辑策略,选择性利用特定训练数据集,提高模型在动态编辑任务中的效率和适应性。
链接: https://arxiv.org/abs/2411.15732
作者: Yangyang Qian,Yuan Sun,Yu Guo
关键词-EN: film production, virtual reality, reality and film, head avatars, Generating
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Generating and editing dynamic 3D head avatars are crucial tasks in virtual reality and film production. However, existing methods often suffer from facial distortions, inaccurate head movements, and limited fine-grained editing capabilities. To address these challenges, we present DynamicAvatars, a dynamic model that generates photorealistic, moving 3D head avatars from video clips and parameters associated with facial positions and expressions. Our approach enables precise editing through a novel prompt-based editing model, which integrates user-provided prompts with guiding parameters derived from large language models (LLMs). To achieve this, we propose a dual-tracking framework based on Gaussian Splatting and introduce a prompt preprocessing module to enhance editing stability. By incorporating a specialized GAN algorithm and connecting it to our control module, which generates precise guiding parameters from LLMs, we successfully address the limitations of existing methods. Additionally, we develop a dynamic editing strategy that selectively utilizes specific training datasets to improve the efficiency and adaptability of the model for dynamic editing tasks.
zh
[CV-131] OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions
【速读】: 该论文试图解决现有动作识别视频数据集中遮挡数据不足的问题,这限制了模型的鲁棒性并阻碍了性能的持续提升。解决方案的关键在于构建了一个大规模的遮挡视频数据集OccludeNet,该数据集包含真实世界和合成遮挡场景视频,涵盖动态跟踪遮挡、静态场景遮挡和多视角交互遮挡,填补了现有数据的空白。论文进一步提出了Causal Action Recognition (CAR)框架,通过结构因果模型和反事实推理,增强关键演员信息,从而提高模型对遮挡的鲁棒性。这一框架的引入旨在激发对遮挡场景中因果关系的进一步探索,并促使重新评估类别间的关联,最终推动性能的可持续提升。
链接: https://arxiv.org/abs/2411.15729
作者: Guanyu Zhou,Wenxuan Liu,Wenxin Huang,Xuemei Jia,Xian Zhong,Chia-Wen Lin
关键词-EN: video datasets limits, impedes sustained performance, recognition video datasets, impedes sustained, action recognition video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The lack of occlusion data in commonly used action recognition video datasets limits model robustness and impedes sustained performance improvements. We construct OccludeNet, a large-scale occluded video dataset that includes both real-world and synthetic occlusion scene videos under various natural environments. OccludeNet features dynamic tracking occlusion, static scene occlusion, and multi-view interactive occlusion, addressing existing gaps in data. Our analysis reveals that occlusion impacts action classes differently, with actions involving low scene relevance and partial body visibility experiencing greater accuracy degradation. To overcome the limitations of current occlusion-focused approaches, we propose a structural causal model for occluded scenes and introduce the Causal Action Recognition (CAR) framework, which employs backdoor adjustment and counterfactual reasoning. This framework enhances key actor information, improving model robustness to occlusion. We anticipate that the challenges posed by OccludeNet will stimulate further exploration of causal relations in occlusion scenarios and encourage a reevaluation of class correlations, ultimately promoting sustainable performance improvements. The code and full dataset will be released soon.
zh
[CV-132] GSurf: 3D Reconstruction via Signed Distance Fields with Direct Gaussian Supervision
【速读】: 该论文试图解决从多视角图像进行表面重建时,传统方法如神经辐射场 (NeRF) 中使用的有符号距离场 (SDF) 存在的训练和渲染速度慢的问题,以及3D高斯光栅化 (3DGS) 方法中由于深度数据噪声或缺失导致的表面不完整和碎片化问题。解决方案的关键在于提出了GSurf,一种端到端的方法,直接从高斯基元中学习有符号距离场。GSurf利用高斯光栅化进行渲染,避免了其他方法中冗余的体积渲染,从而在保持与神经隐式表面方法(如VolSDF和NeuS)相当的3D重建质量的同时,显著提高了训练和渲染速度。
链接: https://arxiv.org/abs/2411.15723
作者: Xu Baixin,Hu Jiangbei,Li Jiaze,He Ying
关键词-EN: multi-view images, core challenge, Neural Radiance Fields, Radiance Fields, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: see this https URL
点击查看摘要
Abstract:Surface reconstruction from multi-view images is a core challenge in 3D vision. Recent studies have explored signed distance fields (SDF) within Neural Radiance Fields (NeRF) to achieve high-fidelity surface reconstructions. However, these approaches often suffer from slow training and rendering speeds compared to 3D Gaussian splatting (3DGS). Current state-of-the-art techniques attempt to fuse depth information to extract geometry from 3DGS, but frequently result in incomplete reconstructions and fragmented surfaces. In this paper, we introduce GSurf, a novel end-to-end method for learning a signed distance field directly from Gaussian primitives. The continuous and smooth nature of SDF addresses common issues in the 3DGS family, such as holes resulting from noisy or missing depth data. By using Gaussian splatting for rendering, GSurf avoids the redundant volume rendering typically required in other GS and SDF integrations. Consequently, GSurf achieves faster training and rendering speeds while delivering 3D reconstruction quality comparable to neural implicit surface methods, such as VolSDF and NeuS. Experimental results across various benchmark datasets demonstrate the effectiveness of our method in producing high-fidelity 3D reconstructions.
zh
[CV-133] Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
【速读】: 该论文试图解决预训练视觉-语言模型(Vision-Language Models, VLMs)在面对对抗性攻击时的鲁棒性问题。解决方案的关键在于提出了一种名为“攻击链(Chain of Attack, CoA)”的新策略,该策略通过一系列中间攻击步骤,基于多模态语义更新迭代增强对抗样本的生成,从而实现更高的对抗转移性和效率。CoA方法特别强调了视觉和文本模态之间的语义关联,以优化对抗样本的生成和攻击性能。此外,论文还提出了一种统一的攻击成功率计算方法,用于自动化的规避评估。实验结果表明,CoA策略能够在不依赖受害者模型任何知识的情况下,仅通过黑盒攻击有效地误导模型生成目标响应,从而揭示了VLMs的潜在脆弱性,并为未来模型开发的安全性考虑提供了参考。
链接: https://arxiv.org/abs/2411.15720
作者: Peng Xie,Yequan Bie,Jianda Mao,Yangqiu Song,Yang Wang,Hao Chen,Kani Chen
关键词-EN: natural language understanding, Pre-trained vision-language models, Pre-trained vision-language, showcased remarkable performance, image captioning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.
zh
[CV-134] ROOT: VLM based System for Indoor Scene Understanding and Beyond
【速读】: 该论文试图解决视觉语言模型(Vision Language Models, VLMs)在室内场景中空间层次推理能力不足的问题。解决方案的关键在于引入ROOT系统,该系统基于VLM并结合GPT-4V进行迭代对象感知算法,以检测室内场景中的对象实体,并通过视觉基础模型获取场景的元信息(如边界框)。随后,提出了一种专门用于室内场景的VLM,即SceneVLM,能够生成空间层次的场景图并提供对象间的距离信息,从而增强对室内场景空间布局的理解。为训练SceneVLM,研究团队收集了超过61万张来自多个公开室内数据集的图像,并采用半自动化技术构建场景数据生成管道,以建立对象间的关系和估计距离。实验结果表明,ROOT系统在室内场景理解方面表现出色,并在3D场景生成和具身AI等下游应用中展现出有效性。
链接: https://arxiv.org/abs/2411.15714
作者: Yonghui Wang,Shi-Yong Chen,Zhenxing Zhou,Siyi Li,Haoran Li,Wengang Zhou,Houqiang Li
关键词-EN: Vision Language Models, experienced significant advancements, Vision Language, Language Models, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \urlthis https URL.
zh
[CV-135] Fixing the Perspective: A Critical Examination of Zero-1-to-3
【速读】: 该论文试图解决在图像到3D生成中的新视角合成问题,特别是在处理多张条件图像时,现有方法如Zero-1-to-3在生成一致且准确的新视角图像时面临的挑战。解决方案的关键在于对Zero-1-to-3中的跨注意力机制(cross-attention mechanism)在扩散2D条件UNet的空间变换器(Spatial Transformer)中的实现进行深入分析,并揭示了理论框架与实际实现之间的关键差异。论文提出了两项重要改进:一是修正跨注意力机制的实现,以有效利用图像条件上下文;二是增强架构,使其能够同时利用多个条件视图。这些改进有望提高新视角合成的连贯性和准确性。
链接: https://arxiv.org/abs/2411.15706
作者: Jack Yu,Xueying Jia,Charlie Sun,Prince Wang
关键词-EN: target view images, relative poses, conditioning images, fundamental challenge, multiple conditioning images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Novel view synthesis is a fundamental challenge in image-to-3D generation, requiring the generation of target view images from a set of conditioning images and their relative poses. While recent approaches like Zero-1-to-3 have demonstrated promising results using conditional latent diffusion models, they face significant challenges in generating consistent and accurate novel views, particularly when handling multiple conditioning images. In this work, we conduct a thorough investigation of Zero-1-to-3’s cross-attention mechanism within the Spatial Transformer of the diffusion 2D-conditional UNet. Our analysis reveals a critical discrepancy between Zero-1-to-3’s theoretical framework and its implementation, specifically in the processing of image-conditional context. We propose two significant improvements: (1) a corrected implementation that enables effective utilization of the cross-attention mechanism, and (2) an enhanced architecture that can leverage multiple conditional views simultaneously. Our theoretical analysis and preliminary results suggest potential improvements in novel view synthesis consistency and accuracy.
zh
[CV-136] Editable-DeepSC: Reliable Cross-Modal Semantic Communications for Facial Editing
【速读】: 该论文试图解决实时计算机视觉(Real-time Computer Vision, CV)应用中,特别是社交媒体上的语义面部编辑(Semantic Facial Editing)任务中,传统通信方式与实时CV任务需求不匹配的问题。解决方案的关键在于提出了一种名为Editable-DeepSC的新型跨模态语义通信方法。该方法通过联合编辑-信道编码(Joint Editing-Channel Coding, JECC),将编辑过程集成到通信链路中,以保留更多的语义互信息。此外,利用预训练的StyleGAN先验进行语义编码,以及通过模型微调实现信噪比(SNR)感知的信道编码,来应对动态信道噪声条件。这些创新使得Editable-DeepSC在保持高质量编辑效果的同时,显著节省了传输带宽,即使在高分辨率和分布外(Out-of-Distribution, OOD)设置下也能表现出色。
链接: https://arxiv.org/abs/2411.15702
作者: Bin Chen,Wenbo Yu,Qinshan Zhang,Shu-Tao Xia
关键词-EN: Real-time computer vision, computer vision, plays a crucial, crucial role, performance is highly
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注:
点击查看摘要
Abstract:Real-time computer vision (CV) plays a crucial role in various real-world applications, whose performance is highly dependent on communication networks. Nonetheless, the data-oriented characteristics of conventional communications often do not align with the special needs of real-time CV tasks. To alleviate this issue, the recently emerged semantic communications only transmit task-related semantic information and exhibit a promising landscape to address this problem. However, the communication challenges associated with Semantic Facial Editing, one of the most important real-time CV applications on social media, still remain largely unexplored. In this paper, we fill this gap by proposing Editable-DeepSC, a novel cross-modal semantic communication approach for facial editing. Firstly, we theoretically discuss different transmission schemes that separately handle communications and editings, and emphasize the necessity of Joint Editing-Channel Coding (JECC) via iterative attributes matching, which integrates editings into the communication chain to preserve more semantic mutual information. To compactly represent the high-dimensional data, we leverage inversion methods via pre-trained StyleGAN priors for semantic coding. To tackle the dynamic channel noise conditions, we propose SNR-aware channel coding via model fine-tuning. Extensive experiments indicate that Editable-DeepSC can achieve superior editings while significantly saving the transmission bandwidth, even under high-resolution and out-of-distribution (OOD) settings.
zh
[CV-137] owards RAW Object Detection in Diverse Conditions
【速读】: 该论文试图解决现有目标检测方法在复杂光照和天气条件下,由于使用压缩后的sRGB图像(从RAW数据通过图像信号处理(ISP)生成)而可能丢失关键信息的问题。解决方案的关键在于引入AODRaw数据集,该数据集包含7,785张高分辨率真实RAW图像,涵盖62个类别和135,601个标注实例,捕捉了9种不同光照和天气条件下的室内外场景。通过在RAW域上进行直接预训练,并利用从sRGB域预训练模型中提取的知识进行知识蒸馏(Knowledge Distillation),论文提出了一种在不依赖额外预处理模块的情况下,显著提升在多样和恶劣条件下目标检测性能的方法。
链接: https://arxiv.org/abs/2411.15678
作者: Zhong-Yu Li,Xin Jin,Boyuan Sun,Chun-Le Guo,Ming-Ming Cheng
关键词-EN: ISP originally designed, Existing object detection, data using ISP, ISP originally, Existing object
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Existing object detection methods often consider sRGB input, which was compressed from RAW data using ISP originally designed for visualization. However, such compression might lose crucial information for detection, especially under complex light and weather conditions. We introduce the AODRaw dataset, which offers 7,785 high-resolution real RAW images with 135,601 annotated instances spanning 62 categories, capturing a broad range of indoor and outdoor scenes under 9 distinct light and weather conditions. Based on AODRaw that supports RAW and sRGB object detection, we provide a comprehensive benchmark for evaluating current detection methods. We find that sRGB pre-training constrains the potential of RAW object detection due to the domain gap between sRGB and RAW, prompting us to directly pre-train on the RAW domain. However, it is harder for RAW pre-training to learn rich representations than sRGB pre-training due to the camera noise. To assist RAW pre-training, we distill the knowledge from an off-the-shelf model pre-trained on the sRGB domain. As a result, we achieve substantial improvements under diverse and adverse conditions without relying on extra pre-processing modules. Code and dataset are available at this https URL.
zh
[CV-138] Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment CVPR2024
【速读】: 该论文试图解决在自监督训练的视觉-语言模型中,由于使用从网络抓取的大规模数据集而导致的潜在安全威胁,如后门攻击和中毒攻击。解决方案的关键在于利用语言模型提取的外部知识来防止模型学习与外部知识不强相关的图像区域之间的关联。具体来说,通过施加约束,确保模型对视觉区域的注意力与其与外部知识的对齐程度成正比,从而有效防御此类攻击,同时保持模型效用,且无需在推理时进行任何更改。
链接: https://arxiv.org/abs/2411.15673
作者: Alvi Md Ishmam,Christopher Thomas
关键词-EN: self-supervised objectives, enormous interest, external knowledge, attacks, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2024
点击查看摘要
Abstract:In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However, the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats, such as backdooring and poisoning attacks. In this paper, we propose a method for mitigating such attacks on contrastively trained vision-language models. Our approach leverages external knowledge extracted from a language model to prevent models from learning correlations between image regions which lack strong alignment with external knowledge. We do this by imposing constraints to enforce that attention paid by the model to visual regions is proportional to the alignment of those regions with external knowledge. We conduct extensive experiments using a variety of recent backdooring and poisoning attacks on multiple datasets and architectures. Our results clearly demonstrate that our proposed approach is highly effective at defending against such attacks across multiple settings, while maintaining model utility and without requiring any changes at inference time
zh
[CV-139] SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution
【速读】: 该论文试图解决在基于CPU的架构中加速卷积运算的问题。现有方法主要依赖于将图像数据打包到矩阵列中(im2col),并通过通用矩阵乘法(GEMM)进行计算,但存在两个主要缺点:一是im2col需要大量内存缓冲区,且可能导致内存访问效率低下;二是GEMM虽然针对科学矩阵乘法进行了高度优化,但并不完全适用于卷积运算。论文提出的解决方案关键在于利用标量-矩阵乘法,减少内存开销,从而显著提升卷积运算的速度。实验结果表明,该方法在常见网络架构中相比现有间接方法有显著的加速效果。
链接: https://arxiv.org/abs/2411.15659
作者: Amir Ofir,Gil Ben-Artzi
关键词-EN: inference for CPU-based, accelerating convolutions, CPU-based architectures, performing general matrix, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present a novel approach for accelerating convolutions during inference for CPU-based architectures. The most common method of computation involves packing the image into the columns of a matrix (im2col) and performing general matrix multiplication (GEMM) with a matrix of weights. This results in two main drawbacks: (a) im2col requires a large memory buffer and can experience inefficient memory access, and (b) while GEMM is highly optimized for scientific matrices multiplications, it is not well suited for convolutions. We propose an approach that takes advantage of scalar-matrix multiplication and reduces memory overhead. Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
zh
[CV-140] raining an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data NEURIPS2024
【速读】: 该论文试图解决开放词汇3D物体检测(open-vocabulary 3D object detection)在点云数据依赖下的高部署成本问题。解决方案的关键在于提出了一种名为OVM3D-Det的新型开放词汇单目3D物体检测框架,该框架仅使用RGB图像进行训练,从而降低了成本并提高了可扩展性。OVM3D-Det通过利用开放词汇2D模型和伪激光雷达(pseudo-LiDAR)来自动标注RGB图像中的3D物体,并引入了自适应伪激光雷达腐蚀(adaptive pseudo-LiDAR erosion)和基于大型语言模型(large language models)的边界框细化(bounding box refinement)技术,以校准3D标签并实现仅使用RGB图像的3D检测器训练。
链接: https://arxiv.org/abs/2411.15657
作者: Rui Huang,Henry Zheng,Yan Wang,Zhuofan Xia,Marco Pavone,Gao Huang
关键词-EN: previously unseen domains, recently attracted considerable, attracted considerable attention, considerable attention due, driving and robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However, training 3D models with labels directly derived from pseudo-LiDAR is inadequate due to imprecise boxes estimated from noisy point clouds and severely occluded objects. To address these issues, we introduce two innovative designs: adaptive pseudo-LiDAR erosion and bounding box refinement with prior knowledge from large language models. These techniques effectively calibrate the 3D labels and enable RGB-only training for 3D detectors. Extensive experiments demonstrate the superiority of OVM3D-Det over baselines in both indoor and outdoor scenarios. The code will be released.
zh
[CV-141] Machine Learning-based sEMG Signal Classification for Hand Gesture Recognition
【速读】: 该论文旨在通过引入新的特征提取方法,即融合时域描述符、时空描述符和小波变换特征,结合先进的机器学习和深度学习模型,来评估基于肌电信号(EMG)的手势识别性能。解决方案的关键在于采用1D Dilated CNN和随机森林等模型,分别在Grabmyo和FORS-EMG数据集上实现了高达97%和94.95%的准确率,其中融合时域描述符(如功率谱矩、稀疏性、不规则因子及波形长度比)和时空描述符(包括时域特征及变异系数COV和Teager-Kaiser能量算子TKEO)的特征提取方法显著提升了手势识别的准确性。
链接: https://arxiv.org/abs/2411.15655
作者: Parshuram N. Aarotale,Ajita Rattani
关键词-EN: analyzing electrical activity, electrical activity generated, classify hand movements, movements by analyzing, analyzing electrical
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE BIBM 2024
点击查看摘要
Abstract:EMG-based hand gesture recognition uses electromyographic~(EMG) signals to interpret and classify hand movements by analyzing electrical activity generated by muscle contractions. It has wide applications in prosthesis control, rehabilitation training, and human-computer interaction. Using electrodes placed on the skin, the EMG sensor captures muscle signals, which are processed and filtered to reduce noise. Numerous feature extraction and machine learning algorithms have been proposed to extract and classify muscle signals to distinguish between various hand gestures. This paper aims to benchmark the performance of EMG-based hand gesture recognition using novel feature extraction methods, namely, fused time-domain descriptors, temporal-spatial descriptors, and wavelet transform-based features, combined with the state-of-the-art machine and deep learning models. Experimental investigations on the Grabmyo dataset demonstrate that the 1D Dilated CNN performed the best with an accuracy of 97% using fused time-domain descriptors such as power spectral moments, sparsity, irregularity factor and waveform length ratio. Similarly, on the FORS-EMG dataset, random forest performed the best with an accuracy of 94.95% using temporal-spatial descriptors (which include time domain features along with additional features such as coefficient of variation (COV), and Teager-Kaiser energy operator (TKEO)).
zh
[CV-142] OCDet: Object Center Detection via Bounding Box-Aware Heatmap Prediction on Edge Devices with NPUs
【速读】: 该论文试图解决在资源受限的边缘设备上进行实时目标定位的问题。传统框架如目标检测、分割和关键点检测在资源受限环境中表现不佳,常导致显著的目标遗漏。解决方案的关键在于引入了一种轻量级的对象中心检测框架OCDet,该框架针对配备NPU的边缘设备进行了优化。OCDet通过预测表示对象中心概率的热图,并通过峰值识别提取中心点。与使用固定高斯分布的先前方法不同,OCDet引入了广义中心性(Generalized Centerness, GC)来从边界框注释生成地面真值热图,提供更精细的空间细节而不需要额外的人工标注。此外,OCDet基于NPU友好的语义FPN和MobileNetV4骨干网络,并采用平衡连续焦点损失(Balanced Continuous Focal Loss, BCFL)进行训练,以缓解数据不平衡问题并专注于概率回归任务中的困难负样本。通过结合中心对齐分数(Center Alignment Score, CAS)和匈牙利匹配算法,OCDet在对象中心检测方面显著优于YOLO11,同时减少了参数数量、计算量和NPU延迟。
链接: https://arxiv.org/abs/2411.15653
作者: Chen Xin,Thomas Motz,Andreas Hartel,Enkelejda Kasneci
关键词-EN: Real-time object localization, Real-time object, numerous applications, ranging from surveillance, industrial automation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Real-time object localization on edge devices is fundamental for numerous applications, ranging from surveillance to industrial automation. Traditional frameworks, such as object detection, segmentation, and keypoint detection, struggle in resource-constrained environments, often resulting in substantial target omissions. To address these challenges, we introduce OCDet, a lightweight Object Center Detection framework optimized for edge devices with NPUs. OCDet predicts heatmaps representing object center probabilities and extracts center points through peak identification. Unlike prior methods using fixed Gaussian distribution, we introduce Generalized Centerness (GC) to generate ground truth heatmaps from bounding box annotations, providing finer spatial details without additional manual labeling. Built on NPU-friendly Semantic FPN with MobileNetV4 backbones, OCDet models are trained by our Balanced Continuous Focal Loss (BCFL), which alleviates data imbalance and focuses training on hard negative examples for probability regression tasks. Leveraging the novel Center Alignment Score (CAS) with Hungarian matching, we demonstrate that OCDet consistently outperforms YOLO11 in object center detection, achieving up to 23% higher CAS while requiring 42% fewer parameters, 34% less computation, and 64% lower NPU latency. When compared to keypoint detection frameworks, OCDet achieves substantial CAS improvements up to 186% using identical models. By integrating GC, BCFL, and CAS, OCDet establishes a new paradigm for efficient and robust object center detection on edge devices with NPUs. The code is released at this https URL.
zh
[CV-143] Sample- and Parameter-Efficient Auto-Regressive Image Models
【速读】: 该论文试图解决现有自回归图像模型在样本和参数效率上的不足,特别是对比学习或掩码图像建模方法在处理不平衡互联网数据时缺乏一致的扩展性问题。解决方案的关键在于引入了一种新的自回归目标函数,即XTRA模型,该模型采用了一种称为Block Causal Mask的创新方法,通过块级(Block)而非传统的单个token级进行像素值的重建。这种块级重建机制使得模型能够捕捉更大图像区域的高级结构模式,从而在更广泛的像素区域上学习关系,生成更抽象和语义上有意义的表示。这一简单但有效的修改显著提升了XTRA的样本和参数效率,使其在训练数据量大幅减少的情况下(13.1M vs. 2B),仍能在15个多样化的图像识别基准测试中超越先前的自回归模型,同时在参数使用上也更为高效(85M vs. 1.36B/0.63B)。
链接: https://arxiv.org/abs/2411.15648
作者: Elad Amrani,Leonid Karlinsky,Alex Bronstein
关键词-EN: XTRA, vision model pre-trained, Causal Mask, objective that significantly, significantly enhances
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: for code, see this https URL
点击查看摘要
Abstract:We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data, auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents k \times k tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically meaningful representations than traditional next-token prediction. This simple modification yields two key results. First, XTRA is sample-efficient. Despite being trained on 152 \times fewer samples (13.1M vs. 2B), XTRA ViT-H/14 surpasses the top-1 average accuracy of the previous state-of-the-art auto-regressive model across 15 diverse image recognition benchmarks. Second, XTRA is parameter-efficient. Compared to auto-regressive models trained on ImageNet-1k, XTRA ViT-B/16 outperforms in linear and attentive probing tasks, using 7-16 \times fewer parameters (85M vs. 1.36B/0.63B).
zh
[CV-144] Effort: Efficient Orthogonal Modeling for Generalizable AI-Generated Image Detection
【速读】: 该论文试图解决现有AI生成图像(AIGI)检测方法泛化性能不足的问题。解决方案的关键在于识别并利用AIGI检测中存在的关键不对称现象:模型在训练过程中容易过度拟合于训练集中的特定伪造模式,而未能充分捕捉其他信息,导致面对新伪造方法时泛化能力差。论文提出通过引入大规模视觉基础模型(VFMs)中嵌入的丰富语义知识,扩展原有的基于伪造模式的判别空间,使得判别不仅依赖于伪造模式,还依赖于语义线索,从而减少对特定伪造模式的过度拟合。具体解决方案是设计了一种名为Effort的新方法:通过奇异值分解(SVD)构建正交的语义和伪造子空间,冻结主成分并调整剩余成分(约0.19M参数),以保留原始语义子空间,并在其正交子空间中学习伪造特征。实验结果表明,该方法在AIGI检测基准上具有优越的有效性。
链接: https://arxiv.org/abs/2411.15633
作者: Zhiyuan Yan,Jiangming Wang,Zhendong Wang,Peng Jin,Ke-Yue Zhang,Shen Chen,Taiping Yao,Shouhong Ding,Baoyuan Wu,Li Yuan
关键词-EN: Existing AI-generated image, limited generalization performance, Existing AI-generated, AIGI detection, AI-generated image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Existing AI-generated image (AIGI) detection methods often suffer from limited generalization performance. In this paper, we identify a crucial yet previously overlooked asymmetry phenomenon in AIGI detection: during training, models tend to quickly overfit to specific fake patterns in the training set, while other information is not adequately captured, leading to poor generalization when faced with new fake methods. A key insight is to incorporate the rich semantic knowledge embedded within large-scale vision foundation models (VFMs) to expand the previous discriminative space (based on forgery patterns only), such that the discrimination is decided by both forgery and semantic cues, thereby reducing the overfitting to specific forgery patterns. A straightforward solution is to fully fine-tune VFMs, but it risks distorting the well-learned semantic knowledge, pushing the model back toward overfitting. To this end, we design a novel approach called Effort: Efficient orthogonal modeling for generalizable AIGI detection. Specifically, we employ Singular Value Decomposition (SVD) to construct the orthogonal semantic and forgery subspaces. By freezing the principal components and adapting the residual components ( \sim 0.19M parameters), we preserve the original semantic subspace and use its orthogonal subspace for learning forgeries. Extensive experiments on AIGI detection benchmarks demonstrate the superior effectiveness of our approach.
zh
[CV-145] ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos WACV2025
【速读】: 该论文试图解决现有视觉-语言模型(Vision-language models, VLMs)在程序性动作分类中缺乏对动作概念的内在理解,导致对固定标签的过拟合以及对未见动作同义词的不变性不足的问题。解决方案的关键是提出了一种简单的微调技术,称为动作概念增强(Action Concept Enhancement, ACE)。ACE通过在训练过程中随机替换固定标签,并引入增强的动作同义词和负样本,在辅助分类损失中持续整合这些信息,从而创建新的动作标签组合,防止模型对固定动作表示的过拟合,增强模型对动作概念的理解。实验结果表明,ACE在零样本动作分类中显著提升了性能,同时在已见动作分类中保持了竞争性表现。
链接: https://arxiv.org/abs/2411.15628
作者: Reza Ghoddoosian,Nakul Agarwal,Isht Dwivedi,Behzad Darisuh
关键词-EN: Vision-language models, action, recognizing unseen actions, capable of recognizing, unseen action synonyms
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025
点击查看摘要
Abstract:Vision-language models (VLMs) are capable of recognizing unseen actions. However, existing VLMs lack intrinsic understanding of procedural action concepts. Hence, they overfit to fixed labels and are not invariant to unseen action synonyms. To address this, we propose a simple fine-tuning technique, Action Concept Enhancement (ACE), to improve the robustness and concept understanding of VLMs in procedural action classification. ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss by stochastically replacing fixed labels during training. This creates new combinations of action labels over the course of fine-tuning and prevents overfitting to fixed action representations. We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space. Our experiments on the ATA, IKEA and GTEA datasets demonstrate the efficacy of ACE in domains of cooking and assembly leading to significant improvements in zero-shot action classification while maintaining competitive performance on seen actions.
zh
[CV-146] On the importance of local and global feature learning for automated measurable residual disease detection in flow cytometry data ICPR2024
【速读】: 该论文试图解决在流式细胞术 (Flow Cytometry, FCM) 数据中检测可测量残留病 (Measurable Residual Disease, MRD) 的问题,重点关注深度学习方法在捕捉长程依赖、获取全局信息以及学习局部特征方面的优势。解决方案的关键在于对当前最先进 (State-of-the-Art, SOTA) 模型的两项改进:一是增强模型对长程依赖的建模能力,二是优化获取全局信息和局部特征学习的方法。这些改进不仅在公开数据集上展示了优越的性能,还提高了模型在不同实验室间的泛化能力,为FCM社区提供了宝贵的指导,推动了未来深度学习架构在FCM数据分析中的设计。
链接: https://arxiv.org/abs/2411.15621
作者: Lisa Weijler,Michael Reiter,Pedro Hermosilla,Margarita Maurer-Granofszky,Michael Dworzak
关键词-EN: learning local features, measurable residual disease, modeling long-range dependencies, obtaining global information, deep learning methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR 2024
点击查看摘要
Abstract:This paper evaluates various deep learning methods for measurable residual disease (MRD) detection in flow cytometry (FCM) data, addressing questions regarding the benefits of modeling long-range dependencies, methods of obtaining global information, and the importance of learning local features. Based on our findings, we propose two adaptations to the current state-of-the-art (SOTA) model. Our contributions include an enhanced SOTA model, demonstrating superior performance on publicly available datasets and improved generalization across laboratories, as well as valuable insights for the FCM community, guiding future DL architecture designs for FCM data analysis. The code is available at \urlthis https URL.
zh
[CV-147] Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation
【速读】: 该论文试图解决现有基于视觉的基础模型在对象检测中难以捕捉整体对象中的细小部分以及无法充分考虑用户意图的问题。解决方案的关键在于提出了一种名为FOCUS(Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation)的新方法。FOCUS通过结合视觉基础模型的能力,实现了灵活粒度的开放词汇对象检测,并允许用户通过自然语言直接引导检测过程。这种方法不仅擅长识别和定位细粒度的组成部分,还能在减少不必要用户干预的同时,赋予用户显著的控制权。通过FOCUS,用户可以提出可解释的请求,主动引导检测过程朝着预期方向进行,从而有效提升基线模型的检测能力,并在不同对象类型上表现出一致的性能。
链接: https://arxiv.org/abs/2411.15620
作者: Jinwoo Ahn,Hyeokjoon Kwon,Hwiyeon Yoo
关键词-EN: Recent advent, high-quality object detection, advent of vision-based, enabled efficient, efficient and high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advent of vision-based foundation models has enabled efficient and high-quality object detection at ease. Despite the success of previous studies, object detection models face limitations on capturing small components from holistic objects and taking user intention into account. To address these challenges, we propose a novel foundation model-based detection method called FOCUS: Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation. FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language. It not only excels at identifying and locating granular constituent elements but also minimizes unnecessary user intervention yet grants them significant control. With FOCUS, users can make explainable requests to actively guide the detection process in the intended direction. Our results show that FOCUS effectively enhances the detection capabilities of baseline models and shows consistent performance across varying object types.
zh
[CV-148] Knowledge Transfer Across Modalities with Natural Language Supervision
【速读】: 该论文试图解决通过仅使用文本描述来学习新概念的问题,提出了名为“知识转移 (Knowledge Transfer)”的方法。解决方案的关键在于利用预训练视觉编码器中已学习的低级特征(如形状、外观、颜色),通过跨模态交互将这些低级特征与新概念的高级文本描述对齐。该方法通过单一的文本描述即可高效地引入新概念,并适用于独立的文本和视觉编码器(如CLIP)以及跨模态共享参数的模型。此外,知识转移还能提升模型已知概念的表现,并在零样本分类、分割、图像-文本检索和图像描述等任务中提高性能。
链接: https://arxiv.org/abs/2411.15611
作者: Carlo Alberto Barbano,Luca Molinaro,Emanuele Aiello,Marco Grangetto
关键词-EN: Knowledge Transfer, Transfer, method Knowledge Transfer, Knowledge, Leveraging Knowledge Transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures, 17 tables
点击查看摘要
Abstract:We present a way to learn novel concepts by only using their textual description. We call this method Knowledge Transfer. Similarly to human perception, we leverage cross-modal interaction to introduce new concepts. We hypothesize that in a pre-trained visual encoder there are enough low-level features already learned (e.g. shape, appearance, color) that can be used to describe previously unknown high-level concepts. Provided with a textual description of the novel concept, our method works by aligning the known low-level features of the visual encoder to its high-level textual description. We show that Knowledge Transfer can successfully introduce novel concepts in multimodal models, in a very efficient manner, by only requiring a single description of the target concept. Our approach is compatible with both separate textual and visual encoders (e.g. CLIP) and shared parameters across modalities. We also show that, following the same principle, Knowledge Transfer can improve concepts already known by the model. Leveraging Knowledge Transfer we improve zero-shot performance across different tasks such as classification, segmentation, image-text retrieval, and captioning.
zh
[CV-149] GIFT: A Framework for Global Interpretable Faithful Textual Explanations of Vision Classifiers
【速读】: 该论文试图解决在安全关键应用中部署深度模型时,如何生成后验、全局、可解释且忠实的文本解释的问题。解决方案的关键在于引入了一个名为GIFT的框架,该框架从局部忠实的视觉反事实解释出发,利用(视觉)语言模型将其转化为全局文本解释。GIFT框架还包含一个验证阶段,用于测量所提出的解释对分类器决策的因果效应,从而确保解释的忠实性。通过在多个数据集(如CLEVR、CelebA和BDD)上的实验,GIFT展示了其有效性,揭示了深度视觉分类器所使用的任务、概念和偏见。
链接: https://arxiv.org/abs/2411.15605
作者: Éloi Zablocki,Valentin Gerard,Amaia Cardiel,Eric Gaussier,Matthieu Cord,Eduardo Valle
关键词-EN: Understanding deep models, Understanding deep, safety-critical applications, crucial for deploying, Understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Understanding deep models is crucial for deploying them in safety-critical applications. We introduce GIFT, a framework for deriving post-hoc, global, interpretable, and faithful textual explanations for vision classifiers. GIFT starts from local faithful visual counterfactual explanations and employs (vision) language models to translate those into global textual explanations. Crucially, GIFT provides a verification stage measuring the causal effect of the proposed explanations on the classifier decision. Through experiments across diverse datasets, including CLEVR, CelebA, and BDD, we demonstrate that GIFT effectively reveals meaningful insights, uncovering tasks, concepts, and biases used by deep vision classifiers. Our code, data, and models are released at this https URL.
zh
[CV-150] FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video
【速读】: 该论文试图解决从单目视频中重建高保真、可动画的3D头部化身的问题,特别是针对不完全重建和低效的高斯表示等挑战。解决方案的关键在于引入了一种名为FATE的新方法,该方法通过以下几个关键技术来实现:1) 基于采样的密集化策略,以确保点的最佳位置分布,从而提高渲染效率;2) 神经烘焙技术,将离散的高斯表示转换为连续的属性图,便于直观的外貌编辑;3) 通用完成框架,用于恢复非正面视角的外貌,最终实现360°可渲染的3D头部化身。FATE在定性和定量评估中均优于先前的方法,达到了最先进的性能。
链接: https://arxiv.org/abs/2411.15604
作者: Jiawei Zhang,Zijian Wu,Zhiyang Liang,Yicheng Gong,Dongfang Hu,Yao Yao,Xun Cao,Hao Zhu
关键词-EN: effortlessly captured monocular, captured monocular videos, effortlessly captured, pivotal yet formidable, Reconstructing high-fidelity
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
点击查看摘要
Abstract:Reconstructing high-fidelity, animatable 3D head avatars from effortlessly captured monocular videos is a pivotal yet formidable challenge. Although significant progress has been made in rendering performance and manipulation capabilities, notable challenges remain, including incomplete reconstruction and inefficient Gaussian representation. To address these challenges, we introduce FATE, a novel method for reconstructing an editable full-head avatar from a single monocular video. FATE integrates a sampling-based densification strategy to ensure optimal positional distribution of points, improving rendering efficiency. A neural baking technique is introduced to convert discrete Gaussian representations into continuous attribute maps, facilitating intuitive appearance editing. Furthermore, we propose a universal completion framework to recover non-frontal appearance, culminating in a 360 ^\circ -renderable 3D head avatar. FATE outperforms previous approaches in both qualitative and quantitative evaluations, achieving state-of-the-art performance. To the best of our knowledge, FATE is the first animatable and 360 ^\circ full-head monocular reconstruction method for a 3D head avatar. The code will be publicly released upon publication.
zh
[CV-151] Enhancing Object Detection Accuracy in Autonomous Vehicles Using Synthetic Data
【速读】: 该论文试图解决机器学习模型在实际应用中因训练数据稀缺、噪声和失衡而导致的性能受限问题。解决方案的关键在于利用合成数据(synthetic data)来增强训练数据集的质量和多样性,从而提高模型的预测准确性。论文通过创建合成数据集并将其应用于自动驾驶场景中的目标检测系统,验证了合成数据对提升模型性能的有效性。实验结果表明,结合真实数据和合成数据训练的模型(System-2)在准确性、精确度、召回率和平均精度均值等关键性能指标上均优于仅使用真实数据训练的模型(System-1),具体表现为准确性提高了3%。
链接: https://arxiv.org/abs/2411.15602
作者: Sergei Voronin,Abubakar Siddique,Muhammad Iqbal
关键词-EN: machine learning models, machine learning, disease diagnoses, learning models, learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 Pages, 7 figures, 1 table
点击查看摘要
Abstract:The rapid progress in machine learning models has significantly boosted the potential for real-world applications such as autonomous vehicles, disease diagnoses, and recognition of emergencies. The performance of many machine learning models depends on the nature and size of the training data sets. These models often face challenges due to the scarcity, noise, and imbalance in real-world data, limiting their performance. Nonetheless, high-quality, diverse, relevant and representative training data is essential to build accurate and reliable machine learning models that adapt well to real-world scenarios. It is hypothesised that well-designed synthetic data can improve the performance of a machine learning algorithm. This work aims to create a synthetic dataset and evaluate its effectiveness to improve the prediction accuracy of object detection systems. This work considers autonomous vehicle scenarios as an illustrative example to show the efficacy of synthetic data. The effectiveness of these synthetic datasets in improving the performance of state-of-the-art object detection models is explored. The findings demonstrate that incorporating synthetic data improves model performance across all performance matrices. Two deep learning systems, System-1 (trained on real-world data) and System-2 (trained on a combination of real and synthetic data), are evaluated using the state-of-the-art YOLO model across multiple metrics, including accuracy, precision, recall, and mean average precision. Experimental results revealed that System-2 outperformed System-1, showing a 3% improvement in accuracy, along with superior performance in all other metrics. Comments: 7 Pages, 7 figures, 1 table Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2411.15602 [cs.CV] (or arXiv:2411.15602v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.15602 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-152] How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking
【速读】: 该论文试图解决视觉语言跟踪 (Vision-language tracking, VLT) 中语义信息在复杂场景下可能成为“干扰”的问题,导致现有VLT跟踪器在多基准测试中表现不如单一模态方法。解决方案的关键在于提出了VLTVerse,这是一个细粒度的评估框架,通过引入10个序列级挑战标签和6种多粒度语义信息,创建了一个灵活且多维的评估空间。该框架利用60个子空间对三种主流SOTA VLT跟踪器进行系统性细粒度评估,揭示了它们在复杂场景中的性能瓶颈,并提供了关于VLT评估的新视角。通过实验结果的解耦分析,论文还探讨了不同语义类型对特定挑战因素的影响,为提升VLT在数据、评估和算法维度上的性能提供了重要指导。
链接: https://arxiv.org/abs/2411.15600
作者: Xuchen Li,Shiyu Hu,Xiaokun Feng,Dailing Zhang,Meiqi Wu,Jing Zhang,Kaiqi Huang
关键词-EN: extends traditional single, single object tracking, traditional single object, Vision-language tracking, incorporating textual information
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint, Under Review
点击查看摘要
Abstract:Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information, providing semantic guidance to enhance tracking performance under challenging conditions like fast motion and deformations. However, current VLT trackers often underperform compared to single-modality methods on multiple benchmarks, with semantic information sometimes becoming a “distraction.” To address this, we propose VLTVerse, the first fine-grained evaluation framework for VLT trackers that comprehensively considers multiple challenge factors and diverse semantic information, hoping to reveal the role of language in VLT. Our contributions include: (1) VLTVerse introduces 10 sequence-level challenge labels and 6 types of multi-granularity semantic information, creating a flexible and multi-dimensional evaluation space for VLT; (2) leveraging 60 subspaces formed by combinations of challenge factors and semantic types, we conduct systematic fine-grained evaluations of three mainstream SOTA VLT trackers, uncovering their performance bottlenecks across complex scenarios and offering a novel perspective on VLT evaluation; (3) through decoupled analysis of experimental results, we examine the impact of various semantic types on specific challenge factors in relation to different algorithms, providing essential guidance for enhancing VLT across data, evaluation, and algorithmic dimensions. The VLTVerse, toolkit, and results will be available at \urlthis http URL.
zh
[CV-153] An adversarial feature learning based semantic communication method for Human 3D Reconstruction
【速读】: 该论文试图解决在网络带宽有限且需要低延迟的场景下,人体3D重建技术中的数据传输和处理效率问题。解决方案的关键在于提出了一种基于对抗特征学习(Adversarial Feature Learning)的语义通信方法(AFLSC),通过提取和传输对3D重建任务至关重要的语义信息,优化数据流并缓解带宽压力。具体来说,发送端采用多任务学习(Multitask Learning)特征提取方法捕捉2D人体图像的空间布局、关键点、姿态和深度信息,并设计了基于对抗特征学习的语义编码技术进行高效编码和动态压缩传输。接收端则通过多层次语义特征解码方法将语义数据转换回关键图像特征,最终利用改进的ViT-diffusion模型生成人体3D网格模型。实验结果表明,该方法在数据传输效率和重建质量方面具有显著优势,适用于带宽受限的环境。
链接: https://arxiv.org/abs/2411.15595
作者: Shaojiang Liu,Jiajun Zou,Zhendan Liu,Meixia Dong,Zhiping Wan
关键词-EN: processing efficiency continue, human body, continue to rise, Learning-based Semantic Communication, scenarios where network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With the widespread application of human body 3D reconstruction technology across various fields, the demands for data transmission and processing efficiency continue to rise, particularly in scenarios where network bandwidth is limited and low latency is required. This paper introduces an Adversarial Feature Learning-based Semantic Communication method (AFLSC) for human body 3D reconstruction, which focuses on extracting and transmitting semantic information crucial for the 3D reconstruction task, thereby significantly optimizing data flow and alleviating bandwidth pressure. At the sender’s end, we propose a multitask learning-based feature extraction method to capture the spatial layout, keypoints, posture, and depth information from 2D human images, and design a semantic encoding technique based on adversarial feature learning to encode these feature information into semantic data. We also develop a dynamic compression technique to efficiently transmit this semantic data, greatly enhancing transmission efficiency and reducing latency. At the receiver’s end, we design an efficient multi-level semantic feature decoding method to convert semantic data back into key image features. Finally, an improved ViT-diffusion model is employed for 3D reconstruction, producing human body 3D mesh models. Experimental results validate the advantages of our method in terms of data transmission efficiency and reconstruction quality, demonstrating its excellent potential for application in bandwidth-limited environments.
zh
[CV-154] Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing
【速读】: 该论文试图解决现有场景文本识别(Scene Text Recognition, STR)方法在识别艺术性和严重扭曲字符时表现不佳的问题。其核心在于通过增强模型对字符形态(character morphologies)的理解来提升识别能力。解决方案的关键在于:1) 提出在线生成策略(Online Generation Strategy),通过生成无背景的多样化字符样本,弥补合成数据简单性的不足,增强模型对字符形态的专注和泛化能力;2) 提出新的字符单向对齐损失(Character Unidirectional Alignment Loss),修正先前字符对比损失中的推导错误,统一同一字符在不同样本中的表示,从而减少类内分布的稀疏性和挑战性样本的模糊性。这些改进使得模型在常见基准和Union14M-Benchmark上达到了最先进的性能(94.7%和70.9%的平均准确率)。
链接: https://arxiv.org/abs/2411.15585
作者: Yadong Qu,Yuxin Wang,Bangbang Zhou,Zixiao Wang,Hongtao Xie,Yongdong Zhang
关键词-EN: Existing scene text, scene text recognition, severely distorted characters, Existing scene, text recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at this https URL.
zh
[CV-155] FLD: Data-efficient Evaluation Metric for Generative Models
【速读】: 该论文试图解决现有生成图像质量评估指标(如Fréchet Inception Distance (FID))在可靠性、数据效率、计算效率和适应新领域方面的不足。解决方案的关键在于提出了一种基于归一化流 (normalizing flows) 的新指标——Flow-based Likelihood Distance Plus (FLD+)。FLD+ 通过计算图像的密度(精确对数似然)来评估图像质量,具有以下优势:1) 对不同类型的图像退化(如噪声、遮挡、扩散步骤和生成模型大小)表现出强单调性;2) 训练稳定且高效,所需图像数量比FID少两个数量级;3) 计算效率更高,通过在低维潜在空间中应用归一化流实现;4) 易于在新领域(如医学图像)上重新训练,无需依赖预训练网络(如InceptionNetV3)。
链接: https://arxiv.org/abs/2411.15584
作者: Pranav Jeevan,Neeraj Nixon,Amit Sethi
关键词-EN: Fréchet Inception Distance, Fréchet Inception, Inception Distance, Flow-based Likelihood Distance, assess the quality
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 13 pages, 10 figures
点击查看摘要
Abstract:We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics – such as InceptionNetV3 pre-trained on ImageNet.
zh
[CV-156] EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting
【速读】: 该论文试图解决在自动驾驶场景中,基于3D/4D高斯喷射(Gaussian Splatting, GS)方法在复杂街道场景中对动态物体运动建模不足的问题。现有方法通常将街道场景分解为静态和动态物体,并采用监督学习(如使用3D边界框)或自监督学习(如不使用3D边界框)的方式学习高斯分布,但这些方法未能有效建模动态物体的运动特性(如行人和车辆的移动速度差异),导致场景分解效果不佳。论文提出的解决方案是显式运动分解(Explicit Motion Decomposition, EMD),通过引入可学习的运动嵌入(motion embeddings)到高斯分布中,增强对动态物体运动的建模,从而提升场景分解的效果。EMD是一种即插即用的方法,适用于多种基线方法,并提出了针对性的训练策略以应用于监督和自监督基线。
链接: https://arxiv.org/abs/2411.15582
作者: Xiaobao Wei,Qingpo Wuwu,Zhongyu Zhao,Zhuangzhe Wu,Nan Huang,Ming Lu,Ningning MA,Shanghang Zhang
关键词-EN: developing real-world simulators, Photorealistic reconstruction, street scenes, dynamic objects, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Photorealistic reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. While recent methods based on 3D/4D Gaussian Splatting (GS) have demonstrated promising results, they still encounter challenges in complex street scenes due to the unpredictable motion of dynamic objects. Current methods typically decompose street scenes into static and dynamic objects, learning the Gaussians in either a supervised manner (e.g., w/ 3D bounding-box) or a self-supervised manner (e.g., w/o 3D bounding-box). However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. The proposed EMD is a plug-and-play approach applicable to various baseline methods. We also propose tailored training strategies to apply EMD to both supervised and self-supervised baselines. Through comprehensive experimentation, we illustrate the effectiveness of our approach with various established baselines. The code will be released at: this https URL.
zh
[CV-157] KG-DM: Training-free Chroma Key Content Generation Diffusion Model
【速读】: 该论文试图解决大规模文本到图像生成模型(如 Stable Diffusion)在生成前景物体置于色键背景上的图像时,难以分离前景和背景元素的问题。解决方案的关键在于提出了一种无需训练的色键内容生成扩散模型(Training-Free Chroma Key Content Generation Diffusion Model, TKG-DM),通过优化初始随机噪声以生成前景物体在指定颜色背景上的图像。该方法首次探索了在初始噪声中操控颜色属性以实现背景的精确控制,从而无需微调即可实现前景和背景的精确分离。实验结果表明,该无需训练的方法在定性和定量评估中均优于现有方法,甚至达到或超越了微调模型的效果,并展示了其在其他生成任务(如一致性模型和文本到视频生成)中的广泛应用潜力。
链接: https://arxiv.org/abs/2411.15580
作者: Ryugo Morita,Stanislav Frolov,Brian Bernhard Moser,Takahiro Shirakawa,Ko Watanabe,Andreas Dengel,Jinjia Zhou
关键词-EN: Content Generation Diffusion, Chroma Key Content, textual fidelity, Generation Diffusion Model, Stable Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.
zh
[CV-158] Reassessing Layer Pruning in LLM s: New Insights and Methods
【速读】: 该论文试图解决在大语言模型(LLMs)中如何有效进行层剪枝(layer pruning)的问题,特别是在资源受限的环境中如何减少计算资源的需求。解决方案的关键在于发现了一种简单而有效的层剪枝策略:即剪枝模型最后25%的层,然后对lm_head
和剩余的最后三层进行微调。这种策略不仅显著减少了计算开销,而且在性能上超越了许多同规模的流行LLMs,如ChatGLM2-6B、Vicuna-7B-v1.5、Qwen1.5-7B和Baichuan2-7B。
链接: https://arxiv.org/abs/2411.15558
作者: Yao Lu,Hao Cheng,Yujie Fang,Zeyu Wang,Jiaheng Wei,Dongwei Xu,Qi Xuan,Xiaoniu Yang,Zhaowei Zhu
关键词-EN: posing significant challenges, achieved remarkable success, considerable scale necessitates, scale necessitates substantial, substantial computational resources
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final 25% of layers followed by fine-tuning the \textttlm_head and the remaining last three layer, yields remarkably strong performance. Following this guide, we prune Llama-3.1-8B-It and obtain a model that outperforms many popular LLMs of similar size, such as ChatGLM2-6B, Vicuna-7B-v1.5, Qwen1.5-7B and Baichuan2-7B. We release the optimal model weights on Huggingface, and the code is available on GitHub.
zh
[CV-159] LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces
【速读】: 该论文试图解决无监督领域自适应(Unsupervised Domain Adaptation)中的关键问题,即如何在保持领域特定特征的同时,实现模型在未见领域间的知识迁移。现有方法难以平衡领域不变表示与领域特定特征的保留,主要原因是它们采用的对齐方法在潜在空间中将语义相似但领域差异显著的样本投影得过于接近。论文提出的解决方案是 \mnamelong,其关键在于从绝对坐标中的表示对齐转向潜在空间中等价概念的相对定位对齐。\mname 通过在语言空间中定义类标签间的语义/几何关系来构建领域无关的结构,并指导自适应过程,确保视觉空间中样本的组织反映参考的类间关系,同时保留领域特定特征。
链接: https://arxiv.org/abs/2411.15557
作者: Anxhelo Diko,Antonino Furnari,Luigi Cinque,Giovanni Maria Farinella
关键词-EN: Unsupervised domain adaptation, Unsupervised domain, domain adaptation remains, remains a critical, critical challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce \mnamelong, a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. \mname defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. %We empirically demonstrate \mname’s superiority in domain adaptation tasks across four diverse images and video datasets. Remarkably, \mname surpasses previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.
zh
[CV-160] ReWind: Understanding Long Videos with Instructed Learnable Memory
【速读】: 该论文试图解决现有视觉语言模型(Vision-Language Models, VLMs)在处理长视频时面临的计算效率低、内存限制和时间连贯性理解困难的问题。解决方案的关键在于引入了一种名为ReWind的新型基于记忆的VLM,其核心创新包括:1) 一个动态可学习的记忆模块,采用独特的“读-感知-写”循环(read-perceive-write cycle)来存储和更新与指令相关的视觉信息,通过可学习的查询和跨注意力机制(cross-attentions)确保内存需求随token数量线性增长;2) 一种自适应帧选择机制,根据记忆内容识别指令相关的关键时刻,并选择高分辨率帧来丰富记忆表示,最终结合记忆内容和大型语言模型(Large Language Model, LLM)生成最终答案。这些创新使得ReWind在视觉问答(VQA)和时间定位任务中表现出优越性能,显著超越了现有方法。
链接: https://arxiv.org/abs/2411.15556
作者: Anxhelo Diko,Tinghuai Wang,Wassim Swaileh,Shiyan Sun,Ioannis Patras
关键词-EN: applications requiring integrated, requiring integrated understanding, integrated understanding textual, crucial for applications, applications requiring
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel \textbfread-perceive-write cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind’s superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13% score gain and a +12% accuracy improvement on the MovieChat-1K VQA dataset and an +8% mIoU increase on Charades-STA for temporal grounding.
zh
[CV-161] Enhancing the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation
【速读】: 该论文试图解决现有面部识别模型(Face Recognition, FR)在对抗样本(adversarial examples)攻击下的脆弱性问题,特别是提高对抗攻击的可迁移性(transferability)以揭示这些系统的盲点。解决方案的关键在于提出了一种名为多样化参数增强(Diverse Parameters Augmentation, DPA)的新方法,通过引入多样化的参数初始化来增强代理模型(surrogate models),从而生成更具迁移性的对抗样本。DPA方法包括两个核心阶段:多样化参数优化(Diverse Parameters Optimization, DPO)和硬模型聚合(Hard Model Aggregation, HMA)。在DPO阶段,通过使用预训练和随机参数初始化代理模型的参数,并在训练过程中保存中间模型,以获得多样化的代理模型集合。在HMA阶段,通过引入有益的扰动来增强多样化代理模型的特征图,进一步提高对抗样本的迁移性。实验结果表明,该方法能有效提升生成的对抗面部样本的可迁移性。
链接: https://arxiv.org/abs/2411.15555
作者: Fengfan Zhou,Bangjie Yin,Hefei Ling,Qianyu Zhou,Wenxuan Wang
关键词-EN: subtly manipulate benign, benign face images, manipulate benign face, surrogate models, Diverse Parameters Augmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Face Recognition (FR) models are vulnerable to adversarial examples that subtly manipulate benign face images, underscoring the urgent need to improve the transferability of adversarial attacks in order to expose the blind spots of these systems. Existing adversarial attack methods often overlook the potential benefits of augmenting the surrogate model with diverse initializations, which limits the transferability of the generated adversarial examples. To address this gap, we propose a novel method called Diverse Parameters Augmentation (DPA) attack method, which enhances surrogate models by incorporating diverse parameter initializations, resulting in a broader and more diverse set of surrogate models. Specifically, DPA consists of two key stages: Diverse Parameters Optimization (DPO) and Hard Model Aggregation (HMA). In the DPO stage, we initialize the parameters of the surrogate model using both pre-trained and random parameters. Subsequently, we save the models in the intermediate training process to obtain a diverse set of surrogate models. During the HMA stage, we enhance the feature maps of the diversified surrogate models by incorporating beneficial perturbations, thereby further improving the transferability. Experimental results demonstrate that our proposed attack method can effectively enhance the transferability of the crafted adversarial face examples.
zh
[CV-162] Improving Transferable Targeted Attacks with Feature Tuning Mixup
【速读】: 该论文试图解决深度神经网络在对抗样本攻击中的可转移性问题,特别是针对目标攻击的可转移性。解决方案的关键在于提出了一种名为特征调谐混合 (Feature Tuning Mixup, FTM) 的新方法,该方法通过在特征空间中结合随机噪声和优化噪声来增强目标攻击的可转移性。FTM 引入了可学习的特征扰动,并采用高效的随机更新策略进行优化,从而生成更具鲁棒性和可转移性的对抗样本。此外,通过集成多个经过 FTM 扰动的代理模型,进一步提升了攻击性能。实验结果表明,该方法在保持低计算成本的同时,显著优于现有最先进的方法。
链接: https://arxiv.org/abs/2411.15553
作者: Kaisheng Liang,Xuelong Dai,Yanjie Li,Dong Wang,Bin Xiao
关键词-EN: Deep neural networks, neural networks exhibit, networks exhibit vulnerability, Deep neural, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep neural networks exhibit vulnerability to adversarial examples that can transfer across different models. A particularly challenging problem is developing transferable targeted attacks that can mislead models into predicting specific target classes. While various methods have been proposed to enhance attack transferability, they often incur substantial computational costs while yielding limited improvements. Recent clean feature mixup methods use random clean features to perturb the feature space but lack optimization for disrupting adversarial examples, overlooking the advantages of attack-specific perturbations. In this paper, we propose Feature Tuning Mixup (FTM), a novel method that enhances targeted attack transferability by combining both random and optimized noises in the feature space. FTM introduces learnable feature perturbations and employs an efficient stochastic update strategy for optimization. These learnable perturbations facilitate the generation of more robust adversarial examples with improved transferability. We further demonstrate that attack performance can be enhanced through an ensemble of multiple FTM-perturbed surrogate models. Extensive experiments on the ImageNet-compatible dataset across various models demonstrate that our method achieves significant improvements over state-of-the-art methods while maintaining low computational cost.
zh
[CV-163] NeRF Inpainting with Geometric Diffusion Prior and Balanced Score Distillation
【速读】: 该论文试图解决现有NeRF(Neural Radiance Fields)图像修复方法在利用预训练扩散模型时表现不佳的问题,主要体现在两个方面:预训练扩散模型对几何信息的捕捉不足,以及现有Score Distillation Sampling (SDS)方法提供的指导不够优化。解决方案的关键在于引入了一种名为GB-NeRF的新框架,通过改进2D扩散先验的利用来增强NeRF图像修复。具体创新包括:同时学习外观和几何先验的微调策略,以及将这些几何先验整合到NeRF图像修复中的专用法向蒸馏损失。此外,论文提出了一种名为Balanced Score Distillation (BSD)的技术,该技术在外观和几何方面的修复质量上优于现有的SDS和Conditional Score Distillation (CSD)方法。
链接: https://arxiv.org/abs/2411.15551
作者: Menglin Zhang,Xin Luo,Yunwei Lan,Chang Liu,Rui Li,Kaidong Zhang,Ganlin Yang,Dong Liu
关键词-EN: Recent advances, Score Distillation, Score Distillation Sampling, pretrained diffusion models, leveraged pretrained diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in NeRF inpainting have leveraged pretrained diffusion models to enhance performance. However, these methods often yield suboptimal results due to their ineffective utilization of 2D diffusion priors. The limitations manifest in two critical aspects: the inadequate capture of geometric information by pretrained diffusion models and the suboptimal guidance provided by existing Score Distillation Sampling (SDS) methods. To address these problems, we introduce GB-NeRF, a novel framework that enhances NeRF inpainting through improved utilization of 2D diffusion priors. Our approach incorporates two key innovations: a fine-tuning strategy that simultaneously learns appearance and geometric priors and a specialized normal distillation loss that integrates these geometric priors into NeRF inpainting. We propose a technique called Balanced Score Distillation (BSD) that surpasses existing methods such as Score Distillation (SDS) and the improved version, Conditional Score Distillation (CSD). BSD offers improved inpainting quality in appearance and geometric aspects. Extensive experiments show that our method provides superior appearance fidelity and geometric consistency compared to existing approaches.
zh
[CV-164] Hierarchical Cross-Attention Network for Virtual Try-On
【速读】: 该论文试图解决虚拟试衣任务中的挑战,提出了一种名为分层交叉注意力网络 (Hierarchical Cross-Attention Network, HCANet) 的创新解决方案。解决方案的关键在于两个主要阶段:几何匹配和试衣,以及在这两个阶段中引入的新型分层交叉注意力 (Hierarchical Cross-Attention, HCA) 模块。HCA 模块能够有效捕捉个体与服装模态之间的长程相关性,增强网络的深度和鲁棒性,通过分层方法细致地表示人与服装之间的交互,从而生成高度逼真的虚拟试衣效果。实验结果表明,HCANet 在定量指标和视觉真实性评估中均表现出色,成为虚拟试衣技术领域的先进解决方案。
链接: https://arxiv.org/abs/2411.15542
作者: Hao Tang,Bin Ren,Pingping Wu,Nicu Sebe
关键词-EN: Hierarchical Cross-Attention Network, virtual try-on task, virtual try-on, present an innovative, Hierarchical Cross-Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we present an innovative solution for the challenges of the virtual try-on task: our novel Hierarchical Cross-Attention Network (HCANet). HCANet is crafted with two primary stages: geometric matching and try-on, each playing a crucial role in delivering realistic virtual try-on outcomes. A key feature of HCANet is the incorporation of a novel Hierarchical Cross-Attention (HCA) block into both stages, enabling the effective capture of long-range correlations between individual and clothing modalities. The HCA block enhances the depth and robustness of the network. By adopting a hierarchical approach, it facilitates a nuanced representation of the interaction between the person and clothing, capturing intricate details essential for an authentic virtual try-on experience. Our experiments establish the prowess of HCANet. The results showcase its performance across both quantitative metrics and subjective evaluations of visual realism. HCANet stands out as a state-of-the-art solution, demonstrating its capability to generate virtual try-on results that excel in accuracy and realism. This marks a significant step in advancing virtual try-on technologies.
zh
[CV-165] Optical-Flow Guided Prompt Optimization for Coherent Video Generation
【速读】: 该论文试图解决文本到视频扩散模型在生成过程中面临的时间一致性问题。解决方案的关键在于提出了一种名为MotionPrompt的新框架,通过光流(optical flow)来引导视频生成过程。具体来说,论文训练了一个判别器来区分真实视频和生成视频中随机帧对之间的光流差异,并在反向采样步骤中优化可学习的token嵌入,利用训练好的判别器对随机帧对的梯度进行优化。这种方法能够在不降低生成内容保真度的情况下,生成视觉上连贯且符合自然运动动态的视频序列。
链接: https://arxiv.org/abs/2411.15540
作者: Hyelin Nam,Jaemin Kim,Dohun Lee,Jong Chul Ye
关键词-EN: made significant strides, significant strides, temporal consistency, made significant, face challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: project page: this https URL
点击查看摘要
Abstract:While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.
zh
[CV-166] Large Language Model with Region-guided Referring and Grounding for CT Report Generation
【速读】: 该论文试图解决CT报告生成中现有方法仅考虑全局特征而忽略特定区域细节的问题,导致可能遗漏异常情况。解决方案的关键在于提出了Reg2RG,这是一个区域引导的参考和定位框架,通过聚焦于CT体积中的解剖区域来提升诊断性能。具体来说,Reg2RG利用通用分割模块的掩码来捕捉每个参考区域的局部特征,并通过局部特征解耦(LFD)策略保留局部高分辨率细节。随后,将局部特征与全局特征整合,以捕捉区域间的关系。此外,论文提出了区域-报告对齐(RRA)训练策略,利用区域识别来指导生成区域特定的报告,增强模型的参考和定位能力,同时提高报告的可解释性。最后,使用大型语言模型(LLM)作为语言解码器,从整合的视觉特征中生成报告,促进区域级别的理解。
链接: https://arxiv.org/abs/2411.15539
作者: Zhixuan Chen,Yequan Bie,Haibo Jin,Hao Chen
关键词-EN: Computed tomography, time-consuming and labor-intensive, crucial to assist, assist radiologists, radiologists in interpreting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages
点击查看摘要
Abstract:Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model’s referring and grounding capabilities while also improving the report’s interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code will be made publicly available.
zh
[CV-167] MUNBa: Machine Unlearning via Nash Bargaining
【速读】: 该论文试图解决机器遗忘 (Machine Unlearning, MU) 中遗忘特定概念/数据与保留模型整体效用之间的目标冲突问题。解决方案的关键在于将MU重新构建成一个双玩家合作博弈,其中遗忘玩家和保留玩家通过各自的梯度提议来最大化整体收益。基于纳什谈判理论,论文推导出一个封闭形式的解,指导模型向帕累托前沿移动,从而有效避免梯度冲突。该方法确保了均衡解,即任何偏离最终状态的行为都会导致双方整体目标的减少,从而在每个目标上实现最优性。
链接: https://arxiv.org/abs/2411.15537
作者: Jing Wu,Mehrtash Harandi
关键词-EN: Machine Unlearning, selectively erase harmful, erase harmful behaviors, aims to selectively, selectively erase
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto front, effectively avoiding the gradient conflicts. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm’s effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving superior performance on several benchmarks. For example, in the challenging scenario of sample-wise forgetting, our algorithm approaches the gold standard retrain baseline. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.
zh
[CV-168] CellPilot
【速读】: 该论文试图解决在数字病理学中细胞和腺体分割的自动化与交互性之间的平衡问题。解决方案的关键在于引入了一个名为CellPilot的框架,该框架结合了自动分割和交互式细化的优势。CellPilot通过提供初始的自动分割结果,并允许用户在图形用户界面(GUI)中进行引导性的交互式修正,从而提高了分割的准确性和效率。该模型在超过675,000个掩码的九个多样化细胞和腺体分割数据集上进行了训练,涵盖了16个器官,展示了其在三个独立病理学数据集上的优越性能。此外,CellPilot的开源发布有助于推动更强大和通用的诊断模型的开发。
链接: https://arxiv.org/abs/2411.15514
作者: Philipp Endres,Valentin Koch,Julia A. Schnabel,Carsten Marr
关键词-EN: enabling improved visualization, increasingly digitized, streamlined workflows, microscopic study, study of diseased
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Histopathology, the microscopic study of diseased tissue, is increasingly digitized, enabling improved visualization and streamlined workflows. An important task in histopathology is the segmentation of cells and glands, essential for determining shape and frequencies that can serve as indicators of disease. Deep learning tools are widely used in histopathology. However, variability in tissue appearance and cell morphology presents challenges for achieving reliable segmentation, often requiring manual correction to improve accuracy. This work introduces CellPilot, a framework that bridges the gap between automatic and interactive segmentation by providing initial automatic segmentation as well as guided interactive refinement. Our model was trained on over 675,000 masks of nine diverse cell and gland segmentation datasets, spanning 16 organs. CellPilot demonstrates superior performance compared to other interactive tools on three held-out histopathological datasets while enabling automatic segmentation. We make the model and a graphical user interface designed to assist practitioners in creating large-scale annotated datasets available as open-source, fostering the development of more robust and generalized diagnostic models.
zh
[CV-169] Interactive Visual Assessment for Text-to-Image Generation Models
【速读】: 该论文试图解决现有视觉生成模型评估方法在实际部署中面临的挑战,特别是评估框架的固定覆盖范围、不断变化的难度以及数据泄露风险等问题。解决方案的关键是提出了DyEval,一个基于大型语言模型(LLM)的动态交互式视觉评估框架。DyEval通过直观的视觉界面,使用户能够与生成模型进行协作评估,并根据模型反馈动态生成层次化、细粒度和多样化的文本输入,以持续探测模型的能力边界。此外,DyEval还包含一个上下文反思模块,用于挖掘测试输入的失败触发因素,并通过LLM的逻辑推理能力反映模型的潜在失败模式,从而支持深入分析。实验结果表明,DyEval能够有效帮助用户识别比传统方法多2.56倍的生成失败,并揭示复杂和罕见的失败模式,如代词生成和文化背景生成问题。
链接: https://arxiv.org/abs/2411.15509
作者: Xiaoyue Mi,Fan Tang,Juan Cao,Qiang Sheng,Ziyao Huang,Peng Li,Yang Liu,Tong-Yee Lee
关键词-EN: achieved remarkable progress, computer graphics applications, face significant challenges, real-world deployment, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review
点击查看摘要
Abstract:Visual generation models have achieved remarkable progress in computer graphics applications but still face significant challenges in real-world deployment. Current assessment approaches for visual generation tasks typically follow an isolated three-phase framework: test input collection, model output generation, and user assessment. These fashions suffer from fixed coverage, evolving difficulty, and data leakage risks, limiting their effectiveness in comprehensively evaluating increasingly complex generation models. To address these limitations, we propose DyEval, an LLM-powered dynamic interactive visual assessment framework that facilitates collaborative evaluation between humans and generative models for text-to-image systems. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors, while adaptively generating hierarchical, fine-grained, and diverse textual inputs to continuously probe the capability boundaries of the models based on their feedback. Additionally, to provide interpretable analysis for users to further improve tested models, we develop a contextual reflection module that mines failure triggers of test inputs and reflects model potential failure patterns supporting in-depth analysis using the logical reasoning ability of LLM. Qualitative and quantitative experiments demonstrate that DyEval can effectively help users identify max up to 2.56 times generation failures than conventional methods, and uncover complex and rare failure patterns, such as issues with pronoun generation and specific cultural context generation. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems across various domains.
zh
[CV-170] AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation
【速读】: 该论文试图解决遥感图像目标检测 (Remote Sensing Image Object Detection, RSIOD) 中标注数据稀缺的问题。解决方案的关键在于提出了一个布局可控的扩散生成模型 (Layout-controllable Diffusion Generative Model, AeroGen),该模型能够同时支持水平和旋转边界框的条件生成,从而生成符合特定布局和目标类别要求的高质量合成图像。此外,论文还提出了一种端到端的数据增强框架,该框架集成了多样性条件生成器和过滤机制,以增强生成数据的多样性和质量。实验结果表明,该方法生成的合成数据不仅质量高且多样性丰富,还能显著提升现有RSIOD模型的检测性能。
链接: https://arxiv.org/abs/2411.15497
作者: Datao Tang,Xiangyong Cao,Xuan Wu,Jialin Li,Jing Yao,Xueru Bai,Dongsheng Jiang,Yin Li,Deyu Meng
关键词-EN: Remote sensing image, Remote sensing, aims to identify, aerial imagery, identify and locate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7%, 4.3%, and 2.43%, respectively. The code is available at this https URL.
zh
[CV-171] Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation
【速读】: 该论文试图解决急性缺血性卒中(Acute Ischemic Stroke, AIS)的自动化诊断报告生成问题,特别是在扩散加权成像(Diffusion Weighted Imaging, DWI)图像与放射学报告之间的跨模态映射难题。解决方案的关键在于提出了配对图像域检索与文本域增强(Paired Image-domain Retrieval and Text-domain Augmentation, PIRTA)框架,这是一个跨模态检索增强生成(Retrieval-Augmented Generation, RAG)方法。PIRTA通过将跨模态映射问题转化为在已有配对DWI图像和放射学报告的数据库中检索相似图像,从而避免了直接学习跨模态映射的困难。利用检索到的报告来增强查询图像的报告生成过程,实验结果表明PIRTA能够从3D DWI图像中准确检索相关报告,显著提高了报告生成的准确性,相比于直接使用最先进的跨模态语言模型的图像到文本生成方法。
链接: https://arxiv.org/abs/2411.15490
作者: Junhyeok Lee,Yujin Oh,Dahyoun Lee,Hyon Keun Joh,Chul-Ho Sohn,Sung Hyun Baik,Cheol Kyu Jung,Jung Hyun Park,Kyu Sung Choi,Byung-Hoon Kim,Jong Chul Ye
关键词-EN: Acute ischemic stroke, requires time-critical management, delayed intervention leading, Acute ischemic, AIS radiology reports
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.
zh
[CV-172] SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving
【速读】: 该论文试图解决现有动态高斯溅射方法在复杂动态城市场景中依赖于昂贵的手动标注进行对象级监督,从而限制了其在实际应用中的可扩展性的问题。解决方案的关键在于引入SplatFlow,这是一种在神经运动流场(Neural Motion Flow Fields, NMFF)内实现自监督动态高斯溅射的方法。SplatFlow通过设计一个统一的框架,将时间依赖的4D高斯表示无缝集成到NMFF中,其中NMFF是一组隐式函数,用于建模LiDAR点和Gaussians的连续运动流场。这种方法能够有效分解静态背景和动态对象,分别用3D和4D高斯基元表示,并通过建模每个4D高斯在时间上的状态对应关系,聚合时间特征以增强动态组件的跨视图一致性。此外,SplatFlow通过从2D基础模型中提取特征到4D时空表示,进一步提高了动态场景的识别能力。实验结果表明,SplatFlow在Waymo Open Dataset和KITTI Dataset上的图像重建和新视图合成方面达到了最先进的性能。
链接: https://arxiv.org/abs/2411.15482
作者: Su Sun,Cheng Zhao,Zhuoyang Sun,Yingjie Victor Chen,Mei Chen
关键词-EN: Gaussian Splatting methods, Dynamic Gaussian Splatting, expensive manual labeling, Motion Flow Fields, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Most existing Dynamic Gaussian Splatting methods for complex dynamic urban scenarios rely on accurate object-level supervision from expensive manual labeling, limiting their scalability in real-world applications. In this paper, we introduce SplatFlow, a Self-Supervised Dynamic Gaussian Splatting within Neural Motion Flow Fields (NMFF) to learn 4D space-time representations without requiring tracked 3D bounding boxes, enabling accurate dynamic scene reconstruction and novel view RGB, depth and flow synthesis. SplatFlow designs a unified framework to seamlessly integrate time-dependent 4D Gaussian representation within NMFF, where NMFF is a set of implicit functions to model temporal motions of both LiDAR points and Gaussians as continuous motion flow fields. Leveraging NMFF, SplatFlow effectively decomposes static background and dynamic objects, representing them with 3D and 4D Gaussian primitives, respectively. NMFF also models the status correspondences of each 4D Gaussian across time, which aggregates temporal features to enhance cross-view consistency of dynamic components. SplatFlow further improves dynamic scene identification by distilling features from 2D foundational models into 4D space-time representation. Comprehensive evaluations conducted on the Waymo Open Dataset and KITTI Dataset validate SplatFlow’s state-of-the-art (SOTA) performance for both image reconstruction and novel view synthesis in dynamic urban scenarios.
zh
[CV-173] KinMo: Kinematic-aware Human Motion Understanding and Generation
【速读】: 该论文试图解决基于文本控制人体运动时难以捕捉和操控局部身体部位细微运动的问题。解决方案的关键在于提出了一种新的运动表示方法,该方法从运动学角度将运动分解为不同的身体关节组运动及其相互作用。论文设计了一个自动数据集收集流程,通过引入细粒度的局部关节组运动和交互描述,增强了现有的文本-运动基准。此外,论文引入了一种层次化的运动语义方法,逐步将关节级别的交互信息融合到全局动作级别的语义中,以实现模态对齐。通过这种层次结构,论文提出了一种从粗到细的运动合成过程,适用于各种生成和编辑下游应用。实验结果表明,该方法不仅提高了文本-运动检索中的关节空间理解能力,还实现了更精确的关节运动生成和控制。
链接: https://arxiv.org/abs/2411.15472
作者: Pengfei Zhang,Pinxin Liu,Hyeongwoo Kim,Pablo Garrido,Bindita Chaudhuri
关键词-EN: Controlling human motion, Controlling human, human motion based, computer vision, presents an important
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis, which struggle to capture subtle movements of local body parts. This limitation restricts the ability to isolate and manipulate specific movements. To address this, we propose a novel motion representation that decomposes motion into distinct body joint group movements and interactions from a kinematic perspective. We design an automatic dataset collection pipeline that enhances the existing text-motion benchmark by incorporating fine-grained local joint-group motion and interaction descriptions. To bridge the gap between text and motion domains, we introduce a hierarchical motion semantics approach that progressively fuses joint-level interaction information into the global action-level semantics for modality alignment. With this hierarchy, we introduce a coarse-to-fine motion synthesis procedure for various generation and editing downstream applications. Our quantitative and qualitative experiments demonstrate that the proposed formulation enhances text-motion retrieval by improving joint-spatial understanding, and enables more precise joint-motion generation and control. Project Page: \small\urlthis https URL
zh
[CV-174] Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning
【速读】: 该论文试图解决持续学习 (Continual Learning, CL) 中的灾难性遗忘问题,即在模型学习新任务时如何避免遗忘之前任务的知识。解决方案的关键在于引入 Mamba-CL 框架,通过在特征子空间之外更新大规模 Mamba 基础模型中的状态空间模型 (State Space Models, SSMs) 的核心参数,确保每个 SSM 模块在当前和之前任务中输出的一致性。具体实现上,通过推导 Mamba 模型中四个关键的时间不变参数的整体一致性约束,并利用零空间投影 (null-space projection) 高效实现参数的正交性,从而在理论和实践中有效克服灾难性遗忘问题。
链接: https://arxiv.org/abs/2411.15469
作者: De Cheng,Yue Lu,Lingfeng He,Shizhou Zhang,Xi Yang,Nannan Wang,Xinbo Gao
关键词-EN: Continual Learning, previously learned knowledge, State Space Models, Mamba model, forgetting previously learned
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Continual Learning (CL) aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge. Recently, State Space Models (SSMs), particularly the Mamba model, have achieved notable success in computer vision. Building on the strengths of SSMs, this study explores leveraging the Mamba model for CL. Therefore, we introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model by updating parameters orthogonal to the feature subspace of previous tasks. This approach theoretically guarantees the consistency objective aiming to preserves consistent output for each SSM module across both previous and current tasks, so as to overcome catastrophic forgetting issue. Specifically, we achieve this goal by deducing the overall consistency constraints on four key time-invariant parameters in the Mamba model, streamlining its recurrent state-space structure and non-linear discretization process in SSM. In practice, we apply the null-space projection to efficiently implement the orthogonality within Mamba model. Extensive experiments on four class-incremental benchmarks demonstrate the effectiveness of Mamba-CL for anti-forgetting, achieving superior performances to state-of-the-art methods. Code is available in the supplementary materials.
zh
[CV-175] SplatSDF: Boosting Neural Implicit SDF via Gaussian Splatting Fusion
【速读】: 该论文试图解决场景级有符号距离函数(Signed Distance Function, SDF)重建的准确性和收敛速度问题。解决方案的关键在于提出了一种名为“SplatSDF”的新型神经隐式SDF,通过在架构层面上融合3D高斯喷射(3D Gaussian Splatting, 3DGS)和SDF-NeRF,显著提升了几何和光度准确性以及收敛速度。SplatSDF在训练阶段仅依赖3DGS作为输入,而在推理阶段保持与原始SDF-NeRF相同的复杂度和效率。
链接: https://arxiv.org/abs/2411.15468
作者: Runfa Blark Li,Keito Suzuki,Bang Du,Ki Myung Brian Le,Nikolay Atanasov,Truong Nguyen
关键词-EN: signed distance function, SDF, collision checking, neural implicit SDF, distance function
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:A signed distance function (SDF) is a useful representation for continuous-space geometry and many related operations, including rendering, collision checking, and mesh generation. Hence, reconstructing SDF from image observations accurately and efficiently is a fundamental problem. Recently, neural implicit SDF (SDF-NeRF) techniques, trained using volumetric rendering, have gained a lot of attention. Compared to earlier truncated SDF (TSDF) fusion algorithms that rely on depth maps and voxelize continuous space, SDF-NeRF enables continuous-space SDF reconstruction with better geometric and photometric accuracy. However, the accuracy and convergence speed of scene-level SDF reconstruction require further improvements for many applications. With the advent of 3D Gaussian Splatting (3DGS) as an explicit representation with excellent rendering quality and speed, several works have focused on improving SDF-NeRF by introducing consistency losses on depth and surface normals between 3DGS and SDF-NeRF. However, loss-level connections alone lead to incremental improvements. We propose a novel neural implicit SDF called “SplatSDF” to fuse 3DGSandSDF-NeRF at an architecture level with significant boosts to geometric and photometric accuracy and convergence speed. Our SplatSDF relies on 3DGS as input only during training, and keeps the same complexity and efficiency as the original SDF-NeRF during inference. Our method outperforms state-of-the-art SDF-NeRF models on geometric and photometric evaluation by the time of submission.
zh
[CV-176] Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
【速读】: 该论文试图解决在零样本(zero-shot)条件下生成特定主题(subject-driven)图像时,传统方法需要大量时间和资源进行微调(fine-tuning),而现有零样本方法在主题对齐(subject alignment)方面表现不佳的问题。解决方案的关键在于引入了一种名为“Diptych Prompting”的新型零样本方法,该方法通过将任务重新解释为修复(inpainting)任务,利用大规模文本到图像模型中的双联画生成(diptych generation)的涌现特性,实现了精确的主题对齐。具体来说,Diptych Prompting将参考图像放置在左面板,并在右面板进行文本条件下的修复,通过去除参考图像的背景和增强面板间注意力权重来防止内容泄露并提升生成图像的细节质量。实验结果表明,该方法在视觉上优于现有的零样本图像提示方法,并支持多种图像生成应用,如主题驱动生成、风格化图像生成和主题驱动图像编辑。
链接: https://arxiv.org/abs/2411.15466
作者: Chaehun Shin,Jooyoung Choi,Heeseung Kim,Sungroh Yoon
关键词-EN: text prompt, aims to produce, desired context, context by accurately, accurately capturing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: this https URL
zh
[CV-177] MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
【速读】: 该论文试图解决视觉-语言跟踪任务中现有Transformer方法在利用时间信息和动态更新参考特征方面的不足。解决方案的关键在于引入基于状态空间模型(State Space Model, SSM)的Mamba模型,即MambaVLT,通过其时间演化混合状态空间块和选择性局部增强块,有效捕捉多模态上下文信息并自适应更新参考特征。此外,引入的模态选择模块动态调整视觉和语言参考之间的权重,以减少单一模态可能带来的歧义。
链接: https://arxiv.org/abs/2411.15459
作者: Xinqi Liu,Li Zhou,Zikun Zhou,Jianqiu Chen,Zhenyu He
关键词-EN: tracking task aims, object tracking based, Existing Transformer-based vision-language, State Space, Transformer-based vision-language tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.
zh
[CV-178] Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在指令跟随能力上与单一语言模型(Large Language Models, LLMs)之间存在的显著差距问题。解决方案的关键在于提出了视觉模态令牌压缩(Visual-Modality Token Compression, VMTC)和跨模态注意力抑制(Cross-Modality Attention Inhibition, CMAI)策略。VMTC通过保留主要令牌并压缩冗余令牌来减少视觉模态的冗余信息,而CMAI则通过聚合文本到图像的注意力,生成文本到图像的焦点分数,并对低分数的文本-图像令牌对进行注意力抑制,从而在增强MLLMs的指令跟随能力的同时,保留其多模态理解和处理能力。
链接: https://arxiv.org/abs/2411.15453
作者: Te Yang,Jian Jia,Xiangyu Zhu,Weisong Zhao,Bo Wang,Yanhua Cheng,Yan Li,Shengyuan Liu,Quan Chen,Peng Jiang,Kun Gai,Zhen Lei
关键词-EN: Large Language Models, Language Models, Large Language, Multimodal Large Language, strong instruction-following capability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs. However, there is a significant gap in the instruction-following capabilities between the MLLMs and LLMs. In this study, we conduct a pilot experiment, which demonstrates that spatially down-sampling visual tokens significantly enhances the instruction-following capability of MLLMs. This is attributed to the substantial redundancy in visual modality. However, this intuitive method severely impairs the MLLM’s multimodal understanding capability. In this paper, we propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap between MLLMs and LLMs by inhibiting the influence of irrelevant visual tokens during content generation, increasing the instruction-following ability of the MLLMs while retaining their multimodal understanding capacity. In VMTC module, the primary tokens are retained and the redundant tokens are condensed by token clustering and merging. In CMAI process, we aggregate text-to-image attentions by text-to-text attentions to obtain a text-to-image focus score. Attention inhibition is performed on the text-image token pairs with low scores. Our comprehensive experiments over instruction-following capabilities and VQA-V2, GQA, TextVQA, MME and MMBench five benchmarks, demonstrate that proposed strategy significantly enhances the instruction following capability of MLLMs while preserving the ability to understand and process multimodal inputs.
zh
[CV-179] Gotta Hear Them All: Sound Source Aware Vision to Audio Generation
【速读】: 该论文试图解决视觉到音频(V2A)合成中存在的沉浸感和表现力不足的问题,主要原因是现有方法仅依赖全局场景信息而忽略了局部声源(sound sources)的细节。解决方案的关键在于提出了一个声源感知视觉到音频生成器(Sound Source-Aware V2A, SSV2A)。SSV2A通过视觉检测和跨模态转换来局部感知场景中的多模态声源,并对比学习一个跨模态声源(Cross-Modal Sound Source, CMSS)流形,以语义区分每个声源。随后,通过注意力机制将这些CMSS语义混合成丰富的音频表示,最终由预训练的音频生成器输出声音。此外,论文还构建了一个新的单声源视觉-音频数据集VGGS3,并设计了声源匹配评分(Sound Source Matching Score)来评估局部音频的相关性。这是首次在声源级别上解决V2A生成问题,实验结果表明SSV2A在生成保真度和相关性方面优于现有最先进的方法。
链接: https://arxiv.org/abs/2411.15447
作者: Wei Guo,Heng Wang,Weidong Cai,Jianbo Ma
关键词-EN: synthesis has broad, applications in multimedia, broad applications, sound, sound sources
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 16 pages, 9 figures, source code released at this https URL
点击查看摘要
Abstract:Vision-to-audio (V2A) synthesis has broad applications in multimedia. Recent advancements of V2A methods have made it possible to generate relevant audios from inputs of videos or still images. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source-Aware V2A (SSV2A) generator. SSV2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross-modality translation. It then contrastively learns a Cross-Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS manifold, we curate a novel single-sound-source visual-audio dataset VGGS3 from VGGSound. We also design a Sound Source Matching Score to measure localized audio relevance. This is to our knowledge the first work to address V2A generation at the sound-source level. Extensive experiments show that SSV2A surpasses state-of-the-art methods in both generation fidelity and relevance. We further demonstrate SSV2A’s ability to achieve intuitive V2A control by compositing vision, text, and audio conditions. Our SSV2A generation can be tried and heard at this https URL .
zh
[CV-180] freePruner: A Training-free Approach for Large Multimodal Model Acceleration
【速读】: 该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)在视觉-语言任务中由于高计算需求而面临的部署挑战。解决方案的关键是提出了一种无需训练的token减少方法,称为freePruner。freePruner通过两阶段token选择策略实现加速:首先使用设计的贡献度指标识别捕捉高层语义信息的关键token,然后通过注意力模式分析选择保留低层视觉细节的补充token。这种方法无需重新训练或微调,可直接应用于任何开源LMM,并在主流视觉问答基准测试中实现了2倍加速,同时保持了相当的性能。此外,freePruner与其他后训练加速技术(如后训练量化)正交,可结合使用,为高效LMM部署提供了实用解决方案。
链接: https://arxiv.org/abs/2411.15446
作者: Bingxin Xu,Yuzhang Shang,Yunhao Ge,Qian Lou,Yan Yan
关键词-EN: Large Multimodal Models, Large Multimodal, high computational demands, demonstrated impressive capabilities, Multimodal Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Multimodal Models (LMMs) have demonstrated impressive capabilities in visual-language tasks but face significant deployment challenges due to their high computational demands. While recent token reduction methods show promise for accelerating LMMs, they typically require extensive retraining or fine-tuning, making them impractical for many state-of-the-art models, especially those with proprietary training data. We propose freePruner, a training-free token reduction approach that can be directly applied to any open-source LMM without additional training. Unlike existing methods that rely heavily on token merging operations, freePruner employs a two-stage token selection strategy: (1) identifying pivotal tokens that capture high-level semantic information using our designed contribution degree metric, and (2) selecting complementary tokens that preserve essential low-level visual details through attention pattern analysis. Extensive experiments demonstrate that freePruner achieves 2x acceleration while maintaining comparable performance across mainstream visual question-answering benchmarks in the training-free setting. Moreover, freePruner is orthogonal to and can be combined with other post-training acceleration techniques, such as post-training quantization, providing a practical solution for efficient LMM deployment.
zh
[CV-181] win Trigger Generative Networks for Backdoor Attacks against Object Detection
【速读】: 该论文试图解决对象检测模型在训练和推理阶段易受后门攻击的问题。解决方案的关键在于提出了一种新颖的频率域双触发生成网络,用于生成不可见的触发器(invisible triggers)和可见的触发器(visible triggers)。不可见触发器在训练阶段植入模型,通过高斯平滑层和高频伪影分类器增强植入的隐蔽性;可见触发器在推理阶段激活后门,通过设计的新对齐损失优化,使其与原始模式不同但仍与不可见触发器的恶意激活行为对齐。这种双触发机制使得攻击过程难以追踪,显著降低了对象检测模型的mAP_0.5指标,分别达到70.0%和84.5%的降低效果。
链接: https://arxiv.org/abs/2411.15439
作者: Zhiying Li,Zhi Liu,Guanggang Geng,Shreyank N Gowda,Shuyuan Lin,Jian Weng,Xiaobo Jin
关键词-EN: invisible trigger generative, trigger generative, trigger generative networks, real-world applications, backdoor attacks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures
点击查看摘要
Abstract:Object detectors, which are widely used in real-world applications, are vulnerable to backdoor attacks. This vulnerability arises because many users rely on datasets or pre-trained models provided by third parties due to constraints on data and resources. However, most research on backdoor attacks has focused on image classification, with limited investigation into object detection. Furthermore, the triggers for most existing backdoor attacks on object detection are manually generated, requiring prior knowledge and consistent patterns between the training and inference stages. This approach makes the attacks either easy to detect or difficult to adapt to various scenarios. To address these limitations, we propose novel twin trigger generative networks in the frequency domain to generate invisible triggers for implanting stealthy backdoors into models during training, and visible triggers for steady activation during inference, making the attack process difficult to trace. Specifically, for the invisible trigger generative network, we deploy a Gaussian smoothing layer and a high-frequency artifact classifier to enhance the stealthiness of backdoor implantation in object detectors. For the visible trigger generative network, we design a novel alignment loss to optimize the visible triggers so that they differ from the original patterns but still align with the malicious activation behavior of the invisible triggers. Extensive experimental results and analyses prove the possibility of using different triggers in the training stage and the inference stage, and demonstrate the attack effectiveness of our proposed visible trigger and invisible trigger generative networks, significantly reducing the mAP_0.5 of the object detectors by 70.0% and 84.5%, including YOLOv5 and YOLOv7 with different settings, respectively.
zh
[CV-182] ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance
【速读】: 该论文试图解决现有扩散模型在生成对话头像时存在的时间、3D和表情不一致的问题。解决方案的关键在于提出了一种名为ConsistentAvatar的新框架,通过引入时间敏感细节(Temporally-Sensitive Detail, TSD)映射来捕捉相邻帧之间的高频特征和轮廓变化,并利用时间一致性扩散模块将初始结果的TSD与视频帧的地面实况对齐。最终的头像生成依赖于对齐的TSD、粗糙的头部法线以及情感提示嵌入,从而约束扩散过程以生成时间上稳定的对话头像,同时抑制误差累积并提高多方面的连续性。
链接: https://arxiv.org/abs/2411.15436
作者: Haijie Yang,Zhenyu Zhang,Hao Tang,Jianjun Qian,Jian Yang
关键词-EN: shown impressive potential, shown impressive, impressive potential, consistent diffusion module, talking head generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: this https URL
zh
[CV-183] What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation
【速读】: 该论文试图解决从场景图生成图像时面临的挑战,特别是准确建模空间关系和对象交互的问题。解决方案的关键在于引入了一个名为Scene-Bench的综合基准,其中包括一个大规模数据集MegaSG,该数据集包含一百万张带有场景图注释的图像,用于训练和公平比较模型在多样化和复杂场景中的表现。此外,论文提出了一个名为SGScore的新评估指标,利用多模态大语言模型(LLMs)的链式思维推理能力来评估对象存在和关系准确性,从而提供比传统指标(如FID和CLIPScore)更有效的真实性一致性衡量方法。基于此评估框架,论文还开发了一个场景图反馈管道,通过迭代识别和纠正场景图与图像之间的差异来优化生成的图像。实验结果表明,Scene-Bench提供了比现有基准更全面和有效的评估框架,特别是在复杂场景生成方面,并且反馈策略显著提高了图像生成模型的真实性一致性。
链接: https://arxiv.org/abs/2411.15435
作者: Zuyao Chen,Jinlin Wu,Zhen Lei,Chang Wen Chen
关键词-EN: accurately modeling spatial, modeling spatial relationships, scene graphs remains, extensively studied, remains relatively underexplored
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.
zh
[CV-184] LDM-Morph: Latent diffusion model guided deformable image registration
【速读】: 该论文试图解决现有基于深度学习的可变形图像配准方法中存在的语义信息缺失和相似性度量仅限于像素空间的问题。解决方案的关键在于提出了LDM-Morph算法,该算法通过集成潜在扩散模型(Latent Diffusion Model, LDM)提取的特征来丰富语义信息,并设计了基于潜在特征和全局特征的交叉注意力模块(Latent and Global Feature-based Cross-Attention, LGCA)以增强LDM和多头自注意力操作之间的语义信息交互。此外,论文还提出了一种分层度量方法,用于在原始像素空间和潜在特征空间中评估图像对的相似性,从而在提高配准精度的同时增强拓扑结构的保持。
链接: https://arxiv.org/abs/2411.15426
作者: Jiong Wu,Kuang Gong
关键词-EN: plays an essential, essential role, image registration plays, medical image tasks, deformable registration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deformable image registration plays an essential role in various medical image tasks. Existing deep learning-based deformable registration frameworks primarily utilize convolutional neural networks (CNNs) or Transformers to learn features to predict the deformations. However, the lack of semantic information in the learned features limits the registration performance. Furthermore, the similarity metric of the loss function is often evaluated only in the pixel space, which ignores the matching of high-level anatomical features and can lead to deformation folding. To address these issues, in this work, we proposed LDM-Morph, an unsupervised deformable registration algorithm for medical image registration. LDM-Morph integrated features extracted from the latent diffusion model (LDM) to enrich the semantic information. Additionally, a latent and global feature-based cross-attention module (LGCA) was designed to enhance the interaction of semantic information from LDM and global information from multi-head self-attention operations. Finally, a hierarchical metric was proposed to evaluate the similarity of image pairs in both the original pixel space and latent-feature space, enhancing topology preservation while improving registration accuracy. Extensive experiments on four public 2D cardiac image datasets show that the proposed LDM-Morph framework outperformed existing state-of-the-art CNNs- and Transformers-based registration methods regarding accuracy and topology preservation with comparable computational efficiency. Our code is publicly available at this https URL.
zh
[CV-185] OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
【速读】: 该论文试图解决眼科手术视觉语言预训练(VLP)中的复杂性和标注数据稀缺的问题。解决方案的关键在于提出了OphCLIP,这是一个分层检索增强的视觉语言预训练框架,专门用于眼科手术流程的理解。OphCLIP利用了OphVL数据集,该数据集包含了超过375K个分层结构的视频-文本对,具有数万个不同属性的组合,涵盖手术、阶段/操作/动作、器械、药物以及更高级的方面如眼病原因、手术目标和术后恢复建议等。通过将短视频片段与详细的叙述描述以及完整视频与结构化标题对齐,OphCLIP能够学习细粒度和长期的视觉表示,捕捉复杂的手术细节和高级程序洞察。此外,OphCLIP设计了一个检索增强的预训练框架,利用未充分探索的大规模无声手术视频,自动检索语义相关内容以增强叙述视频的表示学习。实验结果表明,OphCLIP在阶段识别和多器械识别任务中表现出强大的泛化能力和优越的性能。
链接: https://arxiv.org/abs/2411.15421
作者: Ming Hu,Kun Yuan,Yaling Shen,Feilong Tang,Xiaohao Xu,Lin Zhou,Wei Li,Ying Chen,Zhongxing Xu,Zelin Peng,Siyuan Yan,Vinkle Srivastav,Diping Song,Tianbin Li,Danli Shi,Jin Ye,Nicolas Padoy,Nassir Navab,Junjun He
关键词-EN: practice involves complex, advanced medical knowledge, complex visual interpretation, Surgical practice involves, involves complex visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP’s robust generalization and superior performance.
zh
[CV-186] Semi-supervised Single-view 3D Reconstruction via Multi Shape Prior Fusion Strategy and Self-Attention
【速读】: 该论文试图解决单视图三维重建(single-view 3D reconstruction)中对大量标注数据依赖的问题。解决方案的关键在于引入了一种创新的半监督学习框架,该框架通过多形状先验融合策略(multi shape prior fusion strategy)来指导生成更真实的物体结构,并结合自注意力模块(self-attention module)增强解码器的形状生成质量。实验结果表明,该方法在ShapeNet数据集上显著优于现有的监督学习方法,并在不同标注比例(1%、10%、20%)下表现出色,同时在Pix3D真实数据集上也展示了优异性能。
链接: https://arxiv.org/abs/2411.15420
作者: Wei Zhoua,Xinzhe Shia,Yunfeng Shea,Kunlong Liua,Yongqin Zhanga
关键词-EN: domain of single-view, expensive and time-intensive, techniques have frequently, frequently relied, relied on expensive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In the domain of single-view 3D reconstruction, traditional techniques have frequently relied on expensive and time-intensive 3D annotation data. Facing the challenge of annotation acquisition, semi-supervised learning strategies offer an innovative approach to reduce the dependence on labeled data. Despite these developments, the utilization of this learning paradigm in 3D reconstruction tasks remains relatively constrained. In this research, we created an innovative semi-supervised framework for 3D reconstruction that distinctively uniquely introduces a multi shape prior fusion strategy, intending to guide the creation of more realistic object structures. Additionally, to improve the quality of shape generation, we integrated a self-attention module into the traditional decoder. In benchmark tests on the ShapeNet dataset, our method substantially outperformed existing supervised learning methods at diverse labeled ratios of 1%, 10%, and 20%. Moreover, it showcased excellent performance on the real-world Pix3D dataset. Through comprehensive experiments on ShapeNet, our framework demonstrated a 3.3% performance improvement over the baseline. Moreover, stringent ablation studies further confirmed the notable effectiveness of our approach. Our code has been released on this https URL
zh
[CV-187] FG-CXR: A Radiologist-Aligned Gaze Dataset for Enhancing Interpretability in Chest X-Ray Report Generation ACCV2024
【速读】: 该论文试图解决在计算机辅助诊断 (CAD) 系统中,胸部X光 (CXR) 报告生成模型的解释性问题。现有方法生成的报告与实际放射科医生的解读之间存在显著差距,主要原因是这些模型无法准确反映放射科医生在诊断过程中使用的注意力机制和详细信息。解决方案的关键在于引入细粒度胸部X光 (FG-CXR) 数据集,该数据集提供了放射科医生生成的描述与对应解剖部位的注视注意力热图之间的精细匹配。论文进一步提出了一个可解释的放射科医生注意力生成网络 (Gen-XAI),该网络通过模拟放射科医生的诊断过程,明确约束其输出与放射科医生的注视注意力和诊断记录紧密对齐,从而提高报告生成的准确性和解释性。
链接: https://arxiv.org/abs/2411.15413
作者: Trong Thang Pham,Ngoc-Vuong Ho,Nhat-Tan Bui,Thinh Phan,Patel Brijesh,Donald Adjeroh,Gianfranco Doretto,Anh Nguyen,Carol C. Wu,Hien Nguyen,Ngan Le
关键词-EN: Developing an interpretable, chest X-ray, crucial in Computer-aided, Computer-aided Diagnosis, interpretable system
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACCV 2024
点击查看摘要
Abstract:Developing an interpretable system for generating reports in chest X-ray (CXR) analysis is becoming increasingly crucial in Computer-aided Diagnosis (CAD) systems, enabling radiologists to comprehend the decisions made by these systems. Despite the growth of diverse datasets and methods focusing on report generation, there remains a notable gap in how closely these models’ generated reports align with the interpretations of real radiologists. In this study, we tackle this challenge by initially introducing Fine-Grained CXR (FG-CXR) dataset, which provides fine-grained paired information between the captions generated by radiologists and the corresponding gaze attention heatmaps for each anatomy. Unlike existing datasets that include a raw sequence of gaze alongside a report, with significant misalignment between gaze location and report content, our FG-CXR dataset offers a more grained alignment between gaze attention and diagnosis transcript. Furthermore, our analysis reveals that simply applying black-box image captioning methods to generate reports cannot adequately explain which information in CXR is utilized and how long needs to attend to accurately generate reports. Consequently, we propose a novel explainable radiologist’s attention generator network (Gen-XAI) that mimics the diagnosis process of radiologists, explicitly constraining its output to closely align with both radiologist’s gaze attention and transcript. Finally, we perform extensive experiments to illustrate the effectiveness of our method. Our datasets and checkpoint is available at this https URL.
zh
[CV-188] FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
【速读】: 该论文试图解决大视觉语言模型(Vision-Language Models, VLMs)在细粒度图像区域组合信息感知方面的不足,特别是难以准确对齐分割掩码与相应语义以及精确描述所指区域组合特征的问题。解决方案的关键在于提出了FINECAPTION,这是一种新型VLM,能够识别任意掩码作为参考输入,并处理高分辨率图像以在不同粒度级别上进行组合图像描述。此外,论文还引入了COMPOSITIONCAP数据集,用于多粒度区域组合图像描述任务,特别是组合属性感知的区域图像描述,从而支持FINECAPTION模型的开发和评估。通过实验结果,论文展示了所提出模型相对于其他最先进VLMs的有效性,并分析了当前VLMs在识别各种视觉提示以进行组合区域图像描述方面的能力,指出了VLM设计和训练中需要改进的领域。
链接: https://arxiv.org/abs/2411.15411
作者: Hang Hua,Qing Liu,Lingzhi Zhang,Jing Shi,Zhifei Zhang,Yilin Wang,Jianming Zhang,Jiebo Luo
关键词-EN: significantly advanced multimodal, large Vision-Language Models, visual question answering, advanced multimodal tasks, enabling more sophisticated
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
点击查看摘要
Abstract:The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate reasoning across various applications, including image and video captioning, visual question answering, and cross-modal retrieval. Despite their superior capabilities, VLMs struggle with fine-grained image regional composition information perception. Specifically, they have difficulty accurately aligning the segmentation masks with the corresponding semantics and precisely describing the compositional aspects of the referred regions. However, compositionality - the ability to understand and generate novel combinations of known visual and textual components - is critical for facilitating coherent reasoning and understanding across modalities by VLMs. To address this issue, we propose FINECAPTION, a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different granularity levels. To support this endeavor, we introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning. Empirical results demonstrate the effectiveness of our proposed model compared to other state-of-the-art VLMs. Additionally, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training. Comments: Preprint Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.15411 [cs.CV] (or arXiv:2411.15411v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.15411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-189] Efficient Online Inference of Vision Transformers by Training-Free Tokenization
【速读】: 该论文试图解决视觉变换器(Vision Transformers)部署成本高的问题,特别是如何在不显著影响性能和运行时间的情况下降低能耗。解决方案的关键是引入了一种名为视觉词令牌器(Visual Word Tokenizer, VWT)的无训练方法。VWT通过将频繁使用的图像块(visual subwords)分组为视觉词(visual words),同时保持不频繁的图像块不变,从而实现能耗的降低。该方法利用图像内或图像间的统计信息来识别相似的视觉概念以进行压缩。实验结果表明,VWT能够在最多增加20%运行时间的情况下,实现高达19%的能耗降低,相较于8-bit量化和令牌合并等现有方法,VWT在保持较高能效的同时,显著减少了运行时间的牺牲。
链接: https://arxiv.org/abs/2411.15397
作者: Leonidas Gee,Wing Yan Li,Viktoriia Sharmanska,Novi Quadrianto
关键词-EN: wider industrial adoption, deploying vision transformers, vision transformers increasingly, transformers increasingly represents, industrial adoption
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression requires additional end-to-end fine-tuning or incurs a significant drawback to runtime, thus making them ill-suited for online inference. We introduce the \textbfVisual Word Tokenizer (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups patches (visual subwords) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for compression. Experimentally, we demonstrate a reduction in wattage of up to 19% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to 2\times or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.
zh
[CV-190] Gradient-Free Classifier Guidance for Diffusion Model Sampling
【速读】: 该论文试图解决扩散模型在图像生成过程中,如何在保持高图像保真度的同时,提高类别标签对齐精度的问题。解决方案的关键在于提出了一种名为无梯度分类器引导 (Gradient-free Classifier Guidance, GFCG) 的新方法,该方法在推理阶段完全利用预训练的分类器,而不使用梯度下降。通过在每个时间步动态确定时间自适应的参考类别标签和相应的引导尺度,GFCG 不仅提高了类别预测的准确性,还与现有的引导采样方法(如 CFG)互补,甚至在结合最先进的自动引导 (Autoguidance, ATG) 方法时,无需额外计算开销即可提升图像保真度并保持多样性。实验结果表明,GFCG 在 ImageNet 512×512 数据集上达到了 23.09 的 \textFD_\textDINOv2 值,同时分类精度达到 94.3%,超过了 ATG 的 90.2%。
链接: https://arxiv.org/abs/2411.15393
作者: Rahul Shenoy,Zhihong Pan,Kaushik Balakrishnan,Qisen Cheng,Yongmoon Jeon,Heejune Yang,Jaewon Kim
关键词-EN: outstanding learning capabilities, demonstrated outstanding learning, learning capabilities, effectively capturing, training dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Image generation using diffusion models have demonstrated outstanding learning capabilities, effectively capturing the full distribution of the training dataset. They are known to generate wide variations in sampled images, albeit with a trade-off in image fidelity. Guided sampling methods, such as classifier guidance (CG) and classifier-free guidance (CFG), focus sampling in well-learned high-probability regions to generate images of high fidelity, but each has its limitations. CG is computationally expensive due to the use of back-propagation for classifier gradient descent, while CFG, being gradient-free, is more efficient but compromises class label alignment compared to CG. In this work, we propose an efficient guidance method that fully utilizes a pre-trained classifier without using gradient descent. By using the classifier solely in inference mode, a time-adaptive reference class label and corresponding guidance scale are determined at each time step for guided sampling. Experiments on both class-conditioned and text-to-image generation diffusion models demonstrate that the proposed Gradient-free Classifier Guidance (GFCG) method consistently improves class prediction accuracy. We also show GFCG to be complementary to other guided sampling methods like CFG. When combined with the state-of-the-art Autoguidance (ATG), without additional computational overhead, it enhances image fidelity while preserving diversity. For ImageNet 512 \times 512, we achieve a record \textFD_\textDINOv2 of 23.09, while simultaneously attaining a higher classification Precision (94.3%) compared to ATG (90.2%)
zh
[CV-191] Hatching-Box: Monitoring the Rearing Process of Drosophila Using an Embedded Imaging and in-vial Detection System
【速读】: 该论文试图解决果蝇(Drosophila)发育行为自动监测和量化的问题,特别是在标准饲养瓶和常规饲养过程中,以消除显式实验的必要性。解决方案的关键在于结合定制的成像硬件与专用的检测和跟踪算法,形成名为Hatching-Box的新型成像和分析系统。该系统能够连续多天量化幼虫、满/空蛹和成虫的数量,并通过通用的客户端/服务器软件架构,实现对任意数量饲养瓶的同时监控。通过在近47万标注对象的数据集上评估系统,并在实际实验中验证,论文展示了Hatching-Box在长期实验中的应用潜力,以及在一般培养过程中自动化监测的优势。
链接: https://arxiv.org/abs/2411.15390
作者: Julian Bigge,Maite Ogueta,Luis Garcia,Benjamin Risse
关键词-EN: regular rearing routines, rendering explicit experiments, explicit experiments obsolete, Drosophila in standard, standard rearing vials
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures
点击查看摘要
Abstract:In this paper we propose the Hatching-Box, a novel imaging and analysis system to automatically monitor and quantify the developmental behavior of Drosophila in standard rearing vials and during regular rearing routines, rendering explicit experiments obsolete. This is achieved by combining custom tailored imaging hardware with dedicated detection and tracking algorithms, enabling the quantification of larvae, filled/empty pupae and flies over multiple days. Given the affordable and reproducible design of the Hatching-Box in combination with our generic client/server-based software, the system can easily be scaled to monitor an arbitrary amount of rearing vials simultaneously. We evaluated our system on a curated image dataset comprising nearly 470,000 annotated objects and performed several studies on real world experiments. We successfully reproduced results from well-established circadian experiments by comparing the eclosion periods of wild type flies to the clock mutants \textitper^short , \textitper^long and \textitper^0 without involvement of any manual labor. Furthermore we show, that the Hatching-Box is able to extract additional information about group behavior as well as to reconstruct the whole life-cycle of the individual specimens. These results not only demonstrate the applicability of our system for long-term experiments but also indicate its benefits for automated monitoring in the general cultivation process.
zh
[CV-192] A Constrast-Agnostic Method for Ultra-High Resolution Claustrum Segmentation
【速读】: 该论文试图解决在典型分辨率下(如1 mm各向同性)的磁共振成像(MRI)中,由于其薄片状结构而难以自动分割的claustrum(屏状核)的问题。解决方案的关键在于提出了一种对比度和分辨率无关的方法,该方法基于SynthSeg分割框架,利用合成训练强度图像实现优异的泛化能力。具体而言,该方法仅需要标签图进行训练,因为对应的强度图像是在训练过程中动态合成的,具有随机对比度和分辨率。通过使用18个超高分辨率MRI扫描(主要为离体扫描)获得的claustrum手动标签,训练了一个深度学习网络进行自动分割,并在高分辨率案例中展示了其有效性(Dice系数=0.632,平均表面距离=0.458 mm,体积相似性=0.867,使用6折交叉验证)。此外,该方法还展示了在典型分辨率下的体内T1加权MRI扫描中的应用,以及在多模态成像(如T2加权、质子密度和定量T1扫描)中的鲁棒性。这是首次提出的一种准确且对对比度和分辨率变化鲁棒的超高分辨率claustrum自动分割方法。
链接: https://arxiv.org/abs/2411.15388
作者: Chiara Mauri,Ryan Fritz,Jocelyn Mora,Benjamin Billot,Juan Eugenio Iglesias,Koen Van Leemput,Jean Augustinack,Douglas N Greve
关键词-EN: band-like gray matter, gray matter structure, matter structure located, Magnetic Resonance Imaging, vivo Magnetic Resonance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 14 pages, 10 figures, 3 tables
点击查看摘要
Abstract:The claustrum is a band-like gray matter structure located between putamen and insula whose exact functions are still actively researched. Its sheet-like structure makes it barely visible in in vivo Magnetic Resonance Imaging (MRI) scans at typical resolutions and neuroimaging tools for its study, including methods for automatic segmentation, are currently very limited. In this paper, we propose a contrast- and resolution-agnostic method for claustrum segmentation at ultra-high resolution (0.35 mm isotropic); the method is based on the SynthSeg segmentation framework (Billot et al., 2023), which leverages the use of synthetic training intensity images to achieve excellent generalization. In particular, SynthSeg requires only label maps to be trained, since corresponding intensity images are synthesized on the fly with random contrast and resolution. We trained a deep learning network for automatic claustrum segmentation, using claustrum manual labels obtained from 18 ultra-high resolution MRI scans (mostly ex vivo). We demonstrated the method to work on these 18 high resolution cases (Dice score = 0.632, mean surface distance = 0.458 mm, and volumetric similarity = 0.867 using 6-fold Cross Validation (CV)), and also on in vivo T1-weighted MRI scans at typical resolutions (~1 mm isotropic). We also demonstrated that the method is robust in a test-retest setting and when applied to multimodal imaging (T2-weighted, Proton Density and quantitative T1 scans). To the best of our knowledge this is the first accurate method for automatic ultra-high resolution claustrum segmentation, which is robust against changes in contrast and resolution. The method is released at this https URL and as part of the neuroimaging package Freesurfer (Fischl, 2012).
zh
[CV-193] Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage
【速读】: 该论文试图解决生成式文本到图像扩散模型(如 Stable Diffusion)在训练过程中使用未经授权数据可能导致的知识产权侵权或隐私侵犯问题。解决方案的关键在于提出了一种名为 \tech 的方法,该方法利用扩散过程进行受控的图像生成,保留输入图像的高级特征同时忽略水印所利用的低级细节。通过生成少量图像并对其进行微调,\tech 能够有效规避现有最先进的水印保护措施,从而增强生成模型的安全性。
链接: https://arxiv.org/abs/2411.15367
作者: Soumil Datta,Shih-Chieh Dai,Leo Yu,Guanhong Tao
关键词-EN: shown exceptional potential, generating high-quality images, Stable Diffusion, shown exceptional, exceptional potential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Text-to-image diffusion models, such as Stable Diffusion, have shown exceptional potential in generating high-quality images. However, recent studies highlight concerns over the use of unauthorized data in training these models, which may lead to intellectual property infringement or privacy violations. A promising approach to mitigate these issues is to apply a watermark to images and subsequently check if generative models reproduce similar watermark features. In this paper, we examine the robustness of various watermark-based protection methods applied to text-to-image models. We observe that common image transformations are ineffective at removing the watermark effect. Therefore, we propose \tech, that leverages the diffusion process to conduct controlled image generation on the protected input, preserving the high-level features of the input while ignoring the low-level details utilized by watermarks. A small number of generated images are then used to fine-tune protected models. Our experiments on three datasets and 140 text-to-image diffusion models reveal that existing state-of-the-art protections are not robust against RATTAN.
zh
[CV-194] Personalization of Wearable Sensor-Based Joint Kinematic Estimation Using Computer Vision for Hip Exoskeleton Applications
【速读】: 该论文试图解决在患者监测、康复和外骨骼控制等应用中,准确估计下肢关节运动学的问题。现有基于可穿戴传感器和深度学习(DL)的方法通常需要大量新数据来适应未见过的步态模式,而计算机视觉领域的人体姿态估计模型虽然易于部署且能实时推理,但在无法使用摄像头的场景中不可行。论文提出的解决方案是一个基于计算机视觉的深度学习适应框架,用于实时关节运动学估计。该框架的关键在于仅使用少量数据(1-2个步态周期)和无需专业运动捕捉设备,通过迁移学习将时间卷积网络(TCN)适应于僵硬膝步态数据,从而在减少均方根误差方面表现出色,分别为9.7%和19.9%。这一框架展示了智能手机摄像头训练的深度学习模型在临床人群中实时估计关节运动学的潜力,特别是在可穿戴机器人应用中。
链接: https://arxiv.org/abs/2411.15366
作者: Changseob Song,Bogdan Ivanyuk-Skulskyi,Adrian Krieger,Kaitao Luo,Inseung Kang
关键词-EN: Accurate lower-limb joint, Accurate lower-limb, lower-limb joint kinematic, joint kinematic estimation, patient monitoring
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate lower-limb joint kinematic estimation is critical for applications such as patient monitoring, rehabilitation, and exoskeleton control. While previous studies have employed wearable sensor-based deep learning (DL) models for estimating joint kinematics, these methods often require extensive new datasets to adapt to unseen gait patterns. Meanwhile, researchers in computer vision have advanced human pose estimation models, which are easy to deploy and capable of real-time inference. However, such models are infeasible in scenarios where cameras cannot be used. To address these limitations, we propose a computer vision-based DL adaptation framework for real-time joint kinematic estimation. This framework requires only a small dataset (i.e., 1-2 gait cycles) and does not depend on professional motion capture setups. Using transfer learning, we adapted our temporal convolutional network (TCN) to stiff knee gait data, allowing the model to further reduce root mean square error by 9.7% and 19.9% compared to a TCN trained on only able-bodied and stiff knee datasets, respectively. Our framework demonstrates a potential for smartphone camera-trained DL models to estimate real-time joint kinematics across novel users in clinical populations with applications in wearable robots.
zh
[CV-195] UniGaussian: Driving Scene Reconstruction from Multiple Camera Models via Unified Gaussian Representations
【速读】: 该论文试图解决在自动驾驶场景重建中,现有方法主要关注针孔相机而忽视鱼眼相机的问题。解决方案的关键在于提出了UniGaussian方法,该方法通过学习统一的3D高斯表示(3D Gaussian representation)来处理多种相机模型。具体来说,论文提出了一个新的可微分渲染方法,通过一系列针对鱼眼相机模型的仿射变换(affine transformations)来扭曲3D高斯,从而解决了3D高斯拼接与鱼眼相机兼容性的问题,同时保持了实时渲染的可微分性。此外,基于这种可微分渲染方法,设计了一个新的框架,通过适应不同相机模型的仿射变换和多模态监督来学习统一的3D高斯表示,从而实现了对多种传感器(针孔和鱼眼相机)和模态(深度、语义、法线和LiDAR点云)的综合理解。
链接: https://arxiv.org/abs/2411.15355
作者: Yuan Ren,Guile Wu,Runhao Li,Zheyuan Yang,Yibo Liu,Xingxin Chen,Tongtong Cao,Bingbing Liu
关键词-EN: Urban scene reconstruction, crucial for real-world, fisheye cameras, autonomous driving simulators, Gaussian representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical report
点击查看摘要
Abstract:Urban scene reconstruction is crucial for real-world autonomous driving simulators. Although existing methods have achieved photorealistic reconstruction, they mostly focus on pinhole cameras and neglect fisheye cameras. In fact, how to effectively simulate fisheye cameras in driving scene remains an unsolved problem. In this work, we propose UniGaussian, a novel approach that learns a unified 3D Gaussian representation from multiple camera models for urban scene reconstruction in autonomous driving. Our contributions are two-fold. First, we propose a new differentiable rendering method that distorts 3D Gaussians using a series of affine transformations tailored to fisheye camera models. This addresses the compatibility issue of 3D Gaussian splatting with fisheye cameras, which is hindered by light ray distortion caused by lenses or mirrors. Besides, our method maintains real-time rendering while ensuring differentiability. Second, built on the differentiable rendering method, we design a new framework that learns a unified Gaussian representation from multiple camera models. By applying affine transformations to adapt different camera models and regularizing the shared Gaussians with supervision from different modalities, our framework learns a unified 3D Gaussian representation with input data from multiple sources and achieves holistic driving scene understanding. As a result, our approach models multiple sensors (pinhole and fisheye cameras) and modalities (depth, semantic, normal and LiDAR point clouds). Our experiments show that our method achieves superior rendering quality and fast rendering speed for driving scene simulation.
zh
[CV-196] Zero-Shot Coreset Selection: Efficient Pruning for Unlabeled Data
【速读】: 该论文试图解决深度学习中大规模数据存储、标注和模型训练的高成本问题,特别是如何在不依赖标注数据的情况下选择具有代表性的数据子集(即无标注数据的核心集选择)。解决方案的关键是提出了Zero-Shot Coreset Selection (ZCore)方法,该方法利用现有的基础模型生成无标注数据的零样本嵌入空间,并通过量化嵌入分布中的覆盖度和冗余度来评估每个数据样本的相对重要性,从而高效地选择核心集。ZCore无需依赖标注数据或对候选数据进行训练,显著降低了标注成本,并在多个数据集上表现优于现有的基于标签的方法。
链接: https://arxiv.org/abs/2411.15349
作者: Brent A. Griffin,Jacob Marks,Jason J. Corso
关键词-EN: learning increasingly relies, Deep learning increasingly, coreset selection, increasingly relies, relies on massive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep learning increasingly relies on massive data with substantial costs for storage, annotation, and model training. To reduce these costs, coreset selection aims to find a representative subset of data to train models while ideally performing on par with the full data training. State-of-the-art coreset methods use carefully-designed criteria to quantify the importance of each data example via ground truth labels and dataset-specific training, then select examples whose scores lie in a certain range to construct a coreset. These methods work well in their respective settings, however, they cannot select data that are unlabeled, which is the majority of real-world data. To that end, this paper motivates and formalizes the problem of unlabeled coreset selection to enable greater scale and reduce annotation costs for deep learning. As a solution, we develop Zero-Shot Coreset Selection (ZCore), a method that efficiently selects coresets without ground truth labels or training on candidate data. Instead, ZCore uses existing foundation models to generate a zero-shot embedding space for unlabeled data, then quantifies the relative importance of each example based on overall coverage and redundancy within the embedding distribution. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, leading to a strong baseline for future research in unlabeled coreset selection. On ImageNet, ZCore selections achieve a downstream model accuracy of 53.99% with only 10% training data, which outperforms label-based methods while removing annotation requirements for 1.15 million images. Our code is publicly available at this https URL.
zh
[CV-197] here is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks
【速读】: 该论文试图解决的问题是Segment Anything Model (SAM)在缺乏标签的情况下生成掩码的能力是否具备固有的语义理解,从而适用于更广泛的视觉任务。解决方案的关键在于通过多阶段的方法探索如何增强SAM的语义理解能力。首先,论文通过比较SAM的基础图像编码器在分类任务中的效能与已建立的模型(如CLIP和DINOv2),发现SAM在特征表示中缺乏语义区分能力,这限制了其在需要类别区分任务中的应用。基于这一发现,论文进一步探讨了通过轻量级微调进行上下文学习以引入语义信息的方法,但发现其对未见类别的泛化能力有限。最终,论文提出了一种无需训练的方法,通过利用DINOv2的特征来增强SAM的语义理解,实现基于特征相似性的实例级类别区分。研究表明,结合外部语义源是提升SAM在复杂视觉任务中实用性的一个有前景的方向。
链接: https://arxiv.org/abs/2411.15288
作者: Miguel Espinosa,Chenhongyi Yang,Linus Ericsson,Steven McDonagh,Elliot J. Crowley
关键词-EN: label-agnostic mask generation, mask generation, originally designed, designed for label-agnostic, label-agnostic mask
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Work in progress
点击查看摘要
Abstract:The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach towards exploring this question. We firstly quantify SAM’s semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM’s utility with respect to complex visual tasks that require semantic understanding.
zh
[CV-198] When Spatial meets Temporal in Action Recognition
【速读】: 该论文试图解决视频动作识别中空间信息与时间信息有效整合的问题。现有方法通常侧重于空间特征(如物体外观)或时间动态(如运动),而很少全面整合两者。论文提出的解决方案之关键是引入了一种名为“时间整合与运动增强 (Temporal Integration and Motion Enhancement, TIME) 层”的新型预处理技术。该层通过重新排列原始视频序列,将 N^2 个时间演变的帧嵌入到一个 N \times N 的空间网格中,生成新的视频帧,从而在保留时间顺序的同时平衡了空间和时间信息。这种变换使得新帧既包含丰富的空间细节,又突出了时间动态,从而提高了现有视频模型的兼容性和识别精度。
链接: https://arxiv.org/abs/2411.15284
作者: Huilin Chen,Lei Wang,Yifan Chen,Tom Gedeon,Piotr Koniusz
关键词-EN: made significant strides, temporal, significant strides, TIME layer, made significant
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Research report
点击查看摘要
Abstract:Video action recognition has made significant strides, but challenges remain in effectively using both spatial and temporal information. While existing methods often focus on either spatial features (e.g., object appearance) or temporal dynamics (e.g., motion), they rarely address the need for a comprehensive integration of both. Capturing the rich temporal evolution of video frames, while preserving their spatial details, is crucial for improving accuracy. In this paper, we introduce the Temporal Integration and Motion Enhancement (TIME) layer, a novel preprocessing technique designed to incorporate temporal information. The TIME layer generates new video frames by rearranging the original sequence, preserving temporal order while embedding N^2 temporally evolving frames into a single spatial grid of size N \times N . This transformation creates new frames that balance both spatial and temporal information, making them compatible with existing video models. When N=1 , the layer captures rich spatial details, similar to existing methods. As N increases ( N\geq2 ), temporal information becomes more prominent, while the spatial information decreases to ensure compatibility with model inputs. We demonstrate the effectiveness of the TIME layer by integrating it into popular action recognition models, such as ResNet-50, Vision Transformer, and Video Masked Autoencoders, for both RGB and depth video data. Our experiments show that the TIME layer enhances recognition accuracy, offering valuable insights for video processing tasks.
zh
[CV-199] Foundation Cures Personalization: Recovering Facial Personalized Models Prompt Consistency
【速读】: 该论文试图解决在文本到图像生成领域中面部个性化任务中,身份嵌入(identity embedding)机制导致的提示一致性(prompt consistency)问题。具体来说,身份嵌入机制在处理包含多个面部属性的提示时,会削弱其他属性的效果,从而影响生成图像与提示的一致性。论文提出的解决方案是FreeCure,一个无需训练的框架,通过利用基础模型(foundation models)的内在知识来提高个性化模型的一致性。关键在于通过提取基础模型去噪过程中的交叉注意力(cross-attention)和语义图(semantic maps),识别并增强容易定位的属性(如头发、配饰等),并通过噪声混合策略和基于反转的过程来增强个性化模型输出中的多个属性。该方法无需额外训练,能够非侵入性地增强多种面部属性,并可无缝集成到现有的流行个性化模型中。
链接: https://arxiv.org/abs/2411.15277
作者: Yiyang Cai,Zhengkai Jiang,Yulong Liu,Chunyang Jiang,Wei Xue,Wenhan Luo,Yike Guo
关键词-EN: crucial downstream task, Facial personalization represents, Facial personalization, represents a crucial, crucial downstream
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Facial personalization represents a crucial downstream task in the domain of text-to-image generation. To preserve identity fidelity while ensuring alignment with user-defined prompts, current mainstream frameworks for facial personalization predominantly employ identity embedding mechanisms to associate identity information with textual embeddings. However, our experiments show that identity embeddings compromise the effectiveness of other tokens within the prompt, thereby hindering high prompt consistency, particularly when prompts involve multiple facial attributes. Moreover, previous works overlook the fact that their corresponding foundation models hold great potential to generate faces aligning to prompts well and can be easily leveraged to cure these ill-aligned attributes in personalized models. Building upon these insights, we propose FreeCure, a training-free framework that harnesses the intrinsic knowledge from the foundation models themselves to improve the prompt consistency of personalization models. First, by extracting cross-attention and semantic maps from the denoising process of foundation models, we identify easily localized attributes (e.g., hair, accessories, etc). Second, we enhance multiple attributes in the outputs of personalization models through a novel noise-blending strategy coupled with an inversion-based process. Our approach offers several advantages: it eliminates the need for training; it effectively facilitates the enhancement for a wide array of facial attributes in a non-intrusive manner; and it can be seamlessly integrated into existing popular personalization models. FreeCure has demonstrated significant improvements in prompt consistency across a diverse set of state-of-the-art facial personalization models while maintaining the integrity of original identity fidelity.
zh
[CV-200] Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras
【速读】: 该论文试图解决事件相机(Event cameras)数据量有限的问题,通过引入一个定制的U形状态空间模型知识转移(U-shaped State Space Model Knowledge Transfer, USKT)框架,实现事件数据到RGB数据的转换。解决方案的关键在于USKT框架,它能够生成与RGB帧兼容的输入,使得事件数据能够有效复用预训练的RGB模型,从而在参数调优最小化的前提下实现竞争性性能。此外,USKT架构中提出的双向反向状态空间模型(Bidirectional Reverse State Space Model, BiR-SSM)通过共享权重策略,在提高模型效率的同时节省计算资源。
链接: https://arxiv.org/abs/2411.15276
作者: Yuhui Lin,Jiahao Zhang,Siyuan Li,Jimin Xiao,Ding Xu,Wenjun Wu,Jiaxuan Lu
关键词-EN: emerging imaging technology, offer distinct advantages, including reduced energy, traditional RGB cameras, reduced energy consumption
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Event cameras, as an emerging imaging technology, offer distinct advantages over traditional RGB cameras, including reduced energy consumption and higher frame rates. However, the limited quantity of available event data presents a significant challenge, hindering their broader development. To alleviate this issue, we introduce a tailored U-shaped State Space Model Knowledge Transfer (USKT) framework for Event-to-RGB knowledge transfer. This framework generates inputs compatible with RGB frames, enabling event data to effectively reuse pre-trained RGB models and achieve competitive performance with minimal parameter tuning. Within the USKT architecture, we also propose a bidirectional reverse state space model. Unlike conventional bidirectional scanning mechanisms, the proposed Bidirectional Reverse State Space Model (BiR-SSM) leverages a shared weight strategy, which facilitates efficient modeling while conserving computational resources. In terms of effectiveness, integrating USKT with ResNet50 as the backbone improves model performance by 0.95%, 3.57%, and 2.9% on DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets, respectively, underscoring USKT’s adaptability and effectiveness. The code will be made available upon acceptance.
zh
[CV-201] EADReg: Probabilistic Correspondence Generation with Efficient Autoregressive Diffusion Model for Outdoor Point Cloud Registration
【速读】: 该论文试图解决在户外LiDAR点云配准(PCR)任务中,由于点云的稀疏性、不规则性和大规模尺度导致的难以建立密集全局点对点对应关系的问题。解决方案的关键在于提出了一种名为EADReg的新框架,该框架基于自回归扩散模型,采用从粗到细的配准范式。在粗配准阶段,使用双向高斯混合模型(Bi-directional Gaussian Mixture Model, BGMM)来剔除异常点并获得纯净的点云对,通过建立源和目标帧的高斯混合模型(GMMs)之间的对应关系,实现基于过滤特征和几何信息的可靠粗配准。在精细配准阶段,将扩散模型应用于PCR视为自回归过程,生成鲁棒的点对应关系,并在上层进行迭代细化。尽管扩散模型通常被批评为推理速度慢,但EADReg实现了与基于卷积的方法相当的运行时间。
链接: https://arxiv.org/abs/2411.15271
作者: Linrui Gong,Jiuming Liu,Junyi Ma,Lihao Liu,Yaonan Wang,Hesheng Wang
关键词-EN: Gaussian Mixture Model, challenging cases, Diffusion models, Bi-directional Gaussian Mixture, shown the great
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Diffusion models have shown the great potential in the point cloud registration (PCR) task, especially for enhancing the robustness to challenging cases. However, existing diffusion-based PCR methods primarily focus on instance-level scenarios and struggle with outdoor LiDAR points, where the sparsity, irregularity, and huge point scale inherent in LiDAR points pose challenges to establishing dense global point-to-point correspondences. To address this issue, we propose a novel framework named EADReg for efficient and robust registration of LiDAR point clouds based on autoregressive diffusion models. EADReg follows a coarse-to-fine registration paradigm. In the coarse stage, we employ a Bi-directional Gaussian Mixture Model (BGMM) to reject outlier points and obtain purified point cloud pairs. BGMM establishes correspondences between the Gaussian Mixture Models (GMMs) from the source and target frames, enabling reliable coarse registration based on filtered features and geometric information. In the fine stage, we treat diffusion-based PCR as an autoregressive process to generate robust point correspondences, which are then iteratively refined on upper layers. Despite common criticisms of diffusion-based methods regarding inference speed, EADReg achieves runtime comparable to convolutional-based methods. Extensive experiments on the KITTI and NuScenes benchmark datasets highlight the state-of-the-art performance of our proposed method. Codes will be released upon publication.
zh
[CV-202] Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI
【速读】: 该论文试图解决基于梯度的解释性技术在图像模型中的几个主要缺陷:(1)需要白盒访问模型;(2)易受对抗攻击;(3)生成的归因偏离图像流形,导致解释不忠实于模型且不符合人类感知。解决方案的关键是引入了一种名为“无导数扩散流形约束梯度 (FreeMCG)”的新方法,通过利用集合卡尔曼滤波器和扩散模型,实现对模型梯度的无导数近似,并将其投影到数据流形上,仅依赖于模型的输出。这种方法在反事实生成和特征归因任务中展示了其有效性,并取得了最先进的结果,同时保留了可解释性AI工具的基本特性。
链接: https://arxiv.org/abs/2411.15265
作者: Won Jun Kim,Hyungjin Chung,Jaemin Kim,Sangmin Lee,Byeongsu Sim,Jong Chul Ye
关键词-EN: Gradient-based methods, prototypical family, explainability techniques, image-based models, Gradient-based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 5 figures
点击查看摘要
Abstract:Gradient-based methods are a prototypical family of explainability techniques, especially for image-based models. Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG), a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model’s gradient projected onto the data manifold, requiring access only to the model’s outputs. We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks, counterfactual explanation and feature attribution, we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.
zh
[CV-203] AI-Driven Real-Time Monitoring of Ground-Nesting Birds: A Case Study on Curlew Detection Using YOLOv10
【速读】: 该论文试图解决野生鸟类,特别是地巢鸟类如长嘴杓鹬(curlew, Numenius arquata)的实时监测问题,以应对其种群数量显著下降的现状。解决方案的关键在于开发并应用一种基于AI的实时物种检测系统,该系统利用定制训练的YOLOv10模型,通过3/4G连接的摄像头与Conservation AI平台结合,实现对长嘴杓鹬及其雏鸟的高效检测与分类。该系统能够在11个威尔士的巢址中实现高精度的实时数据处理,显著提高了监测效率,并为生物多样性评估和早期保护干预提供了及时、准确的数据支持。
链接: https://arxiv.org/abs/2411.15263
作者: Carl Chalmers,Paul Fergus,Serge Wich,Steven N Longmore,Naomi Davies Walsh,Lee Oliver,James Warrington,Julieanne Quinlan,Katie Appleby
关键词-EN: signal significant environmental, Effective monitoring, ecosystem health, wildlife is critical, critical for assessing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Effective monitoring of wildlife is critical for assessing biodiversity and ecosystem health, as declines in key species often signal significant environmental changes. Birds, particularly ground-nesting species, serve as important ecological indicators due to their sensitivity to environmental pressures. Camera traps have become indispensable tools for monitoring nesting bird populations, enabling data collection across diverse habitats. However, the manual processing and analysis of such data are resource-intensive, often delaying the delivery of actionable conservation insights. This study presents an AI-driven approach for real-time species detection, focusing on the curlew (Numenius arquata), a ground-nesting bird experiencing significant population declines. A custom-trained YOLOv10 model was developed to detect and classify curlews and their chicks using 3/4G-enabled cameras linked to the Conservation AI platform. The system processes camera trap data in real-time, significantly enhancing monitoring efficiency. Across 11 nesting sites in Wales, the model achieved high performance, with a sensitivity of 90.56%, specificity of 100%, and F1-score of 95.05% for curlew detections, and a sensitivity of 92.35%, specificity of 100%, and F1-score of 96.03% for curlew chick detections. These results demonstrate the capability of AI-driven monitoring systems to deliver accurate, timely data for biodiversity assessments, facilitating early conservation interventions and advancing the use of technology in ecological research.
zh
[CV-204] MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
【速读】: 该论文试图解决现有视频生成模型在生成包含多个场景、连贯叙事和一致角色的长视频时面临的挑战。解决方案的关键在于提出了MovieBench,这是一个层次化的电影级数据集,专门用于长视频生成的分析、评估和训练。MovieBench的主要贡献包括:(1) 提供具有丰富连贯故事线和多场景叙事的电影长度视频;(2) 确保角色外观和音频在场景间的一致性;(3) 包含高层次电影信息和详细镜头级别描述的分层数据结构。通过这些创新,MovieBench旨在为长视频生成领域带来新的见解和挑战,例如在多个场景中保持角色身份的一致性。
链接: https://arxiv.org/abs/2411.15262
作者: Weijia Wu,Mingyu Liu,Zeyu Zhu,Xi Xia,Haoen Feng,Wen Wang,Kevin Qinghong Lin,Chunhua Shen,Mike Zheng Shou
关键词-EN: Stable Video Diffusion, show promising results, long video generation, Recent advancements, video generation models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project website is at: this https URL . Code: this https URL
点击查看摘要
Abstract:Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) movie-length videos featuring rich, coherent storylines and multi-scene narratives, (2) consistency of character appearance and audio across scenes, and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions. Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters. The dataset will be public and continuously maintained, aiming to advance the field of long video generation. Data can be found at: this https URL.
zh
[CV-205] VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing
【速读】: 该论文试图解决高质量视频编辑面临的挑战,特别是缺乏基于真实数据的大规模开源视频编辑数据集、视频数据表示所需的大量标记导致的高训练成本,以及现有视频编辑模型交互性有限的问题。解决方案的关键在于引入了一个名为VIVID-10M的大规模混合图像-视频局部编辑数据集,以及一个基于此数据集训练的通用交互式视频局部编辑模型VIVID。VIVID-10M数据集包含9.7M样本,旨在降低数据构建和模型训练成本,涵盖广泛的编辑任务。VIVID模型支持实体的添加、修改和删除,并通过关键帧引导的交互式视频编辑机制,使用户能够迭代编辑关键帧并将其传播到其他帧,从而减少达到预期效果的延迟。实验评估表明,该方法在视频局部编辑方面达到了最先进的性能。
链接: https://arxiv.org/abs/2411.15260
作者: Jiahao Hu,Tianxiong Zhong,Xuebo Wang,Boyuan Jiang,Xingye Tian,Fei Yang,Pengfei Wan,Di Zhang
关键词-EN: Diffusion-based image editing, made remarkable progress, Diffusion-based image, video editing, editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 14 figures
点击查看摘要
Abstract:Diffusion-based image editing models have made remarkable progress in recent years. However, achieving high-quality video editing remains a significant challenge. One major hurdle is the absence of open-source, large-scale video editing datasets based on real-world data, as constructing such datasets is both time-consuming and costly. Moreover, video data requires a significantly larger number of tokens for representation, which substantially increases the training costs for video editing models. Lastly, current video editing models offer limited interactivity, often making it difficult for users to express their editing requirements effectively in a single attempt. To address these challenges, this paper introduces a dataset VIVID-10M and a baseline model VIVID. VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs, which comprises 9.7M samples that encompass a wide range of video editing tasks. VIVID is a Versatile and Interactive VIdeo local eDiting model trained on VIVID-10M, which supports entity addition, modification, and deletion. At its core, a keyframe-guided interactive video editing mechanism is proposed, enabling users to iteratively edit keyframes and propagate it to other frames, thereby reducing latency in achieving desired outcomes. Extensive experimental evaluations show that our approach achieves state-of-the-art performance in video local editing, surpassing baseline methods in both automated metrics and user studies. The VIVID-10M dataset and the VIVID editing model will be available at \urlthis https URL.
zh
[CV-206] LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation
【速读】: 该论文试图解决基于扩散模型的文本到图像生成中个性化和可控生成实例的问题。解决方案的关键在于提出了LocRef-Diffusion模型,该模型通过引入Layout-net和appearance-net来实现实例位置和外观的精确控制。Layout-net利用显式的实例布局信息和实例区域交叉注意力模块来控制实例的生成位置,而appearance-net则通过提取实例外观特征并通过交叉注意力机制将其集成到扩散模型中,从而提高生成图像的外观保真度。实验结果表明,该方法在布局和外观引导生成方面达到了最先进的性能。
链接: https://arxiv.org/abs/2411.15252
作者: Fan Deng,Yaguang Wu,Xinyang Yu,Xiangjun Huang,Jian Yang,Guangyu Yan,Qiang Xu
关键词-EN: achieved remarkable success, generating high-quality images, achieved remarkable, remarkable success, success in generating
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances’ appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.
zh
[CV-207] AnyText2: Visual Text Generation and Editing With Customizable Attributes
【速读】: 该论文试图解决在文本到图像 (Text-to-Image, T2I) 生成领域中,难以精确控制多语言文本的字体和颜色属性的问题。解决方案的关键在于提出了AnyText2方法,该方法包含两个主要组件:首先,引入WriteNet+AttnX架构,将文本渲染能力注入预训练的T2I模型,相比前作AnyText,不仅提升了图像真实感,还提高了19.8%的推理速度;其次,开发了文本嵌入模块 (Text Embedding Module),用于从场景图像中提取字体和颜色,并分别编码为条件,从而实现对每行文本属性的定制化控制,使得中文和英文的文本准确性分别提高了3.3%和9.3%。
链接: https://arxiv.org/abs/2411.15245
作者: Yuxiang Tuo,Yifeng Geng,Liefeng Bo
关键词-EN: garnered significant attention, domain progresses, significant attention, seamlessly integrates, integrates with visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:As the text-to-image (T2I) domain progresses, generating text that seamlessly integrates with visual content has garnered significant attention. However, even with accurate text generation, the inability to control font and color can greatly limit certain applications, and this issue remains insufficiently addressed. This paper introduces AnyText2, a novel method that enables precise control over multilingual text attributes in natural scene image generation and editing. Our approach consists of two main components. First, we propose a WriteNet+AttnX architecture that injects text rendering capabilities into a pre-trained T2I model. Compared to its predecessor, AnyText, our new approach not only enhances image realism but also achieves a 19.8% increase in inference speed. Second, we explore techniques for extracting fonts and colors from scene images and develop a Text Embedding Module that encodes these text attributes separately as conditions. As an extension of AnyText, this method allows for customization of attributes for each line of text, leading to improvements of 3.3% and 9.3% in text accuracy for Chinese and English, respectively. Through comprehensive experiments, we demonstrate the state-of-the-art performance of our method. The code and model will be made open-source in this https URL.
zh
[CV-208] Adversarial Prompt Distillation for Vision-Language Models
【速读】: 该论文试图解决预训练视觉-语言模型(Vision-Language Models, VLMs)如对比语言-图像预训练模型(Contrastive Language-Image Pre-Training, CLIP)在对抗攻击下的脆弱性问题。解决方案的关键在于提出了一种新的方法——对抗提示蒸馏(Adversarial Prompt Distillation, APD),该方法结合了对抗提示调优(Adversarial Prompt Tuning, APT)和知识蒸馏(Knowledge Distillation),通过为视觉和文本模态同时添加提示,并利用一个干净的预训练教师CLIP模型来提升学生CLIP模型在下游任务中的表现,从而增强模型的对抗鲁棒性和自然性能。实验结果表明,APD在自然和对抗性能方面均优于当前最先进的APT方法。
链接: https://arxiv.org/abs/2411.15244
作者: Lin Luo,Xin Wang,Bojia Zi,Shihao Zhao,Xingjun Ma
关键词-EN: Contrastive Language-Image Pre-Training, Large pre-trained Vision-Language, Contrastive Language-Image, Adversarial Prompt Tuning, Large pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large pre-trained Vision-Language Models (VLMs) such as Contrastive Language-Image Pre-Training (CLIP) have been shown to be susceptible to adversarial attacks, raising concerns about their deployment in safety-critical scenarios like autonomous driving and medical diagnosis. One promising approach for improving the robustness of pre-trained VLMs is Adversarial Prompt Tuning (APT), which combines adversarial training with prompt tuning. However, existing APT methods are mostly single-modal methods that design prompt(s) for only the visual or textual modality, limiting their effectiveness in either robustness or clean accuracy. In this work, we propose a novel method called Adversarial Prompt Distillation (APD) that combines APT with knowledge distillation to boost the adversarial robustness of CLIP. Specifically, APD is a bimodal method that adds prompts for both the visual and textual modalities while leveraging a cleanly pre-trained teacher CLIP model to distill and boost the performance of the student CLIP model on downstream tasks. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our APD over the current state-of-the-art APT methods in terms of both natural and adversarial performances. The effectiveness of our APD method validates the possibility of using a non-robust teacher to improve the generalization and robustness of VLMs.
zh
[CV-209] EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
【速读】: 该论文试图解决在资源受限环境中部署神经网络时,如何高效捕捉全局依赖关系的问题。解决方案的关键在于提出了基于隐藏状态混合器的状态空间对偶模型(HSM-SSD),通过重新设计SSD层以实现隐藏状态内的通道混合操作,并引入多阶段隐藏状态融合来增强表示能力,从而显著降低计算成本并提高模型性能。Efficient Vision Mamba (EfficientViM) 架构在ImageNet-1k上实现了新的速度-准确性权衡,相比SHViT模型在速度更快的同时提供了高达0.7%的性能提升。
链接: https://arxiv.org/abs/2411.15241
作者: Sanghyeok Lee,Joonmyung Choi,Hyunwoo J. Kim
关键词-EN: resource-constrained environments, built lightweight architectures, deployment of neural, neural networks, networks in resource-constrained
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint
点击查看摘要
Abstract:For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens. Yet, efficient vision backbones built with SSM have been explored less. In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states. Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at this https URL.
zh
[CV-210] Faithful Label-free Knowledge Distillation
【速读】: 该论文试图解决在大规模计算机视觉基础模型中,如何通过知识蒸馏(Knowledge Distillation)技术训练出高性能且轻量化的学生模型的问题。解决方案的关键在于提出了一种名为“Teacher in the Middle (TinTeM)”的无标签知识蒸馏方法,该方法通过学习教师网络潜在空间到学生网络的近似正交映射,从而生成一个更忠实于教师网络行为的学生模型。这种方法不仅在模型鲁棒性、泛化能力和分布外检测(OOD detection)方面表现优异,还能在特定任务数据集上训练出更准确、更具泛化能力和OOD检测性能的模型,为在小数据集上训练高性能轻量级模型提供了竞争性路径。
链接: https://arxiv.org/abs/2411.15239
作者: Evelyn J. Mannix,Liam Hodgkinson,Howard Bondell
关键词-EN: inductive bias, Knowledge distillation approaches, Knowledge distillation, model compression techniques, teacher network
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Knowledge distillation approaches are model compression techniques, with the goal of training a highly performant student model by using a teacher network that is larger or contains a different inductive bias. These approaches are particularly useful when applied to large computer vision foundation models, which can be compressed into smaller variants that retain desirable properties such as improved robustness. This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM), which improves on previous methods by learning an approximately orthogonal mapping from the latent space of the teacher to the student network. This produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection. It is further shown that knowledge distillation with TinTeM on task specific datasets leads to more accurate models with greater generalisability and OOD detection performance, and that this technique provides a competitive pathway for training highly performant lightweight models on small datasets.
zh
[CV-211] Stain-Invariant Representation for Tissue Classification in Histology Images
【速读】: 该论文试图解决在计算病理学 (CPath) 中,由于数字化病理切片 (WSI) 的多样性(包括染色协议、扫描仪和组织类型等因素)导致的领域偏移问题,这使得深度学习 (DL) 算法在多队列设置中的训练和测试面临显著挑战。解决方案的关键在于提出了一种框架,通过染色矩阵扰动生成训练图像的染色增强版本,并采用染色正则化损失来强制源图像和增强图像的特征表示之间的一致性。这种方法促使模型学习染色不变性,从而实现领域不变性的特征表示,最终在跨领域的结直肠癌图像的多类组织类型分类任务中取得了优于现有最先进方法的性能。
链接: https://arxiv.org/abs/2411.15237
作者: Manahil Raza,Saad Bashir,Talha Qaiser,Nasir Rajpoot
关键词-EN: digitising histology slides, histology slides involves, slides involves multiple, involves multiple factors, final appearance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The process of digitising histology slides involves multiple factors that can affect a whole slide image’s (WSI) final appearance, including the staining protocol, scanner, and tissue type. This variability constitutes a domain shift and results in significant problems when training and testing deep learning (DL) algorithms in multi-cohort settings. As such, developing robust and generalisable DL models in computational pathology (CPath) remains an open challenge. In this regard, we propose a framework that generates stain-augmented versions of the training images using stain matrix perturbation. Thereafter, we employed a stain regularisation loss to enforce consistency between the feature representations of the source and augmented images. Doing so encourages the model to learn stain-invariant and, consequently, domain-invariant feature representations. We evaluate the performance of the proposed model on cross-domain multi-class tissue type classification of colorectal cancer images and have achieved improved performance compared to other state-of-the-art methods.
zh
[CV-212] xt Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps
【速读】: 该论文试图解决文本到图像扩散模型中,由于文本嵌入(text embeddings)未能准确捕捉语法关系,导致交叉注意力图(cross-attention maps)错误地聚焦于相同图像区域,从而引发生成图像中对象缺失或属性绑定错误的问题。解决方案的关键在于提出一种方法,通过测试时优化(test-time optimization)直接将文本注意力图(text attention maps)中的语法关系转移到交叉注意力模块中,从而增强图像与文本之间的语义对齐,无需依赖外部指导。
链接: https://arxiv.org/abs/2411.15236
作者: Jeeyung Kim,Erfan Esmaeili,Qiang Qiu
关键词-EN: text attention maps, image regions attended, maps, specific image regions, attention maps
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, “a black car and a white clock”, the cross-attention maps for “black” and “car” should focus on overlapping regions to depict a black car, while “car” and “clock” should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens – used as conditioning inputs – can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.
zh
[CV-213] CODE-CL: COnceptor-Based Gradient Projection for DEep Continual Learning
【速读】: 该论文试图解决深度神经网络在持续学习(Continual Learning)过程中面临的灾难性遗忘(catastrophic forgetting)问题。解决方案的关键在于引入基于概念矩阵(conceptor matrix)的梯度投影方法,即COnceptor-based gradient projection for DEep Continual Learning (CODE-CL)。CODE-CL通过利用概念矩阵表示法,灵活处理高度相关的任务,保留过去任务的重要权重子空间,并通过限制更新到正交子空间来减少遗忘。该方法通过编码过去任务输入空间中的方向重要性,允许在新知识集成时根据任务相关性进行调节,从而在不显著破坏先前知识的情况下,增强对相关任务的学习能力。实验结果表明,CODE-CL在持续学习图像分类基准测试中表现优异,显著减少了遗忘现象,超越了大多数最先进的方法。
链接: https://arxiv.org/abs/2411.15235
作者: Marco Paul E. Apolinario,Kaushik Roy
关键词-EN: integrate new concepts, enabling adaptability, dynamic environments, ability to progressively, progressively integrate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 10 pages, 2 figures
点击查看摘要
Abstract:Continual learning, or the ability to progressively integrate new concepts, is fundamental to intelligent beings, enabling adaptability in dynamic environments. In contrast, artificial deep neural networks face the challenge of catastrophic forgetting when learning new tasks sequentially. To alleviate the problem of forgetting, recent approaches aim to preserve essential weight subspaces for previous tasks by limiting updates to orthogonal subspaces via gradient projection. While effective, this approach can lead to suboptimal performance, particularly when tasks are highly correlated. In this work, we introduce COnceptor-based gradient projection for DEep Continual Learning (CODE-CL), a novel method that leverages conceptor matrix representations, a computational model inspired by neuroscience, to more flexibly handle highly correlated tasks. CODE-CL encodes directional importance within the input space of past tasks, allowing new knowledge integration in directions modulated by 1-S , where S represents the direction’s relevance for prior tasks. Additionally, we analyze task overlap using conceptor-based representations to identify highly correlated tasks, facilitating efficient forward knowledge transfer through scaled projection within their intersecting subspace. This strategy enhances flexibility, allowing learning in correlated tasks without significantly disrupting previous knowledge. Extensive experiments on continual learning image classification benchmarks validate CODE-CL’s efficacy, showcasing superior performance with minimal forgetting, outperforming most state-of-the-art methods.
zh
[CV-214] LPLgrad: Optimizing Active Learning Through Gradient Norm Sample Selection and Auxiliary Model Training
【速读】: 该论文试图解决机器学习模型在缺乏大量标注数据的情况下表现不佳的问题。解决方案的关键在于提出了一种新的主动学习方法——损失预测损失与梯度范数(Loss Prediction Loss with Gradient Norm, LPLgrad),该方法通过两个阶段来提高图像分类任务的准确性:(i) 训练阶段,通过联合训练主模型和辅助模型来预测输入特征的损失,从而有效提取复杂特征和学习数据内在模式;(ii) 查询阶段,通过计算未标注数据集中样本的熵值的梯度范数来量化模型的不确定性,优先选择梯度范数最高的样本进行标注,从而在最小化标注工作量的同时提升模型性能。该方法在多个真实世界数据集上的广泛评估表明,其在少量标注图像的情况下,准确性显著优于现有最先进的方法。
链接: https://arxiv.org/abs/2411.15217
作者: Shreen Gul,Mohamed Elmahallawy,Sanjay Madria,Ardhendu Tripathy
关键词-EN: strong generalization capabilities, Machine learning models, Machine learning, Loss Prediction Loss, generalization capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Machine learning models are increasingly being utilized across various fields and tasks due to their outstanding performance and strong generalization capabilities. Nonetheless, their success hinges on the availability of large volumes of annotated data, the creation of which is often labor-intensive, time-consuming, and expensive. Many active learning (AL) approaches have been proposed to address these challenges, but they often fail to fully leverage the information from the core phases of AL, such as training on the labeled set and querying new unlabeled samples. To bridge this gap, we propose a novel AL approach, Loss Prediction Loss with Gradient Norm (LPLgrad), designed to quantify model uncertainty effectively and improve the accuracy of image classification tasks. LPLgrad operates in two distinct phases: (i) \em Training Phase aims to predict the loss for input features by jointly training a main model and an auxiliary model. Both models are trained on the labeled data to maximize the efficiency of the learning process, an aspect often overlooked in previous AL methods. This dual-model approach enhances the ability to extract complex input features and learn intrinsic patterns from the data effectively; (ii) \em Querying Phase that quantifies the uncertainty of the main model to guide sample selection. This is achieved by calculating the gradient norm of the entropy values for samples in the unlabeled dataset. Samples with the highest gradient norms are prioritized for labeling and subsequently added to the labeled set, improving the model’s performance with minimal labeling effort. Extensive evaluations on real-world datasets demonstrate that the LPLgrad approach outperforms state-of-the-art methods by order of magnitude in terms of accuracy on a small number of labeled images, yet achieving comparable training and querying times in multiple image classification tasks.
zh
[CV-215] Image Harmonization using Robust Restricted CDF Matching
【速读】: 该论文试图解决机器学习算法在实际应用中面临的输入数据变异性问题,特别是在不同用户、机构和扫描设备之间数据差异显著的情况下。解决方案的关键在于采用基于累积分布函数(Cumulative Distribution Function, CDF)匹配的图像调和方法,通过曲线拟合实现图像强度的非线性变换。这种非线性变换不仅能够“平滑且弹性”地匹配模板,还能保留输入数据中的局部变异性,这对于后续的机器学习处理至关重要。与传统的直方图匹配算法相比,该方法在保持重要特征的同时,提供了更好的控制和直观性,尤其在与基于机器学习的方法相比时更为明显。尽管该方法在MRI图像上进行了演示,但其通用性足以应用于其他类型的成像数据。
链接: https://arxiv.org/abs/2411.15213
作者: Roman Stoklasa
关键词-EN: Deployment of machine, machine learning algorithms, difficult task, Cumulative Distribution Function, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI 2025)
点击查看摘要
Abstract:Deployment of machine learning algorithms into real-world practice is still a difficult task. One of the challenges lies in the unpredictable variability of input data, which may differ significantly among individual users, institutions, scanners, etc. The input data variability can be decreased by using suitable data preprocessing with robust data harmonization. In this paper, we present a method of image harmonization using Cumulative Distribution Function (CDF) matching based on curve fitting. This approach does not ruin local variability and individual important features. The transformation of image intensities is non-linear but still ``smooth and elastic", as compared to other known histogram matching algorithms. Non-linear transformation allows for a very good match to the template. At the same time, elasticity constraints help to preserve local variability among individual inputs, which may encode important features for subsequent machine-learning processing. The pre-defined template CDF offers a better and more intuitive control for the input data transformation compared to other methods, especially ML-based ones. Even though we demonstrate our method for MRI images, the method is generic enough to apply to other types of imaging data.
zh
[CV-216] LightLLM : A Versatile Large Language Model for Predictive Light Sensing
【速读】: 该论文试图解决将预训练的大型语言模型(LLMs)适应于基于光感知的特定任务的问题。解决方案的关键在于提出了LightLLM模型,该模型通过微调预训练的LLMs,结合传感器数据编码器、上下文提示和融合层,将传感器数据与文本提示融合成统一的表示形式。关键的创新点在于保持预训练LLM的参数不变,仅通过添加轻量级的可训练组件来进行微调,从而在最小计算开销和重新训练成本的情况下,实现对新任务的灵活适应。实验结果表明,LightLLM在光基定位、户外太阳能预测和室内太阳能估计等任务中显著优于现有最先进的方法,并且在未见过的环境中表现出色。
链接: https://arxiv.org/abs/2411.15211
作者: Jiawei Hu,Hong Jia,Mahbub Hassan,Lina Yao,Brano Kusy,Wen Hu
关键词-EN: fine tunes pre-trained, tunes pre-trained large, pre-trained large language, large language models, fine tunes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 15 pages, 14 figures, 5 tables
点击查看摘要
Abstract:We propose LightLLM, a model that fine tunes pre-trained large language models (LLMs) for light-based sensing tasks. It integrates a sensor data encoder to extract key features, a contextual prompt to provide environmental information, and a fusion layer to combine these inputs into a unified representation. This combined input is then processed by the pre-trained LLM, which remains frozen while being fine-tuned through the addition of lightweight, trainable components, allowing the model to adapt to new tasks without altering its original parameters. This approach enables flexible adaptation of LLM to specialized light sensing tasks with minimal computational overhead and retraining effort. We have implemented LightLLM for three light sensing tasks: light-based localization, outdoor solar forecasting, and indoor solar estimation. Using real-world experimental datasets, we demonstrate that LightLLM significantly outperforms state-of-the-art methods, achieving 4.4x improvement in localization accuracy and 3.4x improvement in indoor solar estimation when tested in previously unseen environments. We further demonstrate that LightLLM outperforms ChatGPT-4 with direct prompting, highlighting the advantages of LightLLM’s specialized architecture for sensor data fusion with textual prompts.
zh
[CV-217] owards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks
【速读】: 该论文试图解决在安全关键应用中深度学习模型对抗扰动的鲁棒性评估问题,特别是在大规模测试和确保评估反映真实世界对抗风险方面的挑战。解决方案的关键在于提出了一种新的个体攻击方法——概率边界攻击 (Probability Margin Attack, PMA),该方法在概率空间而非logits空间定义对抗边界。PMA不仅在性能上优于当前最先进的个体攻击方法,还为后续的集成攻击提供了基础。此外,论文通过构建百万级数据集CC1M,首次对对抗训练的ImageNet模型进行了百万级白盒对抗鲁棒性评估,揭示了个体与集成攻击以及小规模与大规模评估之间的鲁棒性差距。
链接: https://arxiv.org/abs/2411.15210
作者: Yong Xie,Weijie Zheng,Hanxun Huang,Guangnan Ye,Xingjun Ma
关键词-EN: deep learning models, safety-critical applications, evaluating their vulnerabilities, reliability and trustworthiness, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:As deep learning models are increasingly deployed in safety-critical applications, evaluating their vulnerabilities to adversarial perturbations is essential for ensuring their reliability and trustworthiness. Over the past decade, a large number of white-box adversarial robustness evaluation methods (i.e., attacks) have been proposed, ranging from single-step to multi-step methods and from individual to ensemble methods. Despite these advances, challenges remain in conducting meaningful and comprehensive robustness evaluations, particularly when it comes to large-scale testing and ensuring evaluations reflect real-world adversarial risks. In this work, we focus on image classification models and propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We analyze the relationship between PMA and existing cross-entropy or logits-margin-based attacks, and show that PMA can outperform the current state-of-the-art individual methods. Building on PMA, we propose two types of ensemble attacks that balance effectiveness and efficiency. Furthermore, we create a million-scale dataset, CC1M, derived from the existing CC3M dataset, and use it to conduct the first million-scale white-box adversarial robustness evaluation of adversarially-trained ImageNet models. Our findings provide valuable insights into the robustness gaps between individual versus ensemble attacks and small-scale versus million-scale evaluations.
zh
[CV-218] DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh
【速读】: 该论文试图解决现有文本驱动虚拟形象生成方法中,将人体与服装作为一个整体3D模型处理,导致服装替换困难和用户对生成过程控制不足的问题。解决方案的关键在于提出了DAGSM(Disentangled Avatar Generation with Semantic Modeling)这一新流程,通过将穿着衣物的人体各部分(如身体、上衣/下衣)分别建模为GS增强网格(GSM),并引入语义算法实现人体与服装以及服装之间的更好分离。此外,通过视图一致性纹理优化模块,包括跨视图注意力机制和入射角加权去噪(IAW-DE)策略,提升纹理质量,从而生成高质量、可替换服装并支持逼真动画的虚拟形象。
链接: https://arxiv.org/abs/2411.15205
作者: Jingyu Zhuang,Di Kang,Linchao Bao,Liang Lin,Guanbin Li
关键词-EN: Text-driven avatar generation, gained significant attention, significant attention owing, Text-driven avatar, gained significant
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:Text-driven avatar generation has gained significant attention owing to its convenience. However, existing methods typically model the human body with all garments as a single 3D model, limiting its usability, such as clothing replacement, and reducing user control over the generation process. To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. Specifically, we model each part (e.g., body, upper/lower clothes) of the clothed human as one GS-enhanced mesh (GSM), which is a traditional mesh attached with 2D Gaussians to better handle complicated textures (e.g., woolen, translucent clothes) and produce realistic cloth animations. During the generation, we first create the unclothed body, followed by a sequence of individual cloth generation based on the body, where we introduce a semantic-based algorithm to achieve better human-cloth and garment-garment separation. To improve texture quality, we propose a view-consistent texture refinement module, including a cross-view attention mechanism for texture style consistency and an incident-angle-weighted denoising (IAW-DE) strategy to update the appearance. Extensive experiments have demonstrated that DAGSM generates high-quality disentangled avatars, supports clothing replacement and realistic animation, and outperforms the baselines in visual quality.
zh
[CV-219] Label Distribution Shift-Aware Prediction Refinement for Test-Time Adaptation
【速读】: 该论文试图解决测试时适应(Test-time adaptation, TTA)方法在面对标签分布偏移(label distribution shifts)时性能显著下降的问题。解决方案的关键在于引入了一种名为标签分布偏移感知预测优化(label Distribution shift-Aware prediction Refinement for Test-time adaptation, DART)的新方法。DART通过在训练数据中使用具有多样类分布的批次来训练一个预测优化模块,该模块在测试时用于检测和纠正类分布偏移,从而显著提高测试数据的伪标签准确性。这种方法在CIFAR-10C上展示了5-18%的准确性提升,并且在没有标签分布偏移的情况下不会导致性能下降,使其成为现有TTA方法中一个有价值的插件工具。
链接: https://arxiv.org/abs/2411.15204
作者: Minguk Jang,Hye Won Chung
关键词-EN: encountering input distribution, distribution shifts, TTA methods, TTA, label distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Test-time adaptation (TTA) is an effective approach to mitigate performance degradation of trained models when encountering input distribution shifts at test time. However, existing TTA methods often suffer significant performance drops when facing additional class distribution shifts. We first analyze TTA methods under label distribution shifts and identify the presence of class-wise confusion patterns commonly observed across different covariate shifts. Based on this observation, we introduce label Distribution shift-Aware prediction Refinement for Test-time adaptation (DART), a novel TTA method that refines the predictions by focusing on class-wise confusion patterns. DART trains a prediction refinement module during an intermediate time by exposing it to several batches with diverse class distributions using the training dataset. This module is then used during test time to detect and correct class distribution shifts, significantly improving pseudo-label accuracy for test data. Our method exhibits 5-18% gains in accuracy under label distribution shifts on CIFAR-10C, without any performance degradation when there is no label distribution shift. Extensive experiments on CIFAR, PACS, OfficeHome, and ImageNet benchmarks demonstrate DART’s ability to correct inaccurate predictions caused by test-time distribution shifts. This improvement leads to enhanced performance in existing TTA methods, making DART a valuable plug-in tool.
zh
[CV-220] Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking COLING2025
【速读】: 该论文试图解决当前视觉语言模型 (Vision Language Models, VLMs) 评估基准在测试模型对复杂视觉和文本内容的理解和处理能力方面的不足。现有基准通常侧重于简单任务,无法全面评估模型在深度推理和多模态数据整合方面的能力。论文提出的解决方案是引入 PARROT-360V 基准,这是一个包含 2487 个复杂视觉谜题的综合性基准,旨在测试 VLMs 在复杂视觉推理任务中的表现。关键在于通过 PARROT-360V 评估领先模型(如 GPT-4o、Claude-3.5-Sonnet 和 Gemini-1.5-Pro)在结合视觉线索与语言技能解决任务方面的能力,从而揭示当前 VLMs 在处理复杂、多步骤推理任务中的局限性,并强调需要更强大的评估框架来推动该领域的发展。
链接: https://arxiv.org/abs/2411.15201
作者: Harsha Vardhan Khurdula,Basem Rizk,Indus Khaitan,Janit Anjaria,Aviral Srivastava,Rajvardhan Khaitan
关键词-EN: evaluating Vision Language, Vision Language Models, evaluating Vision, assessing model abilities, Vision Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, Accepted at COLING 2025
点击查看摘要
Abstract:Current benchmarks for evaluating Vision Language Models (VLMs) often fall short in thoroughly assessing model abilities to understand and process complex visual and textual content. They typically focus on simple tasks that do not require deep reasoning or the integration of multiple data modalities to solve an original problem. To address this gap, we introduce the PARROT-360V Benchmark, a novel and comprehensive benchmark featuring 2487 challenging visual puzzles designed to test VLMs on complex visual reasoning tasks. We evaluated leading models: GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro, using PARROT-360V to assess their capabilities in combining visual clues with language skills to solve tasks in a manner akin to human problem-solving. Our findings reveal a notable performance gap: state-of-the-art models scored between 28 to 56 percentage on our benchmark, significantly lower than their performance on popular benchmarks. This underscores the limitations of current VLMs in handling complex, multi-step reasoning tasks and highlights the need for more robust evaluation frameworks to advance the field.
zh
[CV-221] Deep Learning-Based Classification of Hyperkinetic Movement Disorders in Children
【速读】: 该论文试图解决儿童超动力运动障碍(Hyperkinetic Movement Disorders, HMDs)的诊断难题,特别是区分肌张力障碍(dystonia)和舞蹈症(chorea)。解决方案的关键在于开发了一种基于神经网络的模型,该模型通过视频记录的儿童执行运动任务的影像来区分这两种疾病。模型结合了图卷积网络(Graph Convolutional Network, GCN)和长短期记忆网络(Long Short-Term Memory, LSTM),分别用于捕捉空间关系和时间动态,并加入了注意力机制(Attention mechanisms)以提高模型的可解释性。该模型在50个视频数据集上进行了训练和验证,取得了85%的准确率、81%的敏感性和88%的特异性,展示了深度学习在提高HMD诊断准确性和效率方面的潜力。
链接: https://arxiv.org/abs/2411.15200
作者: Nandika Ramamurthy,Dr Daniel Lumsden,Dr Rachel Sparks
关键词-EN: Hyperkinetic movement disorders, pose significant diagnostic, overlapping clinical features, significant diagnostic challenges, abnormal twisting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 59 pages, 20 figures
点击查看摘要
Abstract:Hyperkinetic movement disorders (HMDs) in children, including dystonia (abnormal twisting) and chorea (irregular, random movements), pose significant diagnostic challenges due to overlapping clinical features. The prevalence of dystonia ranges from 2 to 50 per million, and chorea from 5 to 10 per 100,000. These conditions are often diagnosed with delays averaging 4.75 to 7.83 years. Traditional diagnostic methods depend on clinical history and expert physical examinations, but specialized tests are ineffective due to the complex pathophysiology of these disorders. This study develops a neural network model to differentiate between dystonia and chorea from video recordings of paediatric patients performing motor tasks. The model integrates a Graph Convolutional Network (GCN) to capture spatial relationships and Long Short-Term Memory (LSTM) networks to account for temporal dynamics. Attention mechanisms were incorporated to improve model interpretability. The model was trained and validated on a dataset of 50 videos (31 chorea-predominant, 19 dystonia-predominant) collected under regulatory approval from Guy’s and St Thomas’ NHS Foundation Trust. The model achieved 85% accuracy, 81% sensitivity, and 88% specificity at 15 frames per second. Attention maps highlighted the model’s ability to correctly identify involuntary movement patterns, with misclassifications often due to occluded body parts or subtle movement variations. This work demonstrates the potential of deep learning to improve the accuracy and efficiency of HMD diagnosis and could contribute to more reliable, interpretable clinical tools.
zh
[CV-222] Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation
【速读】: 该论文试图解决扩散模型在生成过程中的可控性问题,即如何不仅控制生成结果的类型,还能控制生成过程的长度和参数。解决方案的关键在于提出了一个名为自适应可控扩散 (Adaptively Controllable Diffusion, AC-Diff) 模型的新框架。该框架通过条件时间步 (Conditional Time-Step, CTS) 模块来确定生成所需的步数,并通过自适应混合噪声调度 (Adaptive Hybrid Noise Schedule, AHNS) 模块来估计扩散率参数。此外,该模型还采用了相应的自适应采样机制进行训练,以根据条件调整自身,从而提升整体性能。最终目标是大幅减少生成步骤的平均数量和执行时间,同时保持与现有文献中扩散模型相同的性能。
链接: https://arxiv.org/abs/2411.15199
作者: Yucheng Xing,Xiaodong Liu,Xin Wang
关键词-EN: artificial intelligence, represent the creativity, development of artificial, put onto generative, important aspect
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:With the development of artificial intelligence, more and more attention has been put onto generative models, which represent the creativity, a very important aspect of intelligence. In recent years, diffusion models have been studied and proven to be more reasonable and effective than previous methods. However, common diffusion frameworks suffer from controllability problems. Although extra conditions have been considered by some work to guide the diffusion process for a specific target generation, it only controls the generation result but not its process. In this work, we propose a new adaptive framework, \textitAdaptively Controllable Diffusion (AC-Diff) Model , to automatically and fully control the generation process, including not only the type of generation result but also the length and parameters of the generation process. Both inputs and conditions will be first fed into a \textitConditional Time-Step (CTS) Module to determine the number of steps needed for a generation. Then according to the length of the process, the diffusion rate parameters will be estimated through our \textitAdaptive Hybrid Noise Schedule (AHNS) Module . We further train the network with the corresponding adaptive sampling mechanism to learn how to adjust itself according to the conditions for the overall performance improvement. To enable its practical applications, AC-Diff is expected to largely reduce the average number of generation steps and execution time while maintaining the same performance as done in the literature diffusion models.
zh
[CV-223] Gradient-Weighted Feature Back-Projection: A Fast Alternative to Feature Distillation in 3D Gaussian Splatting
【速读】: 该论文试图解决在高斯光栅化中进行特征场渲染的问题,特别是如何在无需训练的情况下实现高质量的2D和3D分割。解决方案的关键在于提出了一种训练无关的方法,通过将2D特征反投影到预训练的3D高斯分布中,并基于每个高斯分布在最终渲染中的影响进行加权求和。这种方法不仅在2D分割上表现出色,而且在3D分割上也取得了高质量的结果,无需后续处理,且在速度和可扩展性方面与基于训练的方法相当。
链接: https://arxiv.org/abs/2411.15193
作者: Joji Joseph,Bharadwaj Amrutur,Shalabh Bhatnagar
关键词-EN: Gaussian splatting, feature field rendering, introduce a training-free, feature field, field rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce a training-free method for feature field rendering in Gaussian splatting. Our approach back-projects 2D features into pre-trained 3D Gaussians, using a weighted sum based on each Gaussian’s influence in the final rendering. While most training-based feature field rendering methods excel at 2D segmentation but perform poorly at 3D segmentation without post-processing, our method achieves high-quality results in both 2D and 3D segmentation. Experimental results demonstrate that our approach is fast, scalable, and offers performance comparable to training-based methods.
zh
[CV-224] LegoPET: Hierarchical Feature Guided Conditional Diffusion for PET Image Reconstruction
【速读】: 该论文试图解决传统迭代技术在正电子发射断层扫描(PET)图像重建中存在的局限性,特别是深度学习(DL)方法在直接从原始数据(sinograms)重建PET图像时可能产生的过度平滑或引入伪影的问题。解决方案的关键是引入了一种分层特征引导的条件扩散概率模型(cDPM),称为LegoPET。该模型通过分层特征引导,能够在保持输入与输出图像之间的一致性和对应关系的同时,提高图像重建的感知质量和像素级别的PSNR/SSIM指标。实验结果表明,LegoPET不仅改进了cDPM的性能,还在视觉质量和定量指标上超越了现有的基于DL的PET图像重建技术。
链接: https://arxiv.org/abs/2411.16629
作者: Yiran Sun,Osama Mawlawi
关键词-EN: Positron emission tomography, cancer detection due, Positron emission, emission tomography, processes in vivo
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures
点击查看摘要
Abstract:Positron emission tomography (PET) is widely utilized for cancer detection due to its ability to visualize functional and biological processes in vivo. PET images are usually reconstructed from histogrammed raw data (sinograms) using traditional iterative techniques (e.g., OSEM, MLEM). Recently, deep learning (DL) methods have shown promise by directly mapping raw sinogram data to PET images. However, DL approaches that are regression-based or GAN-based often produce overly smoothed images or introduce various artifacts respectively. Image-conditioned diffusion probabilistic models (cDPMs) are another class of likelihood-based DL techniques capable of generating highly realistic and controllable images. While cDPMs have notable strengths, they still face challenges such as maintain correspondence and consistency between input and output images when they are from different domains (e.g., sinogram vs. image domain) as well as slow convergence rates. To address these limitations, we introduce LegoPET, a hierarchical feature guided conditional diffusion model for high-perceptual quality PET image reconstruction from sinograms. We conducted several experiments demonstrating that LegoPET not only improves the performance of cDPMs but also surpasses recent DL-based PET image reconstruction techniques in terms of visual quality and pixel-level PSNR/SSIM metrics. Our code is available at this https URL.
zh
[CV-225] PriorPath: Coarse-To-Fine Approach for Controlled De-Novo Pathology Semantic Masks Generation
【速读】: 该论文试图解决在数字病理学中,由于组织样本多样性和图像标注的复杂性导致的偏差数据集问题,从而限制了基于这些数据集训练的算法的应用性。解决方案的关键在于提出了一种名为PriorPath的管道,该管道能够从粗粒度图像中生成详细的、现实的语义掩码,从而实现对生成掩码的空间排列的控制,进而控制合成图像的细胞特征。PriorPath不仅能够覆盖语义掩码空间,还能提供比先前方法更好的真实掩码相似性,从而在一个平台上同时生成逼真的掩码和图像,为计算病理学中的AI应用提供了一种先进的、可控的解决方案。
链接: https://arxiv.org/abs/2411.16515
作者: Nati Daniel,May Nathan,Eden Azeroual,Yael Fisher,Yonatan Savir
关键词-EN: Incorporating artificial intelligence, offers promising prospects, Incorporating artificial, digital pathology offers, pathology offers promising
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Incorporating artificial intelligence (AI) into digital pathology offers promising prospects for automating and enhancing tasks such as image analysis and diagnostic processes. However, the diversity of tissue samples and the necessity for meticulous image labeling often result in biased datasets, constraining the applicability of algorithms trained on them. To harness synthetic histopathological images to cope with this challenge, it is essential not only to produce photorealistic images but also to be able to exert control over the cellular characteristics they depict. Previous studies used methods to generate, from random noise, semantic masks that captured the spatial distribution of the tissue. These masks were then used as a prior for conditional generative approaches to produce photorealistic histopathological images. However, as with many other generative models, this solution exhibits mode collapse as the model fails to capture the full diversity of the underlying data distribution. In this work, we present a pipeline, coined PriorPath, that generates detailed, realistic, semantic masks derived from coarse-grained images delineating tissue regions. This approach enables control over the spatial arrangement of the generated masks and, consequently, the resulting synthetic images. We demonstrated the efficacy of our method across three cancer types, skin, prostate, and lung, showcasing PriorPath’s capability to cover the semantic mask space and to provide better similarity to real masks compared to previous methods. Our approach allows for specifying desired tissue distributions and obtaining both photorealistic masks and images within a single platform, thus providing a state-of-the-art, controllable solution for generating histopathological images to facilitate AI for computational pathology.
zh
[CV-226] Comparison of Generative Learning Methods for Turbulence Modeling
【速读】: 该论文试图解决湍流流动数值模拟中的高计算成本问题,特别是在技术相关问题中难以负担高分辨率技术如直接数值模拟 (DNS) 和大涡模拟 (LES) 的计算资源。解决方案的关键在于利用生成式概率模型 (Generative Probabilistic Models) 的最新进展,特别是变分自编码器 (VAE)、深度卷积生成对抗网络 (DCGAN) 和去噪扩散概率模型 (DDPM),来模拟二维卡门涡街。通过使用大涡模拟 (LES) 获取训练数据,论文评估了这些模型捕捉湍流流动统计特性和空间结构的能力。结果表明,DDPM 和 DCGAN 能够有效复制流动分布,显示出它们作为高效且准确的湍流建模工具的潜力。特别是 DCGAN,尽管训练难度较大(如模式崩溃问题),但其推理和训练时间最短,所需数据较少,且生成的结果与输入流最为一致。
链接: https://arxiv.org/abs/2411.16417
作者: Claudia Drygala,Edmund Ross,Francesca di Mare,Hanno Gottschalk
关键词-EN: Direct Numerical Simulation, Large Eddy Simulation, present significant challenges, high computational cost, Numerical simulations
类目: Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Numerical simulations of turbulent flows present significant challenges in fluid dynamics due to their complexity and high computational cost. High resolution techniques such as Direct Numerical Simulation (DNS) and Large Eddy Simulation (LES) are generally not computationally affordable, particularly for technologically relevant problems. Recent advances in machine learning, specifically in generative probabilistic models, offer promising alternatives for turbulence modeling. This paper investigates the application of three generative models - Variational Autoencoders (VAE), Deep Convolutional Generative Adversarial Networks (DCGAN), and Denoising Diffusion Probabilistic Models (DDPM) - in simulating a 2D Kármán vortex street around a fixed cylinder. Training data was obtained by means of LES. We evaluate each model’s ability to capture the statistical properties and spatial structures of the turbulent flow. Our results demonstrate that DDPM and DCGAN effectively replicate the flow distribution, highlighting their potential as efficient and accurate tools for turbulence modeling. We find a strong argument for DCGAN, as although they are more difficult to train (due to problems such as mode collapse), they gave the fastest inference and training time, require less data to train compared to VAE and DDPM, and provide the results most closely aligned with the input stream. In contrast, VAE train quickly (and can generate samples quickly) but do not produce adequate results, and DDPM, whilst effective, is significantly slower at both inference and training time.
zh
[CV-227] Privacy-Preserving Federated Foundation Model for Generalist Ultrasound Artificial Intelligence
【速读】: 该论文试图解决传统超声诊断中依赖医生经验、图像质量欠佳以及诊断错误率高等问题。解决方案的关键在于提出了 UltraFedFM,一种创新的隐私保护超声基础模型。UltraFedFM 通过联邦学习(Federated Learning)在分布于9个国家的16家医疗机构中进行协作预训练,利用超过100万张涵盖19个器官和10种超声模态的图像数据。这一方法不仅解决了大规模标注数据带来的隐私问题,还通过广泛的多样化数据和安全训练框架,使模型展现出强大的泛化能力和诊断能力,显著提升了临床诊断的准确性和隐私保护水平。
链接: https://arxiv.org/abs/2411.16380
作者: Yuncheng Jiang,Chun-Mei Feng,Jinke Ren,Jun Wei,Zixun Zhang,Yiwen Hu,Yunbi Liu,Rui Sun,Xuemei Tang,Juan Du,Xiang Wan,Yong Xu,Bo Du,Xin Gao,Guangyu Wang,Shaohua Zhou,Shuguang Cui,Rick Siow Mong Goh,Yong Liu,Zhen Li
关键词-EN: non-invasive nature, nature and real-time, Ultrasound, Ultrasound imaging, clinical diagnosis due
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Ultrasound imaging is widely used in clinical diagnosis due to its non-invasive nature and real-time capabilities. However, conventional ultrasound diagnostics face several limitations, including high dependence on physician expertise and suboptimal image quality, which complicates interpretation and increases the likelihood of diagnostic errors. Artificial intelligence (AI) has emerged as a promising solution to enhance clinical diagnosis, particularly in detecting abnormalities across various biomedical imaging modalities. Nonetheless, current AI models for ultrasound imaging face critical challenges. First, these models often require large volumes of labeled medical data, raising concerns over patient privacy breaches. Second, most existing models are task-specific, which restricts their broader clinical utility. To overcome these challenges, we present UltraFedFM, an innovative privacy-preserving ultrasound foundation model. UltraFedFM is collaboratively pre-trained using federated learning across 16 distributed medical institutions in 9 countries, leveraging a dataset of over 1 million ultrasound images covering 19 organs and 10 ultrasound modalities. This extensive and diverse data, combined with a secure training framework, enables UltraFedFM to exhibit strong generalization and diagnostic capabilities. It achieves an average area under the receiver operating characteristic curve of 0.927 for disease diagnosis and a dice similarity coefficient of 0.878 for lesion segmentation. Notably, UltraFedFM surpasses the diagnostic accuracy of mid-level ultrasonographers and matches the performance of expert-level sonographers in the joint diagnosis of 8 common systemic diseases. These findings indicate that UltraFedFM can significantly enhance clinical diagnostics while safeguarding patient privacy, marking an advancement in AI-driven ultrasound imaging for future clinical applications.
zh
[CV-228] WTDUN: Wavelet Tree-Structured Sampling and Deep Unfolding Network for Image Compressed Sensing
【速读】: 该论文试图解决现有深度展开网络在压缩感知(Compressed Sensing, CS)中面临的两个主要问题:1) 直接从单通道图像中学习,导致特征表示简单,未能充分捕捉复杂特征;2) 对不同图像成分进行均匀处理,忽视了各成分的特性。解决方案的关键在于提出了一种新的基于小波域的深度展开框架,命名为WTDUN。该框架直接在多尺度小波子带上操作,利用小波系数的固有稀疏性和多尺度结构,实现树状采样和重建,