Arxiv今日论文 | 2024-11-26

本篇博文主要展示 2024-11-26 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大型语言模型（LLMs）在回答多跳查询时可能依赖训练数据中的捷径（shortcuts）而非真正进行潜在推理的问题。解决方案的关键在于构建了一个名为SOCRATES（ShOrtCut-fRee lATent rEaSoning）的评估数据集，通过排除训练数据中头实体和答案实体共同出现的情况，系统地移除模型可能猜测答案或利用部分匹配的案例，从而确保模型在没有捷径的情况下进行潜在的多跳推理。研究发现，LLMs在某些类型的查询中表现出较好的潜在多跳推理能力，但在其他类型的查询中表现较差，尤其是在需要潜在回忆年份作为中间答案的查询中。

链接: https://arxiv.org/abs/2411.16679
作者: Sohee Yang,Nora Kassner,Elena Gribovskaya,Sebastian Riedel,Mor Geva
关键词-EN: Large Language Models, Large Language, Summer Olympics, Scarlett Johansson, year Scarlett Johansson
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like “In the year Scarlett Johansson was born, the Summer Olympics were hosted in the country of”. One major challenge in evaluating this ability is that LLMs may have developed shortcuts by encounters of the head entity “Scarlett Johansson” and the answer entity “United States” in the same training sequences or merely guess the answer based on frequency-based priors. To prevent shortcuts, we exclude test queries where the head and answer entities co-appear in pretraining corpora. Through careful selection of relations and facts and systematic removal of cases where models might guess answers or exploit partial matches, we construct an evaluation dataset SOCRATES (ShOrtCut-fRee lATent rEaSoning). We observe that LLMs demonstrate promising latent multi-hop reasoning abilities without exploiting shortcuts, but only for certain types of queries. For queries requiring latent recall of countries as the intermediate answer, the best models achieve 80% latent composability, but this drops to just 5% for the recall of years. Comparisons with Chain-of-Thought composability highlight a significant gap between the ability of models to reason latently versus explicitly. Analysis reveals that latent representations of the intermediate answer are constructed more often in queries with higher latent composability, and shows the emergence of latent multi-hop reasoning during pretraining.
zh

[NLP-1] DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

【速读】：该论文试图解决故事叙述视频生成 (Storytelling Video Generation, SVG) 中的关键挑战，包括复杂精细的运动表现、多对象在场景间的连续性以及单一场景内多运动的无缝过渡。解决方案的关键在于提出了DreamRunner方法，该方法通过以下步骤解决上述问题：首先，利用大型语言模型 (Large Language Model, LLM) 对输入脚本进行结构化处理，以实现粗粒度的场景规划和细粒度的对象布局及运动规划；其次，采用检索增强的测试时适应 (retrieval-augmented test-time adaptation) 来捕捉每个场景中对象的目标运动先验，支持基于检索视频的多样化运动定制，从而生成具有复杂脚本运动的新视频；最后，提出了一种新颖的空间-时间区域基于的3D注意力和先验注入模块 (Spatial-Temporal Region-based 3D Attention and Prior Injection Module, SR3AI)，用于细粒度对象运动的绑定和逐帧语义控制。通过这些创新，DreamRunner在角色一致性、文本对齐和过渡平滑性方面展示了最先进的性能，并在组合文本到视频生成任务中显著优于基线方法。

链接: https://arxiv.org/abs/2411.16657
作者: Zun Wang,Jialu Li,Han Lin,Jaehong Yoon,Mohit Bansal
关键词-EN: Storytelling video generation, Storytelling video, create long, recently emerged, task to create
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project website: this https URL

点击查看摘要

Abstract:Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner’s robust ability to generate multi-object interactions with qualitative examples.
zh

[NLP-2] Self-Generated Critiques Boost Reward Modeling for Language Models

【速读】：该论文试图解决当前奖励模型（Reward Models）在强化学习从人类反馈（RLHF）中主要生成标量分数，难以自然地结合批评的问题。解决方案的关键在于提出Critic-RM框架，通过自生成批评（self-generated critiques）来改进奖励模型，而无需额外的监督。Critic-RM采用两阶段流程：首先生成并过滤高质量的批评，然后进行奖励预测和批评生成的联合微调。实验结果表明，Critic-RM在奖励建模准确性上比标准奖励模型和LLM评判者提高了3.7%-7.3%，同时在纠正推理步骤中的错误方面也显示出2.5%-3.2%的推理准确性提升。

链接: https://arxiv.org/abs/2411.16646
作者: Yue Yu,Zhengxing Chen,Aston Zhang,Liang Tan,Chenguang Zhu,Richard Yuanzhe Pang,Yundi Qian,Xuewei Wang,Suchin Gururangan,Chao Zhang,Melanie Kambadur,Dhruv Mahajan,Rui Hou
关键词-EN: aligning large language, human preferences, human feedback, large language models, Reward
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages

点击查看摘要

Abstract:Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.
zh

[NLP-3] Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective

【速读】：该论文试图解决生成式 AI (Generative AI) 中的“越狱提示”(jailbreak prompts) 问题，这些提示被设计用于绕过大型语言模型中的伦理安全机制，可能导致网络犯罪分子的滥用。解决方案的关键在于从网络防御的角度分析越狱提示，提出包括高级提示分析、动态安全协议和持续模型微调在内的策略，以增强 AI 的韧性。此外，论文强调了 AI 研究人员、网络安全专家和政策制定者之间的合作，以制定保护 AI 系统的标准，并通过案例研究展示了这些网络防御方法，以促进负责任的 AI 实践，维护系统完整性和公众信任。

链接: https://arxiv.org/abs/2411.16642
作者: Jean Marie Tshimula,Xavier Ndona,D’Jeff K. Nkashama,Pierre-Martin Tardif,Froduald Kabanza,Marc Frappier,Shengrui Wang
关键词-EN: potentially enabling misuse, bypass ethical safeguards, Jailbreak prompts pose, large language models, potentially enabling
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Jailbreak prompts pose a significant threat in AI and cybersecurity, as they are crafted to bypass ethical safeguards in large language models, potentially enabling misuse by cybercriminals. This paper analyzes jailbreak prompts from a cyber defense perspective, exploring techniques like prompt injection and context manipulation that allow harmful content generation, content filter evasion, and sensitive information extraction. We assess the impact of successful jailbreaks, from misinformation and automated social engineering to hazardous content creation, including bioweapons and explosives. To address these threats, we propose strategies involving advanced prompt analysis, dynamic safety protocols, and continuous model fine-tuning to strengthen AI resilience. Additionally, we highlight the need for collaboration among AI researchers, cybersecurity experts, and policymakers to set standards for protecting AI systems. Through case studies, we illustrate these cyber defense approaches, promoting responsible AI practices to maintain system integrity and public trust. \textbf\colorredWarning: This paper contains content which the reader may find offensive.
zh

[NLP-4] Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

【速读】：该论文试图解决自动事实一致性度量（factuality metrics）的有效性问题。研究指出，尽管现代大型语言模型（LLMs）能够生成高质量的摘要，但仍可能引入与源文本不一致的信息，即“幻觉”（hallucinations）。现有的自动度量方法，如ROUGE，在评估摘要质量时已趋于饱和，但难以准确检测这些细微的幻觉。论文的关键解决方案在于对现有的事实一致性度量进行压力测试，发现仅使用浅层特征（shallow features）的监督模型在预测事实一致性方面与最先进的度量方法相当。此外，研究还发现，许多事实一致性度量对事实修正的响应有限，而对非事实性的编辑更为敏感。基于这些发现，论文提出了对现有自动事实一致性度量的可靠性和准确性的质疑，并指出可以通过添加无害句子来人为提高这些度量的分数，从而引发对这些度量方法实际测量内容的深思。

链接: https://arxiv.org/abs/2411.16638
作者: Sanjana Ramprasad,Byron C. Wallace
关键词-EN: produce highly readable, highly readable abstractive, Modern LLMs, readable abstractive summaries, evaluating summary quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern LLMs can now produce highly readable abstractive summaries, to the point where traditional automated metrics for evaluating summary quality, such as ROUGE, have become saturated. However, LLMs still sometimes introduce unwanted content into summaries, i.e., information inconsistent with or unsupported by their source. Measuring the occurrence of these often subtle hallucinations'' automatically has proved to be challenging. This in turn has motivated development of a variety of metrics intended to measure the factual consistency of generated summaries against their source. But are these approaches measuring what they purport to do? In this work, we stress-test automatic factuality metrics. Specifically, we investigate whether and to what degree superficial attributes of summary texts suffice to predict factuality’‘, finding that a (supervised) model using only such shallow features is reasonably competitive with SOTA factuality scoring methods. We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements. In contrast, some metrics are more sensitive to benign, non-factual edits. Motivated by these insights, we show that one can game'' (most) automatic factuality metrics, i.e., reliably inflate factuality’’ scores by appending innocuous sentences to generated this http URL together, our results raise questions about the degree to which we should rely on existing automated factuality metrics and what exactly we want ``factuality metrics’’ to measure.
zh

[NLP-5] StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training

【速读】：该论文试图解决当前基于Transformer架构的语言模型（Language Models, LMs）在处理长输入序列时计算需求呈指数增长的问题。解决方案的关键在于引入选择性注意力机制，特别是全局注意力（global attention），并对其在BERT预训练中的实际影响进行实证评估。通过创建包含结构信息的文本语料库（基于arXiv数据）和纯文本语料库，论文进行了预训练实验，分析了注意力模式的变化及其对下游任务的影响。研究结果强调了将文档结构融入语言模型的重要性，展示了这些模型在文档理解等抽象任务中的优越性能。

链接: https://arxiv.org/abs/2411.16618
作者: Kaustubh Ponkshe,Venkatapathy Subramanian,Natwar Modani,Ganesh Ramakrishnan
关键词-EN: techniques for Language, Language Models, ubiquitous attention mechanism, today rely, rely on transformer-based
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most state-of-the-art techniques for Language Models (LMs) today rely on transformer-based architectures and their ubiquitous attention mechanism. However, the exponential growth in computational requirements with longer input sequences confines Transformers to handling short passages. Recent efforts have aimed to address this limitation by introducing selective attention mechanisms, notably local and global attention. While sparse attention mechanisms, akin to full attention in being Turing-complete, have been theoretically established, their practical impact on pre-training remains unexplored. This study focuses on empirically assessing the influence of global attention on BERT pre-training. The primary steps involve creating an extensive corpus of structure-aware text through arXiv data, alongside a text-only counterpart. We carry out pre-training on these two datasets, investigate shifts in attention patterns, and assess their implications for downstream tasks. Our analysis underscores the significance of incorporating document structure into LM models, demonstrating their capacity to excel in more abstract tasks, such as document understanding.
zh

[NLP-6] Recent Trends in Linear Text Segmentation: a Survey

【速读】：该论文试图解决线性文本分割（Linear Text Segmentation）问题，即自动标记文本中的主题变化点。解决方案的关键在于利用自然语言处理（Natural Language Processing）领域的研究成果，结合语言学和计算语言学的概念，通过当前最先进的资源和方法来实现这一任务。论文不仅概述了现有的技术进展，还指出了现有资源和任务本身的局限性，并基于最新文献和未充分探索的研究方向提出了未来的发展方向。

链接: https://arxiv.org/abs/2411.16613
作者: Iacopo Ghinassi,Lin Wang,Chris Newell,Matthew Purver
关键词-EN: Linear Text Segmentation, tagging text documents, automatically tagging text, Natural Language Processing, Text Segmentation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Linear Text Segmentation is the task of automatically tagging text documents with topic shifts, i.e. the places in the text where the topics change. A well-established area of research in Natural Language Processing, drawing from well-understood concepts in linguistic and computational linguistic research, the field has recently seen a lot of interest as a result of the surge of text, video, and audio available on the web, which in turn require ways of summarising and categorizing the mole of content for which linear text segmentation is a fundamental step. In this survey, we provide an extensive overview of current advances in linear text segmentation, describing the state of the art in terms of resources and approaches for the task. Finally, we highlight the limitations of available resources and of the task itself, while indicating ways forward based on the most recent literature and under-explored research directions.
zh

[NLP-7] From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge

【速读】：该论文试图解决在人工智能（AI）和自然语言处理（NLP）领域中，传统评估方法在判断细微属性和提供满意结果方面的不足。解决方案的关键在于引入“大语言模型（LLM）作为评判者”的新范式，即利用LLM进行评分、排序或选择，以提升评估的准确性和全面性。论文通过详细定义输入和输出视角，提出了一种全面的分类法，从“评判什么”、“如何评判”和“在哪里评判”三个维度探讨了LLM-as-a-judge的应用。此外，论文还编纂了评估LLM-as-a-judge的基准，并指出了关键挑战和未来研究方向，旨在为这一新兴领域提供有价值的见解和启发。

链接: https://arxiv.org/abs/2411.16594
作者: Dawei Li,Bohan Jiang,Liangjie Huang,Alimohammad Beigi,Chengshuai Zhao,Zhen Tan,Amrita Bhattacharjee,Yuxuan Jiang,Canyu Chen,Tianhao Wu,Kai Shu,Lu Cheng,Huan Liu
关键词-EN: natural language processing, Large Language Models, artificial intelligence, evaluation have long, long been critical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 32 pages, 5 figures

点击查看摘要

Abstract:Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the “LLM-as-a-judge” paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at \urlthis https URL and \urlthis https URL.
zh

[NLP-8] Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

【速读】：该论文试图解决大型语言模型（LLMs）在处理复杂推理任务时，如科学、编码和数学问题，如何通过增加思考和反思时间来提高解决效率的问题。解决方案的关键在于引入一个两玩家范式，其中推理模型（actor model）和批判模型（critique model）分别承担推理和批判的角色。批判模型在测试和训练过程中提供步骤级别的反馈，监督推理模型的表现。论文提出了AutoMathCritique框架，用于自动收集批判数据，并生成自然语言反馈，从而在微调语言模型时提升其数学推理能力。实验结果表明，这种批判监督方法显著提高了推理模型在处理困难查询时的表现，尤其是在增加推理时间的情况下。此外，论文还提出了基于批判的自我训练方法，进一步增强了推理模型的探索效率和解决方案的多样性。

链接: https://arxiv.org/abs/2411.16579
作者: Zhiheng Xi,Dingwen Yang,Jixuan Huang,Jiafu Tang,Guanyu Li,Yiwen Ding,Wei He,Boyang Hong,Shihan Do,Wenyu Zhan,Xiao Wang,Rui Zheng,Tao Ji,Xiaowei Shi,Yitao Zhai,Rongxiang Weng,Jingang Wang,Xunliang Cai,Tao Gui,Zuxuan Wu,Qi Zhang,Xipeng Qiu,Xuanjing Huang,Yu-Gang Jiang
关键词-EN: effectively solving complex, complex reasoning tasks, solving complex reasoning, large language models, Training large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model’s capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of 76,321 responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor’s performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor’s self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor’s exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase its potential. Our code and datasets are at \hrefthis https URLthis https URL.
zh

[NLP-9] EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code

【速读】：该论文试图解决现有自动化软件漏洞检测方法在处理现代代码库复杂性和多样性方面的不足。解决方案的关键在于引入了一种名为EnStack的新型集成堆叠框架，该框架利用自然语言处理（NLP）技术来增强漏洞检测能力。EnStack通过协同多个预训练的大型语言模型（LLMs），包括CodeBERT（用于语义分析）、GraphCodeBERT（用于结构表示）和UniXcoder（用于跨模态能力），并结合元分类器（如Logistic Regression、Support Vector Machines (SVM)、Random Forest和XGBoost）来整合这些模型的输出。这种集成方法能够有效捕捉复杂的代码模式和漏洞，从而在检测细微和复杂的漏洞方面表现出色，显著优于现有方法。

链接: https://arxiv.org/abs/2411.16561
作者: Shahriyar Zaman Ridoy,Md. Shazzad Hossain Shaon,Alfredo Cuzzocrea,Mst Shapna Akter
关键词-EN: enhancing security, modern codebases, critical for enhancing, complexity and diversity, diversity of modern
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted in 2024 IEEE International Conference on Big Data (IEEE BigData 2024)

点击查看摘要

Abstract:Automated detection of software vulnerabilities is critical for enhancing security, yet existing methods often struggle with the complexity and diversity of modern codebases. In this paper, we introduce EnStack, a novel ensemble stacking framework that enhances vulnerability detection using natural language processing (NLP) techniques. Our approach synergizes multiple pre-trained large language models (LLMs) specialized in code understanding CodeBERT for semantic analysis, GraphCodeBERT for structural representation, and UniXcoder for cross-modal capabilities. By fine-tuning these models on the Draper VDISC dataset and integrating their outputs through meta-classifiers such as Logistic Regression, Support Vector Machines (SVM), Random Forest, and XGBoost, EnStack effectively captures intricate code patterns and vulnerabilities that individual models may overlook. The meta-classifiers consolidate the strengths of each LLM, resulting in a comprehensive model that excels in detecting subtle and complex vulnerabilities across diverse programming contexts. Experimental results demonstrate that EnStack significantly outperforms existing methods, achieving notable improvements in accuracy, precision, recall, and F1-score. This work highlights the potential of ensemble LLM approaches in code analysis tasks and offers valuable insights into applying NLP techniques for advancing automated vulnerability detection.
zh

[NLP-10] RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

【速读】：该论文试图解决机器人空间理解能力不足的问题，特别是在基于视觉语言模型的空间推理任务中，由于训练数据缺乏复杂的空间场景理解能力，导致模型在实际应用中表现不佳。解决方案的关键在于引入了一个名为RoboSpatial的大规模空间理解数据集，该数据集包含了真实室内和桌面场景的3D扫描和以自我为中心的2D图像，并注释了丰富的与机器人相关的空间信息。RoboSpatial数据集包含100万张图像、5000个3D扫描和300万条注释的空间关系，且2D图像与3D扫描配对，使其同时适用于2D和3D任务。实验结果表明，使用RoboSpatial训练的模型在空间功能预测、空间关系预测和机器人操作等下游任务中表现优于基线模型。

链接: https://arxiv.org/abs/2411.16537
作者: Chan Hee Song,Valts Blukis,Jonathan Tremblay,Stephen Tyree,Yu Su,Stan Birchfield
关键词-EN: grounded decisions based, make grounded decisions, crucial capability, grounded decisions, decisions based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Spatial understanding is a crucial capability for robots to make grounded decisions based on their environment. This foundational skill enables robots not only to perceive their surroundings but also to reason about and interact meaningfully within the world. In modern robotics, these capabilities are taken on by visual language models, and they face significant challenges when applied to spatial reasoning context due to their training data sources. These sources utilize general-purpose image datasets, and they often lack sophisticated spatial scene understanding capabilities. For example, the datasets do not address reference frame comprehension - spatial relationships require clear contextual understanding, whether from an ego-centric, object-centric, or world-centric perspective, which allow for effective real-world interaction. To address this issue, we introduce RoboSpatial, a large-scale spatial understanding dataset consisting of real indoor and tabletop scenes captured as 3D scans and egocentric images, annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5K 3D scans, and 3M annotated spatial relationships, with paired 2D egocentric images and 3D scans to make it both 2D and 3D ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robotics manipulation.
zh

[NLP-11] Profiling Bias in LLM s: Stereotype Dimensions in Contextual Word Embeddings

【速读】：该论文试图解决大型语言模型（LLMs）中存在的偏见问题，并提出了一种有效沟通这些风险并鼓励缓解措施的方法。解决方案的关键在于提出基于社会心理学研究中的刻板印象维度的偏见概况（bias profiles），并通过这些维度分析上下文嵌入中的性别偏见，跨上下文和层次生成12种不同LLM的刻板印象概况，从而直观地展示和可视化偏见。这种方法旨在为所有AI受众提供适当且直观的偏见描述，促进对偏见风险的认知和缓解措施的实施。

链接: https://arxiv.org/abs/2411.16527
作者: Carolin M. Schuster,Maria-Alexandra Dinisor,Shashwat Ghatiwala,Georg Groh
关键词-EN: Large language models, Large language, artificial intelligence, unavoidably biased, current successes
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.
zh

[NLP-12] Fundamental Limits of Prompt Tuning Transformers: Universality Capacity and Efficiency

【速读】：该论文试图解决基于Transformer的基础模型在提示调优（prompt tuning）中的统计和计算极限问题。解决方案的关键在于研究单头（single-head）且仅包含单个自注意力层（single self-attention layer）的Transformer模型。论文证明了在这种最简单的Transformer结构中，提示调优具有普遍性（universal），并且支持在强指数时间假设（Strong Exponential Time Hypothesis, SETH）下高效的（甚至是近线性时间）算法。统计上，论文证明了这种最简单的Transformer模型是序列到序列Lipschitz函数的通用逼近器。此外，论文还提供了在单层、单头Transformer中，提示调优所需软提示（soft-prompt）token数量的指数下界，以记忆任何数据集。计算上，论文识别了提示调优效率的相变，由软提示诱导的键（keys）和查询（queries）的范数决定，并提供了上界标准。超出此标准，提示调优不存在次二次（高效）算法；在此标准内，论文展示了近线性时间提示调优推理算法的存在。这些基本极限为设计表达性强且高效的提示调优方法提供了重要的必要条件。

链接: https://arxiv.org/abs/2411.16525
作者: Jerry Yao-Chieh Hu,Wei-Po Wang,Ammar Gilani,Chenyang Li,Zhao Song,Han Liu
关键词-EN: transformer-based foundation models, prompt tuning, Exponential Time Hypothesis, Strong Exponential Time, prompt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on \textitsingle-head transformers with only a \textitsingle self-attention layer: (i) is universal, and (ii) supports efficient (even almost-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in- dL and -in- (1/\epsilon) lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the \textitsoft-prompt-induced keys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.
zh

[NLP-13] LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation

【速读】：该论文试图解决图像描述生成领域中，深度学习模型依赖于高维潜在特征向量并需要模型微调的问题。解决方案的关键是提出了一种基于文本的方法，即标签增强的检索增强生成（Label Boosted Retrieval Augmented Generation, LaB-RAG），该方法利用图像描述符形式的分类标签来增强预训练大型语言模型（LLMs）的标准检索增强生成（RAG）。具体而言，通过简单的线性分类器将提取的图像嵌入转换为放射学特定标签，结合标准RAG，利用通用领域的LLMs生成放射学报告。该方法无需训练生成语言模型或图像特征编码器模型，也无需直接向LLM展示X光片，即可在自然语言和放射学语言指标上取得优于其他检索型放射学报告生成方法的结果，并与微调的视觉-语言放射学报告生成模型相媲美。

链接: https://arxiv.org/abs/2411.16523
作者: Steven Song,Anirudh Subramanyam,Irene Madejski,Robert L. Grossman
关键词-EN: Retrieval Augmented Generation, deep learning models, Boosted Retrieval Augmented, deep learning, Augmented Generation
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG), where the task is to generate a clinician’s report detailing their observations from a set of radiological images, such as X-rays. We argue that simple linear classifiers over extracted image embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image feature encoder models, and without ever directly “showing” the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further present results of our experiments with various components of LaB-RAG to better understand our method. Finally, we critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.
zh

[NLP-14] All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

【速读】：该论文试图解决现有大型多模态模型（Large Multimodal Models, LMMs）在理解和处理多语言、多文化内容时存在的局限性问题。解决方案的关键在于提出了All Languages Matter Benchmark (ALM-bench)，这是一个迄今为止最大、最全面的评估框架，旨在测试LMMs在100种语言中的表现，特别是针对低资源语言和文化多样性的理解能力。ALM-bench通过多种问题格式（如真/假、多选和开放式问题）以及短答案和长答案分类，全面评估模型在视觉和语言推理中的复杂性和难度。此外，该基准还涵盖了13个不同的文化方面，确保模型能够理解和尊重全球多样性，从而推动开发更具包容性和广泛适用性的多模态模型。

链接: https://arxiv.org/abs/2411.16508
作者: Ashmal Vayani,Dinura Dissanayake,Hasindri Watawana,Noor Ahsan,Nevasini Sasikumar,Omkar Thawakar,Henok Biadglign Ademtew,Yahya Hmaiti,Amandeep Kumar,Kartik Kuckreja,Mykola Maslych,Wafa Al Ghallabi,Mihail Mihaylov,Chao Qin,Abdelrahman M Shaker,Mike Zhang,Mahardika Krisna Ihsani,Amiel Esplana,Monil Gokani,Shachar Mirkin,Harsh Singh,Ashay Srivastava,Endre Hamerlik,Fathinah Asma Izzati,Fadillah Adamsyah Maani,Sebastian Cavada,Jenny Chim,Rohit Gupta,Sanjay Manjunath,Kamila Zhumakhanova,Feno Heriniaina Rabevohitra,Azril Amirudin,Muhammad Ridzuan,Daniya Kareem,Ketan More,Kunyang Li,Pramesh Shakya,Muhammad Saad,Amirpouya Ghasemaghaei,Amirbek Djanibekov,Dilshod Azizov,Branislava Jankovic,Naman Bhatia,Alvaro Cabrera,Johan Obando-Ceron,Olympiah Otieno,Fabian Farestam,Muztoba Rabbani,Sanoojan Baliah,Santosh Sanjeev,Abduragim Shtanchaev,Maheen Fatima,Thao Nguyen,Amrin Kareem,Toluwani Aremu,Nathan Xavier,Amit Bhatkal,Hawau Toyin,Aman Chadha,Hisham Cholakkal,Rao Muhammad Anwer,Michael Felsberg,Jorma Laaksonen,Thamar Solorio,Monojit Choudhury,Ivan Laptev,Mubarak Shah,Salman Khan,Fahad Khan
关键词-EN: Existing Large Multimodal, Large Multimodal Models, Large Multimodal, Existing Large, generally focus
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: A Multilingual Multimodal cultural benchmark for 100 languages

点击查看摘要

Abstract:Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model’s ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark is publicly available.
zh

[NLP-15] AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning

【速读】：该论文试图解决大型语言模型（LLMs）在处理知识密集型复杂问答任务时的不足，特别是由于LLMs在推理规划和幻觉问题上的低效性。解决方案的关键在于提出了一个名为AtomR的新型异构知识推理框架，该框架在原子级别上进行多源推理。AtomR借鉴了知识图谱建模的思想，利用LLMs将复杂问题分解为三种原子知识操作符的组合，从而显著增强了推理过程的规划和执行阶段。此外，论文还引入了BlendQA，一个专门用于评估复杂异构知识推理的新型评估基准。实验结果表明，AtomR在多个单源和多源推理基准测试中显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2411.16495
作者: Amy Xin,Jinxin Liu,Zijun Yao,Zhicheng Li,Shulin Cao,Lei Hou,Juanzi Li
关键词-EN: question answering due, Recent advancements, language processing tasks, perform knowledge-intensive complex, natural language processing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have led to significant improvements in various natural language processing tasks, but it is still challenging for LLMs to perform knowledge-intensive complex question answering due to LLMs’ inefficacy in reasoning planning and the hallucination problem. A typical solution is to employ retrieval-augmented generation (RAG) coupled with chain-of-thought (CoT) reasoning, which decomposes complex questions into chain-like sub-questions and applies iterative RAG at each sub-question. However, prior works exhibit sub-optimal reasoning planning and overlook dynamic knowledge retrieval from heterogeneous sources. In this paper, we propose AtomR, a novel heterogeneous knowledge reasoning framework that conducts multi-source reasoning at the atomic level. Drawing inspiration from the graph modeling of knowledge, AtomR leverages large language models (LLMs) to decompose complex questions into combinations of three atomic knowledge operators, significantly enhancing the reasoning process at both the planning and execution stages. We also introduce BlendQA, a novel evaluation benchmark tailored to assess complex heterogeneous knowledge reasoning. Experiments show that AtomR significantly outperforms state-of-the-art baselines across three single-source and two multi-source reasoning benchmarks, with notable performance gains of 9.4% on 2WikiMultihop and 9.5% on BlendQA.
zh

[NLP-16] O1 Replication Journey – Part 2: Surpassing O1-preview through Simple Distillation Big Progress or Bitter Lesson?

【速读】：该论文试图解决当前在复制OpenAI的O1模型能力时，广泛但通常未公开使用的知识蒸馏技术（knowledge distillation techniques）的问题。解决方案的关键在于通过从O1的API中简单地蒸馏出知识，并结合监督微调（supervised fine-tuning），可以在复杂的数学推理任务上实现优于O1-preview的性能。具体来说，通过在数万个O1蒸馏的长思维链样本上微调基础模型，实验结果显示其在American Invitational Mathematics Examination (AIME)上的表现优于O1-preview，且技术复杂度较低。此外，研究还探讨了O1蒸馏模型在不同任务（如幻觉、安全性和开放领域问答）上的泛化能力，发现即使仅在数学问题解决数据上训练，模型也能在开放式问答任务上表现出强大的泛化能力，并在微调后显著减少了对谄媚行为的敏感性。论文旨在通过公开这些发现，促进AI研究的透明度，并挑战当前领域内技术声明的模糊趋势。

链接: https://arxiv.org/abs/2411.16489
作者: Zhen Huang,Haoyang Zou,Xuefeng Li,Yixiu Liu,Yuxiang Zheng,Ethan Chern,Shijie Xia,Yiwei Qin,Weizhe Yuan,Pengfei Liu
关键词-EN: knowledge distillation techniques, Invitational Mathematics Examination, American Invitational Mathematics, replicating OpenAI, paper presents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:This paper presents a critical examination of current approaches to replicating OpenAI’s O1 model capabilities, with particular focus on the widespread but often undisclosed use of knowledge distillation techniques. While our previous work explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1’s API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Moreover, our investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks: hallucination, safety and open-domain QA. Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning. We deliberately make this finding public to promote transparency in AI research and to challenge the current trend of obscured technical claims in the field. Our work includes: (1) A detailed technical exposition of the distillation process and its effectiveness, (2) A comprehensive benchmark framework for evaluating and categorizing O1 replication attempts based on their technical transparency and reproducibility, (3) A critical discussion of the limitations and potential risks of over-relying on distillation approaches, our analysis culminates in a crucial bitter lesson: while the pursuit of more capable AI systems is important, the development of researchers grounded in first-principles thinking is paramount.
zh

[NLP-17] When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets? EMNLP2024 CONLL

【速读】：该论文试图解决数据高效预训练语言模型的问题，特别是在BabyLM挑战中推动这一领域的边界。解决方案的关键在于引入深度互学习（deep mutual learning），并通过学生模型搜索实现多样化的初始化。论文提出了将加权互学习形式化为双层优化问题，内层循环通过在线蒸馏学习紧凑的学生模型，外层循环则优化权重以从多样化的学生中更好地进行知识蒸馏。这种动态加权策略消除了对教师模型的依赖，从而降低了计算需求。实验评估表明，无教师的方法能够匹配甚至超越有教师监督的方法。

链接: https://arxiv.org/abs/2411.16487
作者: Srikrishna Iyer
关键词-EN: language model pretraining, data-efficient language model, BabyLM challenge, aiming to push, present our submission
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to BabyLM challenge, CoNLL Workshop, EMNLP 2024

点击查看摘要

Abstract:We present our submission to the BabyLM challenge, aiming to push the boundaries of data-efficient language model pretraining. Our method builds upon deep mutual learning, introducing a student model search for diverse initialization. We address the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem. The inner loop learns compact students through online distillation, while the outer loop optimizes weights for better knowledge distillation from diverse students. This dynamic weighting strategy eliminates the need for a teacher model, reducing computational requirements. Our evaluations show that teacher-less methods can match or surpass teacher-supervised approaches.
zh

[NLP-18] Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在处理复杂推理任务（如数学应用题 (Math Word Problems, MWPs)）时的困难。解决方案的关键在于利用类似结构问题的类比来提升LLMs的问题解决能力。具体来说，通过检索与给定问题具有相似计算图的问题作为范例，将其嵌入到提示中，为生成模型提供正确的推理路径参考。实验结果表明，该方法在六个数学应用题数据集上显著提升了模型性能，平均绝对值提升达6.7%，凸显了其在解决当前LLMs推理挑战中的潜力。

链接: https://arxiv.org/abs/2411.16454
作者: Xiaocong Yang,Jiacheng Lin,Ziqi Wang,Chengxiang Zhai
关键词-EN: Large language models, Large language, complicated reasoning tasks, struggle with complicated, math word problems
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known to struggle with complicated reasoning tasks such as math word problems (MWPs). In this paper, we present how analogy from similarly structured questions can improve LLMs’ problem-solving capabilities for MWPs. Specifically, we rely on the retrieval of problems with similar computational graphs to the given question to serve as exemplars in the prompt, providing the correct reasoning path for the generation model to refer to. Empirical results across six math word problem datasets demonstrate the effectiveness of our proposed method, which achieves a significant improvement of up to 6.7 percent on average in absolute value, compared to baseline methods. These results highlight our method’s potential in addressing the reasoning challenges in current LLMs.
zh

[NLP-19] Finding Structure in Language Models

【速读】：该论文试图解决的核心问题是：大型语言模型是否具备与人类相似的深层语法结构理解能力。解决方案的关键在于开发新的可解释性技术，以增强对大规模语言模型复杂性的理解。具体方法包括：通过结构启动（structural priming）探索抽象语言信息的呈现；研究不同语言现象（如形容词顺序和负极性项目），并将其理解与模型训练数据分布相联系；以及引入一个受控测试平台，用于研究语言模型中的层次结构，使用不同复杂度的合成语言，并探讨特征交互在模型结构中的作用。这些方法共同揭示了语言模型中嵌入的语法知识，并为使用计算方法研究基本语言学问题提供了新的方向。

链接: https://arxiv.org/abs/2411.16433
作者: Jaap Jumelet
关键词-EN: continuously make predictions, make predictions based, write or listen, continuously make, make predictions
类目: Computation and Language (cs.CL)
备注: PhD Thesis at ILLC, University of Amsterdam

点击查看摘要

Abstract:When we speak, write or listen, we continuously make predictions based on our knowledge of a language’s grammar. Remarkably, children acquire this grammatical knowledge within just a few years, enabling them to understand and generalise to novel constructions that have never been uttered before. Language models are powerful tools that create representations of language by incrementally predicting the next word in a sentence, and they have had a tremendous societal impact in recent years. The central research question of this thesis is whether these models possess a deep understanding of grammatical structure similar to that of humans. This question lies at the intersection of natural language processing, linguistics, and interpretability. To address it, we will develop novel interpretability techniques that enhance our understanding of the complex nature of large-scale language models. We approach our research question from three directions. First, we explore the presence of abstract linguistic information through structural priming, a key paradigm in psycholinguistics for uncovering grammatical structure in human language processing. Next, we examine various linguistic phenomena, such as adjective order and negative polarity items, and connect a model’s comprehension of these phenomena to the data distribution on which it was trained. Finally, we introduce a controlled testbed for studying hierarchical structure in language models using various synthetic languages of increasing complexity and examine the role of feature interactions in modelling this structure. Our findings offer a detailed account of the grammatical knowledge embedded in language model representations and provide several directions for investigating fundamental linguistic questions using computational methods.
zh

[NLP-20] Adapter-based Approaches to Knowledge-enhanced Language Models – A Survey

【速读】：该论文试图解决知识增强型语言模型（Knowledge-enhanced language models, KELMs）在结合大规模语言模型与领域特定知识时面临的挑战，特别是如何提高事实准确性并减少幻觉（hallucinations）。解决方案的关键在于利用知识图谱（Knowledge Graphs, KGs）和适配器模块（adapter modules）。适配器模块的引入不仅降低了计算负荷，还减少了灾难性遗忘（catastrophic forgetting）的风险。论文通过系统文献综述（Systematic Literature Review, SLR），对现有基于适配器的KELMs方法进行了定量和定性分析，探讨了其优势和潜在不足，并特别关注了生物医学领域，提供了现有KELMs性能的深入比较。

链接: https://arxiv.org/abs/2411.16403
作者: Alexander Fichtl,Juraj Vladika,Georg Groh
关键词-EN: Knowledge-enhanced language models, large-scale language models, language models, Knowledge-enhanced language, large-scale language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures. Published at KEOD24 via SciTePress

点击查看摘要

Abstract:Knowledge-enhanced language models (KELMs) have emerged as promising tools to bridge the gap between large-scale language models and domain-specific knowledge. KELMs can achieve higher factual accuracy and mitigate hallucinations by leveraging knowledge graphs (KGs). They are frequently combined with adapter modules to reduce the computational load and risk of catastrophic forgetting. In this paper, we conduct a systematic literature review (SLR) on adapter-based approaches to KELMs. We provide a structured overview of existing methodologies in the field through quantitative and qualitative analysis and explore the strengths and potential shortcomings of individual approaches. We show that general knowledge and domain-specific approaches have been frequently explored along with various adapter architectures and downstream tasks. We particularly focused on the popular biomedical domain, where we provided an insightful performance comparison of existing KELMs. We outline the main trends and propose promising future directions.
zh

[NLP-21] Human-Calibrated Automated Testing and Validation of Generative Language Models

【速读】：该论文试图解决生成式语言模型（Generative Language Models, GLMs）在高风险领域（如银行业）中的评估和验证问题，特别是针对基于检索增强生成（Retrieval-Augmented Generation, RAG）系统的模型。由于GLM输出具有开放性和主观质量评估的挑战，论文提出了一个人类校准的自动化测试（Human-Calibrated Automated Testing, HCAT）框架。HCAT框架的关键在于：1) 使用分层抽样生成自动化测试；2) 利用嵌入式度量进行功能性、风险和安全属性的可解释评估；3) 通过概率校准和保形预测的两阶段校准方法，使机器生成的评估与人类判断相一致。此外，框架还包括鲁棒性测试和针对特定弱点的边际与双变量分析，以识别和改进模型的具体问题。这一多层次的评估框架提供了可扩展、透明和可解释的GLM评估方法，确保在需要高准确性、透明度和法规遵从性的应用中可靠部署。

链接: https://arxiv.org/abs/2411.16391
作者: Agus Sudjianto,Aijun Zhang,Srinivas Neppalli,Tarun Joshi,Michal Malohlava
关键词-EN: generative language models, paper introduces, introduces a comprehensive, validation of generative, deployed in high-stakes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.16391 [cs.CL] (or arXiv:2411.16391v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.16391 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-22] FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

【速读】：该论文试图解决传统中文（Traditional Chinese）用户在大语言模型（LLMs）预训练数据集方面的不足问题。解决方案的关键在于引入了FineWeb-zhtw数据集，该数据集专门为传统中文用户设计，并通过多阶段精心设计的过滤器来适应英语与传统中文之间的语言差异，以确保数据集的全面性和质量。

链接: https://arxiv.org/abs/2411.16387
作者: Cheng-Wei Lin,Wan-Hsuan Hsieh,Kai-Xin Guan,Chan-Jan Hsu,Chia-Chen Kuo,Chuan-Lin Lai,Chung-Wei Chung,Ming-Jen Wang,Da-Shan Shiu
关键词-EN: large language models, Traditional Chinese, Traditional Chinese users, pretraining dataset significantly, dataset significantly influence
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:The quality and size of a pretraining dataset significantly influence the performance of large language models (LLMs). While there have been numerous efforts in the curation of such a dataset for English users, there is a relative lack of similar initiatives for Traditional Chinese. Building upon this foundation of FineWeb, we introduce FineWeb-zhtw, a dataset tailored specifically for Traditional Chinese users. We came up with multiple stages of meticulously designed filters to cater to the linguistic difference between English and Traditional Chinese, to ensure comprehensiveness and quality. We determined effectiveness from querying dataset samples with three main objectives. Our code and datasets are publicly available.
zh

[NLP-23] Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark Evaluate Metrics and Strong Baselines

【速读】：该论文试图解决多模态检索增强多模态生成任务（Multi-modal Retrieval Augmented Multi-modal Generation, M^2 RAG），即要求基础模型浏览包含文本和图像的多模态网页，并生成能够解决用户查询的多模态响应。由于M^2 RAG任务处于早期研究阶段，缺乏系统的研究和分析，论文构建了一个基准测试（benchmark），配备了一系列文本模态和多模态的评估指标，以分析现有基础模型的能力。解决方案的关键在于通过全面评估结果，提出几种有效的方法来帮助基础模型完成这一任务，并揭示了值得进一步研究的有趣现象。

链接: https://arxiv.org/abs/2411.16365
作者: Zi-Ao Ma,Tian Lan,Rong-Cheng Tu,Yong Hu,Heyan Huang,Xian-Ling Mao
关键词-EN: Augmented Multi-modal Generation, Multi-modal Retrieval Augmented, Retrieval Augmented Multi-modal, Retrieval Augmented, Multi-modal Generation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates an intriguing task of Multi-modal Retrieval Augmented Multi-modal Generation (M ^2 RAG). This task requires foundation models to browse multi-modal web pages, with mixed text and images, and generate multi-modal responses for solving user queries, which exhibits better information density and readability. Given the early researching stage of M ^2 RAG task, there is a lack of systematic studies and analysis. To fill this gap, we construct a benchmark for M ^2 RAG task, equipped with a suite of text-modal metrics and multi-modal metrics to analyze the capabilities of existing foundation models. Besides, we also propose several effective methods for foundation models to accomplish this task, based on the comprehensive evaluation results on our benchmark. Extensive experimental results reveal several intriguing phenomena worth further research.
zh

[NLP-24] he Two-Hop Curse: LLM s trained on A-B B-C fail to learn A–C

【速读】：该论文试图解决大语言模型（LLMs）在无链式思维推理（Chain-of-Thought, CoT）情况下进行两步推理（two-hop reasoning）的能力问题。研究的关键在于通过控制实验设置，验证LLMs在没有CoT的情况下是否能够进行潜在的推理。研究发现，当训练数据中事实同时出现或在提示中同时提供时，模型能够进行潜在推理；然而，当事实仅在不同文档中分别出现时，模型在无CoT的情况下完全失败，表现为随机水平的准确率和测试损失，这种现象被称为“两步诅咒”（Two-Hop Curse）。此外，研究还评估了9个前沿LLMs在真实世界事实上的表现，发现模型在无CoT的情况下在超过一半的问题类别上完全失败，而在使用CoT的情况下则能在大多数类别上部分成功。这些结果表明，LLMs缺乏独立于问题类型的潜在多步推理的通用能力。

链接: https://arxiv.org/abs/2411.16353
作者: Mikita Balesni,Tomek Korbak,Owain Evans
关键词-EN: performer of Imagine, reason internally, struggle when forced, forced to reason, Imagine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While LLMs excel at multi-hop questions (e.g. “Who is the spouse of the performer of Imagine?”) when using chain-of-thought reasoning (CoT), they struggle when forced to reason internally (without CoT). Previous work on the size and nature of this gap produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where the above-chance performance constitutes undeniable evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B Instruct and GPT-4o) on fictional facts and confirm that they generalize to answering two-hop questions about them using CoT. We find that models can perform latent reasoning when facts appear together during training or in the prompt. However, to our surprise, models completely fail at two-hop reasoning without CoT when learned facts only appear in different documents, achieving chance-level accuracy and chance-level test loss. We call this complete failure to compose separately learned facts the Two-Hop Curse. Moreover, we evaluate 9 frontier LLMs on real-world facts, finding that models completely fail at two-hop no-CoT reasoning for over half of question categories while maintaining partial success with CoT across most categories. These results suggest that LLMs lack a general capability for latent multi-hop reasoning independent of the question type.
zh

[NLP-25] Preference Optimization for Reasoning with Pseudo Feedback

【速读】：该论文试图解决在数学推理和编程领域中，由于高质量推理任务数据集的稀缺性，导致大型语言模型（LLMs）在推理能力优化上的挑战。解决方案的关键在于引入了一种新的方法，通过将解决方案的标注问题转化为与测试用例的评估来生成伪反馈（pseudo feedback）。具体来说，论文探索了两种基于测试用例的伪反馈形式：一种由前沿LLMs生成，另一种通过扩展自一致性（self-consistency）到多测试用例来实现。实验结果表明，使用这种伪反馈进行偏好优化（Preference Optimization）在数学推理和编程任务上均取得了显著的性能提升。

链接: https://arxiv.org/abs/2411.16345
作者: Fangkai Jiao,Geyang Guo,Xingxing Zhang,Nancy F. Chen,Shafiq Joty,Furu Wei
关键词-EN: Direct Preference Optimization, Preference optimization techniques, Direct Preference, large language models, Preference optimization
类目: Computation and Language (cs.CL)
备注: 28 pages, 11 figures

点击查看摘要

Abstract:Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.6 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.
zh

[NLP-26] Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

【速读】：该论文试图解决教师在评估学生作文时面临的时间消耗问题，并探讨生成式 AI (Generative AI) 在作文评分中的应用潜力。解决方案的关键在于评估大型语言模型 (LLMs) 在评分德国学生作文时的表现和可靠性，特别是比较开源和闭源 LLMs 与教师评分的一致性。研究结果表明，闭源 GPT 模型在语言相关评分标准上表现优于开源模型，尤其是 o1 模型在整体评分上与人类评估的相关性达到 Spearman’s r = .74，内部一致性为 ICC=.80。这表明 LLM 可以作为减轻教师工作量的工具，特别是在语言相关标准的评估上，但模型在内容质量方面的评分仍需进一步优化。

链接: https://arxiv.org/abs/2411.16337
作者: Kathrin Seßler,Maurice Fürstenberg,Babette Bühler,Enkelejda Kasneci
关键词-EN: time-consuming yet critical, critical task, facilitate essay-scoring tasks, student writing, German student essays
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at LAK '25

点击查看摘要

Abstract:The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs’ scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman’s r = .74 with human assessments in the overall score, and an internal consistency of ICC=.80 . These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.
zh

[NLP-27] Learning from Relevant Subgoals in Successful Dialogs using Iterative Training for Task-oriented Dialog Systems

【速读】：该论文试图解决面向任务的对话系统（Task-oriented Dialog, ToD）在完成用户目标时，由于反馈通常只在对话结束时获得，导致难以有效优化中间子目标的问题。解决方案的关键是提出了SUIT（SUbgoal-aware ITerative Training），一种迭代训练方法。SUIT通过从模型中采样对话并使用远监督（distant supervision）确定对对话成功有贡献的子目标，从而生成高质量的训练样本。这种方法不仅改进了监督微调或偏好学习的结果，还能迭代生成更多数据，而非依赖固定的静态数据集。最终，SUIT在流行的ToD基准测试中达到了新的最先进性能。

链接: https://arxiv.org/abs/2411.16305
作者: Magdalena Kaiser,Patrick Ernst,György Szarvas
关键词-EN: accomplish user goals, solve multiple subgoals, Task-oriented Dialog, user goals, solve multiple
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Task-oriented Dialog (ToD) systems have to solve multiple subgoals to accomplish user goals, whereas feedback is often obtained only at the end of the dialog. In this work, we propose SUIT (SUbgoal-aware ITerative Training), an iterative training approach for improving ToD systems. We sample dialogs from the model we aim to improve and determine subgoals that contribute to dialog success using distant supervision to obtain high quality training samples. We show how this data improves supervised fine-tuning or, alternatively, preference learning results. SUIT is able to iteratively generate more data instead of relying on fixed static sets. SUIT reaches new state-of-the-art performance on a popular ToD benchmark.
zh

[NLP-28] BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

【速读】：该论文试图解决大型语言模型（LLMs）在低资源语言（low-resource languages）上生成能力和知识相对较弱的问题。解决方案的关键在于通过语言对齐（language alignment）高效地将高资源语言（high-resource languages）的生成能力和知识转移到低资源语言上。为此，研究团队构建了一个包含320万条指令的数据集，涵盖高资源语言指令（中文和英文）以及跨语言指令，并通过基于该数据集的指令微调（instruction tuning）来促进语言间的生成能力转移。实验结果表明，BayLing在多语言翻译和多语言知识理解基准测试中，特别是在低资源语言上，表现显著优于同规模的开源模型，证明了其有效性。

链接: https://arxiv.org/abs/2411.16300
作者: Shaolei Zhang,Kehao Zhang,Qingkai Fang,Shoutao Guo,Yan Zhou,Xiaodong Liu,Yang Feng
关键词-EN: Large language models, powerful generative capabilities, Large language, languages, generative capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: BayLing 2’s online demo: this http URL . BayLing 2’s code and models: this https URL

点击查看摘要

Abstract:Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.
zh

[NLP-29] Unraveling Arithmetic in Large Language Models : The Role of Algebraic Structures

【速读】：该论文试图解决大语言模型（LLMs）在链式思维（Chain-of-Thought, CoT）提示下进行一步算术推理的机制问题。现有研究对LLMs是否通过编码数值或依赖符号推理进行算术操作存在争议，而该论文提出LLMs通过捕捉代数结构（如交换律和恒等性）来学习算术。解决方案的关键在于利用这些代数结构，通过输入-输出关系来观察和学习，从而增强LLMs的算术能力。实验结果表明，利用代数结构可以显著提升LLMs在算术任务中的表现。

链接: https://arxiv.org/abs/2411.16260
作者: Fu-Chieh Chang,Pei-Yuan Wu
关键词-EN: Large language models, demonstrated remarkable mathematical, Large language, remarkable mathematical capabilities, decomposes complex reasoning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable mathematical capabilities, largely driven by chain-of-thought (CoT) prompting, which decomposes complex reasoning into step-by-step solutions. This approach has enabled significant advancements, as evidenced by performance on benchmarks like GSM8K and MATH. However, the mechanisms underlying LLMs’ ability to perform arithmetic in a single step of CoT remain poorly understood. Existing studies debate whether LLMs encode numerical values or rely on symbolic reasoning, while others explore attention and multi-layered processing in arithmetic tasks. In this work, we propose that LLMs learn arithmetic by capturing algebraic structures, such as \emphCommutativity and \emphIdentity properties. Since these structures are observable through input-output relationships, they can generalize to unseen data. We empirically demonstrate that LLMs can learn algebraic structures using a custom dataset of arithmetic problems. Our findings indicate that leveraging algebraic structures can enhance the LLMs’ arithmetic capabilities, offering insights into improving their arithmetic performance.
zh

[NLP-30] NormXLogit: The Head-on-Top Never Lies

【速读】：该论文试图解决现有大型语言模型（LLMs）解释性方法依赖于特定模型架构且计算成本高的问题。解决方案的关键在于提出了一种名为NormXLogit的新技术，该技术通过分析输入和输出表示来评估单个输入词元的重要性。具体来说，NormXLogit利用词嵌入的范数来捕捉输入词元的重要性，并揭示词元重要性与模型最终预测之间的显著关系。实验结果表明，NormXLogit在忠实性方面优于现有的基于梯度的方法，并且在逐层解释方面表现优于最突出的架构特定方法。

链接: https://arxiv.org/abs/2411.16252
作者: Sina Abbasi,Mohammad Reza Modarres,Mohammad Taher Pilehvar
关键词-EN: building large language, large language models, Transformer architecture, dominant choice, choice for building
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Transformer architecture has emerged as the dominant choice for building large language models (LLMs). However, with new LLMs emerging on a frequent basis, it is important to consider the potential value of architecture-agnostic approaches that can provide interpretability across a variety of architectures. Despite recent successes in the interpretability of LLMs, many existing approaches rely on complex methods that are often tied to a specific model design and come with a significant computational cost. To address these limitations, we propose a novel technique, called NormXLogit, for assessing the significance of individual input tokens. This method operates based on the input and output representations associated with each token. First, we demonstrate that during the pre-training of LLMs, the norms of word embeddings capture the importance of input tokens. Second, we reveal a significant relationship between a token’s importance and the extent to which its representation can resemble the model’s final prediction. Through extensive analysis, we show that our approach consistently outperforms existing gradient-based methods in terms of faithfulness. Additionally, our method achieves better performance in layer-wise explanations compared to the most prominent architecture-specific methods.
zh

[NLP-31] ransparent Neighborhood Approximation for Text Classifier Explanation

【速读】：该论文试图解决生成式模型在解释文本分类器时缺乏透明性和可解释性的问题。解决方案的关键在于引入一种基于概率的编辑方法 (probability-based editing method)，替代传统的黑箱文本生成器。通过在文本上下文中实施基于概率的操作来生成邻近文本，这种方法不仅提高了解释的质量，还增强了整个解释过程的透明度和可控性。论文提出的XPROB方法在两个实际数据集上的评估中表现出与生成式解释器相当的性能，同时具有更高的稳定性和透明度。

链接: https://arxiv.org/abs/2411.16251
作者: Yi Cai,Arthur Zimek,Eirini Ntoutsi,Gerhard Wunder
关键词-EN: Recent literature highlights, deploying generative models, improve synthetic instance, synthetic instance quality, explaining text classifiers
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: IEEE DSAA’24

点击查看摘要

Abstract:Recent literature highlights the critical role of neighborhood construction in deriving model-agnostic explanations, with a growing trend toward deploying generative models to improve synthetic instance quality, especially for explaining text classifiers. These approaches overcome the challenges in neighborhood construction posed by the unstructured nature of texts, thereby improving the quality of explanations. However, the deployed generators are usually implemented via neural networks and lack inherent explainability, sparking arguments over the transparency of the explanation process itself. To address this limitation while preserving neighborhood quality, this paper introduces a probability-based editing method as an alternative to black-box text generators. This approach generates neighboring texts by implementing manipulations based on in-text contexts. Substituting the generator-based construction process with recursive probability-based editing, the resultant explanation method, XPROB (explainer with probability-based editing), exhibits competitive performance according to the evaluation conducted on two real-world datasets. Additionally, XPROB’s fully transparent and more controllable construction process leads to superior stability compared to the generator-based explainers.
zh

[NLP-32] DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings

【速读】：该论文试图解决基础模型对基于群体的偏见的鲁棒性问题。解决方案的关键是提出了一种名为DoubleCCA的方法，该方法通过利用随机句子和典型相关分析（CCA）来丰富基础模型的文本嵌入。具体步骤包括：首先生成各种随机句子以扩展原始提示，然后使用额外的句子嵌入模型生成这些随机句子的不同文本嵌入，最后通过两次CCA对齐和重构这些表示，使其回到原始表示空间。该方法在多种任务和数据集上展示了其有效性，不仅在性能上超越现有方法，而且在鲁棒性方面也有显著提升。DoubleCCA方法简单易实现，并能轻松集成到现有模型中，为提高基础模型对群体偏见的鲁棒性提供了一个实用解决方案。

链接: https://arxiv.org/abs/2411.16236
作者: Hong Liu,Yitong Lu
关键词-EN: Canonical Correlation Analysis, foundation models, Correlation Analysis, paper presents, Canonical Correlation
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures, 2 tables

点击查看摘要

Abstract:This paper presents a novel method to improve the robustness of foundation models to group-based biases. We propose a simple yet effective method, called DoubleCCA, that leverages random sentences and Canonical Correlation Analysis (CCA) to enrich the text embeddings of the foundation model. First, we generate various random sentences that augment the original prompts, which extends the original prompts with random words or character sequences. Second, we use an additional sentence embedding model to generate different text embeddings with respect to these random sentences. We then use CCA double twice to align the representations and reconstruct them back to the original representation space. We demonstrate the effectiveness of our method on a variety of tasks and datasets, showing that it outperforms existing methods in terms of both performance and robustness. Our method is simple to implement and can be easily integrated into existing models, making it a practical solution for improving the robustness of foundation models to group-based biases.
zh

[NLP-33] MH-MoE:Multi-Head Mixture-of-Experts

【速读】：该论文试图解决在保持计算量（FLOPs）和参数数量与稀疏混合专家模型（MoE）相同的情况下，提升多专家混合模型（Mixture-of-Experts, MoE）性能的问题。解决方案的关键在于提出了一种新的多头混合专家模型（Multi-Head Mixture-of-Experts, MH-MoE）实现方式，通过多头机制（multi-head mechanism）共同处理来自不同专家的不同表示空间的信息，从而在语言模型实验中显著提升了模型质量，并展示了与1-bit大型语言模型（Large Language Models, LLMs）如BitNet的兼容性。

链接: https://arxiv.org/abs/2411.16205
作者: Shaohan Huang,Xun Wu,Shuming Ma,Furu Wei
关键词-EN: demonstrates superior performance, superior performance, mechanism to collectively, collectively attend, attend to information
类目: Computation and Language (cs.CL)
备注: 7 pages, 0 figures

点击查看摘要

Abstract:Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.
zh

[NLP-34] Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models

【速读】：该论文试图解决多模态大语言模型 (MLLMs) 在视频-文本对齐任务中高质量偏好数据稀缺的问题。解决方案的关键在于提出了一个名为 MMAIP-V 的高质量视频问答 (VQA) 偏好数据集，该数据集通过从响应分布中采样并使用外部评分函数进行响应评估来构建。此外，论文还提出了 Iter-W2S-RLAIF 框架，通过迭代更新参考模型和参数外推，逐步增强 MLLMs 的对齐能力。最终，论文还提出了一种无偏且信息完整的 VQA 评估方案。实验结果表明，MMAIP-V 对 MLLMs 的偏好学习有益，而 Iter-W2S-RLAIF 则充分利用了 MMAIP-V 中的对齐信息。

链接: https://arxiv.org/abs/2411.16201
作者: Hao Yi,Qingyang Li,Yulan Hu,Fuzheng Zhang,Di Zhang,Yong Liu
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, textbf
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality video-text preference data is crucial for Multimodal Large Language Models (MLLMs) alignment. However, existing preference data is very scarce. Obtaining VQA preference data for preference training is costly, and manually annotating responses is highly unreliable, which could result in low-quality pairs. Meanwhile, AI-generated responses controlled by temperature adjustment lack diversity. To address these issues, we propose a high-quality VQA preference dataset, called \textit\textbfMultiple \textbfMultimodal \textbfArtificial \textbfIntelligence \textbfPreference Datasets in \textbfVQA (\textbfMMAIP-V), which is constructed by sampling from the response distribution set and using an external scoring function for response evaluation. Furthermore, to fully leverage the preference knowledge in MMAIP-V and ensure sufficient optimization, we propose \textit\textbfIterative \textbfWeak-to-\textbfStrong \textbfReinforcement \textbfLearning from \textbfAI \textbfFeedback for video MLLMs (\textbfIter-W2S-RLAIF), a framework that gradually enhances MLLMs’ alignment capabilities by iteratively updating the reference model and performing parameter extrapolation. Finally, we propose an unbiased and information-complete evaluation scheme in VQA evaluation. Experiments demonstrate that MMAIP-V is beneficial for MLLMs in preference learning and Iter-W2S-RLAIF fully exploits the alignment information in MMAIP-V. We believe that the proposed automatic VQA preference data generation pipeline based on AI feedback can greatly promote future work in the MLLMs alignment. \textbfCode and dataset are available \hrefthis https URLMMAIP-V_Iter-W2S-RLAIF-702F.
zh

[NLP-35] Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在处理复杂推理任务时面临的幻觉 (hallucinations) 问题，这限制了LLMs的实际应用。解决方案的关键在于引入第三方LLMs来调整代理的注意力权重，通过不确定性估计和置信度分析优化多代理系统中的共识形成。具体方法包括：1) 通过第三方LLMs的不确定性估计和置信度分析，调整各代理的注意力权重，促进代理间的深入辩论，从而优化共识形成；2) 在算术数据集上的实验验证了该方法的有效性，超越了传统的多代理基线。这一研究为大型模型在处理复杂任务时减轻幻觉现象提供了新的视角。

链接: https://arxiv.org/abs/2411.16189
作者: Zhihua Duan,Jialin Wang
关键词-EN: Large Language Models, Language Models, Large Language, complex reasoning tasks, face challenges
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) still face challenges when dealing with complex reasoning tasks, often resulting in hallucinations, which limit the practical application of LLMs. To alleviate this issue, this paper proposes a new method that integrates different LLMs to expand the knowledge boundary, reduce dependence on a single model, and promote in-depth debate among agents. The main contributions include: 1) Introducing third-party LLMs to adjust the attention weights of agents through uncertainty estimation and confidence analysis, optimizing consensus formation in multi-agent systems; 2) Experiments on arithmetic datasets have validated the effectiveness of the method, surpassing traditional multi-agent baselines. This research provides a new perspective for large models to alleviate hallucination phenomena when dealing with complex tasks.
zh

[NLP-36] LLM Augmentations to support Analytical Reasoning over Multiple Documents

【速读】：该论文试图解决如何利用大型语言模型 (LLMs) 增强情报分析中的深度分析推理能力的问题。解决方案的关键在于开发了一种名为动态证据树 (dynamic evidence trees, DETs) 的记忆模块，以增强 LLM 的能力，使其能够开发和跟踪多个调查线索。通过在多个数据集上的广泛实验，论文指出当前的 LLMs 在支持情报分析方面仍存在不足，并提出了改进 LLMs 以适应复杂推理应用的建议。

链接: https://arxiv.org/abs/2411.16116
作者: Raquib Bin Yousuf,Nicholas Defelice,Mandar Sharma,Shengzhe Xu,Naren Ramakrishnan
关键词-EN: large language models, enhance in-depth analytical, in-depth analytical reasoning, language models, demonstrated ability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2024 IEEE International Conference on Big Data (IEEE BigData 2024)

点击查看摘要

Abstract:Building on their demonstrated ability to perform a variety of tasks, we investigate the application of large language models (LLMs) to enhance in-depth analytical reasoning within the context of intelligence analysis. Intelligence analysts typically work with massive dossiers to draw connections between seemingly unrelated entities, and uncover adversaries’ plans and motives. We explore if and how LLMs can be helpful to analysts for this task and develop an architecture to augment the capabilities of an LLM with a memory module called dynamic evidence trees (DETs) to develop and track multiple investigation threads. Through extensive experiments on multiple datasets, we highlight how LLMs, as-is, are still inadequate to support intelligence analysts and offer recommendations to improve LLMs for such intricate reasoning applications.
zh

[NLP-37] Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability

【速读】：该论文试图解决的问题是：在大规模语言模型（LLMs）中，机制性可解释性（Mechanistic interpretability）所识别的电路（circuits）在面对不同提示格式时，其泛化能力如何。具体来说，论文关注的是间接对象识别（Indirect Object Identification, IOI）电路在GPT-2 small模型中的泛化能力，特别是在面对挑战原有算法假设的提示变体时。解决方案的关键在于通过实验验证，发现IOI电路在面对不同提示变体时，能够惊人地泛化，主要通过重用其所有组件和机制，并仅增加额外的输入边。此外，论文还发现了一种称为S2 Hacking的机制，解释了电路在原有算法应失败的情况下仍能泛化的原因。这些发现表明，LLMs中的电路可能比之前认识到的更具灵活性和通用性，强调了研究电路泛化对于更好地理解这些模型广泛能力的重要性。

链接: https://arxiv.org/abs/2411.16105
作者: Jatin Nainani,Sankaran Vaidyanathan,AJ Yeung,Kartik Gupta,David Jensen
关键词-EN: Mechanistic interpretability aims, performing specific tasks, Mechanistic interpretability, large neural networks, interpretability aims
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.
zh

[NLP-38] Cautious Optimizers: Improving Training with One Line of Code

【速读】：该论文试图解决现有优化器在Transformer预训练中速度和稳定性不足的问题。解决方案的关键在于提出了一种名为“谨慎优化器”（Cautious Optimizer）的单行代码修改，适用于基于动量的优化器，如C-AdamW和C-Lion。这一修改在理论上保留了Adam的哈密顿函数（Hamiltonian function），并且在Lyapunov分析下不破坏收敛性保证。通过这一理论洞察，揭示了一类新的优化器家族，并在实验中验证了其在Llama和MAE预训练中的加速效果，最高可达1.47倍。

链接: https://arxiv.org/abs/2411.16085
作者: Kaizhao Liang,Lizhang Chen,Bo Liu,Qiang Liu
关键词-EN: default optimizer, Abstract, transformer pretraining, Adam Hamiltonian function, preserves Adam Hamiltonian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Discrete Mathematics (cs.DM)
备注:

点击查看摘要

Abstract:AdamW has been the default optimizer for transformer pretraining. For many years, our community searches for faster and more stable optimizers with only constraint positive outcomes. In this work, we propose a \textbfsingle-line modification in Pytorch to any momentum-based optimizer, which we rename Cautious Optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam’s Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing speed-up on Llama and MAE pretraining up to 1.47\times . Code is available at this https URL
zh

[NLP-39] SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text

【速读】：该论文试图解决在大型语言模型（LLM）集成到应用程序中时，如何在没有参考或充足标注数据的情况下，评估自然语言生成（NLG）输出的质量和相关性的问题。解决方案的关键是引入了一个名为“SAGEval”的新框架，该框架利用一个批评代理（critiquing Agent）来对LLM评估器生成的评分进行反馈和修正。通过这种方式，即使在没有参考或真实标签的情况下，批评代理也能有效纠正LLM评估器的评分，从而减少对标注数据的依赖，特别是在复杂NLG评估场景中，如生成具有不同响应风格的JSON结构表单或调查问卷。

链接: https://arxiv.org/abs/2411.16077
作者: Reshmi Ghosh,Tianyi Yao,Lizzy Chen,Sadid Hasan,Tianwei Chen,Dario Bernal,Huitian Jiao,H M Sajjad Hossain
关键词-EN: Large Language Model, Google Workspace, suite and Google, Workspace for creating, Large Language
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) integrations into applications like Microsoft365 suite and Google Workspace for creating/processing documents, emails, presentations, etc. has led to considerable enhancements in productivity and time savings. But as these integrations become more more complex, it is paramount to ensure that the quality of output from the LLM-integrated applications are relevant and appropriate for use. Identifying the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn’t exist or isn’t amply available, this paper introduces a novel framework called “SAGEval” which utilizes a critiquing Agent to provide feedback on scores generated by LLM evaluators. We show that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys with responses in different styles like multiple choice, likert ratings, single choice questions, etc.
zh

[NLP-40] Predicting Emergent Capabilities by Finetuning

【速读】：该论文试图解决现代大型语言模型（LLM）扩展中的一个基本开放挑战，即对涌现能力（emergent capabilities）缺乏理解的问题。具体来说，虽然语言模型的预训练损失（pretraining loss）随着计算资源的增加是高度可预测的，但下游任务的能力却远不如预训练损失那样可预测，有时甚至会出现突变（emergent jumps），这使得预测未来模型的能力变得困难。论文的关键解决方案在于提出了一个名为“涌现预测”（emergence prediction）的任务，即在给定当前LLM在某一任务上的随机少样本准确率的情况下，预测未来模型（如GPT-N+1）是否会在该任务上表现出非平凡的准确率。论文发现，通过在特定任务上微调LLM，可以改变涌现能力出现的扩展点，使其向能力较弱的模型转移。为此，论文提出了一种操作化方法，即通过在不同数据量上微调LLM，并拟合一个参数化函数来预测涌现能力何时出现（即“涌现定律”emergence laws）。研究在四个标准NLP基准测试（MMLU, GSM8K, CommonsenseQA, 和 CoLA）上验证了这一方法，并展示了其在实际应用中的潜力。

链接: https://arxiv.org/abs/2411.16035
作者: Charlie Snell,Eric Wallace,Dan Klein,Sergey Levine
关键词-EN: fundamental open challenge, fundamental open, open challenge, challenge in modern, lack of understanding
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a function of compute. However, downstream capabilities are far less predictable – sometimes even exhibiting emergent jumps – which makes it challenging to anticipate the capabilities of future models. In this work, we first pose the task of emergence prediction: given access to current LLMs that have random few-shot accuracy on a task, can we predict whether future models (GPT-N+1) will have non-trivial accuracy on that task? We then discover a simple insight for this problem: finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable models. To operationalize this insight, we can finetune LLMs with varying amounts of data and fit a parametric function that predicts when emergence will occur (i.e., “emergence laws”). We validate this approach using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence (MMLU, GSM8K, CommonsenseQA, and CoLA). Using only small-scale LLMs, we find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged. Finally, we present a case study of two realistic uses for emergence prediction.
zh

[NLP-41] ransCompressor: LLM -Powered Multimodal Data Compression for Smart Transportation

【速读】：该论文试图解决智能交通系统中多模态传感器数据的高效压缩与解压缩问题。解决方案的关键在于引入了一个名为TransCompressor的新框架，该框架利用大型语言模型（Large Language Models, LLMs）来实现对多种传感器数据（如气压计、速度和高度测量）的高效压缩和解压缩。通过精心设计的提示（prompts），LLMs能够利用其广泛的知识库来优化数据压缩过程，从而在智能交通环境中提升数据存储、分析和检索的效率。

链接: https://arxiv.org/abs/2411.16020
作者: Huanqi Yang,Rucheng Wu,Weitao Xu
关键词-EN: Large Language Models, Language Models, Large Language, incorporation of Large, improving data management
类目: Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:The incorporation of Large Language Models (LLMs) into smart transportation systems has paved the way for improving data management and operational efficiency. This study introduces TransCompressor, a novel framework that leverages LLMs for efficient compression and decompression of multimodal transportation sensor data. TransCompressor has undergone thorough evaluation with diverse sensor data types, including barometer, speed, and altitude measurements, across various transportation modes like buses, taxis, and MTRs. Comprehensive evaluation illustrates the effectiveness of TransCompressor in reconstructing transportation sensor data at different compression ratios. The results highlight that, with well-crafted prompts, LLMs can utilize their vast knowledge base to contribute to data compression processes, enhancing data storage, analysis, and retrieval in smart transportation settings.
zh

[NLP-42] Exploring Performance Contrasts in TableQA: Step-by-Step Reasoning Boosts Bigger Language Models Limits Smaller Language Models

【速读】：该论文旨在探讨在TableQA任务中，使用逐步推理方法时，大型语言模型（LMs）与小型LMs之间的性能对比。解决方案的关键在于提出了一种名为Table-Logic的详细提示流程，该流程通过逐步识别关键列和行、确定必要的聚合、计算或比较，并最终推断结果以生成精确预测，从而处理任务。实验结果显示，大型LMs如Llama-3-70B在HybridQA任务中比传统方法提高了7.8%的准确率，而小型LMs如Llama-2-7B则出现了11%的性能下降。研究通过多维度分析，揭示了小型模型在逐步推理方法中的局限性，并提供了改进的潜在方向。

链接: https://arxiv.org/abs/2411.16002
作者: Haoyan Yang,Yixuan Wang,Keyue Tong,Hongjin Zhu,Yuanxin Zhang
关键词-EN: detailed prompting flow, termed Table-Logic, prompting flow, paper proposes, proposes a detailed
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes a detailed prompting flow, termed Table-Logic, to investigate the performance contrasts between bigger and smaller language models (LMs) utilizing step-by-step reasoning methods in the TableQA task. The method processes tasks by sequentially identifying critical columns and rows given question and table with its structure, determining necessary aggregations, calculations, or comparisons, and finally inferring the results to generate a precise prediction. By deploying this method, we observe a 7.8% accuracy improvement in bigger LMs like Llama-3-70B compared to the vanilla on HybridQA, while smaller LMs like Llama-2-7B shows an 11% performance decline. We empirically investigate the potential causes of performance contrasts by exploring the capabilities of bigger and smaller LMs from various dimensions in TableQA task. Our findings highlight the limitations of the step-by-step reasoning method in small models and provide potential insights for making improvements.
zh

[NLP-43] Multi-ToM: Evaluating Multilingual Theory of Mind Capabilities in Large Language Models

【速读】：该论文试图解决的问题是：在大语言模型 (LLMs) 的社交和认知能力评估中，这些模型在不同语言和文化背景下展现出的心智理论 (Theory of Mind, ToM) 能力尚不清楚。解决方案的关键在于：(1) 将现有的 ToM 数据集翻译成多种语言，创建一个多语言的 ToM 数据集；(2) 在这些翻译中融入文化特定元素，以反映不同群体相关的社交和认知场景。通过这两个关键步骤，论文对六个最先进的 LLMs 进行了广泛的评估，以测量它们在翻译和文化适应数据集上的 ToM 表现，从而揭示语言和文化多样性对模型展示 ToM 能力的影响，并质疑其社交推理能力。

链接: https://arxiv.org/abs/2411.15999
作者: Jayanta Sadhu,Ayan Antik Khan,Noshin Nawal,Sanju Basak,Abhik Bhattacharjee,Rifat Shahriyar
关键词-EN: Theory of Mind, attribute mental states, infer and attribute, attribute mental, mental states
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Theory of Mind (ToM) refers to the cognitive ability to infer and attribute mental states to oneself and others. As large language models (LLMs) are increasingly evaluated for social and cognitive capabilities, it remains unclear to what extent these models demonstrate ToM across diverse languages and cultural contexts. In this paper, we introduce a comprehensive study of multilingual ToM capabilities aimed at addressing this gap. Our approach includes two key components: (1) We translate existing ToM datasets into multiple languages, effectively creating a multilingual ToM dataset and (2) We enrich these translations with culturally specific elements to reflect the social and cognitive scenarios relevant to diverse populations. We conduct extensive evaluations of six state-of-the-art LLMs to measure their ToM performance across both the translated and culturally adapted datasets. The results highlight the influence of linguistic and cultural diversity on the models’ ability to exhibit ToM, and questions their social reasoning capabilities. This work lays the groundwork for future research into enhancing LLMs’ cross-cultural social cognition and contributes to the development of more culturally aware and socially intelligent AI systems. All our data and code are publicly available.
zh

[NLP-44] Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

【速读】：该论文试图解决大型语言模型（LLMs）在长文本生成中事实性不足的问题。解决方案的关键在于通过分析不同LLMs（如GPT-4、Gemini-1.5-Pro、Claude-3-Opus、Llama-3-70B和Mistral）在长文本生成中的事实性表现，揭示生成文本中事实性得分随句子位置后移而下降的现象，并伴随不支持声明数量的增加。论文进一步探讨了不同评估设置（如Self-Known和Self-Unknown）对LLMs自我判断准确性的影响，发现即使是最先进的模型也难以达到完美的Self-Known得分，且Self-Unknown得分始终高于零，表明模型在自我评估中存在持续的不确定性。研究还指出，Self-Known得分与事实性提升正相关，而Self-Unknown得分与事实性下降相关。这些发现不仅揭示了当前LLMs在长文本生成中的局限性，也为提升长文本生成的事实性提供了有价值的见解。

链接: https://arxiv.org/abs/2411.15993
作者: Lifu Tu,Rui Meng,Shafiq Joty,Yingbo Zhou,Semih Yavuz
关键词-EN: demonstrated strong capabilities, Large language models, long-form text generation, Large language, long-form generation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality scores tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). The results indicate that even advanced models like GPT-4 and Gemini-1.5-Pro fail to achieve perfect Self-Known scores, while their Self-Unknown scores remain notably above zero, reflecting ongoing uncertainty in their self-assessments. Moreover, we find a correlation between higher Self-Known scores and improved factuality, while higher Self-Unknown scores are associated with lower factuality. Interestingly, even without significant changes in the models’ self-judgment (Self-Known and Self-Unknown), the number of unsupported claims can increases, likely as an artifact of long-form generation. These findings show the limitations of current LLMs in long-form generation, and provide valuable insights for improving factuality in long-form text generation.
zh

[NLP-45] Generative Context Distillation

【速读】：该论文试图解决大型语言模型应用中固定且冗长的提示（prompts）导致的显著计算开销问题。解决方案的关键是提出了一种轻量级的提示内部化方法，称为生成式上下文蒸馏（Generative Context Distillation, GCD）。该方法通过联合训练的方式，不仅复制了带有提示输入的模型的行为，还生成了提示内容及其对应的模型行为变化的原因。此外，论文引入了一种数据合成技术，通过交换代理（agent）和环境（environment）的角色来自动收集对话数据集，从而在没有交互环境的条件下进行有效训练。这种方法特别适用于仅有预定义提示而没有相应训练数据集的场景。通过内部化复杂提示，GCD实现了高性能和高效的推理，无需显式提示。

链接: https://arxiv.org/abs/2411.15927
作者: Haebin Shin,Lei Ji,Yeyun Gong,Sungdong Kim,Eunbi Choi,Minjoon Seo
关键词-EN: significant computational overhead, recent large language, Generative Context Distillation, large language model, language model based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Context Distillation (GCD), a lightweight prompt internalization method that employs a joint training approach. This method not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model’s behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Context Distillation enables high-performance and efficient inference without the need for explicit prompts.
zh

[NLP-46] Evaluating Large Language Models for Causal Modeling

【速读】：该论文试图解决将因果领域知识转化为更符合因果数据科学指南的表示形式的问题。解决方案的关键在于引入两个新任务：将因果领域知识提炼为因果变量和使用大型语言模型（LLMs）检测交互实体。研究表明，当代LLMs（如GPT-4-turbo和Llama3-70b）在提炼因果领域知识为因果变量方面表现优于稀疏专家模型（如Mixtral-8x22b），而在识别交互实体方面，稀疏专家模型则更为有效。此外，论文强调了生成实体的领域与所选LLM在因果建模中的性能之间的依赖关系。

链接: https://arxiv.org/abs/2411.15888
作者: Houssam Razouk,Leonie Benischke,Georg Niess,Roman Kern
关键词-EN: causal domain knowledge, causal data science, domain knowledge, transforming causal domain, data science
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figutrd, 4 tabels

点击查看摘要

Abstract:In this paper, we consider the process of transforming causal domain knowledge into a representation that aligns more closely with guidelines from causal data science. To this end, we introduce two novel tasks related to distilling causal domain knowledge into causal variables and detecting interaction entities using LLMs. We have determined that contemporary LLMs are helpful tools for conducting causal modeling tasks in collaboration with human experts, as they can provide a wider perspective. Specifically, LLMs, such as GPT-4-turbo and Llama3-70b, perform better in distilling causal domain knowledge into causal variables compared to sparse expert models, such as Mixtral-8x22b. On the contrary, sparse expert models such as Mixtral-8x22b stand out as the most effective in identifying interaction entities. Finally, we highlight the dependency between the domain where the entities are generated and the performance of the chosen LLM for causal modeling.
zh

[NLP-47] LLM s Do Not Think Step-by-step In Implicit Reasoning

【速读】：该论文试图解决的问题是：隐式链式思维（implicit Chain-of-Thought, CoT）是否等同于显式链式思维（explicit CoT）。解决方案的关键在于通过实验探究模型在执行隐式CoT时的隐藏状态信息，结果表明大型语言模型（LLMs）在隐式CoT过程中几乎不考虑中间步骤，而是依赖经验而非严格的逐步推理。此外，研究发现LLMs的隐式推理能力不稳定且易受影响，这再次强调了显式CoT在支持复杂任务中的必要性。

链接: https://arxiv.org/abs/2411.15862
作者: Yijiong Yu
关键词-EN: remarkably enhance LLMs’, enhance LLMs’ performance, remarkably enhance, CoT, explicit CoT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:It has been well-known that Chain-of-Thought can remarkably enhance LLMs’ performance on complex tasks. However, because it also introduces slower inference speeds and higher computational costs, many researches have attempted to use implicit CoT, which does not need LLMs to explicitly generate the intermediate steps. But there is still gap between their efficacy and typical explicit CoT methods. This leaves us a doubt that, does implicit CoT really equal to explicit CoT? Therefore, in this study, we address this question through experiments. We probe the information of intermediate steps from the model’s hidden states when it is performing implicit CoT. The results surprisingly indicate that LLMs hardly think about intermediate steps, suggesting they may just rely on experience rather than strict step-by-step reasoning. Moreover, we find LLMs’ implicit reasoning capabilities are susceptible and unstable, reaffirming the necessity of explicit CoT to effectively support complex tasks.
zh

[NLP-48] Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

【速读】：该论文试图解决训练数据质量与数量对小型语言模型（SLMs）性能的相对影响问题。解决方案的关键在于通过实验分析不同数据集变体（包括大小和重复率的变化）对模型性能的影响，特别是验证损失、准确性和困惑度等指标。研究结果表明，数据质量对SLMs的整体性能影响更为显著，适量的数据重复可以轻微提升模型准确性而不显著增加困惑度，但过度重复会导致性能显著下降。这一发现不仅有助于优化模型性能，还为降低大规模模型训练的财务和计算负担，以及减少环境影响提供了理论支持，从而使AI技术更加民主化和可持续。

链接: https://arxiv.org/abs/2411.15821
作者: Aryan Sajith,Krishna Chaitanya Rao Kathala
关键词-EN: small language models, utilizing the TinyStories, study investigates, small language, data quality versus
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs), utilizing the TinyStories dataset for empirical analysis. Analysis of dataset variations with respect to size (25% and 50% of the original size) and duplication (controlled rates of 25%, 50%, 75%, and 100%) were performed. Model performance was evaluated based on the validation loss, accuracy, and perplexity metrics. Results indicate training data quality plays a more significant role in the overall performance of SLMs, especially given scale of this experiment. Minimal duplication positively impacted model accuracy (+0.87% increase in accuracy at 25% duplication) without significantly increasing perplexity (+0.52% increase going from 0% to 25% duplication) but excessive duplication led to pronounced performance degradation (-40% drop in accuracy at 100% duplication). The implications of this exploration extend beyond just model performance; training large-scale models imposes significant financial and computational burdens, which can be prohibitive for organizations, individuals, and the public at large, especially in developing countries. Additionally, the energy consumption associated with large-scale training raises environmental concerns. Understanding the relative importance of data quality versus quantity could democratize AI technology, making advanced models more accessible and sustainable for all.
zh

[NLP-49] LoRA-Mini : Adaptation Matrices Decomposition and Selective Training

【速读】：该论文试图解决大型语言模型（LLMs）在任务特定微调过程中面临的计算和存储效率问题。传统微调方法涉及大量参数更新，导致计算成本高且内存需求大。论文提出的解决方案是LoRA-Mini，这是对低秩适应（LoRA）方法的优化。关键在于将低秩矩阵分割为四部分，仅训练其中两个内部矩阵，从而在保持与标准LoRA相当性能的同时，实现了高达20倍的训练参数数量减少，有效提升了参数效率，解决了LLM微调中的计算和存储效率问题。

链接: https://arxiv.org/abs/2411.15804
作者: Ayush Singh,Rajdeep Aher,Shivank Garg
关键词-EN: natural language processing, revolutionized natural language, large language models, task-specific fine-tuning methods, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:The rapid advancements in large language models (LLMs) have revolutionized natural language processing, creating an increased need for efficient, task-specific fine-tuning methods. Traditional fine-tuning of LLMs involves updating a large number of parameters, which is computationally expensive and memory-intensive. Low-Rank Adaptation (LoRA) has emerged as a promising solution, enabling parameter-efficient fine-tuning by reducing the number of trainable parameters. However, while LoRA reduces the number of trainable parameters, LoRA modules still create significant storage challenges. We propose LoRA-Mini, an optimized adaptation of LoRA that improves parameter efficiency by splitting low-rank matrices into four parts, with only the two inner matrices being trainable. This approach achieves upto a 20x reduction compared to standard LoRA in the number of trainable parameters while preserving performance levels comparable to standard LoRA, addressing both computational and storage efficiency in LLM fine-tuning.
zh

[NLP-50] A Method for Building Large Language Models with Predefined KV Cache Capacity

【速读】：该论文试图解决在Transformer解码器架构中，处理无限上下文时传统Key-Value (KV)缓存导致的内存消耗过大的问题。解决方案的关键在于引入固定长度的KV缓存，通过动态更新键值向量序列，在有限的缓存容量内实现高效的推理，从而显著减少内存使用并保持模型性能和系统吞吐量。

链接: https://arxiv.org/abs/2411.15785
作者: Zhonghua Yi,Ge Niu,Lei Wang,Wei Tang,Liqiu Zhang
关键词-EN: Transformer decode-only architectures, building large language, layers in Transformer, Transformer decode-only, large language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes a method for building large language models with predefined Key-Value (KV) cache capacity, particularly suitable for the attention layers in Transformer decode-only architectures. This method introduces fixed-length KV caches to address the issue of excessive memory consumption in traditional KV caches when handling infinite contexts. By dynamically updating the key-value vector sequences, it achieves efficient inference within limited cache capacity, significantly reducing memory usage while maintaining model performance and system throughput. Experimental results show that this method significantly reduces memory usage while maintaining the model’s inference quality.
zh

[NLP-51] Detecting Turkish Synonyms Used in Different Time Periods

【速读】：该论文试图解决历史文本处理中由于语言动态结构变化导致的性能下降问题，特别是针对土耳其语在20世纪语言改革后快速变化的情景。解决方案的关键在于提出了两种检测不同时期同义词的方法：第一种方法利用正交普鲁克特斯方法（Orthogonal Procrustes method）对不同时期文档生成的嵌入空间进行对齐；第二种方法在此基础上进一步引入斯皮尔曼相关系数（Spearman’s correlation），分析词频随时间的变化。实验结果表明，这两种方法在处理1960年代至1980年代的文本时表现优异，但随时间推移，性能略有下降。

链接: https://arxiv.org/abs/2411.15768
作者: Umur Togay Yazar,Mucahid Kutlu
关键词-EN: poses significant challenges, languages poses significant, applying natural language, natural language processing, language processing models
类目: Computation and Language (cs.CL)
备注: published at Innovations in Intelligent Systems and Applications Conference (Akıllı Sistemlerde Yenilikler ve Uygulamaları Konferansı - ASYU) 2024

点击查看摘要

Abstract:Dynamic structure of languages poses significant challenges in applying natural language processing models on historical texts, causing decreased performance in various downstream tasks. Turkish is a prominent example of rapid linguistic transformation due to the language reform in the 20th century. In this paper, we propose two methods for detecting synonyms used in different time periods, focusing on Turkish. In our first method, we use Orthogonal Procrustes method to align the embedding spaces created using documents written in the corresponding time periods. In our second method, we extend the first one by incorporating Spearman’s correlation between frequencies of words throughout the years. In our experiments, we show that our proposed methods outperform the baseline method. Furthermore, we observe that the efficacy of our methods remains consistent when the target time period shifts from the 1960s to the 1980s. However, their performance slightly decreases for subsequent time periods.
zh

[NLP-52] ableTime: Reformulating Time Series Classification as Zero-Shot Table Understanding via Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在多变量时间序列分类（MTSC）中存在的三个主要瓶颈：（1）难以无损地编码时间序列中的时序和通道特定信息；（2）难以将学习到的表示空间与LLMs的语义空间对齐；（3）需要针对特定任务进行重新训练，计算成本高且劳动密集。解决方案的关键在于提出了一种名为TableTime的方法，该方法将MTSC重新定义为表格理解任务。具体策略包括：（1）将多变量时间序列转换为表格形式，以最大限度地减少信息损失；（2）将表格时间序列表示为文本格式，从而自然地与LLMs的语义空间对齐；（3）设计一个推理框架，整合上下文文本信息、邻域辅助、多路径推理和问题分解，以增强LLMs的推理能力并实现零样本分类。

链接: https://arxiv.org/abs/2411.15737
作者: Jiahao Wang,Mingyue Cheng,Qingyang Mao,Qi Liu,Feiyang Xu,Xin Li,Enhong Chen
关键词-EN: Large language models, Large language, multivariate time series, time series, language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification (MTSC). Effective adaptation of LLMs for MTSC necessitates informative data representations. Existing LLM-based methods directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs. Despite their effectiveness, we reveal that these methods conceal three inherent bottlenecks: (1) they struggle to encode temporal and channel-specific information in a lossless manner, both of which are critical components of multivariate time series; (2) it is much difficult to align the learned representation space with the semantic space of the LLMs; (3) they require task-specific retraining, which is both computationally expensive and labor-intensive. To bridge these gaps, we propose TableTime, which reformulates MTSC as a table understanding task. Specifically, TableTime introduces the following strategies: (1) convert multivariate time series into a tabular form, thus minimizing information loss to the greatest extent; (2) represent tabular time series in text format to achieve natural alignment with the semantic space of LLMs; (3) design a reasoning framework that integrates contextual text information, neighborhood assistance, multi-path inference and problem decomposition to enhance the reasoning ability of LLMs and realize zero-shot classification. Extensive experiments performed on 10 publicly representative datasets from UEA archive verify the superiorities of the TableTime.
zh

[NLP-53] Development of Pre-Trained Transformer-based Models for the Nepali Language

【速读】：该论文试图解决尼泊尔语（Nepali）在自然语言处理（NLP）领域中数据资源匮乏和模型探索不足的问题。解决方案的关键在于收集了27.5 GB的尼泊尔语文本数据，这是迄今为止最大的尼泊尔语单语语料库，比现有资源大2.4倍。利用这些数据，论文预训练了三种不同的模型：BERT、RoBERTa和GPT-2，专门针对尼泊尔语。此外，论文还进行了指令微调（instruction tuning），探索其在尼泊尔语单语数据上的潜力，为未来的研究奠定了基础。实验结果表明，这些模型在Nep-gLUE基准测试中比现有最佳模型高出2分，达到了95.60分，并且在文本生成任务中也表现出色，显著提升了尼泊尔语的理解和生成能力。

链接: https://arxiv.org/abs/2411.15734
作者: Prajwal Thapa,Jinu Nyachhyon,Mridul Sharma,Bal Krishna Bal
关键词-EN: Natural Language Processing, Transformer-based pre-trained language, field of Natural, Language Processing, Nepali language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.
zh

[NLP-54] LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

【速读】：该论文试图解决在保持模型激活参数数量不变的情况下，如何通过稀疏性（sparsity）概念扩展模型规模的问题。解决方案的关键在于构建混合专家模型（Mixture-of-Experts, MoE），并将其应用于Transformer块中的注意力（Attention MoE）和多层感知机（MLP MoE）模块。研究通过不同的专家构建方法和粒度，分析了稀疏化对模型的影响，并设计了两阶段的后训练策略来抵消因增加稀疏性导致的性能下降，从而提升模型在多个领域（如对话、代码、数学）的综合能力。实验结果表明，这种方法在指导性大型语言模型（LLMs）上的应用具有潜在的有效性。

链接: https://arxiv.org/abs/2411.15708
作者: Xiaoye Qu,Daize Dong,Xuyang Hu,Tong Zhu,Weigao Sun,Yu Cheng
关键词-EN: activated parameters constant, gained increasing popularity, scaling model size, parameters constant, gained increasing
类目: Computation and Language (cs.CL)
备注: Technical report,13 pages

点击查看摘要

Abstract:Recently, inspired by the concept of sparsity, Mixture-of-Experts (MoE) models have gained increasing popularity for scaling model size while keeping the number of activated parameters constant. In this study, we thoroughly investigate the sparsity of the dense LLaMA model by constructing MoE for both the attention (i.e., Attention MoE) and MLP (i.e., MLP MoE) modules in the transformer blocks. Specifically, we investigate different expert construction methods and granularities under the same activation conditions to analyze the impact of sparsifying the model. Additionally, to comprehensively evaluate the model’s capabilities across various domains (e.g., conversation, code, math) after sparsification, we apply sparsity to the instructed large language models (LLMs) and construct instructed MoE models. To counteract the performance degradation resulting from increased sparsity, we design a two-stage post-training strategy to enhance model performance. Experiments on the LLaMA3 model demonstrate the potential effectiveness of this approach for future developments of instructed MoE models. The source codes and models are available at: \urlthis https URL.
zh

[NLP-55] RAMIE: Retrieval-Augmented Multi-task Information Extraction with Large Language Models on Dietary Supplements

【速读】：该论文旨在开发一个先进的多任务大语言模型 (Large Language Model, LLM)框架，用于从临床记录中提取与膳食补充剂 (Dietary Supplements, DS) 相关的多种信息。解决方案的关键在于引入了一种名为检索增强多任务信息提取 (Retrieval-Augmented Multi-task Information Extraction, RAMIE)的新框架，该框架结合了指令微调 (Instruction Fine-tuning)、**多任务学习 (Multi-task Learning, MTL)和检索增强生成 (Retrieval-Augmented Generation, RAG)**技术。具体来说，RAMIE框架通过任务特定的提示进行指令微调，提高了模型在多个任务上的存储效率和训练成本效益，并通过从训练集中检索相似示例来增强生成能力，从而显著提升了多任务信息提取的性能。

链接: https://arxiv.org/abs/2411.15700
作者: Zaifu Zhan,Shuang Zhou,Mingchen Li,Rui Zhang
关键词-EN: advanced multi-task large, large language model, Multi-task Information Extraction, extract multiple types, information extraction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:\textbfObjective: We aimed to develop an advanced multi-task large language model (LLM) framework to extract multiple types of information about dietary supplements (DS) from clinical records. \textbfMethods: We used four core DS information extraction tasks - namely, named entity recognition (NER: 2,949 clinical sentences), relation extraction (RE: 4,892 sentences), triple extraction (TE: 2,949 sentences), and usage classification (UC: 2,460 sentences) as our multitasks. We introduced a novel Retrieval-Augmented Multi-task Information Extraction (RAMIE) Framework, including: 1) employed instruction fine-tuning techniques with task-specific prompts, 2) trained LLMs for multiple tasks with improved storage efficiency and lower training costs, and 3) incorporated retrieval augmentation generation (RAG) techniques by retrieving similar examples from the training set. We compared RAMIE’s performance to LLMs with instruction fine-tuning alone and conducted an ablation study to assess the contributions of multi-task learning and RAG to improved multitasking performance. \textbfResults: With the aid of the RAMIE framework, Llama2-13B achieved an F1 score of 87.39 (3.51% improvement) on the NER task and demonstrated outstanding performance on the RE task with an F1 score of 93.74 (1.15% improvement). For the TE task, Llama2-7B scored 79.45 (14.26% improvement), and MedAlpaca-7B achieved the highest F1 score of 93.45 (0.94% improvement) on the UC task. The ablation study revealed that while MTL increased efficiency with a slight trade-off in performance, RAG significantly boosted overall accuracy. \textbfConclusion: This study presents a novel RAMIE framework that demonstrates substantial improvements in multi-task information extraction for DS-related data from clinical records. Our framework can potentially be applied to other domains. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2411.15700 [cs.CL] (or arXiv:2411.15700v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.15700 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zaifu Zhan [view email] [v1] Sun, 24 Nov 2024 03:56:43 UTC (450 KB)
zh

[NLP-56] Deep Sparse Latent Feature Models for Knowledge Graph Completion

【速读】：该论文试图解决知识图谱补全 (Knowledge Graph Completion, KGC) 中大规模知识图谱 (Knowledge Graphs, KGs) 的复杂实体间连接问题。解决方案的关键在于引入了一种基于稀疏潜在特征模型的新框架，并通过深度变分自编码器 (Variational Autoencoder, VAE) 进行优化。该方法不仅能够有效地补全缺失的三元组，还能揭示潜在的社区结构并生成可解释的表示，从而显著提升在WN18RR、FB15k-237和Wikidata5M数据集上的性能。

链接: https://arxiv.org/abs/2411.15694
作者: Haotian Li,Rui Zhang,Lingzhi Wang,Bin Yu,Youwei Wang,Yuliang Wei,Kai Wang,Richard Yi Da Xu,Bailing Wang
关键词-EN: knowledge graph completion, large-scale knowledge graphs, Recent progress, knowledge graph, graph completion
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress in knowledge graph completion (KGC) has focused on text-based approaches to address the challenges of large-scale knowledge graphs (KGs). Despite their achievements, these methods often overlook the intricate interconnections between entities, a key aspect of the underlying topological structure of a KG. Stochastic blockmodels (SBMs), particularly the latent feature relational model (LFRM), offer robust probabilistic frameworks that can dynamically capture latent community structures and enhance link prediction. In this paper, we introduce a novel framework of sparse latent feature models for KGC, optimized through a deep variational autoencoder (VAE). Our approach not only effectively completes missing triples but also provides clear interpretability of the latent structures, leveraging textual information. Comprehensive experiments on the WN18RR, FB15k-237, and Wikidata5M datasets show that our method significantly improves performance by revealing latent communities and producing interpretable representations.
zh

[NLP-57] Ontology-Constrained Generation of Domain-Specific Clinical Summaries

【速读】：该论文试图解决生成式大语言模型（Large Language Models, LLMs）在特定领域（如医疗领域）生成摘要时面临的两个主要问题：一是生成的摘要缺乏领域特定的信息，二是生成的内容中存在幻觉（hallucinations）。解决方案的关键在于利用本体论（ontologies）来指导生成过程，通过本体论引导的约束解码（ontology-guided constrained decoding）方法，既提高了生成摘要的领域相关性，又减少了幻觉现象。该方法在医疗领域的电子健康记录（Electronic Health Records, EHRs）摘要生成中表现出色，特别是在MIMIC-III数据集上的评估结果显示，生成的临床笔记摘要更具领域适应性且幻觉现象显著减少。

链接: https://arxiv.org/abs/2411.15666
作者: Gaya Mehenni,Amal Zouaq
关键词-EN: Large Language Models, Large Language, Language Models, offer promising solutions, offer promising
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands

点击查看摘要

Abstract:Large Language Models (LLMs) offer promising solutions for text summarization. However, some domains require specific information to be available in the summaries. Generating these domain-adapted summaries is still an open challenge. Similarly, hallucinations in generated content is a major drawback of current approaches, preventing their deployment. This study proposes a novel approach that leverages ontologies to create domain-adapted summaries both structured and unstructured. We employ an ontology-guided constrained decoding process to reduce hallucinations while improving relevance. When applied to the medical domain, our method shows potential in summarizing Electronic Health Records (EHRs) across different specialties, allowing doctors to focus on the most relevant information to their domain. Evaluation on the MIMIC-III dataset demonstrates improvements in generating domain-adapted summaries of clinical notes and hallucination reduction.
zh

[NLP-58] Improving Next Tokens via Second-Last Predictions with Generate and Refine

【速读】：该论文试图解决在自然语言处理中，生成式模型（如GPT）在预测下一个词时可能存在的准确性问题。解决方案的关键在于训练一个仅解码器架构的模型，用于预测序列中倒数第二个词（second last token），并通过一种结构化的确定性方法进行掩码（masking），从而提高训练效率。该方法通过“生成-然后-精炼”（generate-then-refine）策略，将倒数第二个词的预测与标准GPT的下一个词预测相结合，显著提升了下一个词预测的准确性，尤其是在不同版本的GPT-2模型和不同数据集上，倒数第二个词的预测准确性比普通下一个词预测高出超过15%。

链接: https://arxiv.org/abs/2411.15661
作者: Johannes Schneider
关键词-EN: Autoregressive language models, Autoregressive language, BERT are trained, predicting masked tokens, trained on tasks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive language models like GPT aim at predicting next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder only architecture for predicting the second last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach towards masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a generate-then-refine'' approach. We show on different variants of GPT-2 and different datasets that (not unexpectedly) second last token predictions are much more accurate, i.e., more than 15\% higher accuracy than ordinary next token predictors. The generate-then-refine’’ approach also demonstrates notable improvements in next-token predictions, yielding smaller yet consistent and significant gains.
zh

[NLP-59] AfriMed-QA: A Pan-African Multi-Specialty Medical Question-Answering Benchmark Dataset

【速读】：该论文试图解决在低收入和中等收入国家（LMICs）中，由于医生短缺和专家缺乏，如何利用大型语言模型（LLM）来提高医疗保健的可及性和降低成本的问题。解决方案的关键在于引入了AfriMed-QA，这是首个大规模的泛非洲英语多专科医学问答（QA）数据集，包含15,000个问题（开放和封闭式），来源于16个国家的60多所医学院，涵盖32个医学专科。通过评估30个LLM在正确性和人口统计偏差等多个维度的表现，研究发现不同专科和地理区域的表现存在显著差异，且MCQ表现明显落后于USMLE（MedQA）。此外，生物医学LLM的表现不如通用模型，而较小的边缘友好型LLM难以达到及格分数。有趣的是，人类评估显示，与临床医生答案相比，消费者对LLM答案和解释有持续的偏好。

链接: https://arxiv.org/abs/2411.15640
作者: Tobi Olatunji,Charles Nimo,Abraham Owodunni,Tassallah Abdullahi,Emmanuel Ayodele,Mardhiyah Sanni,Chinemelu Aka,Folafunmi Omofoye,Foutse Yuehgoh,Timothy Faniran,Bonaventure F. P. Dossou,Moshood Yekini,Jonas Kemp,Katherine Heller,Jude Chidubem Omeke,Chidi Asuzu MD,Naome A. Etori,Aimérou Ndiaye,Ifeoma Okoh,Evans Doe Ocansey,Wendy Kinara,Michael Best,Irfan Essa,Stephen Edward Moore,Chris Fourie,Mercy Nyamewaa Asiedu
关键词-EN: Recent advancements, benchmarks have stimulated, patients globally, stimulated interest, providers and patients
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language model(LLM) performance on medical multiple choice question (MCQ) benchmarks have stimulated interest from healthcare providers and patients globally. Particularly in low-and middle-income countries (LMICs) facing acute physician shortages and lack of specialists, LLMs offer a potentially scalable pathway to enhance healthcare access and reduce costs. However, their effectiveness in the Global South, especially across the African continent, remains to be established. In this work, we introduce AfriMed-QA, the first large scale Pan-African English multi-specialty medical Question-Answering (QA) dataset, 15,000 questions (open and closed-ended) sourced from over 60 medical schools across 16 countries, covering 32 medical specialties. We further evaluate 30 LLMs across multiple axes including correctness and demographic bias. Our findings show significant performance variation across specialties and geographies, MCQ performance clearly lags USMLE (MedQA). We find that biomedical LLMs underperform general models and smaller edge-friendly LLMs struggle to achieve a passing score. Interestingly, human evaluations show a consistent consumer preference for LLM answers and explanations when compared with clinician answers.
zh

[NLP-60] “All that Glitters”: Approaches to Evaluations with Unreliable Model and Human Annotations

【速读】：该论文试图解决在模型评估过程中，由于“黄金”和“真实”人类标签存在误差，导致评估指标无法准确反映标签质量和模型性能的问题。解决方案的关键在于采用新颖的评估方法，通过六个维度（一致性 (Concordance)、置信度 (Confidence)、有效性 (Validity)、偏差 (Bias)、公平性 (Fairness) 和有用性 (Helpfulness)）来全面评估标签质量和模型表现。研究首先揭示了在标签质量较低的情况下，标准评估指标可能掩盖标签和模型的真实质量，进而发现大型语言模型（LLM）在某些任务上表现“超人类”，但在更严格的评估下暴露出虚假相关性和非随机种族偏差。最后，研究扩展了这些方法，以估计在人机协作情境下，模型使用对人类标签质量的影响，并指出某些LLM在当前数据可泛化性的范围内，可能有助于提高昂贵的人类课堂评估质量。

链接: https://arxiv.org/abs/2411.15634
作者: Michael Hardy
关键词-EN: ground truth, Gold, quality, model, label quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 20 pages, 15 figures, 58 pages with references and appendices

点击查看摘要

Abstract:“Gold” and “ground truth” human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families–encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even “super-human”, results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.
zh

[NLP-61] Multi-label Sequential Sentence Classification via Large Language Model EMNLP2024

【速读】：该论文试图解决科学出版物中序列句子分类 (Sequential Sentence Classification, SSC) 面临的模型大小、序列长度和单标签设置的限制问题。解决方案的关键在于提出了基于大语言模型 (Large Language Model, LLM) 的框架 LLM-SSC，该框架通过设计提示 (prompts) 来生成 SSC 标签，结合演示 (demonstrations) 和查询 (query) 描述预测目标，从而增强任务理解。此外，论文还引入了多标签对比学习损失 (multi-label contrastive learning loss) 和自动加权方案 (auto-weighting scheme)，以支持多标签分类任务。为了验证多标签 SSC 分析的有效性，论文还发布了一个新的生物医学领域数据集 biorc800。

链接: https://arxiv.org/abs/2411.15623
作者: Mengfei Lan,Lecheng Zheng,Shufan Ming,Halil Kilicoglu
关键词-EN: Sequential sentence classification, fine-grained information retrieval, Sequential sentence, supporting downstream tasks, extractive summarization
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:Sequential sentence classification (SSC) in scientific publications is crucial for supporting downstream tasks such as fine-grained information retrieval and extractive summarization. However, current SSC methods are constrained by model size, sequence length, and single-label setting. To address these limitations, this paper proposes LLM-SSC, a large language model (LLM)-based framework for both single- and multi-label SSC tasks. Unlike previous approaches that employ small- or medium-sized language models, the proposed framework utilizes LLMs to generate SSC labels through designed prompts, which enhance task understanding by incorporating demonstrations and a query to describe the prediction target. We also present a multi-label contrastive learning loss with auto-weighting scheme, enabling the multi-label classification task. To support our multi-label SSC analysis, we introduce and release a new dataset, biorc800, which mainly contains unstructured abstracts in the biomedical domain with manual annotations. Experiments demonstrate LLM-SSC’s strong performance in SSC under both in-context learning and task-specific tuning settings. We release biorc800 and our code at: this https URL.
zh

[NLP-62] A Survey on LLM -as-a-Judge

【速读】：该论文试图解决如何构建可靠的大型语言模型（LLM）作为评估系统（LLM-as-a-Judge）的问题。解决方案的关键在于提高评估的一致性、减轻偏见，并适应多样化的评估场景。论文提出了一系列增强可靠性的策略，并设计了新的基准来评估LLM-as-a-Judge系统的可靠性，为研究人员和实践者提供了基础参考。

链接: https://arxiv.org/abs/2411.15594
作者: Jiawei Gu,Xuhui Jiang,Zhichao Shi,Hexiang Tan,Xuehao Zhai,Chengjin Xu,Wei Li,Yinghan Shen,Shengjie Ma,Honghao Liu,Yuanzhuo Wang,Jian Guo
关键词-EN: challenging task due, Large Language Models, Accurate and consistent, inherent subjectivity, crucial for decision-making
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 33 pages, 9 figures. arXiv admin note: text overlap with arXiv:2310.05470 by other authors

点击查看摘要

Abstract:Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of “LLM-as-a-Judge,” where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.
zh

[NLP-63] ransparent but Powerful: Explainability Accuracy and Generalizability in ADHD Detection from Social Media Data

【速读】：该论文试图解决注意力缺陷多动障碍（Attention-deficit/hyperactivity disorder, ADHD）的诊断不足问题，特别是通过利用社交媒体数据进行大规模、非侵入性的筛查。解决方案的关键在于利用自然语言处理（Natural Language Processing, NLP）和机器学习（Machine Learning, ML）技术，分析社交媒体文本中的语言模式。论文通过比较浅层机器学习模型和深度学习模型（如BiLSTM和基于transformer的模型），评估了不同模型在ADHD检测中的性能和可解释性。研究发现，BiLSTM模型在透明性和准确性之间提供了良好的平衡，并揭示了跨平台数据（如Reddit和Twitter）中与ADHD相关的关键语言特征，这些特征有助于开发更有效的数字筛查工具。

链接: https://arxiv.org/abs/2411.15586
作者: D. Wiechmann,E. Kempa,E. Kerz,Y. Qiao
关键词-EN: remains severely underdiagnosed, prevalent mental health, mental health condition, health condition affecting, Natural Language Processing
类目: Computation and Language (cs.CL)
备注: 12 pages (including references and appendix)

点击查看摘要

Abstract:Attention-deficit/hyperactivity disorder (ADHD) is a prevalent mental health condition affecting both children and adults, yet it remains severely underdiagnosed. Recent advances in artificial intelligence, particularly in Natural Language Processing (NLP) and Machine Learning (ML), offer promising solutions for scalable and non-invasive ADHD screening methods using social media data. This paper presents a comprehensive study on ADHD detection, leveraging both shallow machine learning models and deep learning approaches, including BiLSTM and transformer-based models, to analyze linguistic patterns in ADHD-related social media text. Our results highlight the trade-offs between interpretability and performance across different models, with BiLSTM offering a balance of transparency and accuracy. Additionally, we assess the generalizability of these models using cross-platform data from Reddit and Twitter, uncovering key linguistic features associated with ADHD that could contribute to more effective digital screening tools.
zh

[NLP-64] From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars COLING2025

【速读】：该论文试图解决的问题是如何评估和提升语言模型在处理低资源语言（low-resource languages）时的能力，特别是从复杂的语言学语法描述中提取和分类信息的能力。解决方案的关键在于引入了一套基准测试（benchmarks），涵盖了248种语言和142个语系，重点关注WALS和Grambank中的类型学特征（typological features）。论文提出了一种基于检索增强生成（Retrieval-Augmented Generation, RAG）的方法，利用这些语言学描述来支持下游任务，如机器翻译。这些基准测试为首次全面评估语言模型在上下文中的能力，准确解释和提取语言学特征，为扩展自然语言处理（NLP）到低资源语言提供了关键资源。

链接: https://arxiv.org/abs/2411.15577
作者: Albert Kornilov,Tatiana Shavrina
关键词-EN: demonstrated significant improvements, Recent advances, including in-context learning, extremely under-resourced languages, zero-shot capabilities
类目: Computation and Language (cs.CL)
备注: submitted to COLING 2025

点击查看摘要

Abstract:Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank. This set of benchmarks offers the first comprehensive evaluation of language models’ in-context ability to accurately interpret and extract linguistic features, providing a critical resource for scaling NLP to low-resource languages. The code and data are publicly available at \urlthis https URL. Comments: submitted to COLING 2025 Subjects: Computation and Language (cs.CL) MSC classes: 68-06, 68T50, 68T01 ACMclasses: G.3; I.2.7 Cite as: arXiv:2411.15577 [cs.CL] (or arXiv:2411.15577v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.15577 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-65] Do LLM s Agree on the Creativity Evaluation of Alternative Uses?

【速读】：该论文试图解决的问题是：大型语言模型 (LLMs) 在评估替代用途测试 (Alternative Uses Test, AUT) 中的创造性响应时是否能够保持一致性和公正性。解决方案的关键在于使用一个由专家分类的基准数据集（包含常见、创造性和高度创造性的响应），并利用四种最先进的 LLMs 对这些响应进行评分和排序。通过两种评估设置（综合和分段），研究结果显示，LLMs 在评估创造性方面表现出高度的模型间一致性（Spearman 相关系数平均超过 0.7，与基准数据集的相关系数超过 0.77），并且不偏袒自己生成的响应，而是对其他模型生成的响应给予相似的创造性评分或排名。这些发现验证了 LLMs 在创造性评估中的可靠性和公正性，为自动化创造性评估提供了有前景的应用前景。

链接: https://arxiv.org/abs/2411.15560
作者: Abdullah Al Rabeyah,Fabrício Góes,Marco Volpe,Talles Medeiros
关键词-EN: large language models, investigates whether large, large language, creativity, LLMs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 7 figures, 15 tables

点击查看摘要

Abstract:This paper investigates whether large language models (LLMs) show agreement in assessing creativity in responses to the Alternative Uses Test (AUT). While LLMs are increasingly used to evaluate creative content, previous studies have primarily focused on a single model assessing responses generated by the same model or humans. This paper explores whether LLMs can impartially and accurately evaluate creativity in outputs generated by both themselves and other models. Using an oracle benchmark set of AUT responses, categorized by creativity level (common, creative, and highly creative), we experiment with four state-of-the-art LLMs evaluating these outputs. We test both scoring and ranking methods and employ two evaluation settings (comprehensive and segmented) to examine if LLMs agree on the creativity evaluation of alternative uses. Results reveal high inter-model agreement, with Spearman correlations averaging above 0.7 across models and reaching over 0.77 with respect to the oracle, indicating a high level of agreement and validating the reliability of LLMs in creativity assessment of alternative uses. Notably, models do not favour their own responses, instead they provide similar creativity assessment scores or rankings for alternative uses generated by other models. These findings suggest that LLMs exhibit impartiality and high alignment in creativity evaluation, offering promising implications for their use in automated creativity assessment.
zh

[NLP-66] QEQR: An Exploration of Query Expansion Methods for Question Retrieval in CQA Services

【速读】：该论文试图解决CQA（Community Question Answering）服务中由于词汇差异（lexical gap）导致的相似问题检索困难的问题。解决方案的关键在于使用查询扩展（query expansion）方法，包括基于词相似度的方法、提出基于问题相似度的方法以及选择性扩展这些方法，以扩展用户提交的问题，从而缓解词汇差异问题。最佳方法相较于未使用查询扩展的最佳基线方法，实现了1.8%的显著相对改进。

链接: https://arxiv.org/abs/2411.15530
作者: Yasin Ghafourian,Sajad Movahedi,Azadeh Shakery
关键词-EN: CQA services, valuable sources, sources of knowledge, find answers, answers to users’
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:CQA services are valuable sources of knowledge that can be used to find answers to users’ information needs. In these services, question retrieval aims to help users with their information needs by finding similar questions to theirs. However, finding similar questions is obstructed by the lexical gap that exists between relevant questions. In this work, we target this problem by using query expansion methods. We use word-similarity-based methods, propose a question-similarity-based method and selective expansion of these methods to expand a question that’s been submitted and mitigate the lexical gap problem. Our best method achieves a significant relative improvement of 1.8% compared to the best-performing baseline without query expansion.
zh

[NLP-67] Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8 Dataset

【速读】：该论文试图解决语法错误检测 (Grammatical Error Detection, GED) 这一具有挑战性和重要性的问题。解决方案的关键在于精细化的数据清洗和使用基于Transformer的模型进行微调。具体来说，论文通过严格清洗Lang8数据集，并使用BERT-base-uncased模型进行实验，取得了显著的性能提升，F1得分达到0.91，训练集准确率达到98.49%，测试集准确率达到90.53%。此外，研究还发现，尽管使用了更大规模的BERT-large-uncased和RoBERTa-large模型，性能并未显著提升，这表明在GED任务中，数据质量和模型选择比模型规模更为关键。

链接: https://arxiv.org/abs/2411.15523
作者: Rahul Nihalani,Kushal Shah
关键词-EN: Grammatical Error Detection, improved LLM based, Error Detection, Grammatical Error, equally important problem
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 tables, 20 references

点击查看摘要

Abstract:This paper presents an improved LLM based model for Grammatical Error Detection (GED), which is a very challenging and equally important problem for many applications. The traditional approach to GED involved hand-designed features, but recently, Neural Networks (NN) have automated the discovery of these features, improving performance in GED. Traditional rule-based systems have an F1 score of 0.50-0.60 and earlier machine learning models give an F1 score of 0.65-0.75, including decision trees and simple neural networks. Previous deep learning models, for example, Bi-LSTM, have reported F1 scores within the range from 0.80 to 0.90. In our study, we have fine-tuned various transformer models using the Lang8 dataset rigorously cleaned by us. In our experiments, the BERT-base-uncased model gave an impressive performance with an F1 score of 0.91 and accuracy of 98.49% on training data and 90.53% on testing data, also showcasing the importance of data cleaning. Increasing model size using BERT-large-uncased or RoBERTa-large did not give any noticeable improvements in performance or advantage for this task, underscoring that larger models are not always better. Our results clearly show how far rigorous data cleaning and simple transformer-based models can go toward significantly improving the quality of GED.
zh

[NLP-68] MolMetaLM: a Physicochemical Knowledge-Guided Molecular Meta Language Model

【速读】：该论文试图解决现有分子语言模型在处理分子时仅依赖于原子/键符号，而忽视了分子所包含的重要物理/化学性质的问题。解决方案的关键在于提出了一个新颖的物理化学知识引导的分子元语言框架MolMetaLM。该框架设计了一种分子专用的元语言范式，格式化为多个共享相同主体（即分子）的S,P,O（主语、谓语、宾语）知识三元组，以增强学习物理化学知识与分子之间的语义关系。通过引入不同的分子知识和噪声，元语言范式生成了数以万计的预训练任务，从而在属性预测、分子生成、构象推断和分子优化等大规模基准评估中表现出色。MolMetaLM为设计语言模型提供了新的视角。

链接: https://arxiv.org/abs/2411.15500
作者: Yifan Wu,Min Zeng,Yang Li,Yang Zhang,Min Li
关键词-EN: natural language processing, transfer the masked, language, language models transfer, masked language model
类目: Emerging Technologies (cs.ET); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most current molecular language models transfer the masked language model or image-text generation model from natural language processing to molecular field. However, molecules are not solely characterized by atom/bond symbols; they encapsulate important physical/chemical properties. Moreover, normal language models bring grammar rules that are irrelevant for understanding molecules. In this study, we propose a novel physicochemical knowledge-guided molecular meta language framework MolMetaLM. We design a molecule-specialized meta language paradigm, formatted as multiple S,P,O (subject, predicate, object) knowledge triples sharing the same S (i.e., molecule) to enhance learning the semantic relationships between physicochemical knowledge and molecules. By introducing different molecular knowledge and noises, the meta language paradigm generates tens of thousands of pretraining tasks. By recovering the token/sequence/order-level noises, MolMetaLM exhibits proficiency in large-scale benchmark evaluations involving property prediction, molecule generation, conformation inference, and molecular optimization. Through MolMetaLM, we offer a new insight for designing language models.
zh

[NLP-69] raditional Chinese Medicine Case Analysis System for High-Level Semantic Abstraction: Optimized with Prompt and RAG

【速读】：该论文旨在构建一个用于传统中医（TCM）的临床案例数据库，通过网络爬虫技术从多个平台（如360doc）收集了超过5000个TCM临床案例。解决方案的关键在于数据清洗和结构化处理，包括患者信息、病因、证候和注释等关键字段的提取。利用Baidu_ERNIE_Speed_128K API去除冗余信息，并通过DeepSeekv2 API生成最终答案，输出标准JSON格式。此外，通过RAG和rerank技术优化数据召回，结合两阶段检索方法和Jieba关键词匹配，显著提高了模型输出的准确性。

链接: https://arxiv.org/abs/2411.15491
作者: Peng Xu,Hongjin Wu,Jinle Wang,Rongjia Lin,Liwei Tan
关键词-EN: Traditional Chinese Medicine, Chinese Medicine, Traditional Chinese, TCM clinical cases, clinical case database
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper details a technical plan for building a clinical case database for Traditional Chinese Medicine (TCM) using web scraping. Leveraging multiple platforms, including 360doc, we gathered over 5,000 TCM clinical cases, performed data cleaning, and structured the dataset with crucial fields such as patient details, pathogenesis, syndromes, and annotations. Using the Baidu_ERNIE_Speed_128K API, we removed redundant information and generated the final answers through the DeepSeekv2 API, outputting results in standard JSON format. We optimized data recall with RAG and rerank techniques during retrieval and developed a hybrid matching scheme. By combining two-stage retrieval method with keyword matching via Jieba, we significantly enhanced the accuracy of model outputs.
zh

[NLP-70] Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework Distilled Training and Meta-evaluation Benchmark

【速读】：该论文试图解决文本到图像生成质量自动评估中的成本和性能问题。解决方案的关键在于提出了一种基于GPT-4o的任务分解评估框架，通过将复杂的评估任务分解为更简单的子任务，从而降低学习复杂性，并利用这一框架自动构建新的训练数据集。基于此数据集，论文设计了创新的训练策略，成功地将GPT-4o的评估能力提炼到一个7B参数的开源多模态大语言模型（MLLM）MiniCPM-V-2.6中。此外，论文还手动标注了一个包含链式思维解释和质量评分的元评估基准，以全面评估现有方法和所提出模型的性能。实验结果表明，提炼后的开源MLLM在Spearman和Kendall相关性上显著优于当前最先进的GPT-4o-base基线模型VIEScore，分别提高了4.6%。

链接: https://arxiv.org/abs/2411.15488
作者: Rong-Cheng Tu,Zi-Ao Ma,Tian Lan,Yuehao Zhao,Heyan Huang,Xian-Ling Mao
关键词-EN: made significant strides, Multi-modal Large Language, Large Language Models, generation has made, creating a pressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driven by the remarkable progress in diffusion models, text-to-image generation has made significant strides, creating a pressing demand for automatic quality evaluation of generated images. Current state-of-the-art automatic evaluation methods heavily rely on Multi-modal Large Language Models (MLLMs), particularly powerful commercial models like GPT-4o. While these models are highly effective, their substantial costs limit scalability in large-scale evaluations. Adopting open-source MLLMs is an alternative; however, their performance falls short due to significant limitations in processing multi-modal data compared to commercial MLLMs. To tackle these problems, we first propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset, where the complex evaluation task is decoupled into simpler sub-tasks, effectively reducing the learning complexity. Based on this dataset, we design innovative training strategies to effectively distill GPT-4o’s evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Furthermore, to reliably and comprehensively assess prior works and our proposed model, we manually annotate a meta-evaluation benchmark that includes chain-of-thought explanations alongside quality scores for generated images. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline, VIEScore, with over 4.6% improvement in Spearman and Kendall correlations with human judgments.
zh

[NLP-71] ransition Network Analysis: A Novel Framework for Modeling Visualizing and Identifying the Temporal Patterns of Learners and Learning Processes

【速读】：该论文试图解决学习过程数据中过渡模式建模、可视化和识别的问题。解决方案的关键在于提出了一个名为过渡网络分析 (Transition Network Analysis, TNA) 的新型分析框架，该框架整合了随机过程挖掘 (Stochastic Process Mining) 和概率图表示 (probabilistic graph representation)，将关系和时间维度结合在一个统一的视角下。TNA 不仅能够捕捉重要的学习事件（centralities）、识别行为模式（community finding），还能揭示时间模式（clustering）。通过案例研究，TNA 展示了其在揭示监管过程、识别重要事件和时间模式方面的有效性，并通过 Bootstrap 验证确保了过渡的显著性。

链接: https://arxiv.org/abs/2411.15486
作者: Mohammed Saqr,Sonsoles López-Pernas,Tiina Törmänen,Rogers Kaliisa,Kamila Misiejuk,Santtu Tikka
关键词-EN: Stochastic Process Mining, integrates Stochastic Process, Stochastic Process, Process Mining, learning process data
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted at Learning Analytics Knowledge (LAK '25)

点击查看摘要

Abstract:This paper proposes a novel analytical framework: Transition Network Analysis (TNA), an approach that integrates Stochastic Process Mining and probabilistic graph representation to model, visualize, and identify transition patterns in the learning process data. Combining the relational and temporal aspects into a single lens offers capabilities beyond either framework, including centralities to capture important learning events, community finding to identify patterns of behavior, and clustering to reveal temporal patterns. This paper introduces the theoretical and mathematical foundations of TNA. To demonstrate the functionalities of TNA, we present a case study with students (n=191) engaged in small-group collaboration to map patterns of group dynamics using the theories of co-regulation and socially-shared regulated learning. The analysis revealed that TNA could reveal the regulatory processes and identify important events, temporal patterns and clusters. Bootstrap validation established the significant transitions and eliminated spurious transitions. In doing so, we showcase TNA’s utility to capture learning dynamics and provide a robust framework for investigating the temporal evolution of learning processes. Future directions include advancing estimation methods, expanding reliability assessment, exploring longitudinal TNA, and comparing TNA networks using permutation tests.
zh

[NLP-72] Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLM s: A Case Study in Thai ACL

【速读】：该论文试图解决在数据稀缺的情况下，如何高效地对大型语言模型 (LLMs) 进行指令微调以适应低资源语言（特别是泰语）的问题。解决方案的关键在于提出了一种无需种子数据 (seed-data-free) 的合成数据生成框架，该框架通过生成多样化的主题、从维基百科中检索相关上下文，并创建适用于多种任务（如问答、摘要和对话）的指令，来构建具有流畅性、多样性和文化背景的指令微调数据集。实验结果表明，该框架生成的合成数据集在仅使用5,000条指令的情况下，就能达到与使用数十万条指令训练的先进泰语LLMs相媲美的性能。

链接: https://arxiv.org/abs/2411.15484
作者: Parinthapat Pengpun,Can Udomcharoenchaikit,Weerayut Buaphet,Peerat Limkonchotiwat
关键词-EN: large language models, instruction-tuning large language, data-efficient manner, specifically focusing, language models
类目: Computation and Language (cs.CL)
备注: ACL-SRW 2024. Our code and dataset are publicly available at this https URL

点击查看摘要

Abstract:We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at this https URL.
zh

[NLP-73] owards Robust Evaluation of Unlearning in LLM s via Data Transformations EMNLP2024

【速读】：该论文试图解决的问题是如何在大型语言模型 (LLMs) 中实现可靠的机器遗忘 (Machine Unlearning, MUL)，以确保模型能够彻底遗忘特定信息（如个人身份信息 PII），同时不影响其在常规任务中的性能。解决方案的关键在于评估现有 MUL 技术的鲁棒性，特别是研究数据格式转换对遗忘效果的影响。论文通过在 TOFU 数据集上的实验，强调了使用多样化的数据格式来量化 LLMs 中遗忘效果的必要性，以确保模型在不同输入格式下均无法召回被遗忘的信息。

链接: https://arxiv.org/abs/2411.15477
作者: Abhinav Joshi,Shaswati Saha,Divyaksh Shukla,Sriram Vema,Harsh Jhamtani,Manas Gaur,Ashutosh Modi
关键词-EN: Large Language Models, Large Language, Language Models, great success, wide range
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted at EMNLP 2024 Findings; 21 pages (5 page main content + references + appendix)

点击查看摘要

Abstract:Large Language Models (LLMs) have shown to be a great success in a wide range of applications ranging from regular NLP-based use cases to AI agents. LLMs have been trained on a vast corpus of texts from various sources; despite the best efforts during the data pre-processing stage while training the LLMs, they may pick some undesirable information such as personally identifiable information (PII). Consequently, in recent times research in the area of Machine Unlearning (MUL) has become active, the main idea is to force LLMs to forget (unlearn) certain information (e.g., PII) without suffering from performance loss on regular tasks. In this work, we examine the robustness of the existing MUL techniques for their ability to enable leakage-proof forgetting in LLMs. In particular, we examine the effect of data transformation on forgetting, i.e., is an unlearned LLM able to recall forgotten information if there is a change in the format of the input? Our findings on the TOFU dataset highlight the necessity of using diverse data formats to quantify unlearning in LLMs more reliably.
zh

[NLP-74] HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter

【速读】：该论文试图解决在线仇恨言论检测模型在实际应用中的性能评估问题，特别是由于评估数据集的系统性偏差导致模型在不同语言和地理区域中的表现不明确。解决方案的关键在于引入了HateDay，这是首个代表社交媒体环境的全球性仇恨言论数据集，涵盖了2022年9月21日发布的八种语言和四个英语国家的推文。通过HateDay，研究揭示了仇恨言论在不同语言和国家中的流行程度和构成差异，并发现学术数据集上的评估结果高估了实际检测性能，尤其是在非欧洲语言中。论文还指出了模型在区分仇恨言论与攻击性言论方面的不足，以及学术研究目标与现实世界中目标流行度之间的不匹配。最终，研究强调了未来检测模型需要在实际应用环境中进行评估，以应对这一全球性挑战。

链接: https://arxiv.org/abs/2411.15462
作者: Manuel Tonneau,Diyi Liu,Niyati Malhotra,Scott A. Hale,Samuel P. Fraiberger,Victor Orozco-Olvera,Paul Röttger
关键词-EN: online content, hate speech, sea of online, online hate speech, large body
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To tackle the global challenge of online hate speech, a large body of research has developed detection models to flag hate speech in the sea of online content. Yet, due to systematic biases in evaluation datasets, detection performance in real-world settings remains unclear, let alone across geographies. To address this issue, we introduce HateDay, the first global hate speech dataset representative of social media settings, randomly sampled from all tweets posted on September 21, 2022 for eight languages and four English-speaking countries. Using HateDay, we show how the prevalence and composition of hate speech varies across languages and countries. We also find that evaluation on academic hate speech datasets overestimates real-world detection performance, which we find is very low, especially for non-European languages. We identify several factors explaining poor performance, including models’ inability to distinguish between hate and offensive speech, and the misalignment between academic target focus and real-world target prevalence. We finally argue that such low performance renders hate speech moderation with public detection models unfeasible, even in a human-in-the-loop setting which we find is prohibitively costly. Overall, we emphasize the need to evaluate future detection models from academia and platforms in real-world settings to address this global challenge.
zh

[NLP-75] Efficient Ternary Weight Embedding Model: Bridging Scalability and Performance

【速读】：该论文试图解决嵌入模型在资源受限环境中部署时面临的高内存和计算需求问题。解决方案的关键在于提出了一种新颖的微调框架，用于三值权重嵌入模型（ternary-weight embedding models），通过引入自教知识蒸馏（self-taught knowledge distillation）来确定线性层的三值权重，从而在保持高性能的同时显著降低内存和计算开销。实验结果表明，三值化模型在推理阶段具有低内存占用和低延迟，且在与近似最近邻搜索（Approximate Nearest Neighbor, ANN）结合时，在精度和计算效率上均取得了显著提升。

链接: https://arxiv.org/abs/2411.15438
作者: Jiayi Chen,Chen Wu,Shaoqun Zhang,Nan Li,Liangjie Zhang,Qi Zhang
关键词-EN: enabling efficient semantic, natural language processing, efficient semantic search, enabling efficient, essential tools
类目: Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Embedding models have become essential tools in both natural language processing and computer vision, enabling efficient semantic search, recommendation, clustering, and more. However, the high memory and computational demands of full-precision embeddings pose challenges for deployment in resource-constrained environments, such as real-time recommendation systems. In this work, we propose a novel finetuning framework to ternary-weight embedding models, which reduces memory and computational overhead while maintaining high performance. To apply ternarization to pre-trained embedding models, we introduce self-taught knowledge distillation to finalize the ternary-weights of the linear layers. With extensive experiments on public text and vision datasets, we demonstrated that without sacrificing effectiveness, the ternarized model consumes low memory usage and has low latency in the inference stage with great efficiency. In practical implementations, embedding models are typically integrated with Approximate Nearest Neighbor (ANN) search. Our experiments combining ternary embedding with ANN search yielded impressive improvement in both accuracy and computational efficiency. The repository is available at here.
zh

[NLP-76] Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts

【速读】：该论文试图解决在终身学习场景下，视觉语言大模型（Vision LLMs, VLLMs）中知识编辑的问题，即在不重新训练模型的情况下，如何持续地修正不准确的知识、更新过时的信息以及整合新数据。解决方案的关键在于提出了LiveEdit框架，该框架包括三个主要模块：1) 训练一个编辑专家生成器（editing expert generator），用于为每个编辑实例独立生成低秩专家，以修正VLLM的相关响应；2) 开发一种硬过滤机制（hard filtering mechanism），利用视觉语义知识在推理阶段粗略地排除与输入查询视觉无关的专家；3) 引入一种基于文本语义相关性的软路由机制（soft routing mechanism），以实现多专家融合，从而整合视觉相关的专家。这些设计使得LiveEdit在终身VLLM编辑场景中表现出显著优势。

链接: https://arxiv.org/abs/2411.15432
作者: Qizhou Chen,Chengyu Wang,Dakan Wang,Taolin Zhang,Wangyue Li,Xiaofeng He
关键词-EN: update outdated information, Large Language Models, data into Large, Large Language, correct inaccurate knowledge
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Model editing aims to correct inaccurate knowledge, update outdated information, and incorporate new data into Large Language Models (LLMs) without the need for retraining. This task poses challenges in lifelong scenarios where edits must be continuously applied for real-world applications. While some editors demonstrate strong robustness for lifelong editing in pure LLMs, Vision LLMs (VLLMs), which incorporate an additional vision modality, are not directly adaptable to existing LLM editors. In this paper, we propose LiveEdit, a LIfelong Vision language modEl Edit to bridge the gap between lifelong LLM editing and VLLMs. We begin by training an editing expert generator to independently produce low-rank experts for each editing instance, with the goal of correcting the relevant responses of the VLLM. A hard filtering mechanism is developed to utilize visual semantic knowledge, thereby coarsely eliminating visually irrelevant experts for input queries during the inference stage of the post-edited model. Finally, to integrate visually relevant experts, we introduce a soft routing mechanism based on textual semantic relevance to achieve multi-expert fusion. For evaluation, we establish a benchmark for lifelong VLLM editing. Extensive experiments demonstrate that LiveEdit offers significant advantages in lifelong VLLM editing scenarios. Further experiments validate the rationality and effectiveness of each module design in LiveEdit.
zh

[NLP-77] Exploring Large Language Models for Multimodal Sentiment Analysis: Challenges Benchmarks and Future Directions

【速读】：该论文试图解决多模态基于方面的情感分析 (Multimodal Aspect-Based Sentiment Analysis, MABSA) 中，大型语言模型 (Large Language Models, LLMs) 的适应性和性能问题。解决方案的关键在于构建一个基准测试，以评估LLMs在MABSA任务中的表现，并与传统的监督学习方法进行比较。研究结果表明，尽管LLMs在多模态理解方面展现出潜力，但在MABSA任务中，特别是在准确性和推理时间方面，仍面临显著挑战。基于这些发现，论文讨论了当前LLMs的局限性，并提出了未来研究的方向，以增强其在多模态情感分析中的能力。

链接: https://arxiv.org/abs/2411.15408
作者: Shezheng Song
关键词-EN: extract aspect terms, Aspect-Based Sentiment Analysis, aims to extract, including text, text and images
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect terms and their corresponding sentiment polarities from multimodal information, including text and images. While traditional supervised learning methods have shown effectiveness in this task, the adaptability of large language models (LLMs) to MABSA remains uncertain. Recent advances in LLMs, such as Llama2, LLaVA, and ChatGPT, demonstrate strong capabilities in general tasks, yet their performance in complex and fine-grained scenarios like MABSA is underexplored. In this study, we conduct a comprehensive investigation into the suitability of LLMs for MABSA. To this end, we construct a benchmark to evaluate the performance of LLMs on MABSA tasks and compare them with state-of-the-art supervised learning methods. Our experiments reveal that, while LLMs demonstrate potential in multimodal understanding, they face significant challenges in achieving satisfactory results for MABSA, particularly in terms of accuracy and inference time. Based on these findings, we discuss the limitations of current LLMs and outline directions for future research to enhance their capabilities in multimodal sentiment analysis.
zh

[NLP-78] ML-SPEAK: A Theory-Guided Machine Learning Method for Studying and Predicting Conversational Turn-taking Patterns

【速读】：该论文试图解决从团队成员的个性特征预测团队动态的问题，解决方案的关键在于开发了一种基于对话轮换模式的计算模型。该模型通过分析团队成员在自组织团队中的对话轮换模式（turn-taking patterns），独立于对话内容，来揭示个性特征与团队沟通动态之间的关系。模型通过训练对话数据，学习个体特征与发言行为之间的关联，并能基于团队特征组合预测整体的沟通模式。这种方法不仅提高了预测对话轮换序列的准确性，还能揭示新的个性特征与沟通模式之间的关系，从而为团队过程理论提供数据驱动的动态理解，并为团队人员配置和培训提供实用指导。

链接: https://arxiv.org/abs/2411.15405
作者: Lisa R. O’Bryan,Madeline Navarro,Juan Segundo Hevia,Santiago Segarra
关键词-EN: team, personality traits remains, remains a fundamental, fundamental challenge, psychological sciences
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 64 pages, 9 figures

点击查看摘要

Abstract:Predicting team dynamics from personality traits remains a fundamental challenge for the psychological sciences and team-based organizations. Understanding how team composition generates team processes can significantly advance team-based research along with providing practical guidelines for team staffing and training. Although the Input-Process-Output (IPO) model has been useful for studying these connections, the complex nature of team member interactions demands a more dynamic approach. We develop a computational model of conversational turn-taking within self-organized teams that can provide insight into the relationships between team member personality traits and team communication dynamics. We focus on turn-taking patterns between team members, independent of content, which can significantly influence team emergent states and outcomes while being objectively measurable and quantifiable. As our model is trained on conversational data from teams of given trait compositions, it can learn the relationships between individual traits and speaking behaviors and predict group-wide patterns of communication based on team trait composition alone. We first evaluate the performance of our model using simulated data and then apply it to real-world data collected from self-organized student teams. In comparison to baselines, our model is more accurate at predicting speaking turn sequences and can reveal new relationships between team member traits and their communication patterns. Our approach offers a more data-driven and dynamic understanding of team processes. By bridging the gap between individual personality traits and team communication patterns, our model has the potential to inform theories of team processes and provide powerful insights into optimizing team staffing and training.
zh

[NLP-79] A Comparative Analysis of Transformer and LSTM Models for Detecting Suicidal Ideation on Reddit ICML

【速读】：该论文试图解决从社交媒体平台（如Reddit）上检测用户自杀倾向的问题。解决方案的关键在于评估和比较基于深度学习的Transformer模型（如BERT、RoBERTa、DistilBERT、ALBERT和ELECTRA）以及各种基于长短期记忆网络（LSTM）的模型在识别自杀倾向方面的有效性。研究结果表明，RoBERTa模型在准确率和F1分数上表现最佳，分别为93.22%和93.14%，而结合了注意力机制和BERT嵌入的LSTM模型紧随其后，准确率和F1分数分别为92.65%和92.69%。这些发现强调了基于Transformer的模型在提升自杀倾向检测方面的潜力，为开发强大的社交媒体心理健康监测工具提供了路径，从而有助于改进自杀预防工作。

链接: https://arxiv.org/abs/2411.15404
作者: Khalid Hasan,Jamil Saquer
关键词-EN: critical global health, global health problem, health problem involving, deaths yearly, young adults
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 23rd IEEE International Conference on Machine Learning and Applications, ICMLA 2024 (camera-ready)

点击查看摘要

Abstract:Suicide is a critical global health problem involving more than 700,000 deaths yearly, particularly among young adults. Many people express their suicidal thoughts on social media platforms such as Reddit. This paper evaluates the effectiveness of the deep learning transformer-based models BERT, RoBERTa, DistilBERT, ALBERT, and ELECTRA and various Long Short-Term Memory (LSTM) based models in detecting suicidal ideation from user posts on Reddit. Toward this objective, we curated an extensive dataset from diverse subreddits and conducted linguistic, topic modeling, and statistical analyses to ensure data quality. Our results indicate that each model could reach high accuracy and F1 scores, but among them, RoBERTa emerged as the most effective model with an accuracy of 93.22% and F1 score of 93.14%. An LSTM model that uses attention and BERT embeddings performed as the second best, with an accuracy of 92.65% and an F1 score of 92.69%. Our findings show that transformer-based models have the potential to improve suicide ideation detection, thereby providing a path to develop robust mental health monitoring tools from social media. This research, therefore, underlines the undeniable prospect of advanced techniques in Natural Language Processing (NLP) while improving suicide prevention efforts.
zh

[NLP-80] ChatBCI: A P300 Speller BCI Leveraging Large Language Models for Improved Sentence Composition in Realistic Scenarios

【速读】：该论文试图解决P300拼写器脑机接口（BCI）在句子构建过程中高按键需求、时间消耗、认知负荷和疲劳的问题。解决方案的关键在于引入ChatBCI，这是一种利用大型语言模型（LLM）的零样本学习能力（zero-shot learning capabilities）来减少按键次数并加速句子构建的P300拼写器BCI。ChatBCI通过远程查询GPT-3.5 API获取单词建议，并设计了一个新的图形用户界面（GUI）来显示这些建议，从而显著减少了按键次数和时间消耗，提高了信息传输率（information transfer rate），特别是在用户自编句子和即兴创作句子的情况下表现尤为突出。

链接: https://arxiv.org/abs/2411.15395
作者: Jiazhen Hong,Weinan Wang,Laleh Najafizadeh
关键词-EN: EEG signals, selecting target keys, visual stimuli, speller BCIs, selecting target
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:P300 speller BCIs allow users to compose sentences by selecting target keys on a GUI through the detection of P300 component in their EEG signals following visual stimuli. Most P300 speller BCIs require users to spell words letter by letter, or the first few initial letters, resulting in high keystroke demands that increase time, cognitive load, and fatigue. This highlights the need for more efficient, user-friendly methods for faster sentence composition. In this work, we introduce ChatBCI, a P300 speller BCI that leverages the zero-shot learning capabilities of large language models (LLMs) to suggest words from user-spelled initial letters or predict the subsequent word(s), reducing keystrokes and accelerating sentence composition. ChatBCI retrieves word suggestions through remote queries to the GPT-3.5 API. A new GUI, displaying GPT-3.5 word suggestions as extra keys is designed. SWLDA is used for the P300 classification. Seven subjects completed two online spelling tasks: 1) copy-spelling a self-composed sentence using ChatBCI, and 2) improvising a sentence using ChatBCI’s word suggestions. Results demonstrate that in Task 1, on average, ChatBCI outperforms letter-by-letter BCI spellers, reducing time and keystrokes by 62.14% and 53.22%, respectively, and increasing information transfer rate by 198.96%. In Task 2, ChatBCI achieves 80.68% keystroke savings and a record 8.53 characters/min for typing speed. Overall, ChatBCI, by employing remote LLM queries, enhances sentence composition in realistic scenarios, significantly outperforming traditional spellers without requiring local model training or storage. ChatBCI’s (multi-) word predictions, combined with its new GUI, pave the way for developing next-generation speller BCIs that are efficient and effective for real-time communication, especially for users with communication and motor disabilities.
zh

[NLP-81] From Jack of All Trades to Master of One: Specializing LLM -based Autoraters to a Test Set

【速读】：该论文试图解决在大规模语言模型（LLMs）评估中，依赖于固定测试集的传统自动评估方法的局限性问题。解决方案的关键在于设计了一种名为“Specialist”的方法，通过利用测试集上的历史评分来构建上下文学习（In-Context Learning, ICL）示例，从而使提示的自动评估模型（Autorater）专门化于特定的测试集。这种方法在细粒度机器翻译评估任务中显著优于现有的最先进评估指标XCOMET，分别在WMT’23和WMT’24测试集上提升了54%和119%的性能。

链接: https://arxiv.org/abs/2411.15387
作者: Mara Finkelstein,Dan Deutsch,Parker Riley,Juraj Juraska,Geza Kovacs,Markus Freitag
关键词-EN: powerful and versatile, quickly become intractable, intractable at scale, scale and reliance, LLMs continue
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT’23 and WMT’24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method for designing automatic metrics across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.
zh

[NLP-82] On the Impact of Fine-Tuning on Chain-of-Thought Reasoning

【速读】：该论文试图解决大语言模型（LLMs）在特定任务微调（fine-tuning）过程中对其推理能力的影响问题。解决方案的关键在于系统地研究微调对LLMs推理能力的影响，特别是对链式思维（Chain-of-Thought, CoT）推理性能和推理的忠实性（faithfulness）的影响。通过分析微调对不同数据集上CoT推理忠实性的平均影响，研究发现微调过程可能导致LLMs内部机制的变化，从而影响其推理能力。

链接: https://arxiv.org/abs/2411.15382
作者: Elita Lobo,Chirag Agarwal,Himabindu Lakkaraju
关键词-EN: Large language models, advanced natural language, natural language processing, showcasing advanced natural, Large language
类目: Computation and Language (cs.CL)
备注: This paper is a work in progress with findings based on limited evidence. Please exercise discretion when interpreting the findings

点击查看摘要

Abstract:Large language models have emerged as powerful tools for general intelligence, showcasing advanced natural language processing capabilities that find applications across diverse domains. Despite their impressive performance, recent studies have highlighted the potential for significant enhancements in LLMs’ task-specific performance through fine-tuning strategies like Reinforcement Learning with Human Feedback (RLHF), supervised fine-tuning (SFT), and Quantized Low-Rank Adapters (Q-LoRA) method. However, previous works have shown that while fine-tuning offers significant performance gains, it also leads to challenges such as catastrophic forgetting and privacy and safety risks. To this end, there has been little to no work in \textitunderstanding the impact of fine-tuning on the reasoning capabilities of LLMs. Our research investigates the effect of fine-tuning on the reasoning abilities of LLMs, addressing critical questions regarding the impact of task-specific fine-tuning on overall reasoning capabilities, the influence of fine-tuning on Chain-of-Thought (CoT) reasoning performance, and the implications for the faithfulness of CoT reasonings. By exploring these dimensions, our study shows the impact of fine-tuning on LLM reasoning capabilities, where the faithfulness of CoT reasoning, on average across four datasets, decreases, highlighting potential shifts in internal mechanisms of the LLMs resulting from fine-tuning processes.
zh

[NLP-83] ransforming NLU with Babylon: A Case Study in Development of Real-time Edge-Efficient Multi-Intent Translation System for Automated Drive-Thru Ordering

【速读】：该论文试图解决在动态户外环境中，如自动得来速系统中，实时对话AI代理进行自然语言理解（Natural Language Understanding, NLU）时面临的挑战。这些挑战包括处理背景噪音、多样口音、多意图查询，以及在边缘设备上严格的时间延迟和内存限制。解决方案的关键在于引入了一种名为Babylon的基于transformer的架构，将NLU任务视为意图翻译任务，将自然语言输入转换为编码意图和槽位信息的序列化常规语言单元（‘transcodes’）。这种设计使得Babylon能够在一个对话轮次中处理多意图场景。此外，Babylon还集成了基于LSTM的音素序列预处理机制，通过减少输入长度来优化低延迟和低内存的边缘部署，同时增强对上游自动语音识别（Automatic Speech Recognition, ASR）错误输出的鲁棒性。

链接: https://arxiv.org/abs/2411.15372
作者: Mostafa Varzaneh,Pooja Voladoddi,Tanmay Bakshi,Uma Gunturi
关键词-EN: Natural Language Understanding, agents face challenges, Language Understanding, performing Natural Language, Automatic Speech Recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Real-time conversational AI agents face challenges in performing Natural Language Understanding (NLU) in dynamic, outdoor environments like automated drive-thru systems. These settings require NLU models to handle background noise, diverse accents, and multi-intent queries while operating under strict latency and memory constraints on edge devices. Additionally, robustness to errors from upstream Automatic Speech Recognition (ASR) is crucial, as ASR outputs in these environments are often noisy. We introduce Babylon, a transformer-based architecture that tackles NLU as an intent translation task, converting natural language inputs into sequences of regular language units (‘transcodes’) that encode both intents and slot information. This formulation allows Babylon to manage multi-intent scenarios in a single dialogue turn. Furthermore, Babylon incorporates an LSTM-based token pooling mechanism to preprocess phoneme sequences, reducing input length and optimizing for low-latency, low-memory edge deployment. This also helps mitigate inaccuracies in ASR outputs, enhancing system robustness. While this work focuses on drive-thru ordering, Babylon’s design extends to similar noise-prone scenarios, for e.g. ticketing kiosks. Our experiments show that Babylon achieves significantly better accuracy-latency-memory footprint trade-offs over typically employed NMT models like Flan-T5 and BART, demonstrating its effectiveness for real-time NLU in edge deployment settings.
zh

[NLP-84] Exploring Facets of Language Generation in the Limit

【速读】：该论文试图解决在给定未知目标语言的序列示例的情况下，如何生成新示例的问题，确保在某个点之后不再生成错误的示例。解决方案的关键在于区分两种生成模式：均匀生成（uniform generation）和非均匀生成（non-uniform generation）。论文展示了每个可数语言集合都存在一个具有非均匀生成特性的生成器，但同时指出，仅使用成员查询（membership queries）的算法无法实现非均匀生成，即使在仅包含两种语言的集合中也是如此。此外，论文通过引入穷尽生成（exhaustive generation）的概念，揭示了生成过程中有效性和广度之间的内在权衡。最后，论文探讨了在反馈模型下均匀生成的可能性，并完全刻画了在复杂度度量下可能实现均匀生成反馈的语言集合。

链接: https://arxiv.org/abs/2411.15364
作者: Moses Charikar,Chirag Pabbaraju
关键词-EN: Kleinberg and Mullainathan, unknown target language, generation, target language, language
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages

点击查看摘要

Abstract:The recent work of Kleinberg and Mullainathan [KM24] provides a concrete model for language generation in the limit: given a sequence of examples from an unknown target language, the goal is to generate new examples from the target language such that no incorrect examples are generated beyond some point. In sharp contrast to strong negative results for the closely related problem of language identification, they establish positive results for language generation in the limit for all countable collections of languages. Follow-up work by Raman and Tewari [RT24] studies bounds on the number of distinct inputs required by an algorithm before correct language generation is achieved – namely, whether this is a constant for all languages in the collection (uniform generation) or a language-dependent constant (non-uniform generation). We show that every countable language collection has a generator which has the stronger property of non-uniform generation in the limit. However, while the generation algorithm of [KM24] can be implemented using membership queries, we show that any algorithm cannot non-uniformly generate even for collections of just two languages, using only membership queries. We also formalize the tension between validity and breadth in the generation algorithm of [KM24] by introducing a definition of exhaustive generation, and show a strong negative result for exhaustive generation. Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit. Finally, inspired by algorithms that can choose to obtain feedback, we consider a model of uniform generation with feedback, completely characterizing language collections for which such uniform generation with feedback is possible in terms of a complexity measure of the collection. Comments: 24 pages Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2411.15364 [cs.DS] (or arXiv:2411.15364v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.15364 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-85] PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models

【速读】：该论文试图解决生成式大型语言模型（LLMs）在无监督情况下评估其响应质量的问题。解决方案的关键是提出了一种名为PPLqa的信息论度量方法，该方法易于计算且语言无关，能够在无需真实标注或人工监督的情况下，评估生成式LLMs的响应质量。PPLqa不仅涵盖了连贯性、流畅性（写作质量）以及相关性和一致性（响应的适当性），还能有效地对生成式语言模型进行排序，从而选择最适合特定任务的模型。该方法在长篇问答任务中表现尤为出色，能够替代传统的基于真实标注的评估过程，并与人类和LLM的排序结果高度相关。

链接: https://arxiv.org/abs/2411.15320
作者: Gerald Friedland,Xin Huang,Yueying Cui,Vishaal Kapoor,Ashish Khetan,Sanjiv Das
关键词-EN: Large Language Models, generative Large Language, generative language models, Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose PPLqa, an easy to compute, language independent, information-theoretic metric to measure the quality of responses of generative Large Language Models (LLMs) in an unsupervised way, without requiring ground truth annotations or human supervision. The method and metric enables users to rank generative language models for quality of responses, so as to make a selection of the best model for a given task. Our single metric assesses LLMs with an approach that subsumes, but is not explicitly based on, coherence and fluency (quality of writing) and relevance and consistency (appropriateness of response) to the query. PPLqa performs as well as other related metrics, and works better with long-form Q\A. Thus, PPLqa enables bypassing the lengthy annotation process required for ground truth evaluations, and it also correlates well with human and LLM rankings.
zh

[NLP-86] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLM s

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）的评估问题。解决方案的关键在于系统性地总结和分类现有的评估基准类型，包括基础能力、模型自我分析和扩展应用的评估；详细描述基准构建的典型过程，如数据收集、标注和注意事项；以及提出系统化的评估方式，包括评判标准、度量方法和工具包。通过这些关键步骤，论文旨在为研究人员提供一个全面的框架，以便根据不同需求有效地评估MLLMs，并激发更优的评估方法，从而推动MLLM研究的进步。

链接: https://arxiv.org/abs/2411.15296
作者: Chaoyou Fu,Yi-Fan Zhang,Shukang Yin,Bo Li,Xinyu Fang,Sirui Zhao,Haodong Duan,Xing Sun,Ziwei Liu,Liang Wang,Caifeng Shan,Ran He
关键词-EN: Artificial General Intelligence, Multimodal Large Language, Large Language Models, General Intelligence, Artificial General
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Produced by MME+MMBench+LLaVA Teams. Project Page: this https URL

点击查看摘要

Abstract:As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.
zh

[NLP-87] Sycophancy in Large Language Models : Causes and Mitigations

【速读】：该论文试图解决大型语言模型（LLMs）中存在的“谄媚行为”（sycophancy）问题，即模型过度同意或奉承用户，从而影响其可靠性和伦理部署。解决方案的关键在于识别和量化谄媚行为的倾向，分析其与幻觉（hallucination）和偏见（bias）等其他挑战的关系，并探索有效的缓解策略。关键方法包括改进训练数据、采用新颖的微调方法、部署后的控制机制以及解码策略。此外，论文还讨论了谄媚行为对AI对齐的广泛影响，并提出了未来研究的方向。

链接: https://arxiv.org/abs/2411.15287
作者: Lars Malmqvist
关键词-EN: demonstrated remarkable capabilities, language processing tasks, Large language models, natural language processing, processing tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. However, their tendency to exhibit sycophantic behavior - excessively agreeing with or flattering users - poses significant risks to their reliability and ethical deployment. This paper provides a technical survey of sycophancy in LLMs, analyzing its causes, impacts, and potential mitigation strategies. We review recent work on measuring and quantifying sycophantic tendencies, examine the relationship between sycophancy and other challenges like hallucination and bias, and evaluate promising techniques for reducing sycophancy while maintaining model performance. Key approaches explored include improved training data, novel fine-tuning methods, post-deployment control mechanisms, and decoding strategies. We also discuss the broader implications of sycophancy for AI alignment and propose directions for future research. Our analysis suggests that mitigating sycophancy is crucial for developing more robust, reliable, and ethically-aligned language models.
zh

[NLP-88] BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques

【速读】：该论文试图解决低资源语言（如孟加拉语）在自然语言理解任务中缺乏高效句子嵌入模型的问题。解决方案的关键在于引入了一种轻量级的跨语言知识蒸馏方法，通过从预训练的高性能英语句子嵌入模型中提取知识，构建适用于孟加拉语的轻量级句子转换器。这种方法不仅在多个下游任务（如释义检测、语义文本相似性（STS）和孟加拉语仇恨言论检测）中表现优异，而且其轻量级架构和较短的推理时间使其非常适合在资源受限的环境中部署，从而为低资源语言的实际NLP应用提供了有价值的解决方案。

链接: https://arxiv.org/abs/2411.15270
作者: Muhammad Rafsan Kabir,Md. Mohibur Rahman Nabil,Mohammad Ashrafuzzaman Khan
关键词-EN: require understanding natural, Sentence-level embedding, understanding natural language, require understanding, understanding natural
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in ACAI 2024

点击查看摘要

Abstract:Sentence-level embedding is essential for various tasks that require understanding natural language. Many studies have explored such embeddings for high-resource languages like English. However, low-resource languages like Bengali (a language spoken by almost two hundred and thirty million people) are still under-explored. This work introduces two lightweight sentence transformers for the Bangla language, leveraging a novel cross-lingual knowledge distillation approach. This method distills knowledge from a pre-trained, high-performing English sentence transformer. Proposed models are evaluated across multiple downstream tasks, including paraphrase detection, semantic textual similarity (STS), and Bangla hate speech detection. The new method consistently outperformed existing Bangla sentence transformers. Moreover, the lightweight architecture and shorter inference time make the models highly suitable for deployment in resource-constrained environments, making them valuable for practical NLP applications in low-resource languages.
zh

[NLP-89] ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

【速读】：该论文试图解决大型视觉语言模型 (Large Vision Language Models, LVLMs) 在理解和响应复杂视觉文本上下文时存在的幻觉倾向问题，尤其是在需要高精度应用的实际场景中。解决方案的关键在于提出了一种轻量级、无需训练的方法，称为 ICT (Intervention-based Calibration Technique)。ICT 通过计算干预方向，调整模型对不同层次视觉信息的注意力，特别是在前向传递阶段对编码整体图像信息和细粒度对象细节的注意力头进行干预，从而有效减少语言先验的过度影响，缓解幻觉现象。该方法在少量数据上表现出色，并能跨不同数据集和模型泛化。

链接: https://arxiv.org/abs/2411.15268
作者: Junzhe Chen,Tianshu Zhang,Shiyu Huang,Yuwei Niu,Linfeng Zhang,Lijie Wen,Xuming Hu
关键词-EN: Large Vision Language, complex visual-textual contexts, recent breakthroughs achieved, Vision Language Models, Large Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in understanding and responding to complex visual-textual contexts, their inherent hallucination tendencies limit their practical application in real-world scenarios that demand high levels of precision. Existing methods typically either fine-tune the LVLMs using additional data, which incurs extra costs in manual annotation and computational resources or perform comparisons at the decoding stage, which may eliminate useful language priors for reasoning while introducing inference time overhead. Therefore, we propose ICT, a lightweight, training-free method that calculates an intervention direction to shift the model’s focus towards different levels of visual information, enhancing its attention to high-level and fine-grained visual details. During the forward pass stage, the intervention is applied to the attention heads that encode the overall image information and the fine-grained object details, effectively mitigating the phenomenon of overly language priors, and thereby alleviating hallucinations. Extensive experiments demonstrate that ICT achieves strong performance with a small amount of data and generalizes well across different datasets and models. Our code will be public.
zh

[NLP-90] PLogAD: Unsupervised Log Anomaly Detection Based on Event Templates and Key Parameters

【速读】：该论文试图解决日志系统中异常检测的问题，特别是现有方法在捕捉日志条目中的特征和语义信息方面的不足，导致漏报和误报的问题。解决方案的关键在于提出了TPLogAD，一种基于事件模板和关键参数的通用无监督日志分析方法。TPLogAD通过itemplate2vec和para2vec两种高效的语义表示方法，分别对事件模板和参数进行异常检测，这在以往的工作中未曾实现。此外，TPLogAD能够避免日志多样性和动态性对异常检测的干扰，从而提高了检测的准确性。实验结果表明，TPLogAD在四个公开日志数据集上的表现优于现有的日志异常检测方法。

链接: https://arxiv.org/abs/2411.15250
作者: Jiawei Lu,Chengrong Wu
关键词-EN: Web service systems, Web service, anomaly detection, log anomaly detection, anomaly detection methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Log-system is an important mechanism for recording the runtime status and events of Web service systems, and anomaly detection in logs is an effective method of detecting problems. However, manual anomaly detection in logs is inefficient, error-prone, and unrealistic. Existing log anomaly detection methods either use the indexes of event templates, or form vectors by embedding the fixed string part of the template as a sentence, or use time parameters for sequence analysis. However, log entries often contain features and semantic information that cannot be fully represented by these methods, resulting in missed and false alarms. In this paper, we propose TPLogAD, a universal unsupervised method for analyzing unstructured logs, which performs anomaly detection based on event templates and key parameters. The itemplate2vec and para2vec included in TPLogAD are two efficient and easy-to-implement semantic representation methods for logs, detecting anomalies in event templates and parameters respectively, which has not been achieved in previous work. Additionally, TPLogAD can avoid the interference of log diversity and dynamics on anomaly detection. Our experiments on four public log datasets show that TPLogAD outperforms existing log anomaly detection methods.
zh

[NLP-91] he Zamba2 Suite: Technical Report

【速读】：该论文旨在解决现有开源模型在推理延迟、吞吐量和内存效率方面的性能瓶颈问题。解决方案的关键在于提出了Zamba2系列模型，这是一组包含1.2B、2.7B和7.4B参数的混合Mamba2-transformer模型，通过优化架构、训练数据集和训练过程（最高达三万亿个token），实现了在保持与同类领先模型相当性能的同时，显著提升了推理效率。此外，论文还公开了Zamba2系列模型的权重、指令调优变体以及用于预训练的Zyda-2数据集，进一步推动了模型的开放性和可访问性。

链接: https://arxiv.org/abs/2411.15242
作者: Paolo Glorioso,Quentin Anthony,Yury Tokpanov,Anna Golubeva,Vasudev Shyam,James Whittington,Jonathan Pilault,Beren Millidge
关键词-EN: achieving substantial gains, leading open-weights models, parameter hybrid, technical report, inference latency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21/11/24 initial upload

点击查看摘要

Abstract:In this technical report, we present the Zamba2 series – a suite of 1.2B, 2.7B, and 7.4B parameter hybrid Mamba2-transformer models that achieve state of the art performance against the leading open-weights models of their class, while achieving substantial gains in inference latency, throughput, and memory efficiency. The Zamba2 series builds upon our initial work with Zamba1-7B, optimizing its architecture, training and annealing datasets, and training for up to three trillion tokens. We provide open-source weights for all models of the Zamba2 series as well as instruction-tuned variants that are strongly competitive against comparable instruct-tuned models of their class. We additionally open-source the pretraining dataset, which we call Zyda-2, used to train the Zamba2 series of models. The models and datasets used in this work are openly available at this https URL
zh

[NLP-92] BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

【速读】：该论文试图解决生物医学图像分类中，由于标注数据有限、图像对比度不直观以及视觉特征复杂，导致现有的视觉-语言模型（Vision-Language Models, VLMs）如CLIP在下游应用中适应性不足的问题。解决方案的关键在于提出了一个名为BiomedCoOp的新型提示学习框架，该框架通过利用大型语言模型（Large Language Models, LLMs）的语义一致性和基于统计的提示选择策略的知识蒸馏，实现了对BiomedCLIP模型的高效适应和少样本生物医学图像分类的准确性与泛化能力的显著提升。

链接: https://arxiv.org/abs/2411.15232
作者: Taha Koleilat,Hojat Asgariandehkordi,Hassan Rivaz,Yiming Xiao
关键词-EN: demonstrated substantial success, self-supervised representation learning, vision tasks, advancements in vision-language, demonstrated substantial
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 18 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code will be publicly available at this https URL.
zh

[NLP-93] Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training BMVC’24

【速读】：该论文试图解决在医疗领域中，由于隐私、敏感性和标注复杂性导致的跨模态数据获取困难和数据稀缺问题。解决方案的关键在于引入了一种名为 Uni-Mlip 的统一自监督框架，该框架在数据层和特征层上整合了跨模态、单模态和融合模态的自监督技术，并针对医疗图像的独特特性定制了单模态图像自监督方法。通过这种方法，Uni-Mlip 在图像-文本检索、图像分类和视觉问答 (VQA) 等下游任务中显著超越了当前最先进的方法。

链接: https://arxiv.org/abs/2411.15207
作者: Ameera Bawazir,Kebin Wu,Wenbin Li
关键词-EN: Recent advancements, computer vision tasks, contrastive learning, computer vision, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 2 figures, accepted by BMVC’24

点击查看摘要

Abstract:Recent advancements in vision-language pre-training via contrastive learning have significantly improved performance across computer vision tasks. However, in the medical domain, obtaining multimodal data is often costly and challenging due to privacy, sensitivity, and annotation complexity. To mitigate data scarcity while boosting model performance, we introduce \textbfUni-Mlip, a unified self-supervision framework specifically designed to enhance medical vision-language pre-training. Uni-Mlip seamlessly integrates cross-modality, uni-modality, and fused-modality self-supervision techniques at the data-level and the feature-level. Additionally, Uni-Mlip tailors uni-modal image self-supervision to accommodate the unique characteristics of medical images. Our experiments across datasets of varying scales demonstrate that Uni-Mlip significantly surpasses current state-of-the-art methods in three key downstream tasks: image-text retrieval, image classification, and visual question answering (VQA).
zh

[NLP-94] Multimodal large language model for wheat breeding: a new exploration of smart breeding

【速读】：该论文试图解决作物育种中跨领域多模态数据的知识挖掘难题，特别是如何高效、准确地利用无人机遥感技术收集的作物表型数据。解决方案的关键在于开发智能育种目标工具，通过监督微调（SFT）、检索增强生成（RAG）和基于人类反馈的强化学习（RLHF）技术，将跨领域知识注入多模态大语言模型（MLLMs），构建适用于小麦育种的多模态大语言模型（WBLMs）。论文中评估了基于不同预训练MLLMs（如Qwen-VL, InternVL, Deepseek-VL）构建的WBLMs，结果表明，结合SFT、RAG和RLHF技术的InternVL2-8B模型表现最佳，尤其在小麦产量预测和多任务决策支持生成方面表现突出。

链接: https://arxiv.org/abs/2411.15203
作者: Guofeng Yang,Yu Li,Yong He,Zhenjiang Zhou,Lingzhen Ye,Hui Fang,Yiqi Luo,Xuping Feng
关键词-EN: UAV remote sensing, key technology, crop phenotyping data, UAV remote, achieve high-throughput
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:UAV remote sensing technology has become a key technology in crop breeding, which can achieve high-throughput and non-destructive collection of crop phenotyping data. However, the multidisciplinary nature of breeding has brought technical barriers and efficiency challenges to knowledge mining. Therefore, it is important to develop a smart breeding goal tool to mine cross-domain multimodal data. Based on different pre-trained open-source multimodal large language models (MLLMs) (e.g., Qwen-VL, InternVL, Deepseek-VL), this study used supervised fine-tuning (SFT), retrieval-augmented generation (RAG), and reinforcement learning from human feedback (RLHF) technologies to inject cross-domain knowledge into MLLMs, thereby constructing multiple multimodal large language models for wheat breeding (WBLMs). The above WBLMs were evaluated using the newly created evaluation benchmark in this study. The results showed that the WBLM constructed using SFT, RAG and RLHF technologies and InternVL2-8B has leading performance. Then, subsequent experiments were conducted using the WBLM. Ablation experiments indicated that the combination of SFT, RAG, and RLHF technologies can improve the overall generation performance, enhance the generated quality, balance the timeliness and adaptability of the generated answer, and reduce hallucinations and biases. The WBLM performed best in wheat yield prediction using cross-domain data (remote sensing, phenotyping, weather, germplasm) simultaneously, with R2 and RMSE of 0.821 and 489.254 kg/ha, respectively. Furthermore, the WBLM can generate professional decision support answers for phenotyping estimation, environmental stress assessment, target germplasm screening, cultivation technique recommendation, and seed price query tasks.
zh

[NLP-95] Graph Neural Network-Based Entity Extraction and Relationship Reasoning in Complex Knowledge Graphs

【速读】：该论文试图解决知识图谱中实体提取和关系推理的问题。解决方案的关键在于利用图神经网络（Graph Neural Network），特别是图卷积网络（Graph Convolutional Network）和图注意力网络（Graph Attention Network），来建模知识图谱的复杂结构。通过构建一个端到端的联合模型，实现了实体和关系的高效识别与推理。实验结果表明，该模型在复杂知识图谱中表现出更强的泛化能力和稳定性，为知识图谱的进一步研究提供了有力支持，并展示了图神经网络在实体提取和关系推理中的应用潜力。

链接: https://arxiv.org/abs/2411.15195
作者: Junliang Du,Guiran Liu,Jia Gao,Xiaoxuan Liao,Jiacheng Hu,Linxiao Wu
关键词-EN: reasoning algorithm based, relationship reasoning algorithm, relationship reasoning, extraction and relationship, graph convolutional network
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study proposed a knowledge graph entity extraction and relationship reasoning algorithm based on a graph neural network, using a graph convolutional network and graph attention network to model the complex structure in the knowledge graph. By building an end-to-end joint model, this paper achieves efficient recognition and reasoning of entities and relationships. In the experiment, this paper compared the model with a variety of deep learning algorithms and verified its superiority through indicators such as AUC, recall rate, precision rate, and F1 value. The experimental results show that the model proposed in this paper performs well in all indicators, especially in complex knowledge graphs, it has stronger generalization ability and stability. This provides strong support for further research on knowledge graphs and also demonstrates the application potential of graph neural networks in entity extraction and relationship reasoning.
zh

[NLP-96] Guiding Word Equation Solving using Graph Neural Networks (Extended Technical Report)

【速读】：该论文试图解决的是基于Nielsen变换的词方程求解问题，关键在于提出了一种由图神经网络（Graph Neural Networks, GNNs）引导的算法。该算法通过迭代重写方程的每一侧的首项，生成树状搜索空间，并在每次分裂点处利用GNNs进行高效的分裂决策。分裂决策被编码为多分类任务，论文还引入了五种图表示方法来编码词方程的结构信息，以供GNNs使用。实验结果表明，该算法在可满足性问题上表现尤为出色，对于单个词方程，DragonLi解算器能够解决比现有字符串解算器更多的问题；对于多个词方程的合取，DragonLi也与最先进的字符串解算器相媲美。

链接: https://arxiv.org/abs/2411.15194
作者: Parosh Aziz Abdulla,Mohamed Faouzi Atig,Julie Cailler,Chencheng Liang,Philipp Rümmer
关键词-EN: well-known Nielsen transformation, Graph Neural Network-guided, Neural Network-guided algorithm, Neural Network-guided, Graph Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This paper proposes a Graph Neural Network-guided algorithm for solving word equations, based on the well-known Nielsen transformation for splitting equations. The algorithm iteratively rewrites the first terms of each side of an equation, giving rise to a tree-like search space. The choice of path at each split point of the tree significantly impacts solving time, motivating the use of Graph Neural Networks (GNNs) for efficient split decision-making. Split decisions are encoded as multi-classification tasks, and five graph representations of word equations are introduced to encode their structural information for GNNs. The algorithm is implemented as a solver named DragonLi. Experiments are conducted on artificial and real-world benchmarks. The algorithm performs particularly well on satisfiable problems. For single word \mboxequations, DragonLi can solve significantly more problems than well-established string solvers. For the conjunction of multiple word equations, DragonLi is competitive with state-of-the-art string solvers.
zh

[NLP-97] Can Open-source LLM s Enhance Data Augmentation for Toxic Detection?: An Experimental Study

【速读】：该论文试图解决在内容审核中高质量、多样化有害数据生成的问题，特别是在毒性内容检测方面。解决方案的关键在于利用提示工程（prompt engineering）和微调（fine-tuning）技术对开源大型语言模型（LLMs）进行优化，以增强有害数据的生成能力。研究通过两阶段实验，第一阶段评估了六个开源LLMs在多个数据集上的表现，仅使用提示工程；第二阶段则专注于微调。研究发现，Mistral模型在生成有害数据时表现出较低的幻觉（hallucination）率。尽管微调提高了数据质量和多样性，但仍面临数据重复和过拟合的挑战。实验结果表明，这种方法在提升毒性内容检测系统方面具有可扩展性和成本效益，证明了开源LLMs在创建强大内容审核工具方面的潜力。

链接: https://arxiv.org/abs/2411.15175
作者: Zheng Hui,Zhaoxiao Guo,Hang Zhao,Juanyong Duan,Lin Ai,Yinheng Li,Julia Hirschberg,Congrui Huang
关键词-EN: toxic content detection, addressing real-time applications, toxic content, content detection, essential to addressing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality, diverse harmful data is essential to addressing real-time applications in content moderation. Current state-of-the-art approaches to toxic content detection using GPT series models are costly and lack explainability. This paper investigates the use of prompt engineering and fine-tuning techniques on open-source LLMs to enhance harmful data augmentation specifically for toxic content detection. We conduct a two-stage empirical study, with stage 1 evaluating six open-source LLMs across multiple datasets using only prompt engineering and stage 2 focusing on fine-tuning. Our findings indicate that Mistral can excel in generating harmful data with minimal hallucination. While fine-tuning these models improves data quality and diversity, challenges such as data duplication and overfitting persist. Our experimental results highlight scalable, cost-effective strategies for enhancing toxic content detection systems. These findings not only demonstrate the potential of open-source LLMs in creating robust content moderation tools. The application of this method in real industrial scenarios further proves the feasibility and efficiency of the fine-tuned open-source LLMs for data augmentation. We hope our study will aid in understanding the capabilities and limitations of current models in toxic content detection and drive further advancements in this field.
zh

[NLP-98] Kleene algebra with commutativity conditions is undecidable

【速读】：该论文试图解决Kleene代数（Kleene algebra）中关于原语（atomic terms）交换性条件的等式理论的可判定性问题。解决方案的关键在于证明了即使在较弱的理论中，不支持Kleene代数的归纳公理，该等式理论仍然是不可判定的。这一结果解决了长期以来在Kleene代数理论中的一个开放问题，并且与Kuznetsov独立解决该问题的结果一致。

链接: https://arxiv.org/abs/2411.15979
作者: Arthur Azevedo de Amorim,Cheng Zhang,Marco Gaboardi
关键词-EN: Kleene algebra, theory of Kleene, Toggle, longstanding open question, Kleene
类目: Logic (math.LO); Computational Complexity (cs.CC); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: Published at CSL 2025

点击查看摘要

Abstract:We prove that the equational theory of Kleene algebra with commutativity conditions on primitives (or atomic terms) is undecidable, thereby settling a longstanding open question in the theory of Kleene algebra. While this question has also been recently solved independently by Kuznetsov, our results hold even for weaker theories that do not support the induction axioms of Kleene algebra. Comments: Published at CSL 2025 Subjects: Logic (math.LO); Computational Complexity (cs.CC); Computation and Language (cs.CL); Programming Languages (cs.PL) Cite as: arXiv:2411.15979 [math.LO] (or arXiv:2411.15979v1 [math.LO] for this version) https://doi.org/10.48550/arXiv.2411.15979 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Cheng Zhang [view email] [v1] Sun, 24 Nov 2024 20:44:27 UTC (252 KB) Full-text links: Access Paper: View a PDF of the paper titled Kleene algebra with commutativity conditions is undecidable, by Arthur Azevedo de Amorim and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: math.LO prev | next new | recent | 2024-11 Change to browse by: cs cs.CC cs.CL cs.PL math References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-99] Bio-inspired AI: Integrating Biological Complexity into Artificial Intelligence

【速读】：该论文试图解决的问题是如何设计出更加适应性强且鲁棒的人工智能系统。解决方案的关键在于借鉴生物计算的基本原则，特别是上下文依赖的层次信息处理、试错启发式方法以及多尺度组织结构。通过深入研究生物智能的微妙机制，如自上而下的因果关系和与环境的适应性交互，论文旨在揭示现有人工智能系统中的潜在局限性，并提供一个受生物系统启发的框架，以设计更为智能和灵活的人工智能系统。

链接: https://arxiv.org/abs/2411.15243
作者: Nima Dehghani,Michael Levin
关键词-EN: mirrors our longstanding, creating artificial intelligence, pursuit of creating, longstanding fascination, fascination with understanding
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:The pursuit of creating artificial intelligence (AI) mirrors our longstanding fascination with understanding our own intelligence. From the myths of Talos to Aristotelian logic and Heron’s inventions, we have sought to replicate the marvels of the mind. While recent advances in AI hold promise, singular approaches often fall short in capturing the essence of intelligence. This paper explores how fundamental principles from biological computation–particularly context-dependent, hierarchical information processing, trial-and-error heuristics, and multi-scale organization–can guide the design of truly intelligent systems. By examining the nuanced mechanisms of biological intelligence, such as top-down causality and adaptive interaction with the environment, we aim to illuminate potential limitations in artificial constructs. Our goal is to provide a framework inspired by biological systems for designing more adaptable and robust artificial intelligent systems.
zh

计算机视觉

[CV-0] Generative Omnimatte: Learning to Decompose Video into Layers

【速读】：该论文试图解决现有视频分解方法在面对动态背景或不准确的姿态和深度估计时表现不佳的问题，特别是在处理被遮挡的动态区域时缺乏生成先验。解决方案的关键在于提出了一种新的生成式分层视频分解框架，该框架不依赖于静态场景假设或相机姿态和深度信息，而是通过训练视频扩散模型来识别和去除特定对象引起的场景效果。核心思想是利用视频扩散模型从现有的视频修复模型中微调，以生成高质量的分解层，包括对被遮挡动态区域的合理补全。

链接: https://arxiv.org/abs/2411.16683
作者: Yao-Chih Lee,Erika Lu,Sarah Rumbley,Michal Geyer,Jia-Bin Huang,Tali Dekel,Forrester Cole
关键词-EN: input object masks, semantically meaningful layers, omnimatte method aims, set of input, aims to decompose
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.
zh

[CV-1] Factorized Visual Tokenization and Generation

【速读】：该论文试图解决基于向量量化（VQ）的视觉分词器在处理大规模词汇时面临的训练不稳定性和性能提升有限的问题。解决方案的关键在于引入因子分解量化（Factorized Quantization, FQ），通过将大型码本分解为多个独立的子码本，从而降低查找复杂度，提高视觉分词的效率和可扩展性。此外，论文还提出了一种解耦正则化方法，以减少子码本之间的冗余，促进多样性，并通过集成表示学习，利用预训练的视觉模型（如CLIP和DINO）来丰富学习到的表示，确保分词器能够捕捉多层次的语义信息，从而生成更具表现力和解耦的表示。实验结果表明，FQGAN模型显著提升了视觉分词器的重建质量，达到了最先进的性能水平，并展示了其在自回归图像生成中的有效性。

链接: https://arxiv.org/abs/2411.16681
作者: Zechen Bai,Jianxiong Gao,Ziteng Gao,Pichao Wang,Zheng Zhang,Tong He,Mike Zheng Shou
关键词-EN: image generation, generation, auto-regressive image generation, image, Visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation. this https URL
zh

[CV-2] Quark: Real-time High-resolution and General Neural View Synthesis SIGGRAPH

【速读】：该论文试图解决高分辨率、实时的新视角合成问题。解决方案的关键在于结合了多个创新概念，包括使用分层深度图 (Layered Depth Maps, LDMs) 来高效表示复杂深度和遮挡场景，采用迭代学习渲染与优化方法，以及在多尺度 UNet 架构中嵌入更新步骤。此外，论文引入了基于 Transformer 的网络组件，以在输入图像空间中处理多视图信息，从而提高效率。最终，通过动态生成和丢弃每帧的内部3D几何结构，实现了实时重建和渲染。这些创新点共同构成了一个高效且高质量的新视角合成算法。

链接: https://arxiv.org/abs/2411.16680
作者: John Flynn,Michael Broxton,Lukas Murmann,Lucy Chai,Matthew DuVall,Clément Godard,Kathryn Heal,Srinivas Kaza,Stephen Lombardi,Xuan Luo,Supreeth Achar,Kira Prabhu,Tiancheng Sun,Lynn Tsai,Ryan Overbeck
关键词-EN: performing high-quality, neural algorithm, input RGB images, quality, RGB images
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: SIGGRAPH Asia 2024 camera ready version; project page this https URL

点击查看摘要

Abstract:We present a novel neural algorithm for performing high-quality, high-resolution, real-time novel view synthesis. From a sparse set of input RGB images or videos streams, our network both reconstructs the 3D scene and renders novel views at 1080p resolution at 30fps on an NVIDIA A100. Our feed-forward network generalizes across a wide variety of datasets and scenes and produces state-of-the-art quality for a real-time method. Our quality approaches, and in some cases surpasses, the quality of some of the top offline methods. In order to achieve these results we use a novel combination of several key concepts, and tie them together into a cohesive and effective algorithm. We build on previous works that represent the scene using semi-transparent layers and use an iterative learned render-and-refine approach to improve those layers. Instead of flat layers, our method reconstructs layered depth maps (LDMs) that efficiently represent scenes with complex depth and occlusions. The iterative update steps are embedded in a multi-scale, UNet-style architecture to perform as much compute as possible at reduced resolution. Within each update step, to better aggregate the information from multiple input views, we use a specialized Transformer-based network component. This allows the majority of the per-input image processing to be performed in the input image space, as opposed to layer space, further increasing efficiency. Finally, due to the real-time nature of our reconstruction and rendering, we dynamically create and discard the internal 3D geometry for each frame, generating the LDM for each view. Taken together, this produces a novel and effective algorithm for view synthesis. Through extensive evaluation, we demonstrate that we achieve state-of-the-art quality at real-time rates. Project page: this https URL
zh

[CV-3] Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

【速读】：该论文试图解决零样本物体姿态估计问题，即在不依赖特定物体训练数据的情况下，从图像中提取物体姿态。解决方案的关键在于采用基于Latent Diffusion Model (LDM) 的骨干网络进行特征提取，并提出了一种基于模板的多阶段方法来实现零样本姿态估计。通过在三个标准数据集上的实验，论文展示了该方法相较于基于Vision Transformers (ViT) 的基线模型，平均召回率提高了27%。

链接: https://arxiv.org/abs/2411.16668
作者: Bernd Von Gimborn,Philipp Ausserlechner,Markus Vincze,Stefan Thalhammer
关键词-EN: Zero-shot object pose, object pose estimation, enables the retrieval, images without necessitating, necessitating object-specific training
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot object pose estimation enables the retrieval of object poses from images without necessitating object-specific training. In recent approaches this is facilitated by vision foundation models (VFM), which are pre-trained models that are effectively general-purpose feature extractors. The characteristics exhibited by these VFMs vary depending on the training data, network architecture, and training paradigm. The prevailing choice in this field are self-supervised Vision Transformers (ViT). This study assesses the influence of Latent Diffusion Model (LDM) backbones on zero-shot pose estimation. In order to facilitate a comparison between the two families of models on a common ground we adopt and modify a recent approach. Therefore, a template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented. The efficacy of the proposed approach is empirically evaluated on three standard datasets for object-specific 6DoF pose estimation. The experiments demonstrate an Average Recall improvement of up to 27% over the ViT baseline. The source code is available at: this https URL.
zh

[CV-4] Edge Weight Prediction For Category-Agnostic Pose Estimation

【速读】：该论文试图解决在多类别物体姿态估计中，现有方法在处理遮挡和对称性问题时表现不佳的问题。解决方案的关键在于引入了一种名为EdgeCape的新框架，该框架通过预测姿态图中边的权重来优化关键点的定位。此外，论文还提出了结合马尔可夫结构偏置（Markovian Structural Bias）的方法，该方法根据节点间的跳数调节自注意力机制中的交互，从而增强模型捕捉全局空间依赖性的能力。这些创新使得EdgeCape在MP-100基准测试中，在1-shot和5-shot设置下均取得了最先进的性能，显著提升了关键点定位的准确性。

链接: https://arxiv.org/abs/2411.16665
作者: Or Hirschorn,Shai Avidan
关键词-EN: Category-Agnostic Pose Estimation, Pose Estimation, annotated support images, diverse object categories, diverse object
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Category-Agnostic Pose Estimation (CAPE) localizes keypoints across diverse object categories with a single model, using one or a few annotated support images. Recent works have shown that using a pose graph (i.e., treating keypoints as nodes in a graph rather than isolated points) helps handle occlusions and break symmetry. However, these methods assume a static pose graph with equal-weight edges, leading to suboptimal results. We introduce EdgeCape, a novel framework that overcomes these limitations by predicting the graph’s edge weights which optimizes localization. To further leverage structural priors, we propose integrating Markovian Structural Bias, which modulates the self-attention interaction between nodes based on the number of hops between them. We show that this improves the model’s ability to capture global spatial dependencies. Evaluated on the MP-100 benchmark, which includes 100 categories and over 20K images, EdgeCape achieves state-of-the-art results in the 1-shot setting and leads among similar-sized methods in the 5-shot setting, significantly improving keypoint localization accuracy. Our code is publicly available.
zh

[CV-5] Imperceptible Adversarial Examples in the Physical World

【速读】：该论文试图解决在物理世界中生成不可察觉的对抗样本（adversarial examples）的问题，特别是在深度学习计算机视觉模型中。现有的方法在生成物理可实现的对抗样本时，通常放宽了对对抗样本的定义，允许无界的扰动，导致明显的或甚至奇怪的视觉模式。论文的关键解决方案是使用直通估计器（Straight-Through Estimator, STE）来克服视觉传感系统中非可微图像失真函数的挑战。通过在反向传播的前向过程中应用精确的非可微失真，并在反向过程中使用恒等函数，STE使得在物理世界中生成不可察觉的对抗样本成为可能。论文还扩展了STE以实现可微渲染，从而在物理世界中生成不可察觉的对抗补丁（adversarial patches）。实验结果表明，尽管存在非可微失真，STE仍能快速生成具有小 $\ell_\infty$ 范数的对抗样本，并在全局扰动威胁模型中迫使分类准确率为零，在补丁扰动威胁模型中导致接近零的AP50。

链接: https://arxiv.org/abs/2411.16622
作者: Weilin Xu,Sebastian Szyller,Cory Cornelius,Luis Murillo Rojas,Marius Arvinte,Alvaro Velasquez,Jason Martin,Nageen Himayat
关键词-EN: deep learning-based computer, learning-based computer vision, computer vision models, physical world, Adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial examples in the digital domain against deep learning-based computer vision models allow for perturbations that are imperceptible to human eyes. However, producing similar adversarial examples in the physical world has been difficult due to the non-differentiable image distortion functions in visual sensing systems. The existing algorithms for generating physically realizable adversarial examples often loosen their definition of adversarial examples by allowing unbounded perturbations, resulting in obvious or even strange visual patterns. In this work, we make adversarial examples imperceptible in the physical world using a straight-through estimator (STE, a.k.a. BPDA). We employ STE to overcome the non-differentiability – applying exact, non-differentiable distortions in the forward pass of the backpropagation step, and using the identity function in the backward pass. Our differentiable rendering extension to STE also enables imperceptible adversarial patches in the physical world. Using printout photos, and experiments in the CARLA simulator, we show that STE enables fast generation of \ell_\infty bounded adversarial examples despite the non-differentiable distortions. To the best of our knowledge, this is the first work demonstrating imperceptible adversarial examples bounded by small \ell_\infty norms in the physical world that force zero classification accuracy in the global perturbation threat model and cause near-zero ( 4.22% ) AP50 in object detection in the patch perturbation threat model. We urge the community to re-evaluate the threat of adversarial examples in the physical world.
zh

[CV-6] Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

【速读】：该论文试图解决AI生成的视频（AGVs）中涉及人类活动时经常出现的视觉和语义失真问题，这些问题阻碍了视频生成技术在实际场景中的应用。解决方案的关键在于构建了一个名为AI-Generated Human activity Video Quality Assessment (Human-AGVQA)的数据集，并开发了一种客观评估指标——AI-Generated Human activity Video Quality metric (GHVQ)。Human-AGVQA数据集包含3200个由8种流行的文本到视频（T2V）模型生成的AGVs，通过400个描述多样人类活动的文本提示构建。GHVQ指标系统地提取了以人为中心的质量特征、AI生成内容感知质量特征和时间连续性特征，使其成为评估人类活动AGVs质量的综合且可解释的工具。实验结果表明，GHVQ在Human-AGVQA数据集上的表现显著优于现有质量指标，证明了其在评估人类活动AGVs质量方面的有效性。

链接: https://arxiv.org/abs/2411.16619
作者: Zhichao Zhang,Wei Sun,Xinyue Li,Yunhao Li,Qihang Ge,Jun Jia,Zicheng Zhang,Zhongpeng Ji,Fengyu Sun,Shangling Jui,Xiongkuo Min,Guangtao Zhai
关键词-EN: made significant progress, human activity AGVs, human activity, Human activity Video, AI-driven video generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AI-driven video generation techniques have made significant progress in recent years. However, AI-generated videos (AGVs) involving human activities often exhibit substantial visual and semantic distortions, hindering the practical application of video generation technologies in real-world scenarios. To address this challenge, we conduct a pioneering study on human activity AGV quality assessment, focusing on visual quality evaluation and the identification of semantic distortions. First, we construct the AI-Generated Human activity Video Quality Assessment (Human-AGVQA) dataset, consisting of 3,200 AGVs derived from 8 popular text-to-video (T2V) models using 400 text prompts that describe diverse human activities. We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. Based on Human-AGVQA, we benchmark the performance of T2V models and analyze their strengths and weaknesses in generating different categories of human activities. Second, we develop an objective evaluation metric, named AI-Generated Human activity Video Quality metric (GHVQ), to automatically analyze the quality of human activity AGVs. GHVQ systematically extracts human-focused quality features, AI-generated content-aware quality features, and temporal continuity features, making it a comprehensive and explainable quality metric for human activity AGVs. The extensive experimental results show that GHVQ outperforms existing quality metrics on the Human-AGVQA dataset by a large margin, demonstrating its efficacy in assessing the quality of human activity AGVs. The Human-AGVQA dataset and GHVQ metric will be released in public at this https URL
zh

[CV-7] GeoFormer: A Multi-Polygon Segmentation Transformer

【速读】：该论文试图解决遥感领域中建筑物等目标物体的尺度不变形状学习问题，传统方法依赖于调整多个损失函数将分割图转换为最终的尺度不变表示，这需要繁琐的设计和优化。论文提出的解决方案是引入GeoFormer，一种新颖的架构，通过端到端的方式学习生成多边形。关键在于将关键点建模为空间依赖的token，并以自回归方式进行处理，从而优化单一似然函数，显著提升了从卫星图像中描绘建筑物对象的性能。这是首次成功应用自回归transformer模型进行遥感中的多边形预测，为建筑物矢量化提供了一种有前景的方法论替代方案。

链接: https://arxiv.org/abs/2411.16616
作者: Maxim Khomiakov,Michael Riis Andersen,Jes Frellsen
关键词-EN: scale invariant shapes, learning scale invariant, scale invariant representation, final scale invariant, scale invariant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 5 figures, in proceedings of British Machine Vision Conference 2024

点击查看摘要

Abstract:In remote sensing there exists a common need for learning scale invariant shapes of objects like buildings. Prior works relies on tweaking multiple loss functions to convert segmentation maps into the final scale invariant representation, necessitating arduous design and optimization. For this purpose we introduce the GeoFormer, a novel architecture which presents a remedy to the said challenges, learning to generate multipolygons end-to-end. By modeling keypoints as spatially dependent tokens in an auto-regressive manner, the GeoFormer outperforms existing works in delineating building objects from satellite imagery. We evaluate the robustness of the GeoFormer against former methods through a variety of parameter ablations and highlight the advantages of optimizing a single likelihood function. Our study presents the first successful application of auto-regressive transformer models for multi-polygon predictions in remote sensing, suggesting a promising methodological alternative for building vectorization.
zh

[CV-8] Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

【速读】：该论文试图解决现有文本到SVG生成方法在形状规则性、泛化能力和表达性方面的局限性。解决方案的关键在于引入Chat2SVG，这是一个结合了大型语言模型（Large Language Models, LLMs）和图像扩散模型的混合框架。该框架首先利用LLM生成基于基本几何图元的语义上有意义的SVG模板，然后通过图像扩散模型引导的双阶段优化流程，在潜在空间中精炼路径并调整点坐标，以增强几何复杂性。这种方法不仅提高了视觉保真度、路径规则性和语义对齐，还通过自然语言指令实现了直观的编辑功能，使专业矢量图形创作对所有用户更加便捷。

链接: https://arxiv.org/abs/2411.16602
作者: Ronghuan Wu,Wanchao Su,Jing Liao
关键词-EN: Scalable Vector Graphics, offering resolution independence, Scalable Vector, Vector Graphics, digital design
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.
zh

[CV-9] Unlocking The Potential of Adaptive Attacks on Diffusion-Based Purification

【速读】：该论文试图解决扩散式净化 (Diffusion-based purification, DBP) 在对抗样本 (Adversarial examples, AEs) 防御中的有效性问题。论文指出，尽管DBP因其对攻击的不可知性和对强敌手的抵抗能力而受到欢迎，但其核心基础在面对基于梯度的自适应攻击 (adaptive attacks) 时被破坏。解决方案的关键在于重新审视和修正用于DBP的梯度反向传播技术中的实现缺陷，并提出了一种新的优化方法，该方法结合了自适应攻击，能够完全击败DBP，即使在多数投票设置下也是如此。论文通过提供首个可靠的DBP梯度库，展示了自适应攻击如何显著降低DBP的鲁棒性，从而证明DBP在当前状态下并非对抗样本的有效防御手段。

链接: https://arxiv.org/abs/2411.16598
作者: Andre Kassis,Urs Hengartner,Yaoliang Yu
关键词-EN: Diffusion-based purification, amassing popularity, ability to protect, attack-oblivious manner, manner and resistance
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion-based purification (DBP) is a defense against adversarial examples (AEs), amassing popularity for its ability to protect classifiers in an attack-oblivious manner and resistance to strong adversaries with access to the defense. Its robustness has been claimed to ensue from the reliance on diffusion models (DMs) that project the AEs onto the natural distribution. We revisit this claim, focusing on gradient-based strategies that back-propagate the loss gradients through the defense, commonly referred to as ``adaptive attacks". Analytically, we show that such an optimization method invalidates DBP’s core foundations, effectively targeting the DM rather than the classifier and restricting the purified outputs to a distribution over malicious samples instead. Thus, we reassess the reported empirical robustness, uncovering implementation flaws in the gradient back-propagation techniques used thus far for DBP. We fix these issues, providing the first reliable gradient library for DBP and demonstrating how adaptive attacks drastically degrade its robustness. We then study a less efficient yet stricter majority-vote setting where the classifier evaluates multiple purified copies of the input to make its decision. Here, DBP’s stochasticity enables it to remain partially robust against traditional norm-bounded AEs. We propose a novel adaptation of a recent optimization method against deepfake watermarking that crafts systemic malicious perturbations while ensuring imperceptibility. When integrated with the adaptive attack, it completely defeats DBP, even in the majority-vote setup. Our findings prove that DBP, in its current state, is not a viable defense against AEs.
zh

[CV-10] Rethinking Diffusion for Text-Driven Human Motion Generation

【速读】：该论文试图解决基于向量量化（Vector Quantization, VQ）的离散生成方法在人体运动生成中存在的信息损失、多样性降低和作为运动先验或生成指导的局限性问题。解决方案的关键在于结合扩散模型（diffusion-based methods）的连续空间生成特性，通过引入双向掩码自回归机制，优化数据表示和分布，从而提升模型的生成能力和多样性。此外，论文还提出了更稳健的评估方法，以公平地比较不同生成方法的性能。

链接: https://arxiv.org/abs/2411.16575
作者: Zichong Meng,Yiming Xie,Xiaogang Peng,Zeyu Han,Huaizu Jiang
关键词-EN: Vector Quantization, primarily surpassing diffusion-based, rapidly dominated human, standard performance metrics, primarily surpassing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.
zh

[CV-11] J-CaPA : Joint Channel and Pyramid Attention Improves Medical Image Segmentation

【速读】：该论文试图解决传统基于卷积神经网络 (CNN) 的医学图像分割模型（如 U-Net）在捕捉长距离依赖和全局上下文方面的局限性。解决方案的关键在于提出了一种基于Transformer的架构，该架构联合应用了通道注意力 (Channel Attention) 和金字塔注意力 (Pyramid Attention) 机制，以增强多尺度特征提取和分割性能。此外，通过CutMix数据增强技术提高了模型的泛化能力，从而在Synapse多器官分割数据集上实现了显著的性能提升，包括6.9%的平均Dice系数提升和39.9%的Hausdorff距离 (HD95) 提升。

链接: https://arxiv.org/abs/2411.16568
作者: Marzia Binta Nizam,Marian Zlateva,James Davis
关键词-EN: treatment planning, crucial for diagnosis, diagnosis and treatment, Medical image segmentation, applies Channel Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation is crucial for diagnosis and treatment planning. Traditional CNN-based models, like U-Net, have shown promising results but struggle to capture long-range dependencies and global context. To address these limitations, we propose a transformer-based architecture that jointly applies Channel Attention and Pyramid Attention mechanisms to improve multi-scale feature extraction and enhance segmentation performance for medical images. Increasing model complexity requires more training data, and we further improve model generalization with CutMix data augmentation. Our approach is evaluated on the Synapse multi-organ segmentation dataset, achieving a 6.9% improvement in Mean Dice score and a 39.9% improvement in Hausdorff Distance (HD95) over an implementation without our enhancements. Our proposed model demonstrates improved segmentation accuracy for complex anatomical structures, outperforming existing state-of-the-art methods.
zh

[CV-12] Generating Out-Of-Distribution Scenarios Using Language Models

【速读】：该论文试图解决自动驾驶车辆在面对分布外（Out-Of-Distribution, OOD）驾驶场景时的安全性和可靠性问题。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）的零样本泛化能力和常识推理能力，构建一个生成多样化OOD驾驶场景的框架。具体来说，论文提出了一种基于LLM的分支树结构，每个分支代表一个独特的OOD场景，并通过CARLA模拟器进行自动化模拟。此外，论文还引入了新的“OOD-ness”指标，用于量化生成场景与典型城市驾驶条件的偏离程度，并探讨了视觉语言模型（Vision-Language Models, VLMs）在解释和安全导航这些模拟OOD场景中的潜力。

链接: https://arxiv.org/abs/2411.16554
作者: Erfan Aasi,Phat Nguyen,Shiva Sreeram,Guy Rosman,Sertac Karaman,Daniela Rus
关键词-EN: machine learning techniques, learning techniques requires, comprehensive safety validation, diverse real-world environments, OOD scenarios
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The deployment of autonomous vehicles controlled by machine learning techniques requires extensive testing in diverse real-world environments, robust handling of edge cases and out-of-distribution scenarios, and comprehensive safety validation to ensure that these systems can navigate safely and effectively under unpredictable conditions. Addressing Out-Of-Distribution (OOD) driving scenarios is essential for enhancing safety, as OOD scenarios help validate the reliability of the models within the vehicle’s autonomy stack. However, generating OOD scenarios is challenging due to their long-tailed distribution and rarity in urban driving dataset. Recently, Large Language Models (LLMs) have shown promise in autonomous driving, particularly for their zero-shot generalization and common-sense reasoning capabilities. In this paper, we leverage these LLM strengths to introduce a framework for generating diverse OOD driving scenarios. Our approach uses LLMs to construct a branching tree, where each branch represents a unique OOD scenario. These scenarios are then simulated in the CARLA simulator using an automated framework that aligns scene augmentation with the corresponding textual descriptions. We evaluate our framework through extensive simulations, and assess its performance via a diversity metric that measures the richness of the scenarios. Additionally, we introduce a new “OOD-ness” metric, which quantifies how much the generated scenarios deviate from typical urban driving conditions. Furthermore, we explore the capacity of modern Vision-Language Models (VLMs) to interpret and safely navigate through the simulated OOD scenarios. Our findings offer valuable insights into the reliability of language models in addressing OOD scenarios within the context of urban driving.
zh

[CV-13] Guarding the Gate: ConceptGuard Battles Concept-Level Backdoors in Concept Bottleneck Models

【速读】：该论文试图解决概念瓶颈模型 (Concept Bottleneck Models, CBMs) 在面对概念级后门攻击 (concept-level backdoor attacks) 时的安全问题。解决方案的关键是引入了一种名为 ConceptGuard 的新型防御框架，该框架通过多阶段方法来保护 CBMs 免受此类攻击。具体来说，ConceptGuard 利用基于文本距离测量的概念聚类和在不同概念子组上训练的分类器之间的投票机制，来隔离和缓解潜在的触发器。这一解决方案不仅提供了理论上的防御保证，还确保了 CBMs 的高性能和可解释性，从而增强了其在关键应用中的安全性和可信度。

链接: https://arxiv.org/abs/2411.16512
作者: Songning Lai,Yu Huang,Jiayu Yang,Gaoxiang Huang,Wenshuo Chen,Yutao Yue
关键词-EN: Explainable Artificial Intelligence, deep learning, medical diagnostics, undermine trust, Artificial Intelligence
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 17pages, 4 figures

点击查看摘要

Abstract:The increasing complexity of AI models, especially in deep learning, has raised concerns about transparency and accountability, particularly in high-stakes applications like medical diagnostics, where opaque models can undermine trust. Explainable Artificial Intelligence (XAI) aims to address these issues by providing clear, interpretable models. Among XAI techniques, Concept Bottleneck Models (CBMs) enhance transparency by using high-level semantic concepts. However, CBMs are vulnerable to concept-level backdoor attacks, which inject hidden triggers into these concepts, leading to undetectable anomalous behavior. To address this critical security gap, we introduce ConceptGuard, a novel defense framework specifically designed to protect CBMs from concept-level backdoor attacks. ConceptGuard employs a multi-stage approach, including concept clustering based on text distance measurements and a voting mechanism among classifiers trained on different concept subgroups, to isolate and mitigate potential triggers. Our contributions are threefold: (i) we present ConceptGuard as the first defense mechanism tailored for concept-level backdoor attacks in CBMs; (ii) we provide theoretical guarantees that ConceptGuard can effectively defend against such attacks within a certain trigger size threshold, ensuring robustness; and (iii) we demonstrate that ConceptGuard maintains the high performance and interpretability of CBMs, crucial for trustworthiness. Through comprehensive experiments and theoretical proofs, we show that ConceptGuard significantly enhances the security and trustworthiness of CBMs, paving the way for their secure deployment in critical applications.
zh

[CV-14] Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

【速读】：该论文试图解决扩散模型在生成图像时与输入提示的语义对齐问题。解决方案的关键在于利用大型视觉语言模型（LVLMs）的语言理解能力来指导初始噪声潜在变量的优化。具体来说，论文提出了Noise Diffusion过程，通过更新噪声潜在变量来生成语义上忠实的图像，同时保持分布一致性。这一方法不仅在理论上分析了更新过程如何提高语义忠实度，还在实验中证明了其有效性和适应性，能够显著提升各种扩散模型的语义对齐效果。

链接: https://arxiv.org/abs/2411.16503
作者: Boming Miao,Chunxiao Li,Xiaoxiao Wang,Andi Zhang,Rui Sun,Zizhe Wang,Yao Zhu
关键词-EN: achieved impressive success, initial noisy latent, ensuring precise semantic, generating photorealistic images, precise semantic alignment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models. The code is available at this https URL.
zh

[CV-15] Multi-Resolution Generative Modeling of Human Motion from Limited Data

【速读】：该论文试图解决从有限训练序列中合成人类运动的问题。解决方案的关键在于提出了一个生成式模型，该模型通过结合骨架卷积层和多尺度架构来捕捉人类运动模式。模型包含生成对抗网络和嵌入模块，能够在特定帧率下生成运动，并控制其内容和细节。此外，该模型还能扩展到合成与语音同步的手势，即使数据对有限。通过直接合成SMPL姿态参数，该方法避免了测试时对人类身体网格的调整。实验结果表明，该模型能够广泛覆盖训练样本，并生成多样化的运动。

链接: https://arxiv.org/abs/2411.16498
作者: David Eduardo Moreno-Villamarín,Anna Hilsmann,Peter Eisert
关键词-EN: learns to synthesize, limited training sequences, synthesize human motion, training sequences, synthesize human
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 1O pages, 7 figures, published in European Conference on Visual Media Production CVMP 24

点击查看摘要

Abstract:We present a generative model that learns to synthesize human motion from limited training sequences. Our framework provides conditional generation and blending across multiple temporal resolutions. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture. Our model contains a set of generative and adversarial networks, along with embedding modules, each tailored for generating motions at specific frame rates while exerting control over their content and details. Notably, our approach also extends to the synthesis of co-speech gestures, demonstrating its ability to generate synchronized gestures from speech inputs, even with limited paired data. Through direct synthesis of SMPL pose parameters, our approach avoids test-time adjustments to fit human body meshes. Experimental results showcase our model’s ability to achieve extensive coverage of training examples, while generating diverse motions, as indicated by local and global diversity metrics.
zh

[CV-16] Deformable Mamba for Wide Field of View Segmentation

【速读】：该论文试图解决广角相机（如鱼眼和全景相机）在180°和360°图像中引入的显著畸变问题，这些畸变使得密集预测任务（如全景语义分割）变得复杂。解决方案的关键在于提出了一个名为Deformable Mamba的统一框架，该框架专门设计用于处理全景和鱼眼图像中的畸变。其核心是一个由一系列Deformable Mamba Fusion (DMF)模块构建的解码器，使得整个框架在处理极端畸变时更具变形性、高效性和准确性。通过在五个数据集上的广泛评估，该方法相较于之前针对特定视场角（FoV）的最先进方法，在分割精度上实现了持续提升，特别是在360° Stanford2D3D数据集上取得了+2.5%的性能提升，并且在60°到360°的视场角范围内均表现出色。

链接: https://arxiv.org/abs/2411.16481
作者: Jie Hu,Junwei Zheng,Jiale Wei,Jiaming Zhang,Rainer Stiefelhagen
关键词-EN: dense prediction tasks, complicating dense prediction, Wide-FoV cameras, introduce significant distortions, complicating dense
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Models and code will be made publicly available at: this https URL

点击查看摘要

Abstract:Wide-FoV cameras, like fisheye and panoramic setups, are essential for broader perception but introduce significant distortions in 180° and 360° images, complicating dense prediction tasks. For instance, existing MAMBA models lacking distortion-aware capacity cannot perform well in panoramic semantic segmentation. To address this problem, this work presents Deformable Mamba, a unified framework specifically designed to address imaging distortions within the context of panoramic and fisheye semantic segmentation. At the core is a decoder constructed with a series of Deformable Mamba Fusion (DMF) blocks, making the whole framework more deformable, efficient, and accurate, when handling extreme distortions. Extensive evaluations across five datasets demonstrate that our method consistently improves segmentation accuracy compared to the previous state-of-the-art methods tailored for specific FoVs. Notably, Deformable Mamba achieves a +2.5% performance improvement on the 360° Stanford2D3D dataset, and shows better results across FoVs from 60° to 360°.
zh

[CV-17] Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

【速读】：该论文试图解决在面对视频中常见的人脸视频时，由于高压缩比导致的模糊和量化噪声等降质问题，特别是这些降质对人脸视频的严重影响。解决方案的关键在于提出了一种新颖且高效的盲视频人脸增强方法，该方法基于3D-VQGAN（3D Vector Quantized Generative Adversarial Network）骨干网络，结合了记录高质量肖像特征和基于残差的时间信息的空间-时间码本。论文通过两阶段学习框架来训练模型，第一阶段通过正则化器缓解码本崩溃问题，第二阶段则利用两个Transformer从码本中查找代码并进一步更新低质量视频的编码器。实验结果表明，该方法在效率和效果上均优于当前最先进的盲人脸视频恢复和去闪烁方法。

链接: https://arxiv.org/abs/2411.16468
作者: Yutong Wang,Jiajie Teng,Jiajiong Cao,Yuming Li,Chenguang Ma,Hongteng Xu,Dixin Luo
关键词-EN: talk shows, live broadcasts, common type, video face enhancement, face
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details. Despite the significant advancement in video face enhancement, current methods still suffer from i) long processing time and ii) inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage \Rmnum1, we learn the model with a regularizer mitigating the codebook collapse problem. In Stage \Rmnum2, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos. Experiments conducted on the VFHQ-Test dataset demonstrate that our method surpasses the current state-of-the-art blind face video restoration and de-flickering methods on both efficiency and effectiveness. Code is available at \urlthis https URL.
zh

[CV-18] No Identity no problem: Motion through detection for people tracking

【速读】：该论文试图解决在行人追踪中，依赖于检测和重识别的传统方法需要大量身份标注的问题。解决方案的关键在于利用运动线索，并通过仅对检测结果进行监督来提供所需的监督信号，而不需要任何运动标注。具体来说，算法预测两个不同时间点的检测热图，并估计这两幅图像之间的2D运动偏移。然后，使用运动估计对其中一个热图进行变形，并强制其与另一个热图保持一致。这种方法在训练过程中耦合了不同图像的信息，从而提高了追踪精度，特别是在拥挤场景和低帧率序列中。

链接: https://arxiv.org/abs/2411.16466
作者: Martin Engilberge,F. Wilke Grosche,Pascal Fua
关键词-EN: facto standard approach, facto standard, motion, regressing motion offset, standard approach
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in TMLR November 2024

点击查看摘要

Abstract:Tracking-by-detection has become the de facto standard approach to people tracking. To increase robustness, some approaches incorporate re-identification using appearance models and regressing motion offset, which requires costly identity annotations. In this paper, we propose exploiting motion clues while providing supervision only for the detections, which is much easier to do. Our algorithm predicts detection heatmaps at two different times, along with a 2D motion estimate between the two images. It then warps one heatmap using the motion estimate and enforces consistency with the other one. This provides the required supervisory signal on the motion without the need for any motion annotations. In this manner, we couple the information obtained from different images during training and increase accuracy, especially in crowded scenes and when using low frame-rate sequences. We show that our approach delivers state-of-the-art results for single- and multi-view multi-target tracking on the MOT17 and WILDTRACK datasets.
zh

[CV-19] VQ-SGen: A Vector Quantized Stroke Representation for Sketch Generation

【速读】：该论文试图解决现有草图生成方法在处理单个笔画之间的内在和上下文关系时存在的不足，特别是忽视了笔画的形状和空间位置关系。解决方案的关键在于提出了一种新的算法VQ-SGen，该算法通过将每个笔画视为一个实体，并引入向量量化（VQ）笔画表示，以实现细粒度的草图生成。具体来说，VQ-SGen采用两阶段框架：第一阶段将每个笔画的形状和位置信息解耦，确保VQ表示优先学习笔画形状；第二阶段将精确且紧凑的表示输入到自解码Transformer中，以整合笔画的语义、位置和形状信息。这种方法不仅提高了生成笔画的保真度，还促进了条件生成和语义感知笔画编辑等新应用。

链接: https://arxiv.org/abs/2411.16446
作者: Jiawei Wang,Zhiming Cui,Changjian Li
关键词-EN: paper presents VQ-SGen, high-quality sketch generation, presents VQ-SGen, paper presents, algorithm for high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:This paper presents VQ-SGen, a novel algorithm for high-quality sketch generation. Recent approaches have often framed the task as pixel-based generation either as a whole or part-by-part, neglecting the intrinsic and contextual relationships among individual strokes, such as the shape and spatial positioning of both proximal and distant strokes. To overcome these limitations, we propose treating each stroke within a sketch as an entity and introducing a vector-quantized (VQ) stroke representation for fine-grained sketch generation. Our method follows a two-stage framework - in the first stage, we decouple each stroke’s shape and location information to ensure the VQ representation prioritizes stroke shape learning. In the second stage, we feed the precise and compact representation into an auto-decoding Transformer to incorporate stroke semantics, positions, and shapes into the generation process. By utilizing tokenized stroke representation, our approach generates strokes with high fidelity and facilitates novel applications, such as conditional generation and semantic-aware stroke editing. Comprehensive experiments demonstrate our method surpasses existing state-of-the-art techniques, underscoring its effectiveness. The code and model will be made publicly available upon publication.
zh

[CV-20] SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

【速读】：该论文试图解决现有方法在3D场景生成和编辑中缺乏统一框架的问题，特别是针对3D高斯溅射（3D Gaussian Splatting, 3DGS）的高保真和实时渲染需求。解决方案的关键在于引入了一个名为SplatFlow的综合框架，该框架包含两个主要组件：多视图矫正流（Multi-view Rectified Flow, RF）模型和高斯溅射解码器（Gaussian Splatting Decoder, GSDecoder）。多视图RF模型在潜在空间中操作，能够根据文本提示同时生成多视图图像、深度和相机姿态，解决了现实场景中多样的场景尺度和复杂的相机轨迹问题。随后，GSDecoder通过前馈3DGS方法将这些潜在输出高效地转换为3DGS表示。此外，SplatFlow利用无训练的反转和修复技术，实现了无缝的3DGS编辑，并在一个统一的框架内支持多种3D任务，如对象编辑、新视图合成和相机姿态估计，无需额外的复杂流程。

链接: https://arxiv.org/abs/2411.16443
作者: Hyojun Go,Byeongjun Park,Jiho Jang,Jin-Young Kim,Soonwoo Kwon,Changick Kim
关键词-EN: intuitive user interactions, hold significant potential, streamlining content creation, scenes hold significant, Gaussian Splatting Decoder
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow’s capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.
zh

[CV-21] AnonyNoise: Anonymizing Event Data with Smart Noise to Outsmart Re-Identification and Preserve Privacy WACV25

【速读】：该论文试图解决深度神经网络在重识别（re-identification）方面的日益增强的能力与近年来公共监控增加对个人隐私构成的威胁之间的矛盾。解决方案的关键在于提出了一种事件相机数据匿名化流程，该流程不仅能够防止人类对事件相机输出数据的解读，还能有效阻止神经网络的重识别。具体来说，论文的方法通过引入可学习的数据依赖性噪声（learnable data-dependent noise）来掩盖原始事件数据中的个人识别信息，从而将攻击者的重识别能力降低高达60%，同时仍保留了执行下游任务所需的大量信息。此外，该匿名化方法在未见数据上具有良好的泛化能力，并且对图像重建和反演攻击具有鲁棒性。

链接: https://arxiv.org/abs/2411.16440
作者: Katharina Bendig,René Schuster,Nicole Thiemer,Karen Joisten,Didier Stricker
关键词-EN: rise in public, public surveillance, neural networks, recent years, deep neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV25

点击查看摘要

Abstract:The increasing capabilities of deep neural networks for re-identification, combined with the rise in public surveillance in recent years, pose a substantial threat to individual privacy. Event cameras were initially considered as a promising solution since their output is sparse and therefore difficult for humans to interpret. However, recent advances in deep learning proof that neural networks are able to reconstruct high-quality grayscale images and re-identify individuals using data from event cameras. In our paper, we contribute a crucial ethical discussion on data privacy and present the first event anonymization pipeline to prevent re-identification not only by humans but also by neural networks. Our method effectively introduces learnable data-dependent noise to cover personally identifiable information in raw event data, reducing attackers’ re-identification capabilities by up to 60%, while maintaining substantial information for the performing of downstream tasks. Moreover, our anonymization generalizes well on unseen data and is robust against image reconstruction and inversion attacks. Code: this https URL
zh

[CV-22] Harnessing Superclasses for Learning from Hierarchical Databases

【速读】：该论文试图解决在大规模分类问题中，类别之间存在已知层次结构（hierarchy）时，如何有效进行监督层次分类的问题。解决方案的关键在于引入了一种新的损失函数，该损失函数利用层次结构的知识，不仅将每个样本分配到一个具体的类别，还分配到所有包含该类别的超类（superclasses）。这种损失函数适用于任何带有softmax输出层的神经网络架构，并且是一个适当的评分规则（proper scoring rule），其期望值由真实的后验类别概率最小化。这一特性使得我们能够在超类和细粒度类别之间同时追求一致的分类目标，消除了不同粒度之间性能权衡的需要。实验结果表明，该方法在不显著增加计算成本的情况下，提高了分类准确性并减少了粗粒度错误，特别是在预测标签与真实标签在层次树中距离较远的情况下。

链接: https://arxiv.org/abs/2411.16438
作者: Nicolas Urbani(Heudiasyc),Sylvain Rousseau(Heudiasyc),Yves Grandvalet(Heudiasyc),Leonardo Tanzi(Polito)
关键词-EN: large-scale classification problems, typically represented, expressing the inclusion, classification problems, large-scale classification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In many large-scale classification problems, classes are organized in a known hierarchy, typically represented as a tree expressing the inclusion of classes in superclasses. We introduce a loss for this type of supervised hierarchical classification. It utilizes the knowledge of the hierarchy to assign each example not only to a class but also to all encompassing superclasses. Applicable to any feedforward architecture with a softmax output layer, this loss is a proper scoring rule, in that its expectation is minimized by the true posterior class probabilities. This property allows us to simultaneously pursue consistent classification objectives between superclasses and fine-grained classes, and eliminates the need for a performance trade-off between different granularities. We conduct an experimental study on three reference benchmarks, in which we vary the size of the training sets to cover a diverse set of learning scenarios. Our approach does not entail any significant additional computational cost compared with the loss of cross-entropy. It improves accuracy and reduces the number of coarse errors, with predicted labels that are distant from ground-truth labels in the tree.
zh

[CV-23] Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack NEURIPS2024

【速读】：该论文试图解决个性化文本到图像（T2I）扩散模型在隐私保护方面的挑战，特别是在防止模型被恶意使用时可能导致的隐私泄露问题。解决方案的关键在于提出了一种新颖且高效的对抗攻击方法，称为概念保护通过选择性注意力操纵（Concept Protection by Selective Attention Manipulation, CoPSAM）。该方法通过仅针对T2I扩散模型的交叉注意力层，精心构建一种不可察觉的噪声，将其添加到干净样本中以生成对抗样本。这一过程在微调阶段通过最大化用户特定令牌和类别特定令牌对应的交叉注意力图之间的差异来实现。实验验证表明，该方法在保护个体身份免受潜在滥用方面优于现有方法，并且在较低噪声水平下提供了更好的保护效果。

链接: https://arxiv.org/abs/2411.16437
作者: Xide Xu,Muhammad Atif Butt,Sandesh Kamath,Bogdan Raducanu
关键词-EN: customized visual content, Selective Attention Manipulation, rise of personalized, growing demand, demand for customized
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at Safe Generative AI Workshop (NeurIPS 2024)

点击查看摘要

Abstract:The growing demand for customized visual content has led to the rise of personalized text-to-image (T2I) diffusion models. Despite their remarkable potential, they pose significant privacy risk when misused for malicious purposes. In this paper, we propose a novel and efficient adversarial attack method, Concept Protection by Selective Attention Manipulation (CoPSAM) which targets only the cross-attention layers of a T2I diffusion model. For this purpose, we carefully construct an imperceptible noise to be added to clean samples to get their adversarial counterparts. This is obtained during the fine-tuning process by maximizing the discrepancy between the corresponding cross-attention maps of the user-specific token and the class-specific token, respectively. Experimental validation on a subset of CelebA-HQ face images dataset demonstrates that our approach outperforms existing methods. Besides this, our method presents two important advantages derived from the qualitative evaluation: (i) we obtain better protection results for lower noise levels than our competitors; and (ii) we protect the content from unauthorized use thereby protecting the individual’s identity from potential misuse.
zh

[CV-24] opV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

【速读】：该论文试图解决零样本目标导航 (Zero-Shot Object Navigation, ZSON) 任务中，现有基于大型语言模型 (LLM) 的方法在将视觉观察转换为语言描述时丢失空间信息的问题。解决方案的关键在于引入了一种基于多模态大型语言模型 (MLLM) 的方法，称为 TopV-Nav，该方法直接在具有完整空间信息的顶视图地图上进行推理。具体来说，论文提出了自适应视觉提示生成 (Adaptive Visual Prompt Generation, AVPG) 方法，用于自适应构建语义丰富的顶视图地图，使代理能够直接利用顶视图地图中的空间信息进行深入推理。此外，设计了动态地图缩放 (Dynamic Map Scaling, DMS) 机制，以动态调整顶视图地图的缩放比例，增强局部细粒度推理能力。同时，提出了目标引导导航 (Target-Guided Navigation, TGN) 机制，用于预测和利用目标位置，促进全局和类人探索。实验结果表明，TopV-Nav 在 MP3D 和 HM3D 基准测试中显著优于现有方法。

链接: https://arxiv.org/abs/2411.16425
作者: Linqing Zhong,Chen Gao,Zihan Ding,Yue Liao,Si Liu
关键词-EN: previously unseen object, Zero-Shot Object Navigation, task requires embodied, Zero-Shot Object, unseen object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages

点击查看摘要

Abstract:The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, a MLLM-based method that directly reasons on the top-view map with complete spatial information. To fully unlock the MLLM’s spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Target-Guided Navigation (TGN) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D benchmarks demonstrate the superiority of our TopV-Nav, e.g., +3.9% SR and +2.0% SPL absolute improvements on HM3D.
zh

[CV-25] Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks

【速读】：该论文试图解决的问题是如何利用长时间序列的空间-时间数据来提升机器学习模型在台风预测任务中的性能。解决方案的关键在于引入数字台风数据集V2 (Digital Typhoon Dataset V2)，该数据集不仅包含北半球的台风数据，还新增了南半球的台风数据，从而能够研究跨区域和跨半球的差异。论文提出了新的任务，如台风中心估计任务，并探讨了自监督学习框架与长短期记忆网络 (LSTM) 结合在强度预测和热带气旋向温带气旋转变预测任务中的表现。此外，论文还研究了模型在不同半球数据上的泛化能力，通过在北半球数据上训练模型并在南半球数据上测试，评估模型的跨区域适应性。

链接: https://arxiv.org/abs/2411.16421
作者: Asanobu Kitamoto,Erwan Dzik,Gaspar Faure
关键词-EN: Digital Typhoon Dataset, satellite image dataset, presents the Digital, long-term spatio-temporal data, longest typhoon satellite
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents the Digital Typhoon Dataset V2, a new version of the longest typhoon satellite image dataset for 40+ years aimed at benchmarking machine learning models for long-term spatio-temporal data. The new addition in Dataset V2 is tropical cyclone data from the southern hemisphere, in addition to the northern hemisphere data in Dataset V1. Having data from two hemispheres allows us to ask new research questions about regional differences across basins and hemispheres. We also discuss new developments in representations and tasks of the dataset. We first introduce a self-supervised learning framework for representation learning. Combined with the LSTM model, we discuss performance on intensity forecasting and extra-tropical transition forecasting tasks. We then propose new tasks, such as the typhoon center estimation task. We show that an object detection-based model performs better for stronger typhoons. Finally, we study how machine learning models can generalize across basins and hemispheres, by training the model on the northern hemisphere data and testing it on the southern hemisphere data. The dataset is publicly available at \urlthis http URL and \urlthis https URL.
zh

[CV-26] Low-Data Classification of Historical Music Manuscripts: A Few-Shot Learning Approach

【速读】：该论文试图解决历史手稿中音乐符号分类的问题，特别是在缺乏标注数据的情况下。解决方案的关键在于开发了一个自监督学习框架，通过在未标注数据上训练神经网络特征提取器，从而实现有效的分类。具体方法包括优化裁剪预处理步骤以适应自监督卷积神经网络，并评估了多种分类方法，如支持向量机（SVM）、多层感知器和原型网络。实验结果显示，该方法在音乐符号分类任务中达到了87.66%的准确率，展示了AI驱动的技术在历史音乐数字化存档中的潜力。

链接: https://arxiv.org/abs/2411.16408
作者: Elona Shatri,Daniel Raymond,George Fazekas
关键词-EN: self-supervised learning framework, explore the intersection, intersection of technology, technology and cultural, cultural preservation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, The Sixth IEEE international conference on Image Processing Applications and Systems

点击查看摘要

Abstract:In this paper, we explore the intersection of technology and cultural preservation by developing a self-supervised learning framework for the classification of musical symbols in historical manuscripts. Optical Music Recognition (OMR) plays a vital role in digitising and preserving musical heritage, but historical documents often lack the labelled data required by traditional methods. We overcome this challenge by training a neural-based feature extractor on unlabelled data, enabling effective classification with minimal samples. Key contributions include optimising crop preprocessing for a self-supervised Convolutional Neural Network and evaluating classification methods, including SVM, multilayer perceptrons, and prototypical networks. Our experiments yield an accuracy of 87.66%, showcasing the potential of AI-driven methods to ensure the survival of historical music for future generations through advanced digital archiving techniques.
zh

[CV-27] A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models BMVC

【速读】：该论文试图解决基于深度学习的计算机视觉中存在的领域偏移问题，特别是在自动驾驶场景中的语义分割任务。解决方案的关键在于利用预训练的视觉-语言模型（vision-language pre-trained models）替换传统的基于ImageNet预训练的编码器（encoder），从而显著提升无监督领域自适应（Unsupervised Domain Adaptation, UDA）方法在目标域上的性能。具体来说，通过将现有UDA方法如DACS的编码器替换为视觉-语言预训练编码器，可以在GTA5到Cityscapes的领域偏移上实现高达10.0%的mIoU提升，并且在未见过的领域上也能获得高达13.7%的mIoU增益。然而，论文也指出并非所有UDA方法都能轻易与新编码器结合，且UDA性能的提升并不总是能转化为泛化性能的提升。

链接: https://arxiv.org/abs/2411.16407
作者: Manuel Schwonberg,Claus Werner,Hanno Gottschalk,Carsten Meyer
关键词-EN: based computer vision, deep learning based, learning based computer, UDA methods, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to British Machine Vision Conference (BMVC) 2024: Workshop on Robust Recognition in the Open World (RROW)

点击查看摘要

Abstract:Despite the recent progress in deep learning based computer vision, domain shifts are still one of the major challenges. Semantic segmentation for autonomous driving faces a wide range of domain shifts, e.g. caused by changing weather conditions, new geolocations and the frequent use of synthetic data in model training. Unsupervised domain adaptation (UDA) methods have emerged which adapt a model to a new target domain by only using unlabeled data of that domain. The variety of UDA methods is large but all of them use ImageNet pre-trained models. Recently, vision-language models have demonstrated strong generalization capabilities which may facilitate domain adaptation. We show that simply replacing the encoder of existing UDA methods like DACS by a vision-language pre-trained encoder can result in significant performance improvements of up to 10.0% mIoU on the GTA5-to-Cityscapes domain shift. For the generalization performance to unseen domains, the newly employed vision-language pre-trained encoder provides a gain of up to 13.7% mIoU across three unseen datasets. However, we find that not all UDA methods can be easily paired with the new encoder and that the UDA performance does not always likewise transfer into generalization performance. Finally, we perform our experiments on an adverse weather condition domain shift to further verify our findings on a pure real-to-real domain shift.
zh

[CV-28] Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN ProGAN and DCGAN

【速读】：该论文试图解决手写乐谱生成中的数据稀缺问题，以提升光学音乐识别系统 (Optical Music Recognition, OMR) 的性能。解决方案的关键在于应用生成对抗网络 (Generative Adversarial Networks, GANs) 来合成逼真的手写乐谱图像。论文通过对比三种GAN模型——DCGAN、ProGAN和CycleWGAN，发现CycleWGAN在风格迁移和训练稳定性方面表现优异，显著优于其他模型，其FID得分为41.87，IS得分为2.29，KID得分为0.05，显示出其在提升OMR系统中的潜力。

链接: https://arxiv.org/abs/2411.16405
作者: Elona Shatri,Kalikidhar Palavala,George Fazekas
关键词-EN: Optical Music Recognition, enhancing Optical Music, handwritten music sheets, enhancing Optical, Music Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 10 pages, one page references, to appear on the IEEE Big Data 2024 2nd Workshop on AI Music Generation (AIMG 2024)

点击查看摘要

Abstract:The generation of handwritten music sheets is a crucial step toward enhancing Optical Music Recognition (OMR) systems, which rely on large and diverse datasets for optimal performance. However, handwritten music sheets, often found in archives, present challenges for digitisation due to their fragility, varied handwriting styles, and image quality. This paper addresses the data scarcity problem by applying Generative Adversarial Networks (GANs) to synthesise realistic handwritten music sheets. We provide a comprehensive evaluation of three GAN models - DCGAN, ProGAN, and CycleWGAN - comparing their ability to generate diverse and high-quality handwritten music images. The proposed CycleWGAN model, which enhances style transfer and training stability, significantly outperforms DCGAN and ProGAN in both qualitative and quantitative evaluations. CycleWGAN achieves superior performance, with an FID score of 41.87, an IS of 2.29, and a KID of 0.05, making it a promising solution for improving OMR systems.
zh

[CV-29] Quadratic Gaussian Splatting for Efficient and Detailed Surface Reconstruction

【速读】：该论文试图解决3D高斯喷射（3D Gaussian Splatting, 3DGS）在表面表示上的局限性，特别是2D高斯喷射（2D Gaussian Splatting, 2DGS）中使用圆盘作为场景基元导致的几何过度平滑问题。解决方案的关键在于提出了一种新的二次高斯喷射（Quadratic Gaussian Splatting, QGS）方法，通过用二次曲面替代圆盘，增强了几何拟合能力。QGS在非欧几里得空间中定义高斯分布，使基元能够捕捉更复杂的纹理，并通过二次曲面近似来渲染空间曲率，从而引导法线一致性项，有效减少过度平滑。实验结果表明，QGS在几何重建方面超越了当前最先进的方法。

链接: https://arxiv.org/abs/2411.16392
作者: Ziyu Zhang,Binbin Huang,Hanqing Jiang,Liyang Zhou,Xiaojun Xiang,Shunhan Shen
关键词-EN: Neural Radiance Fields, Radiance Fields, Neural Radiance, Gaussian Splatting, superior rendering quality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has attracted attention for its superior rendering quality and speed over Neural Radiance Fields (NeRF). To address 3DGS’s limitations in surface representation, 2D Gaussian Splatting (2DGS) introduced disks as scene primitives to model and reconstruct geometries from multi-view images, offering view-consistent geometry. However, the disk’s first-order linear approximation often leads to over-smoothed results. We propose Quadratic Gaussian Splatting (QGS), a novel method that replaces disks with quadric surfaces, enhancing geometric fitting, whose code will be open-sourced. QGS defines Gaussian distributions in non-Euclidean space, allowing primitives to capture more complex textures. As a second-order surface approximation, QGS also renders spatial curvature to guide the normal consistency term, to effectively reduce over-smoothing. Moreover, QGS is a generalized version of 2DGS that achieves more accurate and detailed reconstructions, as verified by experiments on DTU and TNT, demonstrating its effectiveness in surpassing current state-of-the-art methods in geometry reconstruction. Our code willbe released as open source.
zh

[CV-30] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

【速读】：该论文试图解决现有视频扩散模型（VDMs）在生成长视频时存在的计算效率低下和冗余问题。现有自回归VDMs在生成后续片段时，需要重新计算与前一片段重叠的条件帧，导致计算量随自回归步数的增加呈二次方增长。论文提出的解决方案是Ca2-VDM，其关键在于引入因果生成（Causal generation）和缓存共享（Cache sharing）机制。因果生成通过单向特征计算，确保在前序自回归步骤中预计算的条件帧缓存可以在后续步骤中重复使用，从而消除冗余计算。缓存共享则通过在所有去噪步骤中共享缓存，避免了巨大的缓存存储成本。实验结果表明，Ca2-VDM在视频生成质量和速度上均达到了最先进水平。

链接: https://arxiv.org/abs/2411.16375
作者: Kaifeng Gao,Jiaxin Shi,Hanwang Zhang,Chunping Wang,Jun Xiao,Long Chen
关键词-EN: achieved impressive quality, today video generation, video diffusion models, impressive quality, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Code is available at this https URL

点击查看摘要

Abstract:With the advance of diffusion models, today’s video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available at this https URL
zh

[CV-31] A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation

【速读】：该论文试图解决图像分割领域中不确定性量化的问题，特别是在高风险应用中确保算法可靠性的挑战。解决方案的关键在于区分和量化两种不确定性：认知不确定性（epistemic uncertainty）和偶然不确定性（aleatoric uncertainty）。认知不确定性涉及模型参数的不确定性，而偶然不确定性涉及数据本身的不确定性。通过近似贝叶斯推断，分别对潜在变量或模型参数进行不确定性量化，可以有效提升模型的鲁棒性和决策的可靠性。论文还探讨了这些不确定性在四个关键应用中的作用，包括量化标注过程中的统计不一致性、关联预测误差与不确定性、扩展模型假设空间以提高泛化能力，以及在主动学习中的应用。

链接: https://arxiv.org/abs/2411.16370
作者: M.M.A. Valiuddin,R.J.G. van Sloun,C.G.A. Viviers,P.H.N. de With,F. van der Sommen
关键词-EN: Deep Learning-based computer, Learning-based computer vision, Deep Learning-based, scope of Deep, Learning-based computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注: 20 pages

点击查看摘要

Abstract:Advancements in image segmentation play an integral role within the greater scope of Deep Learning-based computer vision. Furthermore, their widespread applicability in critical real-world tasks has given rise to challenges related to the reliability of such algorithms. Hence, uncertainty quantification has been extensively studied within this context, enabling expression of model ignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to prevent uninformed decision making. Due to the rapid adoption of Convolutional Neural Network (CNN)-based segmentation models in high-stake applications, a substantial body of research has been published on this very topic, causing its swift expansion into a distinct field. This work provides a comprehensive overview of probabilistic segmentation by discussing fundamental concepts in uncertainty that govern advancements in the field as well as the application to various tasks. We identify that quantifying aleatoric and epistemic uncertainty approximates Bayesian inference w.r.t. to either latent variables or model parameters, respectively. Moreover, literature on both uncertainties trace back to four key applications; (1) to quantify statistical inconsistencies in the annotation process due ambiguous images, (2) correlating prediction error with uncertainty, (3) expanding the model hypothesis space for better generalization, and (4) active learning. Then, a discussion follows that includes an overview of utilized datasets for each of the applications and comparison of the available methods. We also highlight challenges related to architectures, uncertainty-based active learning, standardization and benchmarking, and recommendations for future work such as methods based on single forward passes and models that appropriately leverage volumetric data.
zh

[CV-32] Cluster-based human-in-the-loop strategy for improving machine learning-based circulating tumor cell detection in liquid biopsy

【速读】：该论文试图解决循环肿瘤细胞 (CTCs) 和非CTCs在癌症患者血液样本中的检测与区分问题。解决方案的关键在于引入了一种人机协作 (Human-in-the-Loop, HiL) 策略，通过结合自监督深度学习和传统机器学习分类器，迭代地由专家对新样本进行有针对性的采样和标注。具体来说，该方法基于局部潜在空间簇的分类性能，选择性地采样未标注的训练样本，从而提高机器学习系统的准确性和可靠性。与简单的随机采样相比，这种有针对性的采样策略显著提升了液体活检数据在转移性乳腺癌患者中的应用效果。

链接: https://arxiv.org/abs/2411.16332
作者: Hümeyra Husseini-Wüsthoff,Sabine Riethdorf,Andreas Schneeweiss,Andreas Trumpp,Klaus Pantel,Harriet Wikman,Maximilian Nielsen,René Werner
关键词-EN: circulating tumor cells, pose multiple challenges, patients pose multiple, tumor cells, multiple challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detection and differentiation of circulating tumor cells (CTCs) and non-CTCs in blood draws of cancer patients pose multiple challenges. While the gold standard relies on tedious manual evaluation of an automatically generated selection of images, machine learning (ML) techniques offer the potential to automate these processes. However, human assessment remains indispensable when the ML system arrives at uncertain or wrong decisions due to an insufficient set of labeled training data. This study introduces a human-in-the-loop (HiL) strategy for improving ML-based CTC detection. We combine self-supervised deep learning and a conventional ML-based classifier and propose iterative targeted sampling and labeling of new unlabeled training samples by human experts. The sampling strategy is based on the classification performance of local latent space clusters. The advantages of the proposed approach compared to naive random sampling are demonstrated for liquid biopsy data from patients with metastatic breast cancer.
zh

[CV-33] CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain

【速读】：该论文试图解决在极端光照条件下，利用可见光合成红外（IR）图像时出现的细节损失和伪热交叉伪影问题。解决方案的关键在于提出了CapHDR2IR框架，该框架利用高动态范围（HDR）图像作为输入，结合视觉-语言模型生成IR图像。HDR图像能够捕捉更广泛的亮度变化，确保在不同光照条件下生成可靠的IR图像。此外，通过密集标注分支引入语义理解，使得生成的IR图像更具意义和可辨识性。实验结果表明，CapHDR2IR在HDRT数据集上达到了最先进的性能，优于现有的通用域转换方法和专门用于可见光到红外图像转换的方法。

链接: https://arxiv.org/abs/2411.16327
作者: Jingchao Peng,Thomas Bashford-Rogers,Zhuang Shao,Haitao Zhao,Aru Ranjan Singh,Abhishek Goswami,Kurt Debattista
关键词-EN: imaging offers advantages, imaging offers, offers advantages, unique ability, ability of capturing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared (IR) imaging offers advantages in several fields due to its unique ability of capturing content in extreme light conditions. However, the demanding hardware requirements of high-resolution IR sensors limit its widespread application. As an alternative, visible light can be used to synthesize IR images but this causes a loss of fidelity in image details and introduces inconsistencies due to lack of contextual awareness of the scene. This stems from a combination of using visible light with a standard dynamic range, especially under extreme lighting, and a lack of contextual awareness can result in pseudo-thermal-crossover artifacts. This occurs when multiple objects with similar temperatures appear indistinguishable in the training data, further exacerbating the loss of fidelity. To solve this challenge, this paper proposes CapHDR2IR, a novel framework incorporating vision-language models using high dynamic range (HDR) images as inputs to generate IR images. HDR images capture a wider range of luminance variations, ensuring reliable IR image generation in different light conditions. Additionally, a dense caption branch integrates semantic understanding, resulting in more meaningful and discernible IR outputs. Extensive experiments on the HDRT dataset show that the proposed CapHDR2IR achieves state-of-the-art performance compared with existing general domain transfer methods and those tailored for visible-to-infrared image translation.
zh

[CV-34] Brain-like emergent properties in deep networks: impact of network architecture datasets and training

【速读】：该论文试图解决深度网络在标准化视觉基准测试中表现优异，但在现实世界视觉任务中仍不如人类的问题。解决方案的关键在于使深度网络更具类脑特性。论文通过系统评估30多种最先进的深度网络，发现网络架构对类脑特性的影响最大，而数据集和训练机制的影响相对较小。此外，不同网络在类脑特性上的表现差异显著，没有单一网络在所有类脑特性上均优于其他网络。这些发现补充了现有基准测试，揭示了当前最先进深度网络中存在的类脑特性的涌现或缺失。

链接: https://arxiv.org/abs/2411.16326
作者: Niranjan Rajesh,Georgin Jacob,SP Arun
关键词-EN: real-world vision tasks, standardized vision benchmarks, deep networks, vision tasks, standardized vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the rapid pace at which deep networks are improving on standardized vision benchmarks, they are still outperformed by humans on real-world vision tasks. This paradoxical lack of generalization could be addressed by making deep networks more brain-like. Although several benchmarks have compared the ability of deep networks to predict brain responses to natural images, they do not capture subtle but important brain-like emergent properties. To resolve this issue, we report several well-known perceptual and neural emergent properties that can be tested on deep networks. To evaluate how various design factors impact brain-like properties, we systematically evaluated over 30 state-of-the-art networks with varying network architectures, training datasets and training regimes. Our main findings are as follows. First, network architecture had the strongest impact on brain-like properties compared to dataset and training regime variations. Second, networks varied widely in their alignment to the brain with no single network outperforming all others. Taken together, our results complement existing benchmarks by revealing brain-like properties that are either emergent or lacking in state-of-the-art deep networks.
zh

[CV-35] Luminance Component Analysis for Exposure Correction

【速读】：该论文试图解决现有曝光校正方法在分离亮度相关和亮度无关成分时存在的困难，导致颜色失真、细节丢失以及需要额外修复步骤的问题。解决方案的关键在于提出了一种基于亮度成分分析 (Luminance Component Analysis, LCA) 的方法，该方法通过在U-Net结构中应用正交约束，成功解耦了亮度相关和亮度无关特征。LCA仅调整亮度相关成分，同时保持亮度无关成分不变，并通过几何优化算法将欧几里得空间中的约束问题转化为正交Stiefel流形中的无约束问题，从而优化正交约束。实验结果表明，LCA能够有效分离RGB色彩空间中的亮度特征，并在曝光校正数据集上实现了最佳的PSNR (21.33) 和SSIM (0.88)，处理速度达到28.72 FPS。

链接: https://arxiv.org/abs/2411.16325
作者: Jingchao Peng,Thomas Bashford-Rogers,Jingkun Chen,Haitao Zhao,Zhengwei Hu,Kurt Debattista
关键词-EN: Exposure correction methods, Exposure correction, correction methods aim, current exposure correction, correction methods
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Exposure correction methods aim to adjust the luminance while maintaining other luminance-unrelated information. However, current exposure correction methods have difficulty in fully separating luminance-related and luminance-unrelated components, leading to distortions in color, loss of detail, and requiring extra restoration procedures. Inspired by principal component analysis (PCA), this paper proposes an exposure correction method called luminance component analysis (LCA). LCA applies the orthogonal constraint to a U-Net structure to decouple luminance-related and luminance-unrelated features. With decoupled luminance-related features, LCA adjusts only the luminance-related components while keeping the luminance-unrelated components unchanged. To optimize the orthogonal constraint problem, LCA employs a geometric optimization algorithm, which converts the constrained problem in Euclidean space to an unconstrained problem in orthogonal Stiefel manifolds. Extensive experiments show that LCA can decouple the luminance feature from the RGB color space. Moreover, LCA achieves the best PSNR (21.33) and SSIM (0.88) in the exposure correction dataset with 28.72 FPS.
zh

[CV-36] CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation

【速读】：该论文试图解决传统2D图像实例分割算法依赖大量人工标注数据的问题，并针对现有无监督方法在处理重叠实例时的不足，提出了一种新的解决方案。其关键在于利用场景的点云表示，在3D空间中对语义掩码进行切割，从而获得最终的2D实例分割结果。此外，论文还引入了一个空间重要性函数（Spatial Importance function），用于沿着实例的3D边界重新锐化语义信息，并通过三个空间置信度组件（Spatial Confidence components）增强类无关检测器的训练，以减少掩码模糊性。这些创新使得该方法在多个无监督实例分割和目标检测的标准基准测试中超越了现有方法。

链接: https://arxiv.org/abs/2411.16319
作者: Leon Sick,Dominik Engel,Sebastian Hartwig,Pedro Hermosilla,Timo Ropinski
关键词-EN: algorithms that learn, human-annotated data, learn to segment, heavily relied, relied on large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditionally, algorithms that learn to segment object instances in 2D images have heavily relied on large amounts of human-annotated data. Only recently, novel approaches have emerged tackling this problem in an unsupervised fashion. Generally, these approaches first generate pseudo-masks and then train a class-agnostic detector. While such methods deliver the current state of the art, they often fail to correctly separate instances overlapping in 2D image space since only semantics are considered. To tackle this issue, we instead propose to cut the semantic masks in 3D to obtain the final 2D instances by utilizing a point cloud representation of the scene. Furthermore, we derive a Spatial Importance function, which we use to resharpen the semantics along the 3D borders of instances. Nevertheless, these pseudo-masks are still subject to mask ambiguity. To address this issue, we further propose to augment the training of a class-agnostic detector with three Spatial Confidence components aiming to isolate a clean learning signal. With these contributions, our approach outperforms competing methods across multiple standard benchmarks for unsupervised instance segmentation and object detection.
zh

[CV-37] One Diffusion to Generate Them All

【速读】：该论文试图解决多任务图像合成与理解的问题，特别是如何在一个统一的框架下支持多种条件生成和逆向任务，如文本到图像生成、图像去模糊、超分辨率、深度估计和分割等。解决方案的关键在于提出了OneDiffusion模型，该模型通过将所有任务视为带有不同噪声尺度的帧序列进行训练，从而在推理时允许任何帧作为条件图像。这种统一的方法不仅简化了架构设计，还支持可扩展的多任务训练，并能平滑适应任意分辨率，从而增强了模型的泛化能力和可扩展性。

链接: https://arxiv.org/abs/2411.16318
作者: Duong H. Le,Tuan Pham,Sangho Lee,Christopher Clark,Aniruddha Kembhavi,Stephan Mandt,Ranjay Krishna,Jiasen Lu
关键词-EN: large-scale diffusion model, bidirectional image synthesis, seamlessly supports bidirectional, large-scale diffusion, camera pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: two first authors contribute equally

点击查看摘要

Abstract:We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at this https URL
zh

[CV-38] Monocular Lane Detection Based on Deep Learning: A Survey

【速读】：该论文旨在全面综述基于深度学习的单目车道检测方法，并探讨其在自动驾驶感知系统中的应用。解决方案的关键在于四个核心设计要素：(1) 任务范式，专注于车道实例级别的区分；(2) 车道建模，将车道表示为神经网络中的可学习参数；(3) 全局上下文补充，增强对遮挡车道的检测；(4) 透视效应消除，提供可用于下游应用的3D车道信息。论文不仅涵盖了日益成熟的2D车道检测方法，还涉及正在发展的3D车道检测工作，并通过统一的设置比较了主流方法在不同基准上的性能和推理速度。此外，论文还介绍了车道检测的扩展工作，如多任务感知、视频车道检测、在线高清地图构建和车道拓扑推理，为读者提供了车道检测技术演变的全面路线图。

链接: https://arxiv.org/abs/2411.16316
作者: Xin He,Haiyun Guo,Kuan Zhu,Bingke Zhu,Xu Zhao,Jianwu Fang,Jinqiao Wang
关键词-EN: autonomous driving perception, Lane detection, driving perception system, Lane detection plays, Lane
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lane detection plays an important role in autonomous driving perception system. As deep learning algorithms gain popularity, monocular lane detection methods based on deep learning have demonstrated superior performance and emerged as a key research direction in autonomous driving perception. The core design of these algorithmic frameworks can be summarized as follows: (1) Task paradigm, focusing on lane instance-level discrimination; (2) Lane modeling, representing lanes as a set of learnable parameters in the neural network; (3) Global context supplementation, enhancing the detection of obscured lanes; (4) Perspective effect elimination, providing 3D lanes usable for downstream applications. From these perspectives, this paper presents a comprehensive overview of existing methods, encompassing both the increasingly mature 2D lane detection approaches and the developing 3D lane detection works. For a relatively fair comparison, in addition to comparing the performance of mainstream methods on different benchmarks, their inference speed is also investigated under a unified setting. Moreover, we present some extended works on lane detection, including multi-task perception, video lane detection, online high-definition (HD) map construction, and lane topology reasoning, to offer readers a comprehensive roadmap for the evolution of lane detection. Finally, we point out some potential future research directions in this field. We exhaustively collect the papers and codes of existing works at this https URL and will keep tracing the research.
zh

[CV-39] EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training

【速读】：该论文试图解决在视频传输系统中，利用深度神经网络（DNNs）的过拟合特性进行超分辨率（SR）重建时，训练大量视频帧所带来的巨大计算成本问题。解决方案的关键在于提出了一种高效的补丁采样方法，称为EPS（Efficient Patch Sampling），用于视频SR网络的过拟合训练。EPS方法通过引入基于离散余弦变换（DCT）的空间-时间特征，直接评估每个补丁的复杂度得分，并根据这些特征的直方图分布将所有可能的补丁分类到不同的簇中，从包含最高空间-时间信息的簇中选择训练补丁。该方法自适应地调整采样补丁的数量，以平衡训练复杂度和效率，从而将训练补丁数量减少到4%至25%，同时保持高视频质量和显著提高训练效率。与最先进的补丁采样方法EMT相比，EPS方法的整体运行时间减少了83%。

链接: https://arxiv.org/abs/2411.16312
作者: Yiying Wei,Hadi Amirpour,Jong Hwan Ko,Christian Timmerer
关键词-EN: video delivery systems, deep neural networks, property of deep, deep neural, delivery systems
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for low-resolution (LR) bitstreams, which are used to reconstruct high-resolution (HR) videos at the decoder. Although these approaches show promising results, the huge computational costs of training a large number of video frames limit their practical applications. To overcome this challenge, we propose an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames. To this end, we first present two low-complexity Discrete Cosine Transform (DCT)-based spatial-temporal features to measure the complexity score of each patch directly. By analyzing the histogram distribution of these features, we then categorize all possible patches into different clusters and select training patches from the cluster with the highest spatial-temporal information. The number of sampled patches is adaptive based on the video content, addressing the trade-off between training complexity and efficiency. Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters, while maintaining high video quality and significantly enhancing training efficiency. Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.
zh

[CV-40] Functionality understanding and segmentation in 3D scenes

【速读】：该论文试图解决在三维场景中理解功能性对象的问题，即通过自然语言描述定位三维环境中的功能性交互对象（如把手和按钮）。解决方案的关键在于引入了一种名为Fun3DU的新方法，该方法利用语言模型通过Chain-of-Thought推理解析任务描述，以识别感兴趣的对象。随后，通过视觉和语言模型在捕获场景的多视图中进行对象分割，并将各视图的分割结果提升到三维空间并聚合到点云中，利用几何信息进行处理。Fun3DU方法无需额外训练，完全依赖预训练模型，并在SceneFun3D数据集上显著优于现有的开放词汇三维分割方法。

链接: https://arxiv.org/abs/2411.16310
作者: Jaime Corsetti,Francesco Giuliari,Alice Fasoli,Davide Boscaini,Fabio Poiesi
关键词-EN: involves interpreting natural, locate functional interactive, interpreting natural language, functional interactive objects, scenes involves interpreting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report. 20 pages, 12 figures, 7 tables

点击查看摘要

Abstract:Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. For example, given a task like ‘turn on the ceiling light’, an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state-of-the-art open-vocabulary 3D segmentation approaches. Code will be released publicly.
zh

[CV-41] An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models

【速读】：该论文试图解决现有条件去噪扩散概率模型（DDPMs）在处理3D场景理解任务时面临的挑战，特别是在复杂几何细节场景中，由于数据分布梯度（scores）拟合困难导致的训练和推理时间较长的问题。解决方案的关键在于提出了一种基于条件-噪声框架（Conditional-Noise Framework, CNF）的端到端鲁棒语义分割网络，名为CDSegNet。CDSegNet通过将噪声网络（Noise Network, NN）建模为可学习的噪声特征生成器，使得条件网络（Conditional Network, CN）能够在多层次特征扰动下理解3D场景语义，从而增强了对未见场景的泛化能力。此外，CDSegNet利用DDPMs的噪声系统，在实验中表现出强大的噪声和稀疏性鲁棒性。由于避免了直接从语义标签中拟合梯度，CDSegNet能够在单步推理中生成语义标签，显著缩短了推理时间，并在公开的室内外基准测试中取得了最先进的性能。

链接: https://arxiv.org/abs/2411.16308
作者: Wentao Qu,Jing Wang,YongShun Gong,Xiaoshui Huang,Liang Xiao
关键词-EN: Denoising Diffusion Probabilistic, conditional Denoising Diffusion, Diffusion Probabilistic Models, Denoising Diffusion, Diffusion Probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end-to-end robust semantic \textbfSegmentation \textbfNetwork based on a \textbfConditional-Noise Framework (CNF) of D\textbfDPMs, named \textbfCDSegNet. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise-feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi-level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single-step inference like non-DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state-of-the-art performance.
zh

[CV-42] DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation

【速读】：该论文试图解决室内设计过程中效率低下和生成式设计与实际需求之间存在显著差异的问题。解决方案的关键在于提出了DiffDesign，一种可控的扩散模型，结合了元先验信息，以提高室内设计生成的效率和质量。具体来说，DiffDesign利用预训练的2D扩散模型的生成先验作为渲染基础，并通过解耦交叉注意力控制设计属性（如外观、姿态和尺寸）来指导去噪过程。此外，引入了一个基于最优传输的对齐模块来确保视图一致性。论文还构建了一个专门的室内设计数据集DesignHelper，用于微调模型，从而提高其在不同空间类型和设计风格上的适应性和鲁棒性。

链接: https://arxiv.org/abs/2411.16301
作者: Yuxuan Yang,Jingyao Wang,Tao Geng,Wenwen Qiang,Changwen Zheng,Fuchun Sun
关键词-EN: discipline involving aesthetics, creative discipline involving, involving aesthetics, materials science, complex and creative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 32 pages

点击查看摘要

Abstract:Interior design is a complex and creative discipline involving aesthetics, functionality, ergonomics, and materials science. Effective solutions must meet diverse requirements, typically producing multiple deliverables such as renderings and design drawings from various perspectives. Consequently, interior design processes are often inefficient and demand significant creativity. With advances in machine learning, generative models have emerged as a promising means of improving efficiency by creating designs from text descriptions or sketches. However, few generative works focus on interior design, leading to substantial discrepancies between outputs and practical needs, such as differences in size, spatial scope, and the lack of controllable generation quality. To address these challenges, we propose DiffDesign, a controllable diffusion model with meta priors for efficient interior design generation. Specifically, we utilize the generative priors of a 2D diffusion model pre-trained on a large image dataset as our rendering backbone. We further guide the denoising process by disentangling cross-attention control over design attributes, such as appearance, pose, and size, and introduce an optimal transfer-based alignment module to enforce view consistency. Simultaneously, we construct an interior design-specific dataset, DesignHelper, consisting of over 400 solutions across more than 15 spatial types and 15 design styles. This dataset helps fine-tune DiffDesign. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness and robustness of DiffDesign.
zh

[CV-43] A Performance Increment Strategy for Semantic Segmentation of Low-Resolution Images from Damaged Roads

【速读】：该论文试图解决新兴国家道路状况复杂、数据集质量低下的问题，特别是在自动驾驶领域中，现有语义分割数据集主要基于高分辨率、维护良好的城市道路图像，而忽视了低分辨率、维护不良的道路图像。解决方案的关键在于提出了性能提升策略（Performance Increment Strategy for Semantic Segmentation, PISSS），通过14个训练实验来提升模型性能，特别是在处理像素少、形状不确定和类别高度不平衡的对象时。该策略在Road Traversing Knowledge (RTK)和Technik Autonomer Systeme 500 (TAS500)测试集上分别达到了79.8和68.8的mIoU，达到了当前最先进的结果，并分析了DeepLabV3+在小对象分割中的不足。

链接: https://arxiv.org/abs/2411.16295
作者: Rafael S. Toledo,Cristiano S. Oliveira,Vitor H. T. Oliveira,Eric A. Antonelo,Aldo von Wangenheim
关键词-EN: Autonomous driving, deep learning models, well-maintained urban roads, semantic segmentation datasets, Brazilian roads
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving needs good roads, but 85% of Brazilian roads have damages that deep learning models may not regard as most semantic segmentation datasets for autonomous driving are high-resolution images of well-maintained urban roads. A representative dataset for emerging countries consists of low-resolution images of poorly maintained roads and includes labels of damage classes; in this scenario, three challenges arise: objects with few pixels, objects with undefined shapes, and highly underrepresented classes. To tackle these challenges, this work proposes the Performance Increment Strategy for Semantic Segmentation (PISSS) as a methodology of 14 training experiments to boost performance. With PISSS, we reached state-of-the-art results of 79.8 and 68.8 mIoU on the Road Traversing Knowledge (RTK) and Technik Autonomer Systeme 500 (TAS500) test sets, respectively. Furthermore, we also offer an analysis of DeepLabV3+ pitfalls for small object segmentation.
zh

[CV-44] Utilizing Uncertainty in 2D Pose Detectors for Probabilistic 3D Human Mesh Recovery WACV2025

【速读】：该论文试图解决单目3D人体姿态和形状估计中的深度模糊、遮挡和截断问题。解决方案的关键在于提出了一种新的监督学习方法，通过最小化学习到的3D人体网格分布与2D姿态检测器生成的热图分布之间的距离，来增强模型对真实分布的捕捉能力。此外，论文还揭示了现有方法在不可见关节上生成错误假设的问题，并提出利用人体分割掩码在训练过程中减少无效样本的数量，同时引入两个新的评估指标来衡量这一改进。最终，基于归一化流的方法能够生成与图像证据一致且对模糊身体部位保持高多样性的合理3D人体网格假设。

链接: https://arxiv.org/abs/2411.16289
作者: Tom Wehrbein,Marco Rudolph,Bodo Rosenhahn,Bastian Wandt
关键词-EN: inherently ill-posed problem, ill-posed problem due, depth ambiguities, shape estimation, inherently ill-posed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025

点击查看摘要

Abstract:Monocular 3D human pose and shape estimation is an inherently ill-posed problem due to depth ambiguities, occlusions, and truncations. Recent probabilistic approaches learn a distribution over plausible 3D human meshes by maximizing the likelihood of the ground-truth pose given an image. We show that this objective function alone is not sufficient to best capture the full distributions. Instead, we propose to additionally supervise the learned distributions by minimizing the distance to distributions encoded in heatmaps of a 2D pose detector. Moreover, we reveal that current methods often generate incorrect hypotheses for invisible joints which is not detected by the evaluation protocols. We demonstrate that person segmentation masks can be utilized during training to significantly decrease the number of invalid samples and introduce two metrics to evaluate it. Our normalizing flow-based approach predicts plausible 3D human mesh hypotheses that are consistent with the image evidence while maintaining high diversity for ambiguous body parts. Experiments on 3DPW and EMDB show that we outperform other state-of-the-art probabilistic methods. Code is available for research purposes at this https URL.
zh

[CV-45] Open-Vocabulary Octree-Graph for 3D Scene Understanding

【速读】：该论文试图解决开放词汇3D场景理解中的存储效率和空间关系表达问题。现有方法依赖于点云数据，虽然能够进行物体分割，但点云数据的无序性和高存储需求限制了其在下游任务（如路径规划和复杂文本对象检索）中的效率。论文提出的解决方案之关键是Octree-Graph，它通过以下步骤实现：首先，设计了时间顺序分组分割合并策略（Chronological Group-wise Segment Merging, CGSM）和实例特征聚合算法（Instance Feature Aggregation, IFA）来获取3D实例及其语义特征；接着，开发了一种自适应八叉树结构，根据物体形状动态调整存储语义信息和占用状态；最后，构建了Octree-Graph，其中每个自适应八叉树作为图节点，节点间的边描述了空间关系。这种方法在多个广泛使用的数据集上进行了广泛实验，展示了其多功能性和有效性。

链接: https://arxiv.org/abs/2411.16253
作者: Zhigang Wang,Yifei Su,Chenhui Li,Dong Wang,Yan Huang,Bin Zhao,Xuelong Li
关键词-EN: embodied agents, indispensable for embodied, Group-wise Segment Merging, Chronological Group-wise Segment, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages,7figures

点击查看摘要

Abstract:Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and complex text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely-used datasets, demonstrating the versatility and effectiveness of our method.
zh

[CV-46] Diagnosis of diabetic retinopathy using machine learning deep learning technique

【速读】：该论文试图解决眼底图像（fundus images）在诊断多种眼病（如糖尿病视网膜病变、青光眼和年龄相关性黄斑变性）时，手动分析耗时且易出错的问题。解决方案的关键在于采用目标检测（object detection）和机器学习分类技术。具体来说，论文使用YOLO_V8进行眼底图像的目标检测，定位视盘（optic disc）、视杯（optic cup）和病灶（lesions）等感兴趣区域（ROIs），然后利用支持向量机（SVM）分类算法根据病理特征（如渗出物、微动脉瘤和出血等）将ROIs分类为不同的糖尿病视网膜病变（DR）阶段。该方法在眼底检测中达到了84%的准确率和效率，特别适用于全球偏远地区的眼底疾病筛查。

链接: https://arxiv.org/abs/2411.16250
作者: Eric Shah,Jay Patel,Mr.Vishal Katheriya,Parth Pataliya
关键词-EN: age-related macular degeneration, Fundus images, diabetic retinopathy, macular degeneration, diagnosing various eye
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, 11 figures, Journal Paper

点击查看摘要

Abstract:Fundus images are widely used for diagnosing various eye diseases, such as diabetic retinopathy, glaucoma, and age-related macular degeneration. However, manual analysis of fundus images is time-consuming and prone to errors. In this report, we propose a novel method for fundus detection using object detection and machine learning classification techniques. We use a YOLO_V8 to perform object detection on fundus images and locate the regions of interest (ROIs) such as optic disc, optic cup and lesions. We then use machine learning SVM classification algorithms to classify the ROIs into different DR stages based on the presence or absence of pathological signs such as exudates, microaneurysms, and haemorrhages etc. Our method achieves 84% accuracy and efficiency for fundus detection and can be applied for retinal fundus disease triage, especially in remote areas around the world.
zh

[CV-47] Weakly supervised image segmentation for defect-based grading of fresh produce

【速读】：该论文试图解决农业中基于图像的机器学习应用在数据稀缺和标注不足的情况下，难以实现高质量模型预测的问题。具体而言，研究聚焦于在分散供应链中对香蕉的采后质量评估，特别是表面缺陷的检测与分割。解决方案的关键在于采用弱监督学习方法，利用粗略标签而非耗时的像素级标注，结合Segment Anything Model (SAM) 生成密集标注，从而显著减少人工标注工作量，同时实现了77.6%的panoptic quality评分。这一方法展示了在数据有限的农业环境中，通过低成本、高精度的分割技术进行缺陷量化评估的潜力。

链接: https://arxiv.org/abs/2411.16219
作者: Manuel Knott,Divinefavour Odion,Sameer Sontakke,Anup Karwa,Thijs Defraeye
关键词-EN: Implementing image-based machine, image-based machine learning, Implementing image-based, high-quality model predictions, achieve high-quality model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implementing image-based machine learning in agriculture is often limited by scarce data and annotations, making it hard to achieve high-quality model predictions. This study tackles the issue of postharvest quality assessment of bananas in decentralized supply chains. We propose a method to detect and segment surface defects in banana images using panoptic segmentation to quantify defect size and number. Instead of time-consuming pixel-level annotations, we use weak supervision with coarse labels. A dataset of 476 smartphone images of bananas was collected under real-world field conditions and annotated for bruises and scars. Using the Segment Anything Model (SAM), a recently published foundation model for image segmentation, we generated dense annotations from coarse bounding boxes to train a segmentation model, significantly reducing manual effort while achieving a panoptic quality score of 77.6%. This demonstrates SAM’s potential for low-effort, accurate segmentation in agricultural settings with limited data.
zh

[CV-48] Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding

【速读】：该论文试图解决多重退化图像恢复（Multiple-in-one Image Restoration）中存在的退化多样性和提示单一性问题。解决方案的关键在于设计了一个局部动态优化模块（Local Dynamic Optimization, LDO）和一个条件特征嵌入模块（Conditional Feature Embedding, CFE）。LDO模块能够动态处理不同类型和粒度的退化区域，而CFE模块则通过引导解码器利用与退化类型相关的特征，显著提升了模型在混合退化恢复场景中的性能。

链接: https://arxiv.org/abs/2411.16217
作者: Yubin Gu,Yuan Meng,Xiaoshuai Sun,Jiayi Ji,Weijian Ruan,Rongrong Ji
关键词-EN: made significant progress, significant progress, aiming to handle, made significant, Local Dynamic Optimization
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Multiple-in-one image restoration (IR) has made significant progress, aiming to handle all types of single degraded image restoration with a single model. However, in real-world scenarios, images often suffer from combinations of multiple degradation factors. Existing multiple-in-one IR models encounter challenges related to degradation diversity and prompt singularity when addressing this issue. In this paper, we propose a novel multiple-in-one IR model that can effectively restore images with both single and mixed degradations. To address degradation diversity, we design a Local Dynamic Optimization (LDO) module which dynamically processes degraded areas of varying types and granularities. To tackle the prompt singularity issue, we develop an efficient Conditional Feature Embedding (CFE) module that guides the decoder in leveraging degradation-type-related features, significantly improving the model’s performance in mixed degradation restoration scenarios. To validate the effectiveness of our model, we introduce a new dataset containing both single and mixed degradation elements. Experimental results demonstrate that our proposed model achieves state-of-the-art (SOTA) performance not only on mixed degradation tasks but also on classic single-task restoration benchmarks.
zh

[CV-49] SMGDiff: Soccer Motion Generation using diffusion probabilistic models

【速读】：该论文试图解决生成逼真足球运动的问题，特别是在视频游戏和VR/AR应用中，由于球员与球之间复杂的交互关系，生成实时且用户可控的足球动作具有挑战性。解决方案的关键在于引入SMGDiff，这是一个两阶段框架，结合了实时角色控制与基于扩散的生成模型。第一阶段将粗略的用户控制即时转换为多样化的角色全局轨迹，第二阶段利用基于Transformer的自回归扩散模型，根据轨迹条件生成足球动作，并在推理过程中通过接触引导模块优化球与脚的接触细节，以确保动作的高质量和多样性。此外，论文还贡献了一个包含超过108万帧多样化足球动作的大规模数据集。

链接: https://arxiv.org/abs/2411.16216
作者: Hongdi Yang,Chengyang Li,Zhenxuan Wu,Gaozheng Li,Jingya Wang,Jingyi Yu,Zhuo Su,Lan Xu
关键词-EN: globally renowned sport, globally renowned, renowned sport, sport with significant, significant applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Soccer is a globally renowned sport with significant applications in video games and VR/AR. However, generating realistic soccer motions remains challenging due to the intricate interactions between the human player and the ball. In this paper, we introduce SMGDiff, a novel two-stage framework for generating real-time and user-controllable soccer motions. Our key idea is to integrate real-time character control with a powerful diffusion-based generative model, ensuring high-quality and diverse output motion. In the first stage, we instantly transform coarse user controls into diverse global trajectories of the character. In the second stage, we employ a transformer-based autoregressive diffusion model to generate soccer motions based on trajectory conditioning. We further incorporate a contact guidance module during inference to optimize the contact details for realistic ball-foot interactions. Moreover, we contribute a large-scale soccer motion dataset consisting of over 1.08 million frames of diverse soccer motions. Extensive experiments demonstrate that our SMGDiff significantly outperforms existing methods in terms of motion quality and condition alignment.
zh

[CV-50] SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

【速读】：该论文试图解决现有大型语言模型（Video-LLMs）在理解和解释长视频时，难以有效整合视频中丰富的视听信息的问题。解决方案的关键在于：(i) 引入首个长音频-视觉视频数据集SAVEn-Vid，包含超过58k的音频-视觉指令；(ii) 提出时间感知的音频-视觉大型语言模型（AV-LLM）SAVEnVideo，并在SAVEn-Vid上进行微调；(iii) 创建AVBench基准，包含2,500个问答对，用于评估模型在长视频中增强的音频-视觉理解任务中的表现。实验结果表明，SAVEnVideo在零样本长视频任务（Video-MME）和零样本音频-视觉任务（Music-AVQA）中分别超越了现有最佳模型3.61%和1.29%，在7B参数规模下达到最先进水平。

链接: https://arxiv.org/abs/2411.16213
作者: Jungang Li,Sicheng Tao,Yibo Yan,Xiaojie Gu,Haodong Xu,Xu Zheng,Yuanhuiyi Lyu,Linfeng Zhang,Xuming Hu
关键词-EN: explore Large Language, Large Language Models, Large Language, Audio-Visual Large Language, interpreting long videos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Endeavors have been made to explore Large Language Models for video analysis (Video-LLMs), particularly in understanding and interpreting long videos. However, existing Video-LLMs still face challenges in effectively integrating the rich and diverse audio-visual information inherent in long videos, which is crucial for comprehensive understanding. This raises the question: how can we leverage embedded audio-visual information to enhance long video understanding? Therefore, (i) we introduce SAVEn-Vid, the first-ever long audio-visual video dataset comprising over 58k audio-visual instructions. (ii) From the model perspective, we propose a time-aware Audio-Visual Large Language Model (AV-LLM), SAVEnVideo, fine-tuned on SAVEn-Vid. (iii) Besides, we present AVBench, a benchmark containing 2,500 QAs designed to evaluate models on enhanced audio-visual comprehension tasks within long video, challenging their ability to handle intricate audio-visual interactions. Experiments on AVBench reveal the limitations of current AV-LLMs. Experiments also demonstrate that SAVEnVideo outperforms the best Video-LLM by 3.61% on the zero-shot long video task (Video-MME) and surpasses the leading audio-visual LLM by 1.29% on the zero-shot audio-visual task (Music-AVQA). Consequently, at the 7B parameter scale, SAVEnVideo can achieve state-of-the-art performance. Our dataset and code will be released at this https URL upon acceptance.
zh

[CV-51] VIRES: Video Instance Repainting with Sketch and Text Guidance

【速读】：该论文试图解决视频实例重绘、替换、生成和移除中的时序一致性和与提供草图序列的精确对齐问题。解决方案的关键在于引入VIRES方法，该方法利用文本到视频生成模型的生成先验来维持时序一致性，并生成视觉上令人满意的结果。具体技术包括：1) 提出顺序控制网络（Sequential ControlNet），通过标准化自缩放有效提取结构布局并自适应捕捉高对比度草图细节；2) 增强扩散变换器骨干网络（diffusion transformer backbone），加入草图注意力（sketch attention）以解释和注入细粒度草图语义；3) 设计草图感知编码器（sketch-aware encoder），确保重绘结果与提供的草图序列对齐。此外，论文还贡献了VireSet数据集，用于训练和评估视频实例编辑方法。实验结果表明，VIRES在视觉质量、时序一致性、条件对齐和人类评分方面优于现有最先进方法。

链接: https://arxiv.org/abs/2411.16199
作者: Shuchen Weng,Haojie Zheng,Peixuan Zhan,Yuchen Hong,Han Jiang,Si Li,Boxin Shi
关键词-EN: video instance repainting, instance repainting, text guidance, enabling video instance, instance repainting method
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings.
zh

[CV-52] Interpreting Object-level Foundation Models via Visual Precision Search

【速读】：该论文试图解决多模态预训练模型（如 Grounding DINO 和 Florence-2）在视觉定位和物体检测任务中决策解释的难题。现有解释方法（如基于梯度的方法和基于扰动的方法）存在显著局限性：(1) 基于梯度的方法由于模型内部视觉-文本融合导致定位不精确；(2) 基于扰动的方法生成的显著性图噪声较大，限制了细粒度解释能力。论文提出的解决方案是视觉精确搜索方法（Visual Precision Search），该方法通过将输入划分为稀疏子区域，并利用一致性和协作评分来准确识别关键决策区域，从而生成更精确的归因图。此方法绕过了模型内部参数，克服了多模态融合带来的归因问题，显著提升了对象级任务的解释性，实验结果表明在多个评估指标上超越了现有最先进方法。

链接: https://arxiv.org/abs/2411.16198
作者: Ruoyu Chen,Siyuan Liang,Jingzhi Li,Shiming Liu,Maosen Li,Zheng Huang,Hua Zhang,Xiaochun Cao
关键词-EN: Grounding DINO, propelled object-level foundation, pre-training have propelled, Visual Precision Search, object-level foundation models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advances in multimodal pre-training have propelled object-level foundation models, such as Grounding DINO and Florence-2, in tasks like visual grounding and object detection. However, interpreting these models’ decisions has grown increasingly challenging. Existing interpretable attribution methods for object-level task interpretation have notable limitations: (1) gradient-based methods lack precise localization due to visual-textual fusion in foundation models, and (2) perturbation-based methods produce noisy saliency maps, limiting fine-grained interpretability. To address these, we propose a Visual Precision Search method that generates accurate attribution maps with fewer regions. Our method bypasses internal model parameters to overcome attribution issues from multimodal fusion, dividing inputs into sparse sub-regions and using consistency and collaboration scores to accurately identify critical decision-making regions. We also conducted a theoretical analysis of the boundary guarantees and scope of applicability of our method. Experiments on RefCOCO, MS COCO, and LVIS show our approach enhances object-level task interpretability over SOTA for Grounding DINO and Florence-2 across various evaluation metrics, with faithfulness gains of 23.7%, 31.6%, and 20.1% on MS COCO, LVIS, and RefCOCO for Grounding DINO, and 102.9% and 66.9% on MS COCO and RefCOCO for Florence-2. Additionally, our method can interpret failures in visual grounding and object detection tasks, surpassing existing methods across multiple evaluation metrics. The code will be released at \urlthis https URL.
zh

[CV-53] Learn from Foundation Model: Fruit Detection Model without Manual Annotation

【速读】：该论文试图解决农业领域数据稀缺的问题，特别是在水果检测任务中缺乏足够的标注数据。解决方案的关键在于提出了一种名为SDM-D（Segmentation-Description-Matching-Distilling）的框架，该框架利用基础模型（如SAM2和OpenCLIP）进行分割和零样本开放词汇分类，并通过知识蒸馏机制从这些基础模型中提取出高效、可部署于边缘设备的小型模型。SDM-D方法在无需手动标注的情况下，在水果检测任务（包括目标检测、语义分割和实例分割）中表现出色，几乎达到了使用大量标注数据训练的模型的性能，并且在开放集检测方法（如Grounding SAM和YOLO-World）中表现更优。

链接: https://arxiv.org/abs/2411.16196
作者: Yanan Wang,Zhenghao Fei,Ruichen Li,Yibin Ying
关键词-EN: limited data availability, Recent breakthroughs, transferring knowledge pre-trained, breakthroughs in large, enabled the possibility
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 12 figures, conference or other essential info

点击查看摘要

Abstract:Recent breakthroughs in large foundation models have enabled the possibility of transferring knowledge pre-trained on vast datasets to domains with limited data availability. Agriculture is one of the domains that lacks sufficient data. This study proposes a framework to train effective, domain-specific, small models from foundation models without manual annotation. Our approach begins with SDM (Segmentation-Description-Matching), a stage that leverages two foundation models: SAM2 (Segment Anything in Images and Videos) for segmentation and OpenCLIP (Open Contrastive Language-Image Pretraining) for zero-shot open-vocabulary classification. In the second stage, a novel knowledge distillation mechanism is utilized to distill compact, edge-deployable models from SDM, enhancing both inference speed and perception accuracy. The complete method, termed SDM-D (Segmentation-Description-Matching-Distilling), demonstrates strong performance across various fruit detection tasks object detection, semantic segmentation, and instance segmentation) without manual annotation. It nearly matches the performance of models trained with abundant labels. Notably, SDM-D outperforms open-set detection methods such as Grounding SAM and YOLO-World on all tested fruit detection datasets. Additionally, we introduce MegaFruits, a comprehensive fruit segmentation dataset encompassing over 25,000 images, and all code and datasets are made publicly available at this https URL.
zh

[CV-54] Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation

【速读】：该论文试图解决从单张图像生成3D网格（3D meshes）时存在的多视图不一致性、网格保真度不足以及生成的网格模糊等问题。解决方案的关键在于提出了Fancy123方法，该方法包含两个增强模块和一个反投影操作：外观增强模块用于调整2D多视图图像以纠正像素对齐问题，从而提高多视图一致性；保真度增强模块用于调整3D网格以更好地匹配输入图像；反投影操作则将输入图像和调整后的多视图图像投影到LRM生成的网格上，以确保高清晰度并去除LRM预测的模糊颜色。这些模块在推理时可即插即用，能够无缝集成到现有的单图像到3D方法中，并通过广泛的定性和定量实验验证了其显著优于现有技术的性能。

链接: https://arxiv.org/abs/2411.16185
作者: Qiao Yu,Xianzhi Li,Yuan Tang,Xu Han,Long Hu,Yixue Hao,Min Chen
关键词-EN: Large Reconstruction Model, multiview images, ill-posed task, important but ill-posed, Large Reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM’s generated mesh ensures high clarity, discarding LRM’s predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123’s SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods.
zh

[CV-55] Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

【速读】：该论文试图解决现有3D实例分割方法中常见的过度分割问题，即由于无监督合并方法导致的冗余和不准确的3D提案，这些问题增加了下游任务的复杂性。解决方案的关键在于提出了两个模块：3D-Aware 2D Mask Tracking模块和3D Mask Optimization模块。前者利用2D掩码分割和跟踪基础模型（SAM-2）的鲁棒3D先验，确保视频帧间对象掩码的一致性；后者通过动态规划算法选择最佳视图集，优化超点以生成每个对象的最终3D提案，从而在减少不必要提案的同时实现场景内对象的全面覆盖。

链接: https://arxiv.org/abs/2411.16183
作者: Phuc Nguyen,Minh Luu,Anh Tran,Cuong Pham,Khoi Nguyen
关键词-EN: frequently encounter issues, methods frequently encounter, issues with over-segmentation, instance segmentation, Instance Segmentation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class-Agnostic, Open-Vocabulary, and Open-Ended 3D Instance Segmentation tasks.
zh

[CV-56] Event-boosted Deformable 3D Gaussians for Fast Dynamic Scene Reconstruction

【速读】：该论文试图解决3D高斯喷射 (3D Gaussian Splatting, 3D-GS) 在实时渲染中因RGB相机时间分辨率低而难以处理快速运动的问题。解决方案的关键在于结合事件相机 (event cameras) 的高时间分辨率、连续运动数据与可变形3D-GS，以实现快速动态场景重建。具体策略包括：1) 提出高斯-阈值联合建模 (GS-Threshold Joint Modeling, GTJM) 策略，通过相互增强的过程显著提升3D重建和阈值建模的质量；2) 引入动态-静态分解 (Dynamic-Static Decomposition, DSD) 策略，通过识别动态区域并应用基于缓冲区的软分解，加速渲染并提高动态区域的保真度。这些方法使得在RTX 3090 GPU上以400×400分辨率实现156 FPS的高保真动态重建成为可能。

链接: https://arxiv.org/abs/2411.16180
作者: Wenhao Xu,Wenming Weng,Yueyi Zhang,Ruikang Xu,Zhiwei Xiong
关键词-EN: Gaussian Splatting, RGB cameras, enables real-time rendering, low temporal resolution, enables real-time
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3D-GS) enables real-time rendering but struggles with fast motion due to low temporal resolution of RGB cameras. To address this, we introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for fast dynamic scene reconstruction. We observe that threshold modeling for events plays a crucial role in achieving high-quality reconstruction. Therefore, we propose a GS-Threshold Joint Modeling (GTJM) strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling. Moreover, we introduce a Dynamic-Static Decomposition (DSD) strategy that first identifies dynamic areas by exploiting the inability of static Gaussians to represent motions, then applies a buffer-based soft decomposition to separate dynamic and static areas. This strategy accelerates rendering by avoiding unnecessary deformation in static areas, and focuses on dynamic areas to enhance fidelity. Our approach achieves high-fidelity dynamic reconstruction at 156 FPS with a 400 \times 400 resolution on an RTX 3090 GPU.
zh

[CV-57] SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

【速读】：该论文试图解决长视频内容处理中的两个主要问题：一是现有大型多模态模型（Large Multi-modal Models, LMMs）在处理长且未剪辑的视频时，由于上下文长度和内存开销的限制，导致信息丢失和模型响应的相关性降低；二是随着网络平台上视频数据的指数增长，理解长视频内容对于推进通用智能至关重要。解决方案的关键在于引入了一种名为SALOVA（Segment-Augmented LOng Video Assistant）的新型视频-LLM框架，通过以下两个关键技术来增强长视频内容的理解：(i) 提出了SceneWalk数据集，这是一个包含87.8K个长视频的高质量集合，每个视频在片段级别上进行了密集标注，以帮助模型捕捉场景连续性和保持丰富的描述性上下文；(ii) 开发了结合动态路由机制和时空投影器的稳健架构设计，以根据用户查询高效地检索和处理相关视频片段。SALOVA通过精确识别和检索相关视频片段来响应查询，从而提高了生成响应的上下文相关性。

链接: https://arxiv.org/abs/2411.16173
作者: Junho Kim,Hyunjun Kim,Hosu Lee,Yong Man Ro
关键词-EN: Large Multi-modal Models, Large Multi-modal, substantial memory overhead, remains challenging due, advances in Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.
zh

[CV-58] U2NeRF: Unsupervised Underwater Image Restoration and Neural Radiance Fields ICLR

【速读】：该论文试图解决水下图像因光线吸收、折射和散射导致的色彩偏移、低对比度和模糊问题。解决方案的关键在于提出了一种无监督的水下神经辐射场 (Unsupervised Underwater Neural Radiance Field, U2NeRF)，这是一种基于transformer的架构，能够在多视角几何条件下同时学习渲染和恢复新视角。通过将恢复能力隐式地融入NeRF流程，并将其预测的颜色分解为场景辐射、直接透射图、后向散射透射图和全局背景光等多个组件，U2NeRF能够在自监督的方式下重建水下图像。此外，论文还发布了一个包含12个水下场景的UVS数据集，用于验证其方法的有效性。实验结果表明，U2NeRF在单一场景优化时，相比多个基线方法在LPIPS、UIQM和UCIQE指标上分别提升了11%、5%和4%（平均值），展示了其优越的渲染和恢复能力。

链接: https://arxiv.org/abs/2411.16172
作者: Vinayak Gupta,Manoj S,Mukund Varma T,Kaushik Mitra
关键词-EN: Unsupervised Underwater Neural, Neural Radiance Field, Underwater images suffer, Underwater Neural Radiance, low contrast
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR Tiny Papers 2024. arXiv admin note: text overlap with arXiv:2207.13298

点击查看摘要

Abstract:Underwater images suffer from colour shifts, low contrast, and haziness due to light absorption, refraction, scattering and restoring these images has warranted much attention. In this work, we present Unsupervised Underwater Neural Radiance Field U2NeRF, a transformer-based architecture that learns to render and restore novel views conditioned on multi-view geometry simultaneously. Due to the absence of supervision, we attempt to implicitly bake restoring capabilities onto the NeRF pipeline and disentangle the predicted color into several components - scene radiance, direct transmission map, backscatter transmission map, and global background light, and when combined reconstruct the underwater image in a self-supervised manner. In addition, we release an Underwater View Synthesis UVS dataset consisting of 12 underwater scenes, containing both synthetically-generated and real-world data. Our experiments demonstrate that when optimized on a single scene, U2NeRF outperforms several baselines by as much LPIPS 11%, UIQM 5%, UCIQE 4% (on average) and showcases improved rendering and restoration capabilities. Code will be made available upon acceptance.
zh

[CV-59] Image Generation Diversity Issues and How to Tame Them

【速读】：该论文试图解决生成式模型（Generative Models）在多样性（diversity）方面的不足问题，特别是现有模型在生成数据时未能充分捕捉真实数据分布的多样性，且缺乏有效的评估指标。解决方案的关键在于提出了一个新的评估指标——图像检索分数（Image Retrieval Score, IRS），通过将多样性问题框架化为图像检索问题，利用合成数据作为查询来检索真实图像，从而量化生成模型输出的多样性。此外，论文还引入了多样性感知扩散模型（Diversity-Aware Diffusion Models, DiADM），通过解耦多样性与图像质量，使用多样性感知模块（diversity aware module）输入伪无条件特征（pseudo-unconditional features），在不损失图像质量的前提下提升生成模型的多样性。

链接: https://arxiv.org/abs/2411.16171
作者: Mischa Dombrowski,Weitong Zhang,Sarah Cechnicka,Hadrien Reynaud,Bernhard Kainz
关键词-EN: Generative, generative models, diversity, models, methods now produce
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 tables, 12 figures

点击查看摘要

Abstract:Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. Unlike quality issues, diversity limitations in generative models are hard to detect visually, requiring specific metrics for assessment. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model’s output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively. Consequently, we perform an extensive search for the best feature extractors to assess diversity. Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. We do this by disentangling diversity from image quality by using a diversity aware module that uses pseudo-unconditional features as input. We provide a Python package offering unified feature extraction and metric computation to further facilitate the evaluation of generative models this https URL.
zh

[CV-60] CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction

【速读】：该论文试图解决现有线性复杂度视觉Transformer在资源受限的移动设备上部署时，面临效率提升有限或精度显著下降的问题。解决方案的关键在于提出了一种新的解耦双交互线性注意力机制 (deCoupled duAl-interactive lineaR attEntion, CARE)，通过不对称特征解耦策略和动态记忆单元，有效分离局部归纳偏置和长程依赖的学习过程，同时设计双交互模块促进不同层特征间的有效交互，从而在保持高精度的同时显著提升模型效率。实验结果表明，该方法在ImageNet-1K、COCO和ADE20K数据集上均表现出色，例如在ImageNet-1K上以仅0.7/1.9 GMACs的计算成本达到78.4/82.1%的top-1准确率。

链接: https://arxiv.org/abs/2411.16170
作者: Yuan Zhou,Qingshan Xu,Jiequan Cui,Junbao Zhou,Jing Zhang,Richang Hong,Hanwang Zhang
关键词-EN: linear-complexity visual Transformers, efficient linear-complexity visual, visual Transformers, design efficient linear-complexity, large efforts
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, large efforts have been made to design efficient linear-complexity visual Transformers. However, current linear attention models are generally unsuitable to be deployed in resource-constrained mobile devices, due to suffering from either few efficiency gains or significant accuracy drops. In this paper, we propose a new de\textbfCoupled du\textbfAl-interactive linea\textbfR att\textbfEntion (CARE) mechanism, revealing that features’ decoupling and interaction can fully unleash the power of linear attention. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies, thereby preserving sufficient local and global information while effectively enhancing the efficiency of models. Then, a dynamic memory unit is employed to maintain critical information along the network pipeline. Moreover, we design a dual interaction module to effectively facilitate interaction between local inductive bias and long-range information as well as among features at different layers. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy. Extensive experiments on ImageNet-1K, COCO, and ADE20K datasets demonstrate the effectiveness of our approach, e.g., achieving 78.4/82.1% top-1 accuracy on ImagegNet-1K at the cost of only 0.7/1.9 GMACs. Codes will be released on \href…github.
zh

[CV-61] Local and Global Feature Attention Fusion Network for Face Recognition

【速读】：该论文试图解决低质量人脸图像识别中的问题，特别是由于部分面部区域缺失或变形导致的识别困难。解决方案的关键在于提出了一个基于特征质量的局部和全局特征注意力融合网络（Local and Global Feature Attention Fusion, LGAF）。该网络能够根据特征质量自适应地分配局部和全局特征的注意力，通过局部和全局信息的互补，提取更具判别力和高质量的人脸特征。此外，论文还引入了一个多头多尺度局部特征提取模块（Multi-Head Multi-Scale Local Feature Extraction, MHMS），以增强在高维空间中人脸特征的可分性，并有效获取多尺度的细粒度信息。实验结果表明，LGAF在多个验证集上均取得了最佳的平均性能，并在TinyFace和SCFace数据集上超越了当前最先进的方法（SoTA）。

链接: https://arxiv.org/abs/2411.16169
作者: Wang Yu,Wei Wei
关键词-EN: partial facial regions, face images remains, partial facial, remains a challenge, challenge due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recognition of low-quality face images remains a challenge due to invisible or deformation in partial facial regions. For low-quality images dominated by missing partial facial regions, local region similarity contributes more to face recognition (FR). Conversely, in cases dominated by local face deformation, excessive attention to local regions may lead to misjudgments, while global features exhibit better robustness. However, most of the existing FR methods neglect the bias in feature quality of low-quality images introduced by different factors. To address this issue, we propose a Local and Global Feature Attention Fusion (LGAF) network based on feature quality. The network adaptively allocates attention between local and global features according to feature quality and obtains more discriminative and high-quality face features through local and global information complementarity. In addition, to effectively obtain fine-grained information at various scales and increase the separability of facial features in high-dimensional space, we introduce a Multi-Head Multi-Scale Local Feature Extraction (MHMS) module. Experimental results demonstrate that the LGAF achieves the best average performance on 4 validation sets (CFP-FP, CPLFW, AgeDB, and CALFW), and the performance on TinyFace and SCFace outperforms the state-of-the-art methods (SoTA).
zh

[CV-62] xt-to-Image Synthesis: A Decade Survey

【速读】：该论文试图解决文本到图像合成 (Text-to-Image Synthesis, T2I) 的问题，即如何从文本描述生成高质量的图像。解决方案的关键在于利用基础模型 (Foundation Models) 在生成式 AI (Generative AI) 中的重要作用。论文回顾了超过440篇相关研究，探讨了生成对抗网络 (GANs)、自回归模型 (Autoregressive Models) 和扩散模型 (Diffusion Models) 在图像生成中的应用，并重点讨论了这些模型在文本条件下的生成能力和多样性。此外，论文还探讨了T2I在性能、可控性、个性化生成、安全性和内容及空间关系一致性等方面的前沿研究，并总结了常用的数据集和评估指标。最终，论文讨论了T2I在人工智能生成内容 (AIGC) 中的潜在应用及其面临的挑战和未来研究方向。

链接: https://arxiv.org/abs/2411.16164
作者: Nonghai Zhang,Hao Tang
关键词-EN: Artificial Intelligence Generated, Intelligence Generated Content, Artificial Intelligence, humans read, read a specific
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In this survey, we review over 440 recent works on T2I

点击查看摘要

Abstract:When humans read a specific text, they often visualize the corresponding images, and we hope that computers can do the same. Text-to-image synthesis (T2I), which focuses on generating high-quality images from textual descriptions, has become a significant aspect of Artificial Intelligence Generated Content (AIGC) and a transformative direction in artificial intelligence research. Foundation models play a crucial role in T2I. In this survey, we review over 440 recent works on T2I. We start by briefly introducing how GANs, autoregressive models, and diffusion models have been used for image generation. Building on this foundation, we discuss the development of these models for T2I, focusing on their generative capabilities and diversity when conditioned on text. We also explore cutting-edge research on various aspects of T2I, including performance, controllability, personalized generation, safety concerns, and consistency in content and spatial relationships. Furthermore, we summarize the datasets and evaluation metrics commonly used in T2I research. Finally, we discuss the potential applications of T2I within AIGC, along with the challenges and future research opportunities in this field.
zh

[CV-63] Sparse patches adversarial attacks via extrapolating point-wise information NEURIPS24

【速读】：该论文试图解决稀疏对抗攻击（Sparse Adversarial Attacks）和补丁对抗攻击（Patch Adversarial Attacks）中无法同时优化多个补丁位置和扰动的问题。解决方案的关键在于提出了一种通过逐点修剪密集对抗扰动（Dense Adversarial Perturbations）来生成稀疏补丁对抗攻击的新方法。该方法能够同时优化任意数量和形状的稀疏补丁的位置和扰动，并且在标准稀疏对抗攻击中也显著提升了现有技术的性能。

链接: https://arxiv.org/abs/2411.16162
作者: Yaniv Nemcovsky,Avi Mendelson,Chaim Baskin
关键词-EN: adversarial attacks, patch adversarial attacks, adversarial, autonomous systems, patch adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AdvML-Frontiers 24: The 3nd Workshop on New Frontiers in Adversarial Machine Learning, NeurIPS 24

点击查看摘要

Abstract:Sparse and patch adversarial attacks were previously shown to be applicable in realistic settings and are considered a security risk to autonomous systems. Sparse adversarial perturbations constitute a setting in which the adversarial perturbations are limited to affecting a relatively small number of points in the input. Patch adversarial attacks denote the setting where the sparse attacks are limited to a given structure, i.e., sparse patches with a given shape and number. However, previous patch adversarial attacks do not simultaneously optimize multiple patches’ locations and perturbations. This work suggests a novel approach for sparse patches adversarial attacks via point-wise trimming dense adversarial perturbations. Our approach enables simultaneous optimization of multiple sparse patches’ locations and perturbations for any given number and shape. Moreover, our approach is also applicable for standard sparse adversarial attacks, where we show that it significantly improves the state-of-the-art over multiple extensive settings. A reference implementation of the proposed method and the reported experiments is provided at \urlthis https URL
zh

[CV-64] MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

【速读】：该论文试图解决多视角新视图合成 (Novel View Synthesis, NVS) 任务中的泛化性和3D一致性问题。解决方案的关键在于引入了一个多视角扩散模型 (MVGenMaster)，并通过结合3D先验信息（使用度量深度和相机姿态进行扭曲）来显著增强模型的泛化能力和3D一致性。该模型能够在一个前向过程中生成多达100个新视图，且支持任意参考视图和相机姿态。此外，论文还开发了一个包含多达120万场景的大规模多视角图像数据集，并提出了针对大规模数据集的训练和模型改进方法，以进一步提升模型的性能。

链接: https://arxiv.org/abs/2411.16157
作者: Chenjie Cao,Chaohui Yu,Shang Liu,Fan Wang,Xiangyang Xue,Yanwei Fu
关键词-EN: View Synthesis, diffusion model enhanced, address versatile, introduce MVGenMaster, NVS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Models and codes will be released at this https URL

点击查看摘要

Abstract:We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on arbitrary reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset comprising up to 1.2 million scenes, equipped with well-aligned metric depth. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at this https URL.
zh

[CV-65] VideoOrion: Tokenizing Object Dynamics in Videos

【速读】：该论文试图解决视频大语言模型 (Video-LLM) 中高效压缩高维视频数据并提取关键语义信息的问题。解决方案的关键在于引入 VideoOrion，这是一个专门设计的视频大语言模型，通过检测-分割-跟踪 (detect-segment-track) 流水线，利用专家视觉模型提取视频中的对象动态，并将这些动态编码为一组对象标记 (object tokens)。这种方法不仅提供了一种更自然和高效的方式来生成紧凑且解耦的语义表示，还能够在最小计算成本下显式地建模视频内容中的对象。此外，对象标记的引入使得 VideoOrion 能够自然地处理基于视频的指代任务。实验结果表明，VideoOrion 能够有效利用这些对象标记，并在一般视频问答和基于视频的指代基准测试中取得竞争性结果。

链接: https://arxiv.org/abs/2411.16156
作者: Yicheng Feng,Yijiang Li,Wanpeng Zhang,Sipeng Zheng,Zongqing Lu
关键词-EN: Large Language Model, Video Large Language, Large Language, Language Model, Video Large
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos–the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.
zh

[CV-66] Revisiting Marr in Face: The Building of 2D–2.5D–3D Representations in Deep Neural Networks

【速读】：该论文试图解决的问题是深度神经网络（DNN）在视觉感知任务中是否遵循David Marr的视觉理论，即从2D草图到2.5D草图再到3D模型的逐步构建过程。解决方案的关键在于引入了一个图形探针（graphics probe），这是一个专门设计的子网络，用于从DNN的中间层重建原始图像。图形探针的关键特性是其灵活的架构，能够支持2D和3D格式的图像重建，以及介于两者之间的过渡状态。通过在神经网络中注入图形探针并分析其在图像重建中的行为，研究发现DNN在低层首先编码为2D表示，在高层最终构建3D表示，而在中层则表现出一种混合状态，类似于2.5D表示，即在狭窄深度范围内构建几何表示，类似于低浮雕雕塑的外观。这一发现为Marr的理论提供了实证支持，并揭示了DNN在视觉感知过程中从2D到3D的演变机制。

链接: https://arxiv.org/abs/2411.16148
作者: Xiangyu Zhu,Chang Yu,Jiankuo Zhao,Zhaoxiang Zhang,Stan Z. Li,Zhen Lei
关键词-EN: David Marr seminal, visual system operates, human visual system, David Marr, Marr seminal theory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:David Marr’s seminal theory of vision proposes that the human visual system operates through a sequence of three stages, known as the 2D sketch, the 2.5D sketch, and the 3D model. In recent years, Deep Neural Networks (DNN) have been widely thought to have reached a level comparable to human vision. However, the mechanisms by which DNNs accomplish this and whether they adhere to Marr’s 2D–2.5D–3D construction theory remain unexplored. In this paper, we delve into the perception task to explore these questions and find evidence supporting Marr’s theory. We introduce a graphics probe, a sub-network crafted to reconstruct the original image from the network’s intermediate layers. The key to the graphics probe is its flexible architecture that supports image in both 2D and 3D formats, as well as in a transitional state between them. By injecting graphics probes into neural networks, and analyzing their behavior in reconstructing images, we find that DNNs initially encode images as 2D representations in low-level layers, and finally construct 3D representations in high-level layers. Intriguingly, in mid-level layers, DNNs exhibit a hybrid state, building a geometric representation that s sur normals within a narrow depth range, akin to the appearance of a low-relief sculpture. This stage resembles the 2.5D representations, providing a view of how DNNs evolve from 2D to 3D in the perception process. The graphics probe therefore serves as a tool for peering into the mechanisms of DNN, providing empirical support for Marr’s theory.
zh

[CV-67] reeFormer: Single-view Plant Skeleton Estimation via Tree-constrained Graph Generation WACV2025

【速读】：该论文试图解决从图像中准确估计植物骨架结构（如分支结构）的问题，这在智能农业和植物科学中至关重要。与人类骨骼具有固定拓扑结构不同，植物骨架估计的挑战在于从图像中推断出任意树形图。尽管最近的图生成方法能够成功地从图像中推断出细小结构，但严格地将输出图约束为树形结构仍然具有挑战性。为此，论文提出了TreeFormer，一种通过树形约束图生成来估计植物骨架的方法。其关键在于结合基于学习的图生成与传统图算法，在训练过程中施加约束。具体而言，该方法在训练过程中将无约束图投影到最小生成树（Minimum Spanning Tree, MST）上，并通过抑制不需要的特征值将这种先验知识融入梯度下降优化中。实验表明，该方法能够准确估计多个领域的目标植物骨架结构，包括合成树模式、真实植物根系和葡萄藤分支。

链接: https://arxiv.org/abs/2411.16132
作者: Xinpeng Liu,Hiroaki Santo,Yosuke Toda,Fumio Okura
关键词-EN: Accurate estimation, essential for smart, smart agriculture, Accurate, graph
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)

点击查看摘要

Abstract:Accurate estimation of plant skeletal structure (e.g., branching structure) from images is essential for smart agriculture and plant science. Unlike human skeletons with fixed topology, plant skeleton estimation presents a unique challenge, i.e., estimating arbitrary tree graphs from images. While recent graph generation methods successfully infer thin structures from images, it is challenging to constrain the output graph strictly to a tree structure. To this problem, we present TreeFormer, a plant skeleton estimator via tree-constrained graph generation. Our approach combines learning-based graph generation with traditional graph algorithms to impose the constraints during the training loop. Specifically, our method projects an unconstrained graph onto a minimum spanning tree (MST) during the training loop and incorporates this prior knowledge into the gradient descent optimization by suppressing unwanted feature values. Experiments show that our method accurately estimates target plant skeletal structures for multiple domains: Synthetic tree patterns, real botanical roots, and grapevine branches. Our implementations are available at this https URL.
zh

[CV-68] hree Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion

【速读】：该论文试图解决基于相机的语义场景补全 (Semantic Scene Completion, SSC) 在3D感知领域中，由于透视和遮挡导致的远距离区域几何信息低估的问题。解决方案的关键在于提出了ScanSSC模型，该模型包含Scan模块和Scan损失函数，旨在通过利用近视角场景的上下文信息来增强远距离场景的感知。Scan模块采用轴向掩码注意力机制，通过近到远的级联掩码使远距离体素能够捕捉与先前体素的关系。Scan损失函数则沿每个轴计算累积对数与相应类别分布之间的交叉熵，从而将近视角的丰富上下文信号传播到远距离体素。这种协同作用使得ScanSSC在SemanticKITTI和SSCBench-KITTI-360基准测试中达到了最先进的性能，IoU分别为44.54和48.29，mIoU分别为17.40和20.14。

链接: https://arxiv.org/abs/2411.16129
作者: Jongseong Bae,Junwoo Ha,Ha Young Kim
关键词-EN: Semantic Scene Completion, Camera-based Semantic Scene, Scene Completion, Camera-based Semantic, Semantic Scene
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both designed to enhance distant scenes by leveraging context from near-viewpoint scenes. The Scan Module uses axis-wise masked attention, where each axis employing a near-to-far cascade masking that enables distant voxels to capture relationships with preceding voxels. In addition, the Scan Loss computes the cross-entropy along each axis between cumulative logits and corresponding class distributions in a near-to-far direction, thereby propagating rich context-aware signals to distant voxels. Leveraging the synergy between these components, ScanSSC achieves state-of-the-art performance, with IoUs of 44.54 and 48.29, and mIoUs of 17.40 and 20.14 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.
zh

[CV-69] CIA: Controllable Image Augmentation Framework Based on Stable Diffusion

【速读】：该论文试图解决在计算机视觉任务中，如目标检测和分割，由于数据集标注不足或质量不高而导致的性能瓶颈问题。解决方案的关键在于提出了一个名为CIA的模块化流水线，该流水线包括三个主要步骤：(1) 使用Stable Diffusion生成合成图像以增强数据集；(2) 通过定义的质量指标过滤掉低质量样本；(3) 通过精确的提示和ControlNet确保生成图像中存在特定模式。通过在COCO和Flickr30k数据集上使用YOLOv8n进行实验，研究结果表明，CIA生成的图像显著提升了目标检测性能，接近于将真实图像数量翻倍的效果。这一发现表明，CIA框架能够显著增强目标检测系统，并为未来在数据受限场景下的研究提供了可能性。

链接: https://arxiv.org/abs/2411.16128
作者: Mohamed Benkedadra,Dany Rimez,Tiffanie Godelaine,Natarajan Chidambaram,Hamed Razavi Khosroshahi,Horacio Tellez,Matei Mancas,Benoit Macq,Sidi Ahmed Mahmoudi
关键词-EN: Computer vision tasks, Computer vision, accurately annotated datasets, availability of extensive, accurately annotated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer vision tasks such as object detection and segmentation rely on the availability of extensive, accurately annotated datasets. In this work, We present CIA, a modular pipeline, for (1) generating synthetic images for dataset augmentation using Stable Diffusion, (2) filtering out low quality samples using defined quality metrics, (3) forcing the existence of specific patterns in generated images using accurate prompting and ControlNet. In order to show how CIA can be used to search for an optimal augmentation pipeline of training data, we study human object detection in a data constrained scenario, using YOLOv8n on COCO and Flickr30k datasets. We have recorded significant improvement using CIA-generated images, approaching the performances obtained when doubling the amount of real images in the dataset. Our findings suggest that our modular framework can significantly enhance object detection systems, and make it possible for future research to be done on data-constrained scenarios. The framework is available at: this http URL.
zh

[CV-70] Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain

【速读】：该论文试图解决在医学领域中使用Segment Anything Model (SAM)进行“一次性”学习时，由于视觉提示生成依赖于像素相似性而导致的提示生成不准确和点提示聚类问题。解决方案的关键在于引入了一种名为Med-PerSAM的新型一次性框架，该框架通过视觉提示工程和轻量级基于扭曲的提示调优模型，实现了自动化的提示生成和迭代优化，从而在不依赖额外训练或人工干预的情况下，提升了预训练SAM在医学影像数据集上的性能。

链接: https://arxiv.org/abs/2411.16123
作者: Hangyul Yoon,Doohyuk Jang,Jungeun Kim,Eunho Yang
关键词-EN: proven highly effective, NLP tasks, effective in NLP, Leveraging pre-trained models, Leveraging pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Leveraging pre-trained models with tailored prompts for in-context learning has proven highly effective in NLP tasks. Building on this success, recent studies have applied a similar approach to the Segment Anything Model (SAM) within a ``one-shot" framework, where only a single reference image and its label are employed. However, these methods face limitations in the medical domain, primarily due to SAM’s essential requirement for visual prompts and the over-reliance on pixel similarity for generating them. This dependency may lead to (1) inaccurate prompt generation and (2) clustering of point prompts, resulting in suboptimal outcomes. To address these challenges, we introduce \textbfMed-PerSAM, a novel and straightforward one-shot framework designed for the medical domain. Med-PerSAM uses only visual prompt engineering and eliminates the need for additional training of the pretrained SAM or human intervention, owing to our novel automated prompt generation process. By integrating our lightweight warping-based prompt tuning model with SAM, we enable the extraction and iterative refinement of visual prompts, enhancing the performance of the pre-trained SAM. This advancement is particularly meaningful in the medical domain, where creating visual prompts poses notable challenges for individuals lacking medical expertise. Our model outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.
zh

[CV-71] FUN-AD: Fully Unsupervised Learning for Anomaly Detection with Noisy Training Data WACV2025

【速读】：该论文试图解决在实际工业环境中，由于标注错误或新/翻新产品缺乏标签导致的训练数据噪声问题，特别是在无监督异常检测场景下。解决方案的关键在于提出了一种基于学习的方法，利用未标记且可能受污染的训练数据进行全无监督异常检测。具体来说，该方法基于两个观察：1) 正常样本之间的成对特征距离平均上可能小于异常样本或异质样本之间的距离；2) 相互最接近的特征对很可能是同质对，前提是正常数据的方差小于异常数据。基于第一个观察，论文提出了使用迭代重建的记忆库（IRMB）进行伪标签策略；基于第二个观察，引入了一种新的损失函数，以促进相互最接近的特征对之间的类同质性，从而减轻任务的病态性。实验结果表明，该方法在不同场景和异常与正常样本比例下均有效。

链接: https://arxiv.org/abs/2411.16110
作者: Jiin Im,Yongho Son,Je Hyeong Hong
关键词-EN: incur noisy training, noisy training data, training data due, practical industrial environments, one-class classification
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025. Supplementary material included after references. 17 pages, 7 figures, 14 tables

点击查看摘要

Abstract:While the mainstream research in anomaly detection has mainly followed the one-class classification, practical industrial environments often incur noisy training data due to annotation errors or lack of labels for new or refurbished products. To address these issues, we propose a novel learning-based approach for fully unsupervised anomaly detection with unlabeled and potentially contaminated training data. Our method is motivated by two observations, that i) the pairwise feature distances between the normal samples are on average likely to be smaller than those between the anomaly samples or heterogeneous samples and ii) pairs of features mutually closest to each other are likely to be homogeneous pairs, which hold if the normal data has smaller variance than the anomaly data. Building on the first observation that nearest-neighbor distances can distinguish between confident normal samples and anomalies, we propose a pseudo-labeling strategy using an iteratively reconstructed memory bank (IRMB). The second observation is utilized as a new loss function to promote class-homogeneity between mutually closest pairs thereby reducing the ill-posedness of the task. Experimental results on two public industrial anomaly benchmarks and semantic anomaly examples validate the effectiveness of FUN-AD across different scenarios and anomaly-to-normal ratios. Our code is available at this https URL.
zh

[CV-72] UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image

【速读】：该论文试图解决在仅有一张未标注的RGB-D参考图像的情况下，对未见过的物体进行姿态估计的问题。解决方案的关键在于提出了一种名为UNOPose的新方法，该方法通过构建一个SE(3)不变的参考框架来标准化物体表示，从而克服了姿态和尺寸变化带来的挑战。此外，UNOPose通过重新校准每个对应点的权重，基于其预测的重叠区域内的可能性，来缓解不同视角之间重叠区域较小的问题。这种方法在仅有一张参考图像的设置下，显著优于传统的和基于学习的方法，并且在性能上与基于CAD模型的方法相当。

链接: https://arxiv.org/abs/2411.16106
作者: Xingyu Liu,Gu Wang,Ruida Zhang,Chenyangguang Zhang,Federico Tombari,Xiangyang Ji
关键词-EN: onboarding stage costly, multiple reference views, rely on CAD, CAD models, Unseen object pose
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Unseen object pose estimation methods often rely on CAD models or multiple reference views, making the onboarding stage costly. To simplify reference acquisition, we aim to estimate the unseen object’s pose through a single unposed RGB-D reference image. While previous works leverage reference images as pose anchors to limit the range of relative pose, our scenario presents significant challenges since the relative transformation could vary across the entire SE(3) space. Moreover, factors like occlusion, sensor noise, and extreme geometry could result in low viewpoint overlap. To address these challenges, we present a novel approach and benchmark, termed UNOPose, for unseen one-reference-based object pose estimation. Building upon a coarse-to-fine paradigm, UNOPose constructs an SE(3)-invariant reference frame to standardize object representation despite pose and size variations. To alleviate small overlap across viewpoints, we recalibrate the weight of each correspondence based on its predicted likelihood of being within the overlapping region. Evaluated on our proposed benchmark based on the BOP Challenge, UNOPose demonstrates superior performance, significantly outperforming traditional and learning-based methods in the one-reference setting and remaining competitive with CAD-model-based methods. The code and dataset will be available.
zh

[CV-73] ENCLIP: Ensembling and Clustering-Based Contrastive Language-Image Pretraining for Fashion Multimodal Search with Limited Data and Low-Quality Images

【速读】：该论文试图解决在时尚智能领域中，由于数据稀缺和图像质量低下的问题，导致对比语言-图像预训练模型（Contrastive Language-Image Pretraining, CLIP）在多模态搜索中的性能受限的问题。解决方案的关键在于提出了一种名为ENCLIP的创新方法，该方法通过训练和集成多个CLIP模型的实例，并利用聚类技术将相似图像分组，从而增强CLIP模型在时尚智能领域的应用效果。这种方法有效应对了数据不足和图像质量差的问题，显著提升了CLIP模型在多模态搜索中的表现。

链接: https://arxiv.org/abs/2411.16096
作者: Prithviraj Purushottam Naik,Rohit Agarwal
关键词-EN: Multimodal search, explore fashion items, Multimodal Search targeted, providing a seamless, seamless and intuitive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal search has revolutionized the fashion industry, providing a seamless and intuitive way for users to discover and explore fashion items. Based on their preferences, style, or specific attributes, users can search for products by combining text and image information. Text-to-image searches enable users to find visually similar items or describe products using natural language. This paper presents an innovative approach called ENCLIP, for enhancing the performance of the Contrastive Language-Image Pretraining (CLIP) model, specifically in Multimodal Search targeted towards the domain of fashion intelligence. This method focuses on addressing the challenges posed by limited data availability and low-quality images. This paper proposes an algorithm that involves training and ensembling multiple instances of the CLIP model, and leveraging clustering techniques to group similar images together. The experimental findings presented in this study provide evidence of the effectiveness of the methodology. This approach unlocks the potential of CLIP in the domain of fashion intelligence, where data scarcity and image quality issues are prevalent. Overall, the ENCLIP method represents a valuable contribution to the field of fashion intelligence and provides a practical solution for optimizing the CLIP model in scenarios with limited data and low-quality images.
zh

[CV-74] Very Basics of Tensors with Graphical Notations: Unfolding Calculations and Decompositions

【速读】：该论文旨在解决读者在阅读使用张量（tensor）的文献时，由于缺乏对张量及其操作的详细定义和解释而感到困惑的问题。解决方案的关键在于通过张量网络图（Tensor network diagram）这一图形表示法，直观地展示张量之间的复杂乘法操作，包括内积（inner product）、外积（outer product）、哈达玛积（Hadamard product）、克罗内克积（Kronecker product）和Khatri-Rao积（Khatri-Rao product）等。通过这种图形表示法，读者可以更清晰地理解张量乘法的本质，从而更好地掌握张量在信号处理和机器学习中的应用。

链接: https://arxiv.org/abs/2411.16094
作者: Tatsuya Yokota
关键词-EN: graphical notation, Tensor network diagram, network diagram, nodes and edges, graphically represents multiplications
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Tensor network diagram (graphical notation) is a useful tool that graphically represents multiplications between multiple tensors using nodes and edges. Using the graphical notation, complex multiplications between tensors can be described simply and intuitively, and it also helps to understand the essence of tensor products. In fact, most of matrix/tensor products including inner product, outer product, Hadamard product, Kronecker product, and Khatri-Rao product can be written in graphical notation. These matrix/tensor operations are essential building blocks for the use of matrix/tensor decompositions in signal processing and machine learning. The purpose of this lecture note is to learn the very basics of tensors and how to represent them in mathematical symbols and graphical notation. Many papers using tensors omit these detailed definitions and explanations, which can be difficult for the reader. I hope this note will be of help to such readers.
zh

[CV-75] AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity

【速读】：该论文试图解决AI生成图像（AIGIs）在感知质量和文本-图像对齐质量评估中存在的问题。现有评估方法过于依赖初始提示（initial prompts），并使用相同的提示来指导感知和对齐质量的评估，忽略了这两项任务之间的区别。论文提出的解决方案之关键是TSP-MGS方法，该方法设计了任务特定的提示（task-specific prompts），并测量AIGIs与提示之间的多粒度相似性（multi-granularity similarity）。具体来说，TSP-MGS首先构建描述感知和对齐质量程度的任务特定提示，并引入初始提示以进行详细的质量感知。然后，计算AIGIs与任务特定提示之间的粗粒度相似性，以促进整体质量意识；同时，测量图像与初始提示之间的细粒度相似性，以增强对AIGI细节的理解。最终，通过整合多粒度相似性来实现精确的质量预测。

链接: https://arxiv.org/abs/2411.16087
作者: Jili Xia,Lihuo He,Fei Gao,Kaifan Zhang,Leida Li,Xinbo Gao
关键词-EN: garnered widespread attention, quality, prompts, widespread attention, garnered widespread
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, AI-generated images (AIGIs) created by given prompts (initial prompts) have garnered widespread attention. Nevertheless, due to technical nonproficiency, they often suffer from poor perception quality and Text-to-Image misalignment. Therefore, assessing the perception quality and alignment quality of AIGIs is crucial to improving the generative model’s performance. Existing assessment methods overly rely on the initial prompts in the task prompt design and use the same prompts to guide both perceptual and alignment quality evaluation, overlooking the distinctions between the two tasks. To address this limitation, we propose a novel quality assessment method for AIGIs named TSP-MGS, which designs task-specific prompts and measures multi-granularity similarity between AIGIs and the prompts. Specifically, task-specific prompts are first constructed to describe perception and alignment quality degrees separately, and the initial prompt is introduced for detailed quality perception. Then, the coarse-grained similarity between AIGIs and task-specific prompts is calculated, which facilitates holistic quality awareness. In addition, to improve the understanding of AIGI details, the fine-grained similarity between the image and the initial prompt is measured. Finally, precise quality prediction is acquired by integrating the multi-granularity similarities. Experiments on the commonly used AGIQA-1K and AGIQA-3K benchmarks demonstrate the superiority of the proposed TSP-MGS.
zh

[CV-76] Leverage Task Context for Object Affordance Ranking

【速读】：该论文试图解决智能代理在复杂环境中根据任务上下文选择合适对象的问题。当前研究将同一功能类别的对象视为等价，忽略了不同任务上下文中对象功能优先级的差异，导致决策不准确。解决方案的关键在于提出了一种基于任务上下文的对象功能排序方法，即通过任务关系挖掘模块和图组更新模块，深入整合任务上下文并进行全局相对关系传递，从而揭示任务与对象之间的关系并明确检测对象的优先级。该方法的核心是利用任务上下文进行对象功能排序，并通过构建大规模任务导向的功能排序数据集来验证其可行性和优越性。

链接: https://arxiv.org/abs/2411.16082
作者: Haojie Huang,Hongchen Luo,Wei Zhai,Yang Cao,Zheng-Jun Zha
关键词-EN: Intelligent agents accomplish, task context, Intelligent agents, task, affordance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intelligent agents accomplish different tasks by utilizing various objects based on their affordance, but how to select appropriate objects according to task context is not well-explored. Current studies treat objects within the affordance category as equivalent, ignoring that object affordances vary in priority with different task contexts, hindering accurate decision-making in complex environments. To enable agents to develop a deeper understanding of the objects required to perform tasks, we propose to leverage task context for object affordance ranking, i.e., given image of a complex scene and the textual description of the affordance and task context, revealing task-object relationships and clarifying the priority rank of detected objects. To this end, we propose a novel Context-embed Group Ranking Framework with task relation mining module and graph group update module to deeply integrate task context and perform global relative relationship transmission. Due to the lack of such data, we construct the first large-scale task-oriented affordance ranking dataset with 25 common tasks, over 50k images and more than 661k objects. Experimental results demonstrate the feasibility of the task context based affordance learning paradigm and the superiority of our model over state-of-the-art models in the fields of saliency ranking and multimodal object detection. The source code and dataset will be made available to the public.
zh

[CV-77] Boosting 3D Object Generation through PBR Materials SIGGRAPH

【速读】：该论文试图解决现有生成式 3D 内容创建方法在生成高质量、逼真 3D 物体时面临的挑战，特别是材质（materials）与纹理（textures）之间的不一致性问题，以及几何（geometry）与高频纹理细节（high-frequency texture details）之间的严重错位。解决方案的关键在于引入基于物理的渲染（Physics-Based Rendering, PBR）材质分析，并结合扩散模型（diffusion models）和多模态模型（multimodal models），通过精细调整的 Stable Diffusion 模型提取 3D 一致的反照率（albedo）和凹凸贴图（bump maps），同时采用半自动流程生成粗糙度（roughness）和金属度（metalness）贴图，以实现更自然的光照效果和显著提升的几何精度。

链接: https://arxiv.org/abs/2411.16080
作者: Yitong Wang,Xudong Xu,Li Ma,Haoran Wang,Bo Dai
关键词-EN: increasing attention recently, gained increasing attention, film industry, content creation, attention recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Accepted to SIGGRAPH Asia 2024 Conference Papers

点击查看摘要

Abstract:Automatic 3D content creation has gained increasing attention recently, due to its potential in various applications such as video games, film industry, and AR/VR. Recent advancements in diffusion models and multimodal models have notably improved the quality and efficiency of 3D object generation given a single RGB image. However, 3D objects generated even by state-of-the-art methods are still unsatisfactory compared to human-created assets. Considering only textures instead of materials makes these methods encounter challenges in photo-realistic rendering, relighting, and flexible appearance editing. And they also suffer from severe misalignment between geometry and high-frequency texture details. In this work, we propose a novel approach to boost the quality of generated 3D objects from the perspective of Physics-Based Rendering (PBR) materials. By analyzing the components of PBR materials, we choose to consider albedo, roughness, metalness, and bump maps. For albedo and bump maps, we leverage Stable Diffusion fine-tuned on synthetic data to extract these values, with novel usages of these fine-tuned models to obtain 3D consistent albedo UV and bump UV for generated objects. In terms of roughness and metalness maps, we adopt a semi-automatic process to provide room for interactive adjustment, which we believe is more practical. Extensive experiments demonstrate that our model is generally beneficial for various state-of-the-art generation methods, significantly boosting the quality and realism of their generated 3D objects, with natural relighting effects and substantially improved geometry.
zh

[CV-78] Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models

【速读】：该论文试图解决神经网络在图像分类任务中因学习到偏差而影响其泛化能力和性能的问题。解决方案的关键在于引入了一种名为 DiffuBias 的新型文本到图像生成管道，该管道通过生成偏差冲突样本（bias-conflict samples）来增强分类器的鲁棒性，而不需要在生成阶段进行训练。DiffuBias 利用预训练的扩散模型和图像字幕生成模型，通过偏差分类器（f_B）的 top-K 损失来生成更具代表性的数据样本，从而有效地去偏并提升分类器的泛化能力。据我们所知，DiffuBias 是首个利用稳定扩散模型在去偏任务中生成偏差冲突样本的方法。

链接: https://arxiv.org/abs/2411.16079
作者: Donggeun Ko,Dongjun Lee,Namjun Park,Wonkyeong Shim,Jaekwang Kim
关键词-EN: Neural networks struggle, Neural networks, Generative Adversarial Networks, misleads correlations, learned and misleads
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages + Appendix

点击查看摘要

Abstract:Neural networks struggle with image classification when biases are learned and misleads correlations, affecting their generalization and performance. Previous methods require attribute labels (e.g. background, color) or utilizes Generative Adversarial Networks (GANs) to mitigate biases. We introduce DiffuBias, a novel pipeline for text-to-image generation that enhances classifier robustness by generating bias-conflict samples, without requiring training during the generation phase. Utilizing pretrained diffusion and image captioning models, DiffuBias generates images that challenge the biases of classifiers, using the top- K losses from a biased classifier ( f_B ) to create more representative data samples. This method not only debiases effectively but also boosts classifier generalization capabilities. To the best of our knowledge, DiffuBias is the first approach leveraging a stable diffusion model to generate bias-conflict samples in debiasing tasks. Our comprehensive experimental evaluations demonstrate that DiffuBias achieves state-of-the-art performance on benchmark datasets. We also conduct a comparative analysis of various generative models in terms of carbon emissions and energy consumption to highlight the significance of computational efficiency.
zh

[CV-79] Geometry Distributions

【速读】：该论文试图解决传统坐标基网络在处理3D数据时面临的挑战，如薄结构和非水密几何体的处理问题，这些问题限制了其灵活性和准确性。解决方案的关键在于提出了一种新的几何数据表示方法，即将几何体建模为分布（distributions），这种表示方法不依赖于表面拓扑、连通性或边界条件。具体实现上，论文采用了扩散模型（diffusion models）结合一种新颖的网络架构来学习表面点分布，从而捕捉精细的几何细节。这种方法在多种物体类型上进行了定性和定量评估，展示了其在实现高几何保真度方面的有效性，并探索了其在纹理网格表示、神经表面压缩、动态物体建模和渲染等应用中的潜力。

链接: https://arxiv.org/abs/2411.16076
作者: Biao Zhang,Jing Ren,Peter Wonka
关键词-EN: recent work leveraging, work leveraging coordinate-based, leveraging coordinate-based networks, vector fields, widely adopted
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: For the project site, see this https URL

点击查看摘要

Abstract:Neural representations of 3D data have been widely adopted across various applications, particularly in recent work leveraging coordinate-based networks to model scalar or vector fields. However, these approaches face inherent challenges, such as handling thin structures and non-watertight geometries, which limit their flexibility and accuracy. In contrast, we propose a novel geometric data representation that models geometry as distributions-a powerful representation that makes no assumptions about surface genus, connectivity, or boundary conditions. Our approach uses diffusion models with a novel network architecture to learn surface point distributions, capturing fine-grained geometric details. We evaluate our representation qualitatively and quantitatively across various object types, demonstrating its effectiveness in achieving high geometric fidelity. Additionally, we explore applications using our representation, such as textured mesh representation, neural surface compression, dynamic object modeling, and rendering, highlighting its potential to advance 3D geometric learning.
zh

[CV-80] Soft-TransFormers for Continual Learning

【速读】：该论文试图解决持续学习（Continual Learning, CL）中的灾难性遗忘（Catastrophic Forgetting, CF）问题，特别是在类增量学习（Class-Incremental Learning, CIL）和任务增量学习（Task-Incremental Learning, TIL）场景下。解决方案的关键在于提出了一种名为Soft-TransFormers（Soft-TF）的全新全微调持续学习方法。Soft-TF通过顺序学习和选择每个任务的最优软网络或子网络，在训练过程中联合优化稀疏层的权重，以获得任务自适应的软（实值）网络或子网络（二进制掩码），同时保持预训练层参数冻结。在推理阶段，Soft-TF通过识别的任务自适应网络掩码预训练网络的参数，映射到每个任务的最优解，从而最小化灾难性遗忘，并保留预训练网络的知识。实验结果表明，Soft-TF在Vision Transformer (ViT)和CLIP模型上表现出色，达到了各种持续学习场景下的最先进性能。

链接: https://arxiv.org/abs/2411.16073
作者: Haeyong Kang,Chang D. Yoo
关键词-EN: Lottery Ticket Hypothesis, Well-initialized Lottery Ticket, Inspired by Well-initialized, Ticket Hypothesis, Well-initialized Lottery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inspired by Well-initialized Lottery Ticket Hypothesis (WLTH), which provides suboptimal fine-tuning solutions, we propose a novel fully fine-tuned continual learning (CL) method referred to as Soft-TransFormers (Soft-TF). Soft-TF sequentially learns and selects an optimal soft-network or subnetwork for each task. During sequential training in CL, Soft-TF jointly optimizes the weights of sparse layers to obtain task-adaptive soft (real-valued) networks or subnetworks (binary masks), while keeping the well-pre-trained layer parameters frozen. In inference, the identified task-adaptive network of Soft-TF masks the parameters of the pre-trained network, mapping to an optimal solution for each task and minimizing Catastrophic Forgetting (CF) - the soft-masking preserves the knowledge of the pre-trained network. Extensive experiments on Vision Transformer (ViT) and CLIP demonstrate the effectiveness of Soft-TF, achieving state-of-the-art performance across various CL scenarios, including Class-Incremental Learning (CIL) and Task-Incremental Learning (TIL), supported by convergence theory.
zh

[CV-81] Language Driven Occupancy Prediction

【速读】：该论文试图解决开放词汇占用预测 (Open-Vocabulary Occupancy, OVO) 中由于监督信号不准确导致模型泛化能力不足的问题。解决方案的关键在于提出了一种语义传递标注流程 (semantic transitive labeling pipeline)，通过将图像中的文本标签传递到LiDAR点云，最终映射到体素 (voxel) 上，生成密集且细粒度的3D语言占用真值。这一流程有效缓解了传统方法中基于图像特征或体素模型视图投影产生的噪声和稀疏对应关系。此外，论文通过替换监督占用模型中的预测头，引入几何头 (geometry head) 和语言头 (language head)，利用生成的语言真值指导3D语言体积的学习，从而显著提升了模型的预测精度和泛化能力。

链接: https://arxiv.org/abs/2411.16072
作者: Zhu Yu,Bowen Pang,Lizhe Liu,Runmin Zhang,Qihao Peng,Maochun Luo,Sheng Yang,Mingxia Chen,Si-Yuan Cao,Hui-Liang Shen
关键词-EN: effective and generalizable, generalizable framework, framework for open-vocabulary, OVO, semantic transitive labeling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and finegrained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and utimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our semantic transitive labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-ofthe-art zero-shot occupancy prediction approaches on the Occ3D-nuScenes dataset. Notably, even based on the simpler BEVDet model, with an input resolution of 256 * 704,Occ-BEVDet achieves an mIoU of 20.29, surpassing previous approaches that rely on temporal images, higher-resolution inputs, or larger backbone networks. The code for the proposed method is available at this https URL.
zh

[CV-82] Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation

【速读】：该论文试图解决的是类增量无源无监督领域自适应 (Class-Incremental Source-Free Unsupervised Domain Adaptation, CI-SFUDA) 问题，即在无法访问带标签的源数据的情况下，如何有效地将源域知识迁移到增量到达的无标签目标域。解决方案的关键在于提出了多粒度类原型拓扑蒸馏 (Multi-Granularity Class Prototype Topology Distillation, GROTO) 算法。该算法通过设计两个核心模块来应对问题中的两个挑战：1) 相似源类知识对目标类表示学习的干扰；2) 新目标知识对旧目标知识的干扰。具体来说，算法首先通过建模两种累积分布来挖掘正类，并引入多粒度类原型生成可靠的伪标签，促进正类目标特征的自组织。接着，利用正类原型构建源域和目标域特征空间的拓扑结构，并通过拓扑蒸馏持续减轻新目标知识对旧目标知识的干扰。实验结果表明，该方法在多个公开数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2411.16064
作者: Peihua Deng,Jiehua Zhang,Xichun Sheng,Chenggang Yan,Yaoqi Sun,Ying Fu,Liang Li
关键词-EN: Unsupervised Domain Adaptation, Source-Free Unsupervised Domain, Class-Incremental Source-Free Unsupervised, Source-Free Unsupervised, labeled source instances
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:This paper explores the Class-Incremental Source-Free Unsupervised Domain Adaptation (CI-SFUDA) problem, where the unlabeled target data come incrementally without access to labeled source instances. This problem poses two challenges, the disturbances of similar source-class knowledge to target-class representation learning and the new target knowledge to old ones. To address them, we propose the Multi-Granularity Class Prototype Topology Distillation (GROTO) algorithm, which effectively transfers the source knowledge to the unlabeled class-incremental target domain. Concretely, we design the multi-granularity class prototype self-organization module and prototype topology distillation module. Firstly, the positive classes are mined by modeling two accumulation distributions. Then, we generate reliable pseudo-labels by introducing multi-granularity class prototypes, and use them to promote the positive-class target feature self-organization. Secondly, the positive-class prototypes are leveraged to construct the topological structures of source and target feature spaces. Then, we perform the topology distillation to continually mitigate the interferences of new target knowledge to old ones. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performances on three public datasets.
zh

[CV-83] Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training

【速读】：该论文试图解决脉冲神经网络 (Spiking Neural Networks, SNNs) 在性能和训练成本方面与传统人工神经网络 (Artificial Neural Networks, ANNs) 之间的差距问题。解决方案的关键在于提出了一种基于整数训练和脉冲驱动推理的脉冲发放近似方法 (Spike Firing Approximation, SFA)，该方法优化了脉冲神经元的脉冲发放模式，从而提高了训练效率、降低了功耗、提升了性能，并使得SNNs更易于扩展和更好地利用神经形态芯片。此外，论文还开发了一种高效的脉冲驱动Transformer架构和脉冲掩码自编码器，以防止SNN在扩展过程中性能下降。实验结果表明，该方法在ImageNet-1k数据集上取得了最先进的性能，并且在训练时间和推理能效方面均有显著提升。

链接: https://arxiv.org/abs/2411.16061
作者: Man Yao,Xuerui Qiu,Tianxiang Hu,Jiakui Hu,Yuhong Chou,Keyu Tian,Jianxing Liao,Luziwei Leng,Bo Xu,Guoqi Li
关键词-EN: Artificial Neural Networks, traditional Artificial Neural, Spiking Neural Networks, Neural Networks, brain-inspired Spiking Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ambition of brain-inspired Spiking Neural Networks (SNNs) is to become a low-power alternative to traditional Artificial Neural Networks (ANNs). This work addresses two major challenges in realizing this vision: the performance gap between SNNs and ANNs, and the high training costs of SNNs. We identify intrinsic flaws in spiking neurons caused by binary firing mechanisms and propose a Spike Firing Approximation (SFA) method using integer training and spike-driven inference. This optimizes the spike firing pattern of spiking neurons, enhancing efficient training, reducing power consumption, improving performance, enabling easier scaling, and better utilizing neuromorphic chips. We also develop an efficient spike-driven Transformer architecture and a spike-masked autoencoder to prevent performance degradation during SNN scaling. On ImageNet-1k, we achieve state-of-the-art top-1 accuracy of 78.5%, 79.8%, 84.0%, and 86.2% with models containing 10M, 19M, 83M, and 173M parameters, respectively. For instance, the 10M model outperforms the best existing SNN by 7.2% on ImageNet, with training time acceleration and inference energy efficiency improved by 4.5 \times and 3.9 \times , respectively. We validate the effectiveness and efficiency of the proposed method across various tasks, including object detection, semantic segmentation, and neuromorphic vision tasks. This work enables SNNs to match ANN performance while maintaining the low-power advantage, marking a significant step towards SNNs as a general visual backbone. Code is available at this https URL.
zh

[CV-84] UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

【速读】：该论文试图解决在连续环境中视觉与语言导航 (VLN-CE) 中由于视觉遮挡或盲点导致的导航困难问题。解决方案的关键在于引入了一种名为 UnitedVLN 的新型 3DGS 预训练范式，通过联合渲染高保真 360 度视觉图像和语义特征，使代理能够更好地探索未来环境。UnitedVLN 采用两种关键策略：搜索-然后-查询采样和分离-然后-联合渲染，这些策略有助于有效利用神经原语，整合外观和语义信息，从而实现更稳健的导航。实验结果表明，UnitedVLN 在现有的 VLN-CE 基准测试中优于最先进的方法。

链接: https://arxiv.org/abs/2411.16053
作者: Guangzhao Dai,Jian Zhao,Yuantao Chen,Yusen Qin,Hao Zhao,Guosen Xie,Yazhou Yao,Xiangbo Shu,Xuelong Li
关键词-EN: target destination, significant advancements, instructions to reach, reach a target, recently seen significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods lack intuitive appearance-level information or high-level semantic complexity crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
zh

[CV-85] ROADS: Robust Prompt-driven Multi-Class Anomaly Detection under Domain Shift WACV2025

【速读】：该论文试图解决多类别统一异常检测 (Multi-class Unified Anomaly Detection, MUAD) 方法中存在的类间干扰和域偏移问题。解决方案的关键在于提出了一种名为 ROADS 的新型鲁棒提示驱动 MUAD 框架。ROADS 通过层次化的类感知提示集成机制，动态地将类特定信息编码到异常检测器中，以减轻类间干扰；同时，引入域适配器来学习域不变表示，增强对域偏移的鲁棒性。实验结果表明，ROADS 在 MVTec-AD 和 VISA 数据集上的异常检测和定位性能均优于现有最先进方法，特别是在分布外设置下表现显著提升。

链接: https://arxiv.org/abs/2411.16049
作者: Hossein Kashiani,Niloufar Alipour Talemi,Fatemeh Afghah
关键词-EN: Multi-class Unified Anomaly, Multi-class Unified, practical alternatives compared, Recent advancements, Unified Anomaly Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)

点击查看摘要

Abstract:Recent advancements in anomaly detection have shifted focus towards Multi-class Unified Anomaly Detection (MUAD), offering more scalable and practical alternatives compared to traditional one-class-one-model approaches. However, existing MUAD methods often suffer from inter-class interference and are highly susceptible to domain shifts, leading to substantial performance degradation in real-world applications. In this paper, we propose a novel robust prompt-driven MUAD framework, called ROADS, to address these challenges. ROADS employs a hierarchical class-aware prompt integration mechanism that dynamically encodes class-specific information into our anomaly detector to mitigate interference among anomaly classes. Additionally, ROADS incorporates a domain adapter to enhance robustness against domain shifts by learning domain-invariant representations. Extensive experiments on MVTec-AD and VISA datasets demonstrate that ROADS surpasses state-of-the-art methods in both anomaly detection and localization, with notable improvements in out-of-distribution settings.
zh

[CV-86] ZoomEye: Enhancing Multimodal LLM s with Human-Like Zooming Capabilities through Tree-Based Image Exploration

【速读】：该论文试图解决多模态大语言模型 (MLLMs) 在处理高分辨率图像时，由于预训练视觉编码器的输入分辨率限制和图像密集上下文导致的对细节对象的忽视问题。解决方案的关键是提出了Zoom Eye算法，这是一种树搜索算法，通过将图像概念化为树结构，每个子节点代表父节点的放大子块，根节点代表整体图像。Zoom Eye不仅模型无关且无需训练，允许任何MLLMs模拟人类的缩放动作，通过从根节点到叶节点的搜索，捕捉相关信息，并准确响应相关查询。实验结果表明，Zoom Eye显著提升了基础MLLMs的性能，并使小型7B MLLMs能够超越强大的大型模型如GPT-4。

链接: https://arxiv.org/abs/2411.16044
作者: Haozhan Shen,Kangjia Zhao,Tiancheng Zhao,Ruochen Xu,Zilun Zhang,Mingwei Zhu,Jianwei Yin
关键词-EN: numerous visual elements, Zoom Eye, fine-grained detailed objects, dominant large objects, typically consists
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects. When perceiving such images, multimodal large language models~(MLLMs) face limitations due to the restricted input resolution of the pretrained vision encoder and the cluttered, dense context of the image, resulting in a focus on primary objects while easily overlooking detailed ones. In this paper, we propose Zoom Eye, a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information. Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs to simulate human zooming actions by searching along the image tree from root to leaf nodes, seeking out pertinent information, and accurately responding to related queries. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series base MLLMs with large margin~(e.g., LLaVA-v1.5-7B increases by 34.57% on V^* Bench and 17.88% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o. Our code is available at \hrefthis https URLthis https URL.
zh

[CV-87] VisualLens: Personalization through Visual History

【速读】：该论文试图解决在个性化推荐系统中，如何有效利用用户的视觉历史（visual history）来提取有价值的兴趣和偏好信息的问题。解决方案的关键在于提出了一种名为VisualLens的新方法，该方法通过提取、过滤和优化图像表示（image representations），从而在任务无关的视觉历史数据中提取出与用户兴趣相关的信号，并用于个性化推荐。这一方法在两个新创建的基准测试中展示了其优越性，相较于现有最先进的推荐系统，在Hit@3指标上提升了5-10%，并优于GPT-4o 2-5%。

链接: https://arxiv.org/abs/2411.16034
作者: Wang Bill Zhu,Deqing Fu,Kai Sun,Yi Lu,Zhaojiang Lin,Seungwhan Moon,Kanika Narang,Mustafa Canim,Yue Liu,Anuj Kumar,Xin Luna Dong
关键词-EN: offers valuable insights, daily life, offers valuable, valuable insights, user visual history
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We hypothesize that a user’s visual history with images reflecting their daily life, offers valuable insights into their interests and preferences, and can be leveraged for personalization. Among the many challenges to achieve this goal, the foremost is the diversity and noises in the visual history, containing images not necessarily related to a recommendation task, not necessarily reflecting the user’s interest, or even not necessarily preference-relevant. Existing recommendation systems either rely on task-specific user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. We propose a novel approach, VisualLens, that extracts, filters, and refines image representations, and leverages these signals for personalization. We created two new benchmarks with task-agnostic visual histories, and show that our method improves over state-of-the-art recommendations by 5-10% on Hit@3, and improves over GPT-4o by 2-5%. Our approach paves the way for personalized recommendations in scenarios where traditional methods fail.
zh

[CV-88] From Dashcam Videos to Driving Simulations: Stress Testing Automated Vehicles against Rare Events

【速读】：该论文试图解决将真实世界驾驶视频自动转换为用于自动驾驶系统（ADS）测试的仿真场景的挑战。解决方案的关键在于提出了一种新颖的框架，利用提示工程的视频语言模型（VLM）将行车记录仪视频转换为SCENIC脚本，这些脚本定义了CARLA模拟器中的环境和驾驶行为，从而生成逼真的仿真场景。该框架不仅关注一对一的场景重建，还强调捕捉原始视频中的关键驾驶行为，并提供天气或道路条件等参数的灵活性，以支持基于搜索的测试。此外，引入了一种相似度度量，通过比较真实和模拟视频中的驾驶行为关键特征，迭代地优化生成的场景。初步结果显示，该方法在几分钟内完成从真实到仿真的转换，完全自动化且无需人工干预，同时保持对原始驾驶事件的高保真度。

链接: https://arxiv.org/abs/2411.16027
作者: Yan Miao,Georgios Fainekos,Bardh Hoxha,Hideki Okamoto,Danil Prokhorov,Sayan Mitra
关键词-EN: Automated Driving Systems, Testing Automated Driving, Testing Automated, Driving Systems, Automated Driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Testing Automated Driving Systems (ADS) in simulation with realistic driving scenarios is important for verifying their performance. However, converting real-world driving videos into simulation scenarios is a significant challenge due to the complexity of interpreting high-dimensional video data and the time-consuming nature of precise manual scenario reconstruction. In this work, we propose a novel framework that automates the conversion of real-world car crash videos into detailed simulation scenarios for ADS testing. Our approach leverages prompt-engineered Video Language Models(VLM) to transform dashcam footage into SCENIC scripts, which define the environment and driving behaviors in the CARLA simulator, enabling the generation of realistic simulation scenarios. Importantly, rather than solely aiming for one-to-one scenario reconstruction, our framework focuses on capturing the essential driving behaviors from the original video while offering flexibility in parameters such as weather or road conditions to facilitate search-based testing. Additionally, we introduce a similarity metric that helps iteratively refine the generated scenario through feedback by comparing key features of driving behaviors between the real and simulated videos. Our preliminary results demonstrate substantial time efficiency, finishing the real-to-sim conversion in minutes with full automation and no human intervention, while maintaining high fidelity to the original driving events.
zh

[CV-89] Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models WACV2025

【速读】：该论文试图解决预训练视觉-语言模型（Vision-language models）在下游任务中过度拟合特定数据分布的问题，这限制了模型在新领域或未见类别上的泛化能力。解决方案的关键是提出了一种名为Style-Pro的新型风格引导提示学习框架（style-guided prompt learning framework）。Style-Pro通过使用可学习的风格基（learnable style bases）来合成多样化的分布偏移，并由两个专门的损失函数确保风格多样性和内容完整性。此外，Style-Pro将未见风格映射到已知风格表示空间，并通过加权组合风格基来最小化未见领域与源领域之间的差异。为了保持风格偏移提示模型与原始冻结的CLIP模型之间的嵌入一致性，Style-Pro引入了一致性约束，从而在适应下游任务时最小化偏差。实验结果表明，Style-Pro在多个基准数据集上显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2411.16018
作者: Niloufar Alipour Talemi,Hossein Kashiani,Fatemeh Afghah
关键词-EN: Pre-trained Vision-language, downstream tasks, shown significant generalization, significant generalization ability, minimal fine-tuning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)

点击查看摘要

Abstract:Pre-trained Vision-language (VL) models, such as CLIP, have shown significant generalization ability to downstream tasks, even with minimal fine-tuning. While prompt learning has emerged as an effective strategy to adapt pre-trained VL models for downstream tasks, current approaches frequently encounter severe overfitting to specific downstream data distributions. This overfitting constrains the original behavior of the VL models to generalize to new domains or unseen classes, posing a critical challenge in enhancing the adaptability and generalization of VL models. To address this limitation, we propose Style-Pro, a novel style-guided prompt learning framework that mitigates overfitting and preserves the zero-shot generalization capabilities of CLIP. Style-Pro employs learnable style bases to synthesize diverse distribution shifts, guided by two specialized loss functions that ensure style diversity and content integrity. Then, to minimize discrepancies between unseen domains and the source domain, Style-Pro maps the unseen styles into the known style representation space as a weighted combination of style bases. Moreover, to maintain consistency between the style-shifted prompted model and the original frozen CLIP, Style-Pro introduces consistency constraints to preserve alignment in the learned embeddings, minimizing deviation during adaptation to downstream tasks. Extensive experiments across 11 benchmark datasets demonstrate the effectiveness of Style-Pro, consistently surpassing state-of-the-art methods in various settings, including base-to-new generalization, cross-dataset transfer, and domain generalization.
zh

[CV-90] DRIVE: Dual-Robustness via Information Variability and Entropic Consistency in Source-Free Unsupervised Domain Adaptation

【速读】：该论文试图解决源数据不可访问情况下的无监督领域自适应问题（Source-Free Unsupervised Domain Adaptation, SFUDA），特别是在目标域数据无标签的情况下，如何有效适应预训练模型。解决方案的关键在于提出了一种名为DRIVE（Dual-Robustness through Information Variability and Entropy）的新型SFUDA框架，该框架采用双模型架构。两个初始权重相同的模型并行工作，以捕捉目标域的多样性特征。其中一个模型通过投影梯度下降（PGD）引入扰动，并由互信息引导，专注于高不确定性区域。此外，论文还引入了一种基于熵的伪标签策略，根据预测不确定性调整标签权重，确保模型关注可靠数据并避免噪声区域。适应过程分为两个阶段：第一阶段通过互信息一致性损失对齐模型，第二阶段根据第一阶段的损失动态调整扰动水平，鼓励模型探索更广泛的目标域特征，同时保持现有性能，从而增强模型的泛化能力和抗干扰能力。

链接: https://arxiv.org/abs/2411.15976
作者: Ruiqiang Xiao,Songning Lai,Yijun Yang,Jiemin Wu,Yutao Yue,Lei Zhu
关键词-EN: Adapting machine learning, machine learning models, autonomous driving, target domain, medical imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting machine learning models to new domains without labeled data, especially when source data is inaccessible, is a critical challenge in applications like medical imaging, autonomous driving, and remote sensing. This task, known as Source-Free Unsupervised Domain Adaptation (SFUDA), involves adapting a pre-trained model to a target domain using only unlabeled target data, which can lead to issues such as overfitting, underfitting, and poor generalization due to domain discrepancies and noise. Existing SFUDA methods often rely on single-model architectures, struggling with uncertainty and variability in the target domain. To address these challenges, we propose DRIVE (Dual-Robustness through Information Variability and Entropy), a novel SFUDA framework leveraging a dual-model architecture. The two models, initialized with identical weights, work in parallel to capture diverse target domain characteristics. One model is exposed to perturbations via projection gradient descent (PGD) guided by mutual information, focusing on high-uncertainty regions. We also introduce an entropy-aware pseudo-labeling strategy that adjusts label weights based on prediction uncertainty, ensuring the model focuses on reliable data while avoiding noisy regions. The adaptation process has two stages: the first aligns the models on stable features using a mutual information consistency loss, and the second dynamically adjusts the perturbation level based on the loss from the first stage, encouraging the model to explore a broader range of the target domain while preserving existing performance. This enhances generalization capabilities and robustness against interference. Evaluations on standard SFUDA benchmarks show that DRIVE consistently outperforms previous methods, delivering improved adaptation accuracy and stability across complex target domains.
zh

[CV-91] CNNs for Style Transfer of Digital to Film Photography

【速读】：该论文试图解决使用深度学习生成Cinestill800T胶片风格效果的问题。解决方案的关键在于采用简单的卷积神经网络（Convolutional Neural Networks）来模拟数字输入的胶片效果，并通过实验测试不同损失函数（Loss Functions）、输入噪声通道（Input Noise Channel）以及训练过程中随机缩放的图像块（Random Scales of Patches）对效果的影响。研究结果表明，结合均方误差（MSE）和VGG损失函数能够产生最佳的色彩效果，尽管能够生成一些颗粒感，但质量不高，且未能生成光晕（Halation）效果。此外，论文还贡献了一个对齐的胶片和数字相机拍摄的图像数据集，以供进一步研究使用。

链接: https://arxiv.org/abs/2411.15967
作者: Pierre Mackenzie,Mika Senghaas,Raphael Achddou
关键词-EN: stylistic effect generation, recent years, deep learning, learning in stylistic, stylistic effect
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The use of deep learning in stylistic effect generation has seen increasing use over recent years. In this work, we use simple convolutional neural networks to model Cinestill800T film given a digital input. We test the effect of different loss functions, the addition of an input noise channel and the use of random scales of patches during training. We find that a combination of MSE/VGG loss gives the best colour production and that some grain can be produced, but it is not of a high quality, and no halation is produced. We contribute our dataset of aligned paired images taken with a film and digital camera for further work.
zh

[CV-92] Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors

【速读】：该论文试图解决从有限数量的未校准2D图像中进行无姿态（pose-free）的360°场景重建问题。解决方案的关键在于提出了一种指令跟随的RGBD扩散模型，该模型能够填补缺失细节并去除新视角渲染和深度图中的伪影。此外，论文还引入了一种新的高斯表示置信度度量，以更好地检测这些伪影。通过逐步整合这些新视角，论文采用了一种类似于高斯SLAM（Gaussian-SLAM）的过程，实现了多视角一致的高斯表示。实验结果表明，该方法在复杂360°场景中超越了现有的无姿态重建技术，并与最先进的已知姿态重建方法表现相当。

链接: https://arxiv.org/abs/2411.15966
作者: Soumava Paul,Prakhar Kaushik,Alan Yuille
关键词-EN: number of uncalibrated, introduce a generative, generative approach, limited number, reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures, 3 tables

点击查看摘要

Abstract:In this work, we introduce a generative approach for pose-free reconstruction of 360^\circ scenes from a limited number of uncalibrated 2D images. Pose-free scene reconstruction from incomplete, unposed observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of unbounded scenes with known camera poses using diffusion priors, these methods rely on explicit camera embeddings for extrapolating unobserved regions. This reliance limits their application in pose-free settings, where view-specific data is only implicitly available. To address this, we propose an instruction-following RGBD diffusion model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We also propose a novel confidence measure for Gaussian representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent Gaussian representation. Evaluations on the MipNeRF360 dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed reconstruction methods in complex 360^\circ scenes.
zh

[CV-93] MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

【速读】：该论文试图解决轻量级模型在处理高分辨率图像时面临的效率与性能平衡问题。解决方案的关键在于提出了MobileMamba框架，通过设计三阶段网络结构显著提升推理速度，并引入Multi-Receptive Field Feature Interaction (MRFFI)模块，该模块包括Long-Range Wavelet Transform-Enhanced Mamba (WTE-Mamba)、Efficient Multi-Kernel Depthwise Convolution (MK-DeConv)和Eliminate Redundant Identity组件，以集成多感受野信息并增强高频细节提取。此外，通过采用特定的训练和测试策略，进一步提升了模型的性能和效率。实验结果表明，MobileMamba在速度和准确性上均优于现有的最先进方法。

链接: https://arxiv.org/abs/2411.15941
作者: Haoyang He,Jiangning Zhang,Yuxuan Cai,Hongxu Chen,Xiaobin Hu,Zhenye Gan,Yabiao Wang,Chengjie Wang,Yunsheng Wu,Lei Xie
关键词-EN: Previous research, primarily focused, Transformer-based designs, CNNs and Transformer-based, Previous
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction(MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba(WTE-Mamba), Efficient Multi-Kernel Depthwise Convolution(MK-DeConv), and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum x21 faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.
zh

[CV-94] Segment to Recognize Robustly – Enhancing Recognition by Image Decomposition

【速读】：该论文试图解决图像识别中背景信息过度依赖的问题，特别是在实际部署环境中模型鲁棒性受限的情况。解决方案的关键在于提出了一种名为“Segment to Recognize Robustly” (S2R^2) 的新型识别方法，该方法通过解耦前景 (FG) 和背景 (BG) 的建模，并在识别过程中结合这两部分信息，从而实现简单、鲁棒且可解释的识别。S2R^2 利用零样本分割技术在识别前或识别过程中隔离前景和背景，通过结合前景、背景以及标准的全图像分类器，不仅在域内数据上达到了最先进的结果，还保持了对背景变化的鲁棒性。

链接: https://arxiv.org/abs/2411.15933
作者: Klara Janouskova,Cristian Gavrus,Jiri Matas
关键词-EN: real-world deployment settings, deep image recognition, standard deep image, limiting model robustness, deep image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In image recognition, both foreground (FG) and background (BG) play an important role; however, standard deep image recognition often leads to unintended over-reliance on the BG, limiting model robustness in real-world deployment settings. Current solutions mainly suppress the BG, sacrificing BG information for improved generalization. We propose “Segment to Recognize Robustly” (S2R^2), a novel recognition approach which decouples the FG and BG modelling and combines them in a simple, robust, and interpretable manner. S2R^2 leverages recent advances in zero-shot segmentation to isolate the FG and the BG before or during recognition. By combining FG and BG, potentially also with a standard full-image classifier, S2R^2 achieves state-of-the-art results on in-domain data while maintaining robustness to BG shifts. The results confirm that segmentation before recognition is now possible.
zh

[CV-95] Improving Pre-Trained Self-Supervised Embeddings Through Effective Entropy Maximization

【速读】：该论文试图解决自监督学习 (Self-Supervised Learning, SSL) 中嵌入向量在高维空间中熵估计不准确的问题。解决方案的关键在于提出了一种有效的熵最大化准则 (Effective Entropy Maximization Criterion, E2MC)，该准则基于易于估计的低维约束。通过在已经训练好的SSL模型上继续训练几个周期，使用E2MC能够显著提升下游任务的性能，而其他替代准则则未能带来显著改进，甚至在某些情况下会降低性能。

链接: https://arxiv.org/abs/2411.15931
作者: Deep Chakraborty,Yann LeCun,Tim G. J. Rudner,Erik Learned-Miller
关键词-EN: supervised downstream tasks, lightly supervised downstream, self-supervised learning, lightly supervised, architectures and loss
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Applications (stat.AP); Machine Learning (stat.ML)
备注: 19 pages including appendix, 5 figures

点击查看摘要

Abstract:A number of different architectures and loss functions have been applied to the problem of self-supervised learning (SSL), with the goal of developing embeddings that provide the best possible pre-training for as-yet-unknown, lightly supervised downstream tasks. One of these SSL criteria is to maximize the entropy of a set of embeddings in some compact space. But the goal of maximizing the embedding entropy often depends–whether explicitly or implicitly–upon high dimensional entropy estimates, which typically perform poorly in more than a few dimensions. In this paper, we motivate an effective entropy maximization criterion (E2MC), defined in terms of easy-to-estimate, low-dimensional constraints. We demonstrate that using it to continue training an already-trained SSL model for only a handful of epochs leads to a consistent and, in some cases, significant improvement in downstream performance. We perform careful ablation studies to show that the improved performance is due to the proposed add-on criterion. We also show that continued pre-training with alternative criteria does not lead to notable improvements, and in some cases, even degrades performance.
zh

[CV-96] Making Images from Images: Interleaving Denoising and Transformation

【速读】：该论文试图解决通过重新排列图像区域来生成新图像的问题，特别是如何将现有图像（如《蒙娜丽莎》）转换为全新的主题。解决方案的关键在于提出了一种同时学习图像内容和参数化变换的方法，通过将图像扩散与能量最小化步骤交替进行，来解决这一约束优化问题。与以往方法不同，增加区域数量不仅不会增加问题复杂性，反而能提升结果质量。该方法在像素空间和潜在空间中均得到了验证，并展示了使用无限复制源图像和多源图像的创意扩展。

链接: https://arxiv.org/abs/2411.15925
作者: Shumeet Baluja,David Marwood,Ashwin Baluja
关键词-EN: Simply by rearranging, image, Simply, subject matter, regions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.
zh

[CV-97] Deep Learning for automated multi-scale functional field boundaries extraction using multi-date Sentinel-2 and PlanetScope imagery: Case Study of Netherlands and Pakistan

【速读】：该论文试图解决在不同地理和多尺度农业系统中，利用多时相卫星影像进行功能性田地边界划分的有效性问题。解决方案的关键在于采用深度学习语义分割架构，结合多时相影像和归一化植被指数（NDVI）堆栈，以捕捉作物生长季节的不同时间点信息。研究通过在荷兰和巴基斯坦两个不同地区的实验，评估了基于UNET架构的四种深度学习模型，并比较了不同组合的多时相影像和NDVI堆栈的效果。结果表明，多时相NDVI堆栈提供了额外的季节性上下文信息，显著提高了田地边界划分的准确性。此外，研究还强调了多尺度地面信息在不同地理区域中的重要性，以及高空间分辨率在小型农田区域边界提取中的关键作用。通过迁移学习和结合多源数据，研究展示了在异质农业环境中实现自动田地边界划分的潜力。

链接: https://arxiv.org/abs/2411.15923
作者: Saba Zahid,Sajid Ghuffar,Obaid-ur-Rehman,Syed Roshaan Ali Shah
关键词-EN: multi-temporal satellite imagery, Pakistan, semantic segmentation architecture, learning semantic segmentation, Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 09 pages, To be published

点击查看摘要

Abstract:This study explores the effectiveness of multi-temporal satellite imagery for better functional field boundary delineation using deep learning semantic segmentation architecture on two distinct geographical and multi-scale farming systems of Netherlands and Pakistan. Multidate images of April, August and October 2022 were acquired for PlanetScope and Sentinel-2 in sub regions of Netherlands and November 2022, February and March 2023 for selected area of Dunyapur in Pakistan. For Netherlands, Basic registration crop parcels (BRP) vector layer was used as labeled training data. while self-crafted field boundary vector data were utilized for Pakistan. Four deep learning models with UNET architecture were evaluated using different combinations of multi-date images and NDVI stacks in the Netherlands subregions. A comparative analysis of IoU scores assessed the effectiveness of the proposed multi-date NDVI stack approach. These findings were then applied for transfer learning, using pre-trained models from the Netherlands on the selected area in Pakistan. Additionally, separate models were trained using self-crafted field boundary data for Pakistan, and combined models were developed using data from both the Netherlands and Pakistan. Results indicate that multi-date NDVI stacks provide additional temporal context, reflecting crop growth over different times of the season. The study underscores the critical role of multi-scale ground information from diverse geographical areas in developing robust and universally applicable models for field boundary delineation. The results also highlight the importance of fine spatial resolution for extraction of field boundaries in regions with small scale framing. The findings can be extended to multi-scale implementations for improved automatic field boundary delineation in heterogeneous agricultural environments.
zh

[CV-98] A Tunable Despeckling Neural Network Stabilized via Diffusion Equation

【速读】：该论文试图解决合成孔径雷达（SAR）成像中乘性Gamma噪声去除的问题，特别是在实际数据与理论模型不符时，神经网络易受干扰和对抗攻击的情况。解决方案的关键在于利用扩散方程的耗散性质，设计了一种可调节的正则化神经网络，该网络将去噪单元和正则化单元整合为一个端到端的训练网络。去噪单元由去噪网络构成，而正则化单元则基于最简单的线性扩散方程，增强了网络的稳定性，允许在训练后调整时间步长以有效缓解对抗攻击的不利影响。该模型在理论上的稳定性和收敛性得到了证明，并在实验中与几种最先进的去噪方法进行了比较，结果显示在定量和视觉评估方面均表现优异。

链接: https://arxiv.org/abs/2411.15921
作者: Yi Ran,Zhichang Guo,Jia Li,Yao Li,Martin Burger,Boying Wu
关键词-EN: Multiplicative Gamma noise, Multiplicative Gamma, Gamma noise remove, synthetic aperture radar, critical research area
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Multiplicative Gamma noise remove is a critical research area in the application of synthetic aperture radar (SAR) imaging, where neural networks serve as a potent tool. However, real-world data often diverges from theoretical models, exhibiting various disturbances, which makes the neural network less effective. Adversarial attacks work by finding perturbations that significantly disrupt functionality of neural networks, as the inherent instability of neural networks makes them highly susceptible. A network designed to withstand such extreme cases can more effectively mitigate general disturbances in real SAR data. In this work, the dissipative nature of diffusion equations is employed to underpin a novel approach for countering adversarial attacks and improve the resistance of real noise disturbance. We propose a tunable, regularized neural network that unrolls a denoising unit and a regularization unit into a single network for end-to-end training. In the network, the denoising unit and the regularization unit are composed of the denoising network and the simplest linear diffusion equation respectively. The regularization unit enhances network stability, allowing post-training time step adjustments to effectively mitigate the adverse impacts of adversarial attacks. The stability and convergence of our model are theoretically proven, and in the experiments, we compare our model with several state-of-the-art denoising methods on simulated images, adversarial samples, and real SAR images, yielding superior results in both quantitative and visual evaluations.
zh

[CV-99] Bimanual Grasp Synthesis for Dexterous Robot Hands

【速读】：该论文试图解决机器人双臂操作中双臂抓取姿态合成的问题，特别是在灵巧手操作器上的应用。解决方案的关键在于提出了BimanGrasp算法，该算法通过优化能量函数来生成考虑抓取稳定性和可行性的抓取姿态，并使用Isaac Gym物理模拟引擎进行验证。此外，论文还创建了BimanGrasp-Dataset，这是首个大规模合成的双臂灵巧手抓取姿态数据集，包含超过150k个验证过的抓取姿态。最后，论文提出了基于扩散模型（BimanGrasp-DDPM）的数据驱动方法，显著提高了抓取合成的成功率和计算速度。

链接: https://arxiv.org/abs/2411.15903
作者: Yanming Shao,Chenxi Xiao
关键词-EN: Humans naturally perform, Humans naturally, naturally perform bimanual, perform bimanual skills, naturally perform
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in RA-L 24’, 8 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Humans naturally perform bimanual skills to handle large and heavy objects. To enhance robots’ object manipulation capabilities, generating effective bimanual grasp poses is essential. Nevertheless, bimanual grasp synthesis for dexterous hand manipulators remains underexplored. To bridge this gap, we propose the BimanGrasp algorithm for synthesizing bimanual grasps on 3D objects. The BimanGrasp algorithm generates grasp poses by optimizing an energy function that considers grasp stability and feasibility. Furthermore, the synthesized grasps are verified using the Isaac Gym physics simulation engine. These verified grasp poses form the BimanGrasp-Dataset, the first large-scale synthesized bimanual dexterous hand grasp pose dataset to our knowledge. The dataset comprises over 150k verified grasps on 900 objects, facilitating the synthesis of bimanual grasps through a data-driven approach. Last, we propose BimanGrasp-DDPM, a diffusion model trained on the BimanGrasp-Dataset. This model achieved a grasp synthesis success rate of 69.87% and significant acceleration in computational speed compared to BimanGrasp algorithm.
zh

[CV-100] Highly Efficient and Unsupervised Framework for Moving Object Detection in Satellite Videos

【速读】：该论文试图解决卫星视频中移动目标检测 (SVMOD) 的问题，特别是由于目标极小且亮度极低所带来的挑战。当前基于学习的方法通过从多帧密集表示中提取时空信息来应对这一问题，但需要大量人工标注，且由于前景与背景区域的不平衡导致计算冗余。论文提出的解决方案关键在于：1) 引入一个通用的无监督框架，其中伪标签由传统方法生成并在训练过程中进化，以提升检测性能；2) 通过将密集的多帧图像形式采样为稀疏的时空点云表示，并跳过背景区域的冗余计算，设计了一种高效且有效的稀疏卷积无锚点检测网络。这些设计使得该方法在保持高效性的同时，实现了最先进的检测性能。

链接: https://arxiv.org/abs/2411.15895
作者: C. Xiao,W. An,Y. Zhang,Z. Su,M. Li,W. Sheng,M. Pietikäinen,L. Liu
关键词-EN: small target characteristics, Moving object detection, challenging task due, Moving object, satellite videos
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:Moving object detection in satellite videos (SVMOD) is a challenging task due to the extremely dim and small target characteristics. Current learning-based methods extract spatio-temporal information from multi-frame dense representation with labor-intensive manual labels to tackle SVMOD, which needs high annotation costs and contains tremendous computational redundancy due to the severe imbalance between foreground and background regions. In this paper, we propose a highly efficient unsupervised framework for SVMOD. Specifically, we propose a generic unsupervised framework for SVMOD, in which pseudo labels generated by a traditional method can evolve with the training process to promote detection performance. Furthermore, we propose a highly efficient and effective sparse convolutional anchor-free detection network by sampling the dense multi-frame image form into a sparse spatio-temporal point cloud representation and skipping the redundant computation on background regions. Coping these two designs, we can achieve both high efficiency (label and computation efficiency) and effectiveness. Extensive experiments demonstrate that our method can not only process 98.8 frames per second on 1024x1024 images but also achieve state-of-the-art performance. The relabeled dataset and code are available at this https URL.
zh

[CV-101] Optimization-Driven Statistical Models of Anatomies using Radial Basis Function Shape Representation

【速读】：该论文试图解决粒子基形状建模 (Particle-based shape modeling, PSM) 中自动量化解剖结构形状变异性的问题。解决方案的关键在于结合传统的优化方法与深度学习技术，通过利用特征形状 (eigenshape) 和对应损失 (correspondence loss) 来实现对模型特性的更精确控制。这种方法不仅避免了黑箱模型的使用，还允许粒子在表面上有更大的自由度，从而生成更具信息量的统计模型。通过在两个真实数据集上与最先进方法的比较，证明了该方法的有效性，并通过实验验证了所选损失函数的合理性。

链接: https://arxiv.org/abs/2411.15882
作者: Hong Xu,Shireen Y. Elhabian
关键词-EN: Particle-based shape modeling, quantify shape variability, automatically quantify shape, Particle-based shape, variability in populations
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Particle-based shape modeling (PSM) is a popular approach to automatically quantify shape variability in populations of anatomies. The PSM family of methods employs optimization to automatically populate a dense set of corresponding particles (as pseudo landmarks) on 3D surfaces to allow subsequent shape analysis. A recent deep learning approach leverages implicit radial basis function representations of shapes to better adapt to the underlying complex geometry of anatomies. Here, we propose an adaptation of this method using a traditional optimization approach that allows more precise control over the desired characteristics of models by leveraging both an eigenshape and a correspondence loss. Furthermore, the proposed approach avoids using a black-box model and allows more freedom for particles to navigate the underlying surfaces, yielding more informative statistical models. We demonstrate the efficacy of the proposed approach to state-of-the-art methods on two real datasets and justify our choice of losses empirically.
zh

[CV-102] Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

【速读】：该论文试图解决预训练视觉-语言模型（如CLIP）在开放词汇分割任务中因图像级预训练而难以捕捉局部细节的问题。解决方案的关键在于提出了一种无需训练的方法——自校准CLIP (Self-Calibrated CLIP, SC-CLIP)，通过识别并解决前向传播过程中的异常标记（anomaly tokens），减少其对正常标记的干扰，从而增强空间感知能力。此外，通过利用CLIP中间特征的语义一致性来提升特征的区分度和注意力相关性，并采用多层次特征融合来丰富细节，最终在不引入新参数或依赖额外骨干网络的情况下，显著提升了CLIP的特征表示粒度和一致性，实验结果表明SC-CLIP在多个语义分割数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2411.15869
作者: Sule Bai,Yong Liu,Yifei Han,Haoji Zhang,Yansong Tang
关键词-EN: pre-trained vision-language models, Recent advancements, CLIP, advancements in pre-trained, pre-trained vision-language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in pre-trained vision-language models like CLIP, have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to its image-level pre-training, CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis reveals that anomaly tokens emerge during the forward pass, drawing excessive attention from normal patch tokens, thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to produce finer-grained representations while preserving its original generalization ability, without introducing new parameters or relying on additional backbones. Specifically, we first identify and resolve the anomaly tokens to mitigate their negative impact. Next, we enhance feature discriminability and attention correlation by leveraging the semantic consistency found in CLIP’s intermediate features. Furthermore, we employ multi-level feature fusion to enrich details. Collectively, these strategies enhance CLIP’s feature representation with greater granularity and coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across eight semantic segmentation datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at this https URL.
zh

[CV-103] PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLM s

【速读】：该论文试图解决全景图像生成中的多层次一致性挑战和扩散模型实现复杂性问题。解决方案的关键在于引入PanoLlama框架，将全景图像生成重新定义为下一个token预测任务，基于预训练的LlamaGen架构，采用自回归方式生成图像，并通过扩展策略处理尺寸限制。该方法以裁剪方式和无需训练的方式与图像token结构对齐，生成高质量全景图像，具有最小接缝和最大可扩展性，从而克服了扩散模型无法解决的问题。

链接: https://arxiv.org/abs/2411.15867
作者: Teng Zhou,Xiaoyu Zhang,Yongchuan Tang
关键词-EN: Panoramic Image Generation, driven by growing, technical applications, Panoramic Image, Image Generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Panoramic Image Generation has emerged as an important task in image generation, driven by growing demands for large-scale visuals in creative and technical applications. While diffusion models have dominated this field, they face inherent limitations, including the multilevel-coherence challenge and implementation complexity, leading to suboptimal outcomes. In this paper, we introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task. Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations. This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability. PanoLlama demonstrates its effectiveness and versatility in our experiments, achieving the best overall performance while offering flexibility for multi-scale, multi-layout, and multi-guidance generation. It overcomes the challenges that diffusion-based methods fail to address, setting a new paradigm for panoramic image generation tasks. Code is available at this https URL.
zh

[CV-104] Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching WACV2025

【速读】：该论文试图解决的是在仅使用单张RGB图像的情况下，对未见过的物体进行姿态估计的问题。解决方案的关键在于利用扩散模型（diffusion model）生成新视角的图像，并通过在这些生成的图像上进行双边匹配（two-sided matching）来确定物体的姿态。这种方法无需依赖实例级别的物体姿态估计和大量的训练数据，也不需要3D物体模型或多视角图像，从而实现了对未见过物体的泛化能力。实验结果表明，该方法在合成数据集和真实世界数据集上都优于现有的姿态估计技术，特别是在视角变化较大的情况下表现出色，显示出其鲁棒性和广泛适用性。

链接: https://arxiv.org/abs/2411.15860
作者: Yujing Sun,Caiyi Sun,Yuan Liu,Yuexin Ma,Siu Ming Yiu
关键词-EN: object pose estimation, generalizable object pose, object pose, pose estimation, RGB image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2025, not published yet

点击查看摘要

Abstract:In this paper, we present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Unlike traditional approaches that rely on instance-level object pose estimation and necessitate extensive training data, our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object. These characteristics are achieved by utilizing a diffusion model to generate novel-view images and conducting a two-sided matching on these generated images. Quantitative experiments demonstrate the superiority of our method over existing pose estimation techniques across both synthetic and real-world datasets. Remarkably, our approach maintains strong performance even in scenarios with significant viewpoint changes, highlighting its robustness and versatility in challenging conditions. The code will be re leased at this https URL.
zh

[CV-105] SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

【速读】：该论文试图解决基于连接主义时间分类（CTC）的场景文本识别（STR）方法在面对复杂和多样化文本实例时准确性较低的问题。解决方案的关键在于提出SVTRv2模型，该模型通过引入多尺寸重调整（Multi-size Resizing, MSR）策略和特征重排模块（Feature Rearrangement Module, FRM）来处理文本的不规则性，并通过语义指导模块（Semantic Guidance Module, SGM）整合语言上下文信息，从而提高识别精度和速度。SGM在推理阶段可被省略，不会增加推理成本，使得SVTRv2在保持高效推理的同时，显著提升了在各种复杂场景下的识别性能。

链接: https://arxiv.org/abs/2411.15858
作者: Yongkun Du,Zhineng Chen,Hongtao Xie,Caiyan Jia,Yu-Gang Jiang
关键词-EN: Connectionist temporal classification, CTC-aligned linear classifier, OCR applications, Connectionist temporal, employed in OCR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed. Code is available at this https URL.
zh

[CV-106] ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

【速读】：该论文试图解决视觉-语言模型（如CLIP）在密集预测任务中的不足，特别是由于自注意力机制在最终块中的局限性，导致无法有效捕捉空间对应关系的问题。解决方案的关键在于提出了残差交叉相关自注意力（Residual Cross-correlation Self-attention, RCS）模块和语义反馈细化（Semantic Feedback Refinement, SFR）模块。RCS模块利用中间层的交叉相关自注意力来重塑最终块的注意力，从而有效重组空间信息，释放CLIP在密集视觉-语言推理中的定位潜力。SFR模块则通过语义分割图进一步调整注意力分数，增强对同一类别区域的关注和局部一致性。通过集成这两种策略，ResCLIP方法能够显著提升现有方法在密集视觉-语言推理任务中的性能，并在多个标准基准测试中超越了最先进的无训练方法。

链接: https://arxiv.org/abs/2411.15851
作者: Yuhang Yang,Jinhong Deng,Wen Li,Lixin Duan
关键词-EN: shown remarkable success, open-vocabulary tasks, image-level tasks, Residual Cross-correlation Self-attention, shown remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP’s non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at this https URL.
zh

[CV-107] Unveiling the Superior Paradigm: A Comparative Study of Source-Free Domain Adaptation and Unsupervised Domain Adaptation

【速读】：该论文试图解决在实际应用中，无监督领域自适应（Unsupervised Domain Adaptation, UDA）与无源领域自适应（Source-Free Domain Adaptation, SFDA）之间的性能比较问题。解决方案的关键在于通过预测编码理论和多基准数据集的广泛实验，证明了SFDA在实际场景中通常优于UDA，特别是在时间效率、存储需求、学习目标的针对性、减少负迁移风险和提高抗过拟合能力方面。此外，论文提出了一种新的数据-模型融合场景，并引入了一种新颖的权重估计方法，以有效整合可用的源数据到多SFDA（Multi-Source-Free Domain Adaptation, MSFDA）方法中，从而在该场景下提升模型性能。

链接: https://arxiv.org/abs/2411.15844
作者: Fan Wang,Zhongyi Han,Xingbo Liu,Xin Gao,Yilong Yin
关键词-EN: Unsupervised Domain Adaptation, UDA versus SFDA, Unsupervised Domain, Source-Free Domain Adaptation, leverages pre-trained source
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:In domain adaptation, there are two popular paradigms: Unsupervised Domain Adaptation (UDA), which aligns distributions using source data, and Source-Free Domain Adaptation (SFDA), which leverages pre-trained source models without accessing source data. Evaluating the superiority of UDA versus SFDA is an open and timely question with significant implications for deploying adaptive algorithms in practical applications. In this study, we demonstrate through predictive coding theory and extensive experiments on multiple benchmark datasets that SFDA generally outperforms UDA in real-world scenarios. Specifically, SFDA offers advantages in time efficiency, storage requirements, targeted learning objectives, reduced risk of negative transfer, and increased robustness against overfitting. Notably, SFDA is particularly effective in mitigating negative transfer when there are substantial distribution discrepancies between source and target domains. Additionally, we introduce a novel data-model fusion scenario, where data sharing among stakeholders varies (e.g., some provide raw data while others provide only models), and reveal that traditional UDA and SFDA methods do not fully exploit their potential in this context. To address this limitation and capitalize on the strengths of SFDA, we propose a novel weight estimation method that effectively integrates available source data into multi-SFDA (MSFDA) approaches, thereby enhancing model performance within this scenario. This work provides a thorough analysis of UDA versus SFDA and advances a practical approach to model adaptation across diverse real-world environments.
zh

[CV-108] Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

【速读】：该论文试图解决在利用流变换器（flow transformer）进行无调优图像编辑时，现有的扩散反演（diffusion inversion）方法在流模型中表现不佳，以及不变性控制机制无法协调多种刚性和非刚性编辑任务的问题。解决方案的关键在于：1) 提出了一种两阶段反演方法，首先优化速度估计，然后补偿剩余误差，以更接近模型先验并有利于编辑；2) 提出了通过在自适应层归一化（adaptive layer normalization）中操纵文本特征的不变性控制机制，将文本提示的变化与图像语义连接起来，从而在保留非目标内容的同时，实现刚性和非刚性编辑，支持多种编辑类型，如视觉文本、数量、面部表情等。

链接: https://arxiv.org/abs/2411.15843
作者: Pengcheng Xu,Boyuan Jiang,Xiaobin Hu,Donghao Luo,Qingdong He,Jiangning Zhang,Chengjie Wang,Yunsheng Wu,Charles Ling,Boyu Wang
关键词-EN: requires authentic inversion, Leveraging the large, editing requires authentic, large generative prior, invariance control
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model’s domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbfinversion and invariance control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.
zh

[CV-109] VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

【速读】：该论文试图解决大视觉-语言模型 (Large Vision-Language Models, LVLMs) 在多模态任务推理中产生的幻觉 (hallucination) 问题。幻觉表现为模型生成的响应看似合理，但实际上并未准确反映视觉内容。论文分析指出，视觉编码过程中的畸变是导致幻觉的重要原因，具体表现为视觉信息在从底层向输出层传播过程中逐渐失真。解决方案的关键在于提出了一种新的幻觉缓解方法——视觉层融合对比解码 (Visual Layer Fusion Contrastive Decoding, VaLiD)。该方法通过利用不确定性来指导选择视觉隐藏层，从而纠正视觉编码过程中的畸变，提高生成文本的可靠性。实验结果表明，VaLiD在多个基准测试中有效减少了幻觉现象，达到了最先进的性能。

链接: https://arxiv.org/abs/2411.15839
作者: Jiaqi Wang,Yifei Gao,Jitao Sang
关键词-EN: Large Vision-Language Models, Large Vision-Language, multimodal task reasoning, demonstrated outstanding performance, demonstrated outstanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in multimodal task reasoning. However, they often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods that mitigate hallucinations by adjusting the decoding strategy during inference stage, typically attributing hallucination to the language model itself. Our analysis, however, reveals that distortions in the visual encoding process significantly affect the model’s reasoning accuracy. Specifically, earlier visual layers may retain key features but gradually distort as the information propagates toward the output layer. Building on these findings, we propose a novel hallucination-mitigation method from the visual encoding perspective: \textbfVisu\textbfal \textbfLayer Fus\textbfion Contrastive \textbfDecoding (VaLiD). This method utilizes uncertainty to guide the selection of visual hidden layers, correcting distortions in the visual encoding process and thereby improving the reliability of generated text. Experimental results show that VaLiD effectively reduces hallucinations across various benchmarks, achieving state-of-the-art performance compared to multiple baseline methods.
zh

[CV-110] Modality Alignment Meets Federated Broadcasting

【速读】：该论文试图解决在联邦学习（Federated Learning, FL）中，由于数据异质性导致的模型收敛性下降和计算成本增加的问题。解决方案的关键在于引入了一种新的联邦学习框架，通过模态对齐（modality alignment）来处理跨客户端的学习。具体来说，该框架在服务器端部署文本编码器，而在本地设备上运行图像编码器，借鉴了多模态学习（multi-modal learning）如CLIP的范式，将服务器与客户端之间的通信类比为多模态广播。此外，通过使用预训练模型和低秩适应（Low-Rank Adaptation, LoRA）更新部分参数，既减少了过拟合风险，又满足了计算需求和性能效率。本地模型独立训练并将其更新传递给服务器，服务器通过基于查询的方法聚合参数，从而促进跨客户端的知识共享和在极端异质性下的性能提升。

链接: https://arxiv.org/abs/2411.15837
作者: Yuting Ma,Shengeng Tang,Xiaohua Xu,Lechao Cheng
关键词-EN: safeguard data privacy, distributed edge devices, Federated learning, centralizing local data, powerful approach
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a powerful approach to safeguard data privacy by training models across distributed edge devices without centralizing local data. Despite advancements in homogeneous data scenarios, maintaining performance between the global and local clients in FL over heterogeneous data remains challenging due to data distribution variations that degrade model convergence and increase computational costs. This paper introduces a novel FL framework leveraging modality alignment, where a text encoder resides on the server, and image encoders operate on local devices. Inspired by multi-modal learning paradigms like CLIP, this design aligns cross-client learning by treating server-client communications akin to multi-modal broadcasting. We initialize with a pre-trained model to mitigate overfitting, updating select parameters through low-rank adaptation (LoRA) to meet computational demand and performance efficiency. Local models train independently and communicate updates to the server, which aggregates parameters via a query-based method, facilitating cross-client knowledge sharing and performance improvement under extreme heterogeneity. Extensive experiments on benchmark datasets demonstrate the efficacy in maintaining generalization and robustness, even in highly heterogeneous settings.
zh

[CV-111] FastTrackTr:Towards Fast Multi-Object Tracking with Transformers

【速读】：该论文试图解决基于Transformer的多目标跟踪（MOT）方法在推理速度上的瓶颈问题。解决方案的关键在于重新审视并改进传统的联合检测与跟踪（JDT, Joint Detection and Tracking）方法，通过在DETR框架中引入高效的信息传递机制，构建了一个名为FastTrackTr的新型JDT-type MOT框架。这一信息传递机制不仅减少了跟踪过程中所需的查询数量，还避免了过度引入网络结构，从而在保证模型简洁性的同时，提升了推理速度，使其具备实现实时跟踪的潜力，并在多个数据集上展示了竞争性的跟踪精度。

链接: https://arxiv.org/abs/2411.15811
作者: Pan Liao,Feng Yang,Di Wu,Jinwen Yu,Wenhui Zhao,Bo Liu
关键词-EN: Transformer-based multi-object tracking, Transformer-based multi-object, recent years, captured the attention, researchers in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based multi-object tracking (MOT) methods have captured the attention of many researchers in recent years. However, these models often suffer from slow inference speeds due to their structure or other issues. To address this problem, we revisited the Joint Detection and Tracking (JDT) method by looking back at past approaches. By integrating the original JDT approach with some advanced theories, this paper employs an efficient method of information transfer between frames on the DETR, constructing a fast and novel JDT-type MOT framework: FastTrackTr. Thanks to the superiority of this information transfer method, our approach not only reduces the number of queries required during tracking but also avoids the excessive introduction of network structures, ensuring model simplicity. Experimental results indicate that our method has the potential to achieve real-time tracking and exhibits competitive tracking accuracy across multiple datasets.
zh

[CV-112] LRSAA: Large-scale Remote Sensing Image Target Recognition and Automatic Annotation

【速读】：该论文试图解决在大面积遥感图像中进行物体识别和自动标注的问题，提出了名为LRSAA的方法。解决方案的关键在于通过集成学习将YOLOv11和MobileNetV3-SSD物体检测算法相结合，以提升模型性能。此外，采用泊松盘采样分割技术和EIOU度量来优化分割图像的训练和推理过程，并通过结果集成进一步提高效率。这种方法不仅减少了计算资源的消耗，还在准确性和速度之间实现了良好的平衡。

链接: https://arxiv.org/abs/2411.15808
作者: Yujuan Zhu,Wuzheng Dong
关键词-EN: images called LRSAA, large-area remote sensing, remote sensing images, sensing images called, called LRSAA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2411.07802

点击查看摘要

Abstract:This paper presents a method for object recognition and automatic labeling in large-area remote sensing images called LRSAA. The method integrates YOLOv11 and MobileNetV3-SSD object detection algorithms through ensemble learning to enhance model performance. Furthermore, it employs Poisson disk sampling segmentation techniques and the EIOU metric to optimize the training and inference processes of segmented images, followed by the integration of results. This approach not only reduces the demand for computational resources but also achieves a good balance between accuracy and speed. The source code for this project has been made publicly available on this https URL.
zh

[CV-113] PG-SLAM: Photo-realistic and Geometry-aware RGB-D SLAM in Dynamic Environments

【速读】：该论文试图解决动态环境中同时定位与地图构建（SLAM）的问题，特别是在处理动态物体时如何实现高质量的场景重建和相机定位。解决方案的关键在于提出了一种基于高斯光栅化的RGB-D SLAM方法，该方法通过三个主要模块来实现：1) 动态前景（包括非刚性人体和刚性物体）的建图；2) 静态背景的重建；3) 相机定位。关键技术包括利用几何和外观约束对动态物体进行建模，通过优化策略整合相邻局部地图的外观约束，以及利用静态背景和动态前景来增加噪声补偿的观测数据。通过结合3D高斯与2D光流和像素块的关联，该方法在几何和外观约束上进行了深入探索，从而在相机定位和场景表示方面优于现有最先进的方法。

链接: https://arxiv.org/abs/2411.15800
作者: Haoang Li,Xiangqi Meng,Xingxing Zuo,Zhe Liu,Hesheng Wang,Daniel Cremers
关键词-EN: achieved impressive performance, Simultaneous localization, achieved impressive, impressive performance, SLAM
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Simultaneous localization and mapping (SLAM) has achieved impressive performance in static environments. However, SLAM in dynamic environments remains an open question. Many methods directly filter out dynamic objects, resulting in incomplete scene reconstruction and limited accuracy of camera localization. The other works express dynamic objects by point clouds, sparse joints, or coarse meshes, which fails to provide a photo-realistic representation. To overcome the above limitations, we propose a photo-realistic and geometry-aware RGB-D SLAM method by extending Gaussian splatting. Our method is composed of three main modules to 1) map the dynamic foreground including non-rigid humans and rigid items, 2) reconstruct the static background, and 3) localize the camera. To map the foreground, we focus on modeling the deformations and/or motions. We consider the shape priors of humans and exploit geometric and appearance constraints of humans and items. For background mapping, we design an optimization strategy between neighboring local maps by integrating appearance constraint into geometric alignment. As to camera localization, we leverage both static background and dynamic foreground to increase the observations for noise compensation. We explore the geometric and appearance constraints by associating 3D Gaussians with 2D optical flows and pixel patches. Experiments on various real-world datasets demonstrate that our method outperforms state-of-the-art approaches in terms of camera localization and scene representation. Source codes will be publicly available upon paper acceptance.
zh

[CV-114] Symmetric Perception and Ordinal Regression for Detecting Scoliosis Natural Image

【速读】：该论文试图解决青少年脊柱侧弯（scoliosis）的广泛筛查问题，传统方法依赖于放射性检查，需要专业医疗设备和专家，且存在辐射风险。论文提出的解决方案关键在于利用人体背部的自然图像，通过双路径脊柱侧弯检测网络（dual-path scoliosis detection network）来实现。该网络包含两个主要模块：对称特征匹配模块（Symmetric Feature Matching Module, SFMM）和序数回归头（Ordinal Regression Head, ORH）。SFMM用于捕捉输入图像与其水平翻转图像之间的对称关系，而ORH则将序数回归问题转化为一系列二分类子问题。实验结果表明，该方法在脊柱侧弯严重程度的粗略和细粒度估计上均优于现有方法和人类表现，分别为95.11%和81.46%的准确率，为广泛筛查提供了有前景且经济的解决方案。

链接: https://arxiv.org/abs/2411.15799
作者: Xiaojia Zhu,Rui Chen,Xiaoqi Guo,Zhiwen Shao,Yuhu Dai,Ming Zhang,Chuandong Lang
关键词-EN: diseases in adolescents, common diseases, Scoliosis, wide-range scoliosis screening, scoliosis screening
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by Applied Intelligence

点击查看摘要

Abstract:Scoliosis is one of the most common diseases in adolescents. Traditional screening methods for the scoliosis usually use radiographic examination, which requires certified experts with medical instruments and brings the radiation risk. Considering such requirement and inconvenience, we propose to use natural images of the human back for wide-range scoliosis screening, which is a challenging problem. In this paper, we notice that the human back has a certain degree of symmetry, and asymmetrical human backs are usually caused by spinal lesions. Besides, scoliosis severity levels have ordinal relationships. Taking inspiration from this, we propose a dual-path scoliosis detection network with two main modules: symmetric feature matching module (SFMM) and ordinal regression head (ORH). Specifically, we first adopt a backbone to extract features from both the input image and its horizontally flipped image. Then, we feed the two extracted features into the SFMM to capture symmetric relationships. Finally, we use the ORH to transform the ordinal regression problem into a series of binary classification sub-problems. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods as well as human performance, which provides a promising and economic solution to wide-range scoliosis screening. In particular, our method achieves accuracies of 95.11% and 81.46% in estimation of general severity level and fine-grained severity level of the scoliosis, respectively.
zh

[CV-115] Multi-Token Enhancing for Vision Representation Learning

【速读】：该论文试图解决传统集成学习策略在视觉表示学习（尤其是自监督学习）中不切实际的问题，因为这些策略需要k倍的训练和推理计算成本。论文提出的解决方案是引入多标记增强（Multi-Token Enhancing, MTE），通过从单个模型中同时提取多个辅助标记（包括辅助CLS标记和自适应池化标记）来增强表示学习，同时仅增加极少的额外训练成本且不增加推理成本。这些辅助标记由于其差异性能够捕捉互补信息。此外，为了应对推理成本的增加，论文提出在预训练期间将辅助标记获得的知识蒸馏到全局标记中，从而在推理时可以丢弃辅助标记而不增加额外成本。MTE方法兼容各种自监督损失函数和架构，并在不同下游任务中持续提升性能。

链接: https://arxiv.org/abs/2411.15787
作者: Zhong-Yu Li,Yu-Song Hu,Bo-Wen Yin,Ming-Ming Cheng
关键词-EN: vision applications, representation learning, learning, Vision representation learning, Vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision representation learning, especially self-supervised learning, is pivotal for various vision applications. Ensemble learning has also succeeded in enhancing the performance and robustness of the vision models. However, traditional ensemble strategies are impractical for representation learning, especially self-supervised representation learning that requires large-scale datasets and long schedules. This is because they require k times more training and inference computation costs for an ensemble of k models. Differently, we introduce Multi-Token Enhancing (MTE) that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs. These auxiliary tokens, including auxiliary CLS tokens and adaptively pooled tokens, capture complementary information due to their differences. Meanwhile, to address the increase in inference costs, we distill the knowledge acquired by the auxiliary tokens into a global token during pre-training. Consequently, we can discard the auxiliary tokens during inference without incurring additional costs. Our MTE is compatible with various self-supervised loss functions and architectures, consistently improving performances across different downstream tasks. Our source code will be made publicly available.
zh

[CV-116] ZeroGS: Training 3D Gaussian Splatting from Unposed Images

【速读】：该论文试图解决从大量无序和未标定的图像中训练3D高斯喷射（3D Gaussian Splatting, 3DGS）模型的问题。解决方案的关键在于利用预训练的基础模型作为神经场景表示，并通过初始化种子图像和逐步注册新图像来微调模型。此外，通过最小化多视图点对相机射线一致性损失来优化相机姿态和点图，从而提高图像注册的准确性和图像渲染的质量。实验结果表明，该方法在恢复相机姿态和渲染图像质量方面优于现有的无姿态NeRF/3DGS方法。

链接: https://arxiv.org/abs/2411.15779
作者: Yu Chen,Rolandos Alexandros Potamias,Evangelos Ververas,Jifei Song,Jiankang Deng,Gim Hee Lee
关键词-EN: Gaussian Splatting, Neural radiance fields, radiance fields, popular techniques, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:Neural radiance fields (NeRF) and 3D Gaussian Splatting (3DGS) are popular techniques to reconstruct and render photo-realistic images. However, the pre-requisite of running Structure-from-Motion (SfM) to get camera poses limits their completeness. While previous methods can reconstruct from a few unposed images, they are not applicable when images are unordered or densely captured. In this work, we propose ZeroGS to train 3DGS from hundreds of unposed and unordered images. Our method leverages a pretrained foundation model as the neural scene representation. Since the accuracy of the predicted pointmaps does not suffice for accurate image registration and high-fidelity image rendering, we propose to mitigate the issue by initializing and finetuning the pretrained model from a seed image. Images are then progressively registered and added to the training buffer, which is further used to train the model. We also propose to refine the camera poses and pointmaps by minimizing a point-to-camera ray consistency loss across multiple views. Experiments on the LLFF dataset, the MipNeRF360 dataset, and the Tanks-and-Temples dataset show that our method recovers more accurate camera poses than state-of-the-art pose-free NeRF/3DGS methods, and even renders higher quality images than 3DGS with COLMAP poses. Our project page is available at this https URL.
zh

[CV-117] Context-Aware Detection of Mixed Critical Events using Video Classification

【速读】：该论文试图解决通过计算机视觉检测混合关键事件（mixed-critical events）的挑战，特别是需要上下文理解来准确评估事件的关键性。解决方案的关键在于提出了一种适用于智能城市应用的多功能检测系统，该系统能够在交通和火灾检测场景中进行测试，并具备适应不同应用需求的灵活性。论文的主要贡献包括对检测需求的分析以及开发出能够适应多样化应用的系统，从而推动智能城市的自动化监控技术。

链接: https://arxiv.org/abs/2411.15773
作者: Filza Akhlaq,Alina Arshad,Muhammad Yehya Hayati,Jawwad A. Shamsi,Muhammad Burhan Khan
关键词-EN: Detecting mixed-critical events, Detecting mixed-critical, event criticality accurately, assess event criticality, criticality accurately
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting mixed-critical events through computer vision is challenging due to the need for contextual understanding to assess event criticality accurately. Mixed critical events, such as fires of varying severity or traffic incidents, demand adaptable systems that can interpret context to trigger appropriate responses. This paper addresses these challenges by proposing a versatile detection system for smart city applications, offering a solution tested across traffic and fire detection scenarios. Our contributions include an analysis of detection requirements and the development of a system adaptable to diverse applications, advancing automated surveillance for smart cities.
zh

[CV-118] Corner2Net: Detecting Objects as Cascade Corners ECAI2024

【速读】：该论文试图解决基于角点检测范式中的三个主要问题：1) 角点匹配困难，启发式角点匹配算法容易导致错误框，特别是在相似物体共存时；2) 实例上下文信息不足，两个独立的角点保留的实例语义信息较少，难以保证在同一热图通道上获取两个类别特定的角点；3) 不友好的骨干网络，沙漏网络的训练成本高。解决方案的关键在于构建了一个名为Corner2Net的新型角点检测框架，通过设计级联角点管道（cascade corner pipeline），逐步预测关联的角点对，而不是通过并行头同步搜索两个独立的角点。Corner2Net将角点定位与目标分类解耦，两个角点均为类别无关，实例特定的右下角点进一步简化了搜索空间。同时，提取具有丰富语义的RoI特征进行分类，并可轻松连接流行的骨干网络（如ResNeXt）。实验结果表明，Corner2Net在COCO数据集上在准确性和速度方面均显著超越了现有的基于角点的检测器。

链接: https://arxiv.org/abs/2411.15772
作者: Chenglong Liu,Jintao Liu,Haorao Wei,Jinze Yang,Liangyu Xu,Yuchen Guo,Lu Fang
关键词-EN: detection paradigm enjoys, produce high-quality boxes, corner-based detection paradigm, detection paradigm, paradigm enjoys
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by 27th EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2024)

点击查看摘要

Abstract:The corner-based detection paradigm enjoys the potential to produce high-quality boxes. But the development is constrained by three factors: 1) Hard to match corners. Heuristic corner matching algorithms can lead to incorrect boxes, especially when similar-looking objects co-occur. 2) Poor instance context. Two separate corners preserve few instance semantics, so it is difficult to guarantee getting both two class-specific corners on the same heatmap channel. 3) Unfriendly backbone. The training cost of the hourglass network is high. Accordingly, we build a novel corner-based framework, named Corner2Net. To achieve the corner-matching-free manner, we devise the cascade corner pipeline which progressively predicts the associated corner pair in two steps instead of synchronously searching two independent corners via parallel heads. Corner2Net decouples corner localization and object classification. Both two corners are class-agnostic and the instance-specific bottom-right corner further simplifies its search space. Meanwhile, RoI features with rich semantics are extracted for classification. Popular backbones (e.g., ResNeXt) can be easily connected to Corner2Net. Experimental results on COCO show Corner2Net surpasses all existing corner-based detectors by a large margin in accuracy and speed.
zh

[CV-119] xt-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering

【速读】：该论文试图解决在光学传感器成像受限的挑战条件下（如云覆盖和低光场景），遥感视觉问答（RSVQA）性能下降的问题。解决方案的关键在于提出了一种文本引导的粗到细融合网络（Text-guided Coarse-to-Fine Fusion Network, TGFNet），通过利用问题文本与多源图像之间的语义关系，在特征层面进行互补融合。具体来说，论文开发了文本引导的粗到细注意力优化模块（Text-guided Coarse-to-Fine Attention Refinement, CFAR），通过关键区域路由逐步从广泛区域聚焦到细节，增强模型对相关区域的注意力。此外，提出了自适应多专家融合模块（Adaptive Multi-Expert Fusion, AMEF），动态集成不同专家，实现光学与SAR特征的自适应融合。论文还创建了首个大规模光学-SAR RSVQA评估基准数据集，包含6,008对对齐的光学-SAR图像和1,036,694个标注的问答对，涵盖16种多样的问题类型。实验结果表明，TGFNet在挑战场景下显著提升了模型的性能。

链接: https://arxiv.org/abs/2411.15770
作者: Zhicheng Zhao,Changfu Zhou,Yu Zhang,Chenglong Li,Xiaoliang Ma,Jin Tang
关键词-EN: significant research interest, gained significant research, Remote Sensing Visual, Sensing Visual Question, Remote Sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote Sensing Visual Question Answering (RSVQA) has gained significant research interest. However, current RSVQA methods are limited by the imaging mechanisms of optical sensors, particularly under challenging conditions such as cloud-covered and low-light scenarios. Given the all-time and all-weather imaging capabilities of Synthetic Aperture Radar (SAR), it is crucial to investigate the integration of optical-SAR images to improve RSVQA performance. In this work, we propose a Text-guided Coarse-to-Fine Fusion Network (TGFNet), which leverages the semantic relationships between question text and multi-source images to guide the network toward complementary fusion at the feature level. Specifically, we develop a Text-guided Coarse-to-Fine Attention Refinement (CFAR) module to focus on key areas related to the question in complex remote sensing images. This module progressively directs attention from broad areas to finer details through key region routing, enhancing the model’s ability to focus on relevant regions. Furthermore, we propose an Adaptive Multi-Expert Fusion (AMEF) module that dynamically integrates different experts, enabling the adaptive fusion of optical and SAR features. In addition, we create the first large-scale benchmark dataset for evaluating optical-SAR RSVQA methods, comprising 6,008 well-aligned optical-SAR image pairs and 1,036,694 well-labeled question-answer pairs across 16 diverse question types, including complex relational reasoning questions. Extensive experiments on the proposed dataset demonstrate that our TGFNet effectively integrates complementary information between optical and SAR images, significantly improving the model’s performance in challenging scenarios. The dataset is available at: this https URL. Index Terms: Remote Sensing Visual Question Answering, Multi-source Data Fusion, Multimodal, Remote Sensing, OPT-SAR. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.15770 [cs.CV] (or arXiv:2411.15770v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.15770 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-120] Integrating Deep Metric Learning with Coreset for Active Learning in 3D Segmentation NEURIPS2024

【速读】：该论文试图解决3D医学图像分割中标注数据需求量大、成本高的问题。解决方案的关键在于引入了一种新的度量学习方法，用于在3D医学分割中进行基于切片的主动学习 (Active Learning, AL)。通过将对比学习 (Contrastive Learning) 与医学影像中的固有数据分组相结合，该方法学习了一种度量，强调了样本间在训练3D医学分割模型时相关差异的重要性。这种方法不仅在弱标注和全标注的情况下均优于现有的主动学习技术，而且在低标注预算下也能获得优越的性能，这对于医学影像领域尤为重要。

链接: https://arxiv.org/abs/2411.15763
作者: Arvind Murari Vepa,Zukang Yang,Andrew Choi,Jungseock Joo,Fabien Scalzo,Yizhou Sun
关键词-EN: demands extensive annotated, Deep learning, extensive annotated data, remarkable advancements, advancements in machine
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To be published in NeurIPS 2024

点击查看摘要

Abstract:Deep learning has seen remarkable advancements in machine learning, yet it often demands extensive annotated data. Tasks like 3D semantic segmentation impose a substantial annotation burden, especially in domains like medicine, where expert annotations drive up the cost. Active learning (AL) holds great potential to alleviate this annotation burden in 3D medical segmentation. The majority of existing AL methods, however, are not tailored to the medical domain. While weakly-supervised methods have been explored to reduce annotation burden, the fusion of AL with weak supervision remains unexplored, despite its potential to significantly reduce annotation costs. Additionally, there is little focus on slice-based AL for 3D segmentation, which can also significantly reduce costs in comparison to conventional volume-based AL. This paper introduces a novel metric learning method for Coreset to perform slice-based active learning in 3D medical segmentation. By merging contrastive learning with inherent data groupings in medical imaging, we learn a metric that emphasizes the relevant differences in samples for training 3D medical segmentation models. We perform comprehensive evaluations using both weak and full annotations across four datasets (medical and non-medical). Our findings demonstrate that our approach surpasses existing active learning techniques on both weak and full annotations and obtains superior performance with low-annotation budgets which is crucial in medical imaging. Source code for this project is available in the supplementary materials and on GitHub: this https URL.
zh

[CV-121] MambaTrack: Exploiting Dual-Enhancement for Night UAV Tracking

【速读】：该论文试图解决夜间无人飞行器（UAV）跟踪中由于光照不足导致的性能下降问题。解决方案的关键在于提出了一种基于mamba的高效跟踪器，利用双增强技术提升夜间UAV跟踪性能。具体来说，该方法包括一个mamba低光增强器，通过光照估计器和损伤修复器实现全局图像增强，同时保留低光图像的细节和结构。此外，还引入了一种跨模态mamba网络，实现视觉和语言模态之间的高效交互学习。实验结果表明，该方法在性能上显著优于现有方法，并且在计算和内存效率上也有显著提升。

链接: https://arxiv.org/abs/2411.15761
作者: Chunhui Zhang,Li Liu,Hao Wen,Xi Zhou,Yanfeng Wang
关键词-EN: unmanned aerial vehicle, Night unmanned aerial, night UAV tracking, boost night UAV, demonstrating suboptimal performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Night unmanned aerial vehicle (UAV) tracking is impeded by the challenges of poor illumination, with previous daylight-optimized methods demonstrating suboptimal performance in low-light conditions, limiting the utility of UAV applications. To this end, we propose an efficient mamba-based tracker, leveraging dual enhancement techniques to boost night UAV tracking. The mamba-based low-light enhancer, equipped with an illumination estimator and a damage restorer, achieves global image enhancement while preserving the details and structure of low-light images. Additionally, we advance a cross-modal mamba network to achieve efficient interactive learning between vision and language modalities. Extensive experiments showcase that our method achieves advanced performance and exhibits significantly improved computation and memory efficiency. For instance, our method is 2.8 \times faster than CiteTracker and reduces 50.2 % GPU memory. Codes will be made publicly available.
zh

[CV-122] Advanced Learning-Based Inter Prediction for Future Video Coding

【速读】：该论文试图解决在第四代音视频编码标准 (Audio Video coding Standard, AVS4) 中，传统帧间预测滤波器 (Inter Prediction Filter, INTERPF) 在处理预测与相邻重建像素间的不连续性时存在的复杂度和效率问题。解决方案的关键在于提出了一种基于学习的低复杂度帧间预测方法 (Low Complexity Learning-based Inter Prediction, LLIP)，通过利用轻量级神经网络模型来替代传统的 INTERPF。具体来说，LLIP 通过提取传统 INTERPF 使用的像素和坐标来构建训练数据集，训练后导出神经网络的权重和偏置，实现无需第三方依赖的高效推理过程，从而在不依赖 Libtorch 的情况下无缝集成到视频编解码器中，显著提升了推理速度。最终，LLIP 用学习到的最优滤波参数替代了传统的手工设计的滤波参数，使得深度学习编码工具与传统视频编码方案的结合更加高效。实验结果表明，该方法在随机接入 (Random Access, RA) 配置下，分别在 Y、U 和 V 分量上平均获得了 0.01%、0.31% 和 0.25% 的编码增益。

链接: https://arxiv.org/abs/2411.15759
作者: Yanchen Zhao,Wenhong Duan,Chuanmin Jia,Shanshe Wang,Siwei Ma
关键词-EN: Inter Prediction Filter, fourth generation Audio, generation Audio Video, Inter Prediction, Prediction Filter
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the fourth generation Audio Video coding Standard (AVS4), the Inter Prediction Filter (INTERPF) reduces discontinuities between prediction and adjacent reconstructed pixels in inter prediction. The paper proposes a low complexity learning-based inter prediction (LLIP) method to replace the traditional INTERPF. LLIP enhances the filtering process by leveraging a lightweight neural network model, where parameters can be exported for efficient inference. Specifically, we extract pixels and coordinates utilized by the traditional INTERPF to form the training dataset. Subsequently, we export the weights and biases of the trained neural network model and implement the inference process without any third-party dependency, enabling seamless integration into video codec without relying on Libtorch, thus achieving faster inference speed. Ultimately, we replace the traditional handcraft filtering parameters in INTERPF with the learned optimal filtering parameters. This practical solution makes the combination of deep learning encoding tools with traditional video encoding schemes more efficient. Experimental results show that our approach achieves 0.01%, 0.31%, and 0.25% coding gain for the Y, U, and V components under the random access (RA) configuration on average.
zh

[CV-123] PR-MIM: Delving Deeper into Partial Reconstruction in Masked Image Modeling

【速读】：该论文试图解决在掩码图像建模（Masked Image Modeling）中由于部分重建（Partial Reconstruction）策略导致的计算成本降低与表示质量下降之间的矛盾。解决方案的关键在于提出了一种渐进重建策略（Progressive Reconstruction Strategy）和最远采样策略（Furthest Sampling Strategy），以极其轻量级的方式逐步重建被丢弃的掩码标记，而不是完全放弃它们。这种方法确保了所有掩码标记在预训练过程中得到充分的监督，同时保持了部分重建策略在降低计算成本方面的优势。通过这种方法，论文在丢弃50%的图像块时，能够在不损失性能的情况下，将ViT-B/16模型的计算量（FLOPs）减少28%，内存使用量减少36%。

链接: https://arxiv.org/abs/2411.15746
作者: Zhong-Yu Li,Yunheng Li,Deng-Ping Fan,Ming-Ming Cheng
关键词-EN: achieved great success, huge computational costs, Masked image modeling, modeling has achieved, achieved great
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Masked image modeling has achieved great success in learning representations but is limited by the huge computational costs. One cost-saving strategy makes the decoder reconstruct only a subset of masked tokens and throw the others, and we refer to this method as partial reconstruction. However, it also degrades the representation quality. Previous methods mitigate this issue by throwing tokens with minimal information using temporal redundancy inaccessible for static images or attention maps that incur extra costs and complexity. To address these limitations, we propose a progressive reconstruction strategy and a furthest sampling strategy to reconstruct those thrown tokens in an extremely lightweight way instead of completely abandoning them. This approach involves all masked tokens in supervision to ensure adequate pre-training, while maintaining the cost-reduction benefits of partial reconstruction. We validate the effectiveness of the proposed method across various existing frameworks. For example, when throwing 50% patches, we can achieve lossless performance of the ViT-B/16 while saving 28% FLOPs and 36% memory usage compared to standard MAE. Our source code will be made publicly available
zh

[CV-124] PEnG: Pose-Enhanced Geo-Localisation

【速读】：该论文试图解决跨视角地理定位（Cross-view Geo-localisation）中由于密集采样导致的重叠问题，从而限制了定位精度的提升。解决方案的关键在于结合跨视角地理定位和相对姿态估计（relative pose estimation），通过开发PEnG系统，该系统首先预测查询图像在城市尺度图表示中最可能的边缘，然后在这些边缘内进行相对姿态估计以确定精确位置。PEnG首次利用跨视角地理定位数据集中的双重视角来将精度提升至亚米级，甚至达到厘米级。该方法显著提高了定位精度，相对于之前的工作，Top-5m检索的改进达到了213%，并将中位欧几里得距离误差从734米降低到22.77米。

链接: https://arxiv.org/abs/2411.15742
作者: Tavis Shore,Oscar Mendez,Simon Hadfield
关键词-EN: densely sampled satellite, patches overlap heavily, coarse granularity, typically performed, overlap heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Cross-view Geo-localisation is typically performed at a coarse granularity, because densely sampled satellite image patches overlap heavily. This heavy overlap would make disambiguating patches very challenging. However, by opting for sparsely sampled patches, prior work has placed an artificial upper bound on the localisation accuracy that is possible. Even a perfect oracle system cannot achieve accuracy greater than the average separation of the tiles. To solve this limitation, we propose combining cross-view geo-localisation and relative pose estimation to increase precision to a level practical for real-world application. We develop PEnG, a 2-stage system which first predicts the most likely edges from a city-scale graph representation upon which a query image lies. It then performs relative pose estimation within these edges to determine a precise position. PEnG presents the first technique to utilise both viewpoints available within cross-view geo-localisation datasets to enhance precision to a sub-metre level, with some examples achieving centimetre level accuracy. Our proposed ensemble achieves state-of-the-art precision - with relative Top-5m retrieval improvements on previous works of 213%. Decreasing the median euclidean distance error by 96.90% from the previous best of 734m down to 22.77m, when evaluating with 90 degree horizontal FOV images. Code will be made available: this http URL
zh

[CV-125] Proceedings of the 6th International Workshop on Reading Music Systems

【速读】：该论文是第六届国际阅读音乐系统研讨会（WoRMS）的会议记录，旨在连接开发音乐阅读系统（如光学音乐识别 (Optical Music Recognition)）的研究人员与其他可能从这些系统中受益的研究人员和实践者（如图书馆员或音乐学家）。研讨会关注的主题包括但不限于：音乐阅读系统、光学音乐识别、数据集和性能评估、音乐乐谱的图像处理、作者识别、音乐乐谱的创作、编辑、存储和展示系统、多模态系统、生成书面音乐的新输入方法、基于网络的音乐信息检索服务、应用和项目，以及与书面音乐相关的用例。解决方案的关键在于促进跨学科合作，推动音乐阅读系统的发展和应用，以满足不同领域的需求。

链接: https://arxiv.org/abs/2411.15741
作者: Jorge Calvo-Zaragoza,Alexander Pacha,Elona Shatri(Eds.)
关键词-EN: Optical Music Recognition, Reading Music Systems, Music reading systems, Web-based Music Information, Reading Music
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Proceedings edited by Jorge Calvo-Zaragoza, Alexander Pacha and Elona Shatri

点击查看摘要

Abstract:The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 6th International Workshop on Reading Music Systems, held Online on November 22nd 2024. Comments: Proceedings edited by Jorge Calvo-Zaragoza, Alexander Pacha and Elona Shatri Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2411.15741 [cs.CV] (or arXiv:2411.15741v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.15741 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-126] LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration

【速读】：该论文试图解决低光图像增强问题，解决方案的关键在于引入了一种新颖的网络架构LTCF-Net。该架构通过利用LAB和YUV两种颜色空间，有效地分离和处理图像的亮度和色度信息，同时结合Transformer架构以全面理解图像内容并保持计算效率。此外，论文还引入了一个傅里叶变换模块，用于在频域动态调整亮度通道，从而在不同区域均匀平衡亮度并消除背景噪声，显著提升图像的视觉质量。通过这些创新组件的结合，LTCF-Net在保持模型轻量化的同时，有效提高了低光图像的质量，并在多个评估指标和数据集上超越了当前最先进的方法。

链接: https://arxiv.org/abs/2411.15740
作者: Gaojing Zhang,Jinglun Feng
关键词-EN: network architecture designed, LAB and YUV, Unlike Retinex-based methods, network architecture, architecture designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce LTCF-Net, a novel network architecture designed for enhancing low-light images. Unlike Retinex-based methods, our approach utilizes two color spaces - LAB and YUV - to efficiently separate and process color information, by leveraging the separation of luminance from chromatic components in color images. In addition, our model incorporates the Transformer architecture to comprehensively understand image content while maintaining computational efficiency. To dynamically balance the brightness in output images, we also introduce a Fourier transform module that adjusts the luminance channel in the frequency domain. This mechanism could uniformly balance brightness across different regions while eliminating background noises, and thereby enhancing visual quality. By combining these innovative components, LTCF-Net effectively improves low-light image quality while keeping the model lightweight. Experimental results demonstrate that our method outperforms current state-of-the-art approaches across multiple evaluation metrics and datasets, achieving more natural color restoration and a balanced brightness distribution.
zh

[CV-127] AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

【速读】：该论文试图解决基于自然语言指令的图像编辑中，现有模型在执行复杂用户指令时表现不佳的问题。主要原因是这些模型通常在低质量、编辑类型有限的数据上进行训练。论文提出的解决方案是引入一个名为AnyEdit的综合多模态指令编辑数据集，该数据集包含250万对高质量编辑样本，涵盖超过20种编辑类型和五个领域。关键在于通过初始数据多样性、自适应编辑过程和自动化编辑结果选择来确保数据集的多样性和质量。基于此数据集，论文进一步训练了一种新型的AnyEdit Stable Diffusion模型，该模型采用任务感知路由和可学习的任务嵌入，以实现统一的图像编辑。实验结果表明，AnyEdit数据集显著提升了基于扩散的编辑模型的性能，为开发支持人类创造力的指令驱动图像编辑模型提供了前景。

链接: https://arxiv.org/abs/2411.15738
作者: Qifan Yu,Wei Chow,Zhongqi Yue,Kaihang Pan,Yang Wu,Xiaoyang Wan,Juncheng Li,Siliang Tang,Hanwang Zhang,Yueting Zhuang
关键词-EN: natural language instructions, Instruction-based image editing, specific image elements, modify specific image, Instruction-based image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 24 figures

点击查看摘要

Abstract:Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models. This presents prospects for developing instruction-driven image editing models that support human creativity.
zh

[CV-128] Enhancing Few-Shot Out-of-Distribution Detection with Gradient Aligned Context Optimization

【速读】：该论文试图解决少样本分布外检测（Few-shot out-of-distribution (OOD) detection）中存在的梯度冲突问题，即在分布内样本分类优化（ID classification optimization）与分布外正则化（OOD regularization）之间由于识别偏差导致的梯度冲突。解决方案的关键是提出了一种名为梯度对齐上下文优化（Gradient Aligned Context Optimization, GaCoOp）的方法，通过分解优化梯度来识别冲突发生的场景，并通过梯度投影技术缓解分布内样本中的冲突，同时优化提示（prompts），从而有效减轻了梯度冲突并显著提升了性能。

链接: https://arxiv.org/abs/2411.15736
作者: Baoshun Tong,Kaiyu Song,Hanjiang Lai
关键词-EN: detect OOD images, detect OOD, OOD images, OOD, OOD regularization
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot out-of-distribution (OOD) detection aims to detect OOD images from unseen classes with only a few labeled in-distribution (ID) images. To detect OOD images and classify ID samples, prior methods have been proposed by regarding the background regions of ID samples as the OOD knowledge and performing OOD regularization and ID classification optimization. However, the gradient conflict still exists between ID classification optimization and OOD regularization caused by biased recognition. To address this issue, we present Gradient Aligned Context Optimization (GaCoOp) to mitigate this gradient conflict. Specifically, we decompose the optimization gradient to identify the scenario when the conflict occurs. Then we alleviate the conflict in inner ID samples and optimize the prompts via leveraging gradient projection. Extensive experiments over the large-scale ImageNet OOD detection benchmark demonstrate that our GaCoOp can effectively mitigate the conflict and achieve great performance. Code will be available at this https URL.
zh

[CV-129] st-time Alignment-Enhanced Adapter for Vision-Language Models

【速读】：该论文试图解决预训练视觉-语言模型（Vision-Language Models, VLMs）在测试阶段面临的分布偏移问题。解决方案的关键在于提出了一种新的测试时对齐增强适配器（Test-time Alignment-Enhanced Adapter, TAEA），通过在测试阶段利用测试样本训练适配器来调整文本特征，从而增强文本与图像的对齐预测。此外，论文还采用了来自测试时数据增强（Test-time Data Augmentation, TDA）的负缓存（negative cache）作为增强模块，进一步提升了TAEA的性能。该方法在分布外基准和跨域基准上分别比现有的最先进测试时适应方法提升了0.75%和2.5%，同时保持了可接受的训练时间。

链接: https://arxiv.org/abs/2411.15735
作者: Baoshun Tong,Kaiyu Song,Hanjiang Lai
关键词-EN: attracted increasing attention, pre-trained vision-language models, vision-language models, addressing distribution shift, distribution shift
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation with pre-trained vision-language models (VLMs) has attracted increasing attention for tackling the issue of distribution shift during the test phase. While prior methods have shown effectiveness in addressing distribution shift by adjusting classification logits, they are not optimal due to keeping text features unchanged. To address this issue, we introduce a new approach called Test-time Alignment-Enhanced Adapter (TAEA), which trains an adapter with test samples to adjust text features during the test phase. We can enhance the text-to-image alignment prediction by utilizing an adapter to adapt text features. Furthermore, we also propose to adopt the negative cache from TDA as enhancement module, which further improves the performance of TAEA. Our approach outperforms the state-of-the-art TTA method of pre-trained VLMs by an average of 0.75% on the out-of-distribution benchmark and 2.5% on the cross-domain benchmark, with an acceptable training time. Code will be available at this https URL.
zh

[CV-130] DynamicAvatars: Accurate Dynamic Facial Avatars Reconstruction and Precise Editing with Diffusion Models

【速读】：该论文试图解决在虚拟现实和电影制作中生成和编辑动态3D头部化身时面临的面部扭曲、头部运动不准确以及精细编辑能力有限的问题。解决方案的关键在于提出了DynamicAvatars模型，该模型通过视频片段和与面部位置及表情相关的参数生成逼真的动态3D头部化身。其核心创新包括：1) 基于提示的编辑模型，结合用户提供的提示和大语言模型(LLMs)导出的指导参数，实现精确编辑；2) 双跟踪框架，基于高斯平滑技术，确保编辑稳定性；3) 提示预处理模块，增强编辑稳定性；4) 专用GAN算法与控制模块的结合，生成精确的指导参数；5) 动态编辑策略，选择性利用特定训练数据集，提高模型在动态编辑任务中的效率和适应性。

链接: https://arxiv.org/abs/2411.15732
作者: Yangyang Qian,Yuan Sun,Yu Guo
关键词-EN: film production, virtual reality, reality and film, head avatars, Generating
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating and editing dynamic 3D head avatars are crucial tasks in virtual reality and film production. However, existing methods often suffer from facial distortions, inaccurate head movements, and limited fine-grained editing capabilities. To address these challenges, we present DynamicAvatars, a dynamic model that generates photorealistic, moving 3D head avatars from video clips and parameters associated with facial positions and expressions. Our approach enables precise editing through a novel prompt-based editing model, which integrates user-provided prompts with guiding parameters derived from large language models (LLMs). To achieve this, we propose a dual-tracking framework based on Gaussian Splatting and introduce a prompt preprocessing module to enhance editing stability. By incorporating a specialized GAN algorithm and connecting it to our control module, which generates precise guiding parameters from LLMs, we successfully address the limitations of existing methods. Additionally, we develop a dynamic editing strategy that selectively utilizes specific training datasets to improve the efficiency and adaptability of the model for dynamic editing tasks.
zh

[CV-131] OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions

【速读】：该论文试图解决现有动作识别视频数据集中遮挡数据不足的问题，这限制了模型的鲁棒性并阻碍了性能的持续提升。解决方案的关键在于构建了一个大规模的遮挡视频数据集OccludeNet，该数据集包含真实世界和合成遮挡场景视频，涵盖动态跟踪遮挡、静态场景遮挡和多视角交互遮挡，填补了现有数据的空白。论文进一步提出了Causal Action Recognition (CAR)框架，通过结构因果模型和反事实推理，增强关键演员信息，从而提高模型对遮挡的鲁棒性。这一框架的引入旨在激发对遮挡场景中因果关系的进一步探索，并促使重新评估类别间的关联，最终推动性能的可持续提升。

链接: https://arxiv.org/abs/2411.15729
作者: Guanyu Zhou,Wenxuan Liu,Wenxin Huang,Xuemei Jia,Xian Zhong,Chia-Wen Lin
关键词-EN: video datasets limits, impedes sustained performance, recognition video datasets, impedes sustained, action recognition video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The lack of occlusion data in commonly used action recognition video datasets limits model robustness and impedes sustained performance improvements. We construct OccludeNet, a large-scale occluded video dataset that includes both real-world and synthetic occlusion scene videos under various natural environments. OccludeNet features dynamic tracking occlusion, static scene occlusion, and multi-view interactive occlusion, addressing existing gaps in data. Our analysis reveals that occlusion impacts action classes differently, with actions involving low scene relevance and partial body visibility experiencing greater accuracy degradation. To overcome the limitations of current occlusion-focused approaches, we propose a structural causal model for occluded scenes and introduce the Causal Action Recognition (CAR) framework, which employs backdoor adjustment and counterfactual reasoning. This framework enhances key actor information, improving model robustness to occlusion. We anticipate that the challenges posed by OccludeNet will stimulate further exploration of causal relations in occlusion scenarios and encourage a reevaluation of class correlations, ultimately promoting sustainable performance improvements. The code and full dataset will be released soon.
zh

[CV-132] GSurf: 3D Reconstruction via Signed Distance Fields with Direct Gaussian Supervision

【速读】：该论文试图解决从多视角图像进行表面重建时，传统方法如神经辐射场 (NeRF) 中使用的有符号距离场 (SDF) 存在的训练和渲染速度慢的问题，以及3D高斯光栅化 (3DGS) 方法中由于深度数据噪声或缺失导致的表面不完整和碎片化问题。解决方案的关键在于提出了GSurf，一种端到端的方法，直接从高斯基元中学习有符号距离场。GSurf利用高斯光栅化进行渲染，避免了其他方法中冗余的体积渲染，从而在保持与神经隐式表面方法（如VolSDF和NeuS）相当的3D重建质量的同时，显著提高了训练和渲染速度。

链接: https://arxiv.org/abs/2411.15723
作者: Xu Baixin,Hu Jiangbei,Li Jiaze,He Ying
关键词-EN: multi-view images, core challenge, Neural Radiance Fields, Radiance Fields, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: see this https URL

点击查看摘要

Abstract:Surface reconstruction from multi-view images is a core challenge in 3D vision. Recent studies have explored signed distance fields (SDF) within Neural Radiance Fields (NeRF) to achieve high-fidelity surface reconstructions. However, these approaches often suffer from slow training and rendering speeds compared to 3D Gaussian splatting (3DGS). Current state-of-the-art techniques attempt to fuse depth information to extract geometry from 3DGS, but frequently result in incomplete reconstructions and fragmented surfaces. In this paper, we introduce GSurf, a novel end-to-end method for learning a signed distance field directly from Gaussian primitives. The continuous and smooth nature of SDF addresses common issues in the 3DGS family, such as holes resulting from noisy or missing depth data. By using Gaussian splatting for rendering, GSurf avoids the redundant volume rendering typically required in other GS and SDF integrations. Consequently, GSurf achieves faster training and rendering speeds while delivering 3D reconstruction quality comparable to neural implicit surface methods, such as VolSDF and NeuS. Experimental results across various benchmark datasets demonstrate the effectiveness of our method in producing high-fidelity 3D reconstructions.
zh

[CV-133] Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

【速读】：该论文试图解决预训练视觉-语言模型（Vision-Language Models, VLMs）在面对对抗性攻击时的鲁棒性问题。解决方案的关键在于提出了一种名为“攻击链（Chain of Attack, CoA）”的新策略，该策略通过一系列中间攻击步骤，基于多模态语义更新迭代增强对抗样本的生成，从而实现更高的对抗转移性和效率。CoA方法特别强调了视觉和文本模态之间的语义关联，以优化对抗样本的生成和攻击性能。此外，论文还提出了一种统一的攻击成功率计算方法，用于自动化的规避评估。实验结果表明，CoA策略能够在不依赖受害者模型任何知识的情况下，仅通过黑盒攻击有效地误导模型生成目标响应，从而揭示了VLMs的潜在脆弱性，并为未来模型开发的安全性考虑提供了参考。

链接: https://arxiv.org/abs/2411.15720
作者: Peng Xie,Yequan Bie,Jianda Mao,Yangqiu Song,Yang Wang,Hao Chen,Kani Chen
关键词-EN: natural language understanding, Pre-trained vision-language models, Pre-trained vision-language, showcased remarkable performance, image captioning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.
zh

[CV-134] ROOT: VLM based System for Indoor Scene Understanding and Beyond

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在室内场景中空间层次推理能力不足的问题。解决方案的关键在于引入ROOT系统，该系统基于VLM并结合GPT-4V进行迭代对象感知算法，以检测室内场景中的对象实体，并通过视觉基础模型获取场景的元信息（如边界框）。随后，提出了一种专门用于室内场景的VLM，即SceneVLM，能够生成空间层次的场景图并提供对象间的距离信息，从而增强对室内场景空间布局的理解。为训练SceneVLM，研究团队收集了超过61万张来自多个公开室内数据集的图像，并采用半自动化技术构建场景数据生成管道，以建立对象间的关系和估计距离。实验结果表明，ROOT系统在室内场景理解方面表现出色，并在3D场景生成和具身AI等下游应用中展现出有效性。

链接: https://arxiv.org/abs/2411.15714
作者: Yonghui Wang,Shi-Yong Chen,Zhenxing Zhou,Siyi Li,Haoran Li,Wengang Zhou,Houqiang Li
关键词-EN: Vision Language Models, experienced significant advancements, Vision Language, Language Models, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \urlthis https URL.
zh

[CV-135] Fixing the Perspective: A Critical Examination of Zero-1-to-3

【速读】：该论文试图解决在图像到3D生成中的新视角合成问题，特别是在处理多张条件图像时，现有方法如Zero-1-to-3在生成一致且准确的新视角图像时面临的挑战。解决方案的关键在于对Zero-1-to-3中的跨注意力机制（cross-attention mechanism）在扩散2D条件UNet的空间变换器（Spatial Transformer）中的实现进行深入分析，并揭示了理论框架与实际实现之间的关键差异。论文提出了两项重要改进：一是修正跨注意力机制的实现，以有效利用图像条件上下文；二是增强架构，使其能够同时利用多个条件视图。这些改进有望提高新视角合成的连贯性和准确性。

链接: https://arxiv.org/abs/2411.15706
作者: Jack Yu,Xueying Jia,Charlie Sun,Prince Wang
关键词-EN: target view images, relative poses, conditioning images, fundamental challenge, multiple conditioning images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Novel view synthesis is a fundamental challenge in image-to-3D generation, requiring the generation of target view images from a set of conditioning images and their relative poses. While recent approaches like Zero-1-to-3 have demonstrated promising results using conditional latent diffusion models, they face significant challenges in generating consistent and accurate novel views, particularly when handling multiple conditioning images. In this work, we conduct a thorough investigation of Zero-1-to-3’s cross-attention mechanism within the Spatial Transformer of the diffusion 2D-conditional UNet. Our analysis reveals a critical discrepancy between Zero-1-to-3’s theoretical framework and its implementation, specifically in the processing of image-conditional context. We propose two significant improvements: (1) a corrected implementation that enables effective utilization of the cross-attention mechanism, and (2) an enhanced architecture that can leverage multiple conditional views simultaneously. Our theoretical analysis and preliminary results suggest potential improvements in novel view synthesis consistency and accuracy.
zh

[CV-136] Editable-DeepSC: Reliable Cross-Modal Semantic Communications for Facial Editing

【速读】：该论文试图解决实时计算机视觉（Real-time Computer Vision, CV）应用中，特别是社交媒体上的语义面部编辑（Semantic Facial Editing）任务中，传统通信方式与实时CV任务需求不匹配的问题。解决方案的关键在于提出了一种名为Editable-DeepSC的新型跨模态语义通信方法。该方法通过联合编辑-信道编码（Joint Editing-Channel Coding, JECC），将编辑过程集成到通信链路中，以保留更多的语义互信息。此外，利用预训练的StyleGAN先验进行语义编码，以及通过模型微调实现信噪比（SNR）感知的信道编码，来应对动态信道噪声条件。这些创新使得Editable-DeepSC在保持高质量编辑效果的同时，显著节省了传输带宽，即使在高分辨率和分布外（Out-of-Distribution, OOD）设置下也能表现出色。

链接: https://arxiv.org/abs/2411.15702
作者: Bin Chen,Wenbo Yu,Qinshan Zhang,Shu-Tao Xia
关键词-EN: Real-time computer vision, computer vision, plays a crucial, crucial role, performance is highly
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Real-time computer vision (CV) plays a crucial role in various real-world applications, whose performance is highly dependent on communication networks. Nonetheless, the data-oriented characteristics of conventional communications often do not align with the special needs of real-time CV tasks. To alleviate this issue, the recently emerged semantic communications only transmit task-related semantic information and exhibit a promising landscape to address this problem. However, the communication challenges associated with Semantic Facial Editing, one of the most important real-time CV applications on social media, still remain largely unexplored. In this paper, we fill this gap by proposing Editable-DeepSC, a novel cross-modal semantic communication approach for facial editing. Firstly, we theoretically discuss different transmission schemes that separately handle communications and editings, and emphasize the necessity of Joint Editing-Channel Coding (JECC) via iterative attributes matching, which integrates editings into the communication chain to preserve more semantic mutual information. To compactly represent the high-dimensional data, we leverage inversion methods via pre-trained StyleGAN priors for semantic coding. To tackle the dynamic channel noise conditions, we propose SNR-aware channel coding via model fine-tuning. Extensive experiments indicate that Editable-DeepSC can achieve superior editings while significantly saving the transmission bandwidth, even under high-resolution and out-of-distribution (OOD) settings.
zh

[CV-137] owards RAW Object Detection in Diverse Conditions

【速读】：该论文试图解决现有目标检测方法在复杂光照和天气条件下，由于使用压缩后的sRGB图像（从RAW数据通过图像信号处理（ISP）生成）而可能丢失关键信息的问题。解决方案的关键在于引入AODRaw数据集，该数据集包含7,785张高分辨率真实RAW图像，涵盖62个类别和135,601个标注实例，捕捉了9种不同光照和天气条件下的室内外场景。通过在RAW域上进行直接预训练，并利用从sRGB域预训练模型中提取的知识进行知识蒸馏（Knowledge Distillation），论文提出了一种在不依赖额外预处理模块的情况下，显著提升在多样和恶劣条件下目标检测性能的方法。

链接: https://arxiv.org/abs/2411.15678
作者: Zhong-Yu Li,Xin Jin,Boyuan Sun,Chun-Le Guo,Ming-Ming Cheng
关键词-EN: ISP originally designed, Existing object detection, data using ISP, ISP originally, Existing object
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing object detection methods often consider sRGB input, which was compressed from RAW data using ISP originally designed for visualization. However, such compression might lose crucial information for detection, especially under complex light and weather conditions. We introduce the AODRaw dataset, which offers 7,785 high-resolution real RAW images with 135,601 annotated instances spanning 62 categories, capturing a broad range of indoor and outdoor scenes under 9 distinct light and weather conditions. Based on AODRaw that supports RAW and sRGB object detection, we provide a comprehensive benchmark for evaluating current detection methods. We find that sRGB pre-training constrains the potential of RAW object detection due to the domain gap between sRGB and RAW, prompting us to directly pre-train on the RAW domain. However, it is harder for RAW pre-training to learn rich representations than sRGB pre-training due to the camera noise. To assist RAW pre-training, we distill the knowledge from an off-the-shelf model pre-trained on the sRGB domain. As a result, we achieve substantial improvements under diverse and adverse conditions without relying on extra pre-processing modules. Code and dataset are available at this https URL.
zh

[CV-138] Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment CVPR2024

【速读】：该论文试图解决在自监督训练的视觉-语言模型中，由于使用从网络抓取的大规模数据集而导致的潜在安全威胁，如后门攻击和中毒攻击。解决方案的关键在于利用语言模型提取的外部知识来防止模型学习与外部知识不强相关的图像区域之间的关联。具体来说，通过施加约束，确保模型对视觉区域的注意力与其与外部知识的对齐程度成正比，从而有效防御此类攻击，同时保持模型效用，且无需在推理时进行任何更改。

链接: https://arxiv.org/abs/2411.15673
作者: Alvi Md Ishmam,Christopher Thomas
关键词-EN: self-supervised objectives, enormous interest, external knowledge, attacks, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2024

点击查看摘要

Abstract:In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However, the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats, such as backdooring and poisoning attacks. In this paper, we propose a method for mitigating such attacks on contrastively trained vision-language models. Our approach leverages external knowledge extracted from a language model to prevent models from learning correlations between image regions which lack strong alignment with external knowledge. We do this by imposing constraints to enforce that attention paid by the model to visual regions is proportional to the alignment of those regions with external knowledge. We conduct extensive experiments using a variety of recent backdooring and poisoning attacks on multiple datasets and architectures. Our results clearly demonstrate that our proposed approach is highly effective at defending against such attacks across multiple settings, while maintaining model utility and without requiring any changes at inference time
zh

[CV-139] SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution

【速读】：该论文试图解决在基于CPU的架构中加速卷积运算的问题。现有方法主要依赖于将图像数据打包到矩阵列中（im2col），并通过通用矩阵乘法（GEMM）进行计算，但存在两个主要缺点：一是im2col需要大量内存缓冲区，且可能导致内存访问效率低下；二是GEMM虽然针对科学矩阵乘法进行了高度优化，但并不完全适用于卷积运算。论文提出的解决方案关键在于利用标量-矩阵乘法，减少内存开销，从而显著提升卷积运算的速度。实验结果表明，该方法在常见网络架构中相比现有间接方法有显著的加速效果。

链接: https://arxiv.org/abs/2411.15659
作者: Amir Ofir,Gil Ben-Artzi
关键词-EN: inference for CPU-based, accelerating convolutions, CPU-based architectures, performing general matrix, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel approach for accelerating convolutions during inference for CPU-based architectures. The most common method of computation involves packing the image into the columns of a matrix (im2col) and performing general matrix multiplication (GEMM) with a matrix of weights. This results in two main drawbacks: (a) im2col requires a large memory buffer and can experience inefficient memory access, and (b) while GEMM is highly optimized for scientific matrices multiplications, it is not well suited for convolutions. We propose an approach that takes advantage of scalar-matrix multiplication and reduces memory overhead. Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
zh

[CV-140] raining an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data NEURIPS2024

【速读】：该论文试图解决开放词汇3D物体检测（open-vocabulary 3D object detection）在点云数据依赖下的高部署成本问题。解决方案的关键在于提出了一种名为OVM3D-Det的新型开放词汇单目3D物体检测框架，该框架仅使用RGB图像进行训练，从而降低了成本并提高了可扩展性。OVM3D-Det通过利用开放词汇2D模型和伪激光雷达（pseudo-LiDAR）来自动标注RGB图像中的3D物体，并引入了自适应伪激光雷达腐蚀（adaptive pseudo-LiDAR erosion）和基于大型语言模型（large language models）的边界框细化（bounding box refinement）技术，以校准3D标签并实现仅使用RGB图像的3D检测器训练。

链接: https://arxiv.org/abs/2411.15657
作者: Rui Huang,Henry Zheng,Yan Wang,Zhuofan Xia,Marco Pavone,Gao Huang
关键词-EN: previously unseen domains, recently attracted considerable, attracted considerable attention, considerable attention due, driving and robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However, training 3D models with labels directly derived from pseudo-LiDAR is inadequate due to imprecise boxes estimated from noisy point clouds and severely occluded objects. To address these issues, we introduce two innovative designs: adaptive pseudo-LiDAR erosion and bounding box refinement with prior knowledge from large language models. These techniques effectively calibrate the 3D labels and enable RGB-only training for 3D detectors. Extensive experiments demonstrate the superiority of OVM3D-Det over baselines in both indoor and outdoor scenarios. The code will be released.
zh

[CV-141] Machine Learning-based sEMG Signal Classification for Hand Gesture Recognition

【速读】：该论文旨在通过引入新的特征提取方法，即融合时域描述符、时空描述符和小波变换特征，结合先进的机器学习和深度学习模型，来评估基于肌电信号（EMG）的手势识别性能。解决方案的关键在于采用1D Dilated CNN和随机森林等模型，分别在Grabmyo和FORS-EMG数据集上实现了高达97%和94.95%的准确率，其中融合时域描述符（如功率谱矩、稀疏性、不规则因子及波形长度比）和时空描述符（包括时域特征及变异系数COV和Teager-Kaiser能量算子TKEO）的特征提取方法显著提升了手势识别的准确性。

链接: https://arxiv.org/abs/2411.15655
作者: Parshuram N. Aarotale,Ajita Rattani
关键词-EN: analyzing electrical activity, electrical activity generated, classify hand movements, movements by analyzing, analyzing electrical
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE BIBM 2024

点击查看摘要

Abstract:EMG-based hand gesture recognition uses electromyographic~(EMG) signals to interpret and classify hand movements by analyzing electrical activity generated by muscle contractions. It has wide applications in prosthesis control, rehabilitation training, and human-computer interaction. Using electrodes placed on the skin, the EMG sensor captures muscle signals, which are processed and filtered to reduce noise. Numerous feature extraction and machine learning algorithms have been proposed to extract and classify muscle signals to distinguish between various hand gestures. This paper aims to benchmark the performance of EMG-based hand gesture recognition using novel feature extraction methods, namely, fused time-domain descriptors, temporal-spatial descriptors, and wavelet transform-based features, combined with the state-of-the-art machine and deep learning models. Experimental investigations on the Grabmyo dataset demonstrate that the 1D Dilated CNN performed the best with an accuracy of 97% using fused time-domain descriptors such as power spectral moments, sparsity, irregularity factor and waveform length ratio. Similarly, on the FORS-EMG dataset, random forest performed the best with an accuracy of 94.95% using temporal-spatial descriptors (which include time domain features along with additional features such as coefficient of variation (COV), and Teager-Kaiser energy operator (TKEO)).
zh

[CV-142] OCDet: Object Center Detection via Bounding Box-Aware Heatmap Prediction on Edge Devices with NPUs

【速读】：该论文试图解决在资源受限的边缘设备上进行实时目标定位的问题。传统框架如目标检测、分割和关键点检测在资源受限环境中表现不佳，常导致显著的目标遗漏。解决方案的关键在于引入了一种轻量级的对象中心检测框架OCDet，该框架针对配备NPU的边缘设备进行了优化。OCDet通过预测表示对象中心概率的热图，并通过峰值识别提取中心点。与使用固定高斯分布的先前方法不同，OCDet引入了广义中心性（Generalized Centerness, GC）来从边界框注释生成地面真值热图，提供更精细的空间细节而不需要额外的人工标注。此外，OCDet基于NPU友好的语义FPN和MobileNetV4骨干网络，并采用平衡连续焦点损失（Balanced Continuous Focal Loss, BCFL）进行训练，以缓解数据不平衡问题并专注于概率回归任务中的困难负样本。通过结合中心对齐分数（Center Alignment Score, CAS）和匈牙利匹配算法，OCDet在对象中心检测方面显著优于YOLO11，同时减少了参数数量、计算量和NPU延迟。

链接: https://arxiv.org/abs/2411.15653
作者: Chen Xin,Thomas Motz,Andreas Hartel,Enkelejda Kasneci
关键词-EN: Real-time object localization, Real-time object, numerous applications, ranging from surveillance, industrial automation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time object localization on edge devices is fundamental for numerous applications, ranging from surveillance to industrial automation. Traditional frameworks, such as object detection, segmentation, and keypoint detection, struggle in resource-constrained environments, often resulting in substantial target omissions. To address these challenges, we introduce OCDet, a lightweight Object Center Detection framework optimized for edge devices with NPUs. OCDet predicts heatmaps representing object center probabilities and extracts center points through peak identification. Unlike prior methods using fixed Gaussian distribution, we introduce Generalized Centerness (GC) to generate ground truth heatmaps from bounding box annotations, providing finer spatial details without additional manual labeling. Built on NPU-friendly Semantic FPN with MobileNetV4 backbones, OCDet models are trained by our Balanced Continuous Focal Loss (BCFL), which alleviates data imbalance and focuses training on hard negative examples for probability regression tasks. Leveraging the novel Center Alignment Score (CAS) with Hungarian matching, we demonstrate that OCDet consistently outperforms YOLO11 in object center detection, achieving up to 23% higher CAS while requiring 42% fewer parameters, 34% less computation, and 64% lower NPU latency. When compared to keypoint detection frameworks, OCDet achieves substantial CAS improvements up to 186% using identical models. By integrating GC, BCFL, and CAS, OCDet establishes a new paradigm for efficient and robust object center detection on edge devices with NPUs. The code is released at this https URL.
zh

[CV-143] Sample- and Parameter-Efficient Auto-Regressive Image Models

【速读】：该论文试图解决现有自回归图像模型在样本和参数效率上的不足，特别是对比学习或掩码图像建模方法在处理不平衡互联网数据时缺乏一致的扩展性问题。解决方案的关键在于引入了一种新的自回归目标函数，即XTRA模型，该模型采用了一种称为Block Causal Mask的创新方法，通过块级（Block）而非传统的单个token级进行像素值的重建。这种块级重建机制使得模型能够捕捉更大图像区域的高级结构模式，从而在更广泛的像素区域上学习关系，生成更抽象和语义上有意义的表示。这一简单但有效的修改显著提升了XTRA的样本和参数效率，使其在训练数据量大幅减少的情况下（13.1M vs. 2B），仍能在15个多样化的图像识别基准测试中超越先前的自回归模型，同时在参数使用上也更为高效（85M vs. 1.36B/0.63B）。

链接: https://arxiv.org/abs/2411.15648
作者: Elad Amrani,Leonid Karlinsky,Alex Bronstein
关键词-EN: XTRA, vision model pre-trained, Causal Mask, objective that significantly, significantly enhances
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: for code, see this https URL

点击查看摘要

Abstract:We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data, auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents k \times k tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically meaningful representations than traditional next-token prediction. This simple modification yields two key results. First, XTRA is sample-efficient. Despite being trained on 152 \times fewer samples (13.1M vs. 2B), XTRA ViT-H/14 surpasses the top-1 average accuracy of the previous state-of-the-art auto-regressive model across 15 diverse image recognition benchmarks. Second, XTRA is parameter-efficient. Compared to auto-regressive models trained on ImageNet-1k, XTRA ViT-B/16 outperforms in linear and attentive probing tasks, using 7-16 \times fewer parameters (85M vs. 1.36B/0.63B).
zh

[CV-144] Effort: Efficient Orthogonal Modeling for Generalizable AI-Generated Image Detection

【速读】：该论文试图解决现有AI生成图像（AIGI）检测方法泛化性能不足的问题。解决方案的关键在于识别并利用AIGI检测中存在的关键不对称现象：模型在训练过程中容易过度拟合于训练集中的特定伪造模式，而未能充分捕捉其他信息，导致面对新伪造方法时泛化能力差。论文提出通过引入大规模视觉基础模型（VFMs）中嵌入的丰富语义知识，扩展原有的基于伪造模式的判别空间，使得判别不仅依赖于伪造模式，还依赖于语义线索，从而减少对特定伪造模式的过度拟合。具体解决方案是设计了一种名为Effort的新方法：通过奇异值分解（SVD）构建正交的语义和伪造子空间，冻结主成分并调整剩余成分（约0.19M参数），以保留原始语义子空间，并在其正交子空间中学习伪造特征。实验结果表明，该方法在AIGI检测基准上具有优越的有效性。

链接: https://arxiv.org/abs/2411.15633
作者: Zhiyuan Yan,Jiangming Wang,Zhendong Wang,Peng Jin,Ke-Yue Zhang,Shen Chen,Taiping Yao,Shouhong Ding,Baoyuan Wu,Li Yuan
关键词-EN: Existing AI-generated image, limited generalization performance, Existing AI-generated, AIGI detection, AI-generated image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing AI-generated image (AIGI) detection methods often suffer from limited generalization performance. In this paper, we identify a crucial yet previously overlooked asymmetry phenomenon in AIGI detection: during training, models tend to quickly overfit to specific fake patterns in the training set, while other information is not adequately captured, leading to poor generalization when faced with new fake methods. A key insight is to incorporate the rich semantic knowledge embedded within large-scale vision foundation models (VFMs) to expand the previous discriminative space (based on forgery patterns only), such that the discrimination is decided by both forgery and semantic cues, thereby reducing the overfitting to specific forgery patterns. A straightforward solution is to fully fine-tune VFMs, but it risks distorting the well-learned semantic knowledge, pushing the model back toward overfitting. To this end, we design a novel approach called Effort: Efficient orthogonal modeling for generalizable AIGI detection. Specifically, we employ Singular Value Decomposition (SVD) to construct the orthogonal semantic and forgery subspaces. By freezing the principal components and adapting the residual components ( \sim 0.19M parameters), we preserve the original semantic subspace and use its orthogonal subspace for learning forgeries. Extensive experiments on AIGI detection benchmarks demonstrate the superior effectiveness of our approach.
zh

[CV-145] ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos WACV2025

【速读】：该论文试图解决现有视觉-语言模型（Vision-language models, VLMs）在程序性动作分类中缺乏对动作概念的内在理解，导致对固定标签的过拟合以及对未见动作同义词的不变性不足的问题。解决方案的关键是提出了一种简单的微调技术，称为动作概念增强（Action Concept Enhancement, ACE）。ACE通过在训练过程中随机替换固定标签，并引入增强的动作同义词和负样本，在辅助分类损失中持续整合这些信息，从而创建新的动作标签组合，防止模型对固定动作表示的过拟合，增强模型对动作概念的理解。实验结果表明，ACE在零样本动作分类中显著提升了性能，同时在已见动作分类中保持了竞争性表现。

链接: https://arxiv.org/abs/2411.15628
作者: Reza Ghoddoosian,Nakul Agarwal,Isht Dwivedi,Behzad Darisuh
关键词-EN: Vision-language models, action, recognizing unseen actions, capable of recognizing, unseen action synonyms
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025

点击查看摘要

Abstract:Vision-language models (VLMs) are capable of recognizing unseen actions. However, existing VLMs lack intrinsic understanding of procedural action concepts. Hence, they overfit to fixed labels and are not invariant to unseen action synonyms. To address this, we propose a simple fine-tuning technique, Action Concept Enhancement (ACE), to improve the robustness and concept understanding of VLMs in procedural action classification. ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss by stochastically replacing fixed labels during training. This creates new combinations of action labels over the course of fine-tuning and prevents overfitting to fixed action representations. We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space. Our experiments on the ATA, IKEA and GTEA datasets demonstrate the efficacy of ACE in domains of cooking and assembly leading to significant improvements in zero-shot action classification while maintaining competitive performance on seen actions.
zh

[CV-146] On the importance of local and global feature learning for automated measurable residual disease detection in flow cytometry data ICPR2024

【速读】：该论文试图解决在流式细胞术 (Flow Cytometry, FCM) 数据中检测可测量残留病 (Measurable Residual Disease, MRD) 的问题，重点关注深度学习方法在捕捉长程依赖、获取全局信息以及学习局部特征方面的优势。解决方案的关键在于对当前最先进 (State-of-the-Art, SOTA) 模型的两项改进：一是增强模型对长程依赖的建模能力，二是优化获取全局信息和局部特征学习的方法。这些改进不仅在公开数据集上展示了优越的性能，还提高了模型在不同实验室间的泛化能力，为FCM社区提供了宝贵的指导，推动了未来深度学习架构在FCM数据分析中的设计。

链接: https://arxiv.org/abs/2411.15621
作者: Lisa Weijler,Michael Reiter,Pedro Hermosilla,Margarita Maurer-Granofszky,Michael Dworzak
关键词-EN: learning local features, measurable residual disease, modeling long-range dependencies, obtaining global information, deep learning methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR 2024

点击查看摘要

Abstract:This paper evaluates various deep learning methods for measurable residual disease (MRD) detection in flow cytometry (FCM) data, addressing questions regarding the benefits of modeling long-range dependencies, methods of obtaining global information, and the importance of learning local features. Based on our findings, we propose two adaptations to the current state-of-the-art (SOTA) model. Our contributions include an enhanced SOTA model, demonstrating superior performance on publicly available datasets and improved generalization across laboratories, as well as valuable insights for the FCM community, guiding future DL architecture designs for FCM data analysis. The code is available at \urlthis https URL.
zh

[CV-147] Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

【速读】：该论文试图解决现有基于视觉的基础模型在对象检测中难以捕捉整体对象中的细小部分以及无法充分考虑用户意图的问题。解决方案的关键在于提出了一种名为FOCUS（Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation）的新方法。FOCUS通过结合视觉基础模型的能力，实现了灵活粒度的开放词汇对象检测，并允许用户通过自然语言直接引导检测过程。这种方法不仅擅长识别和定位细粒度的组成部分，还能在减少不必要用户干预的同时，赋予用户显著的控制权。通过FOCUS，用户可以提出可解释的请求，主动引导检测过程朝着预期方向进行，从而有效提升基线模型的检测能力，并在不同对象类型上表现出一致的性能。

链接: https://arxiv.org/abs/2411.15620
作者: Jinwoo Ahn,Hyeokjoon Kwon,Hwiyeon Yoo
关键词-EN: Recent advent, high-quality object detection, advent of vision-based, enabled efficient, efficient and high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advent of vision-based foundation models has enabled efficient and high-quality object detection at ease. Despite the success of previous studies, object detection models face limitations on capturing small components from holistic objects and taking user intention into account. To address these challenges, we propose a novel foundation model-based detection method called FOCUS: Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation. FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language. It not only excels at identifying and locating granular constituent elements but also minimizes unnecessary user intervention yet grants them significant control. With FOCUS, users can make explainable requests to actively guide the detection process in the intended direction. Our results show that FOCUS effectively enhances the detection capabilities of baseline models and shows consistent performance across varying object types.
zh

[CV-148] Knowledge Transfer Across Modalities with Natural Language Supervision

【速读】：该论文试图解决通过仅使用文本描述来学习新概念的问题，提出了名为“知识转移 (Knowledge Transfer)”的方法。解决方案的关键在于利用预训练视觉编码器中已学习的低级特征（如形状、外观、颜色），通过跨模态交互将这些低级特征与新概念的高级文本描述对齐。该方法通过单一的文本描述即可高效地引入新概念，并适用于独立的文本和视觉编码器（如CLIP）以及跨模态共享参数的模型。此外，知识转移还能提升模型已知概念的表现，并在零样本分类、分割、图像-文本检索和图像描述等任务中提高性能。

链接: https://arxiv.org/abs/2411.15611
作者: Carlo Alberto Barbano,Luca Molinaro,Emanuele Aiello,Marco Grangetto
关键词-EN: Knowledge Transfer, Transfer, method Knowledge Transfer, Knowledge, Leveraging Knowledge Transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures, 17 tables

点击查看摘要

Abstract:We present a way to learn novel concepts by only using their textual description. We call this method Knowledge Transfer. Similarly to human perception, we leverage cross-modal interaction to introduce new concepts. We hypothesize that in a pre-trained visual encoder there are enough low-level features already learned (e.g. shape, appearance, color) that can be used to describe previously unknown high-level concepts. Provided with a textual description of the novel concept, our method works by aligning the known low-level features of the visual encoder to its high-level textual description. We show that Knowledge Transfer can successfully introduce novel concepts in multimodal models, in a very efficient manner, by only requiring a single description of the target concept. Our approach is compatible with both separate textual and visual encoders (e.g. CLIP) and shared parameters across modalities. We also show that, following the same principle, Knowledge Transfer can improve concepts already known by the model. Leveraging Knowledge Transfer we improve zero-shot performance across different tasks such as classification, segmentation, image-text retrieval, and captioning.
zh

[CV-149] GIFT: A Framework for Global Interpretable Faithful Textual Explanations of Vision Classifiers

【速读】：该论文试图解决在安全关键应用中部署深度模型时，如何生成后验、全局、可解释且忠实的文本解释的问题。解决方案的关键在于引入了一个名为GIFT的框架，该框架从局部忠实的视觉反事实解释出发，利用（视觉）语言模型将其转化为全局文本解释。GIFT框架还包含一个验证阶段，用于测量所提出的解释对分类器决策的因果效应，从而确保解释的忠实性。通过在多个数据集（如CLEVR、CelebA和BDD）上的实验，GIFT展示了其有效性，揭示了深度视觉分类器所使用的任务、概念和偏见。

链接: https://arxiv.org/abs/2411.15605
作者: Éloi Zablocki,Valentin Gerard,Amaia Cardiel,Eric Gaussier,Matthieu Cord,Eduardo Valle
关键词-EN: Understanding deep models, Understanding deep, safety-critical applications, crucial for deploying, Understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding deep models is crucial for deploying them in safety-critical applications. We introduce GIFT, a framework for deriving post-hoc, global, interpretable, and faithful textual explanations for vision classifiers. GIFT starts from local faithful visual counterfactual explanations and employs (vision) language models to translate those into global textual explanations. Crucially, GIFT provides a verification stage measuring the causal effect of the proposed explanations on the classifier decision. Through experiments across diverse datasets, including CLEVR, CelebA, and BDD, we demonstrate that GIFT effectively reveals meaningful insights, uncovering tasks, concepts, and biases used by deep vision classifiers. Our code, data, and models are released at this https URL.
zh

[CV-150] FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video

【速读】：该论文试图解决从单目视频中重建高保真、可动画的3D头部化身的问题，特别是针对不完全重建和低效的高斯表示等挑战。解决方案的关键在于引入了一种名为FATE的新方法，该方法通过以下几个关键技术来实现：1) 基于采样的密集化策略，以确保点的最佳位置分布，从而提高渲染效率；2) 神经烘焙技术，将离散的高斯表示转换为连续的属性图，便于直观的外貌编辑；3) 通用完成框架，用于恢复非正面视角的外貌，最终实现360°可渲染的3D头部化身。FATE在定性和定量评估中均优于先前的方法，达到了最先进的性能。

链接: https://arxiv.org/abs/2411.15604
作者: Jiawei Zhang,Zijian Wu,Zhiyang Liang,Yicheng Gong,Dongfang Hu,Yao Yao,Xun Cao,Hao Zhu
关键词-EN: effortlessly captured monocular, captured monocular videos, effortlessly captured, pivotal yet formidable, Reconstructing high-fidelity
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Reconstructing high-fidelity, animatable 3D head avatars from effortlessly captured monocular videos is a pivotal yet formidable challenge. Although significant progress has been made in rendering performance and manipulation capabilities, notable challenges remain, including incomplete reconstruction and inefficient Gaussian representation. To address these challenges, we introduce FATE, a novel method for reconstructing an editable full-head avatar from a single monocular video. FATE integrates a sampling-based densification strategy to ensure optimal positional distribution of points, improving rendering efficiency. A neural baking technique is introduced to convert discrete Gaussian representations into continuous attribute maps, facilitating intuitive appearance editing. Furthermore, we propose a universal completion framework to recover non-frontal appearance, culminating in a 360 ^\circ -renderable 3D head avatar. FATE outperforms previous approaches in both qualitative and quantitative evaluations, achieving state-of-the-art performance. To the best of our knowledge, FATE is the first animatable and 360 ^\circ full-head monocular reconstruction method for a 3D head avatar. The code will be publicly released upon publication.
zh

[CV-151] Enhancing Object Detection Accuracy in Autonomous Vehicles Using Synthetic Data

【速读】：该论文试图解决机器学习模型在实际应用中因训练数据稀缺、噪声和失衡而导致的性能受限问题。解决方案的关键在于利用合成数据（synthetic data）来增强训练数据集的质量和多样性，从而提高模型的预测准确性。论文通过创建合成数据集并将其应用于自动驾驶场景中的目标检测系统，验证了合成数据对提升模型性能的有效性。实验结果表明，结合真实数据和合成数据训练的模型（System-2）在准确性、精确度、召回率和平均精度均值等关键性能指标上均优于仅使用真实数据训练的模型（System-1），具体表现为准确性提高了3%。

链接: https://arxiv.org/abs/2411.15602
作者: Sergei Voronin,Abubakar Siddique,Muhammad Iqbal
关键词-EN: machine learning models, machine learning, disease diagnoses, learning models, learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 Pages, 7 figures, 1 table

点击查看摘要

Abstract:The rapid progress in machine learning models has significantly boosted the potential for real-world applications such as autonomous vehicles, disease diagnoses, and recognition of emergencies. The performance of many machine learning models depends on the nature and size of the training data sets. These models often face challenges due to the scarcity, noise, and imbalance in real-world data, limiting their performance. Nonetheless, high-quality, diverse, relevant and representative training data is essential to build accurate and reliable machine learning models that adapt well to real-world scenarios. It is hypothesised that well-designed synthetic data can improve the performance of a machine learning algorithm. This work aims to create a synthetic dataset and evaluate its effectiveness to improve the prediction accuracy of object detection systems. This work considers autonomous vehicle scenarios as an illustrative example to show the efficacy of synthetic data. The effectiveness of these synthetic datasets in improving the performance of state-of-the-art object detection models is explored. The findings demonstrate that incorporating synthetic data improves model performance across all performance matrices. Two deep learning systems, System-1 (trained on real-world data) and System-2 (trained on a combination of real and synthetic data), are evaluated using the state-of-the-art YOLO model across multiple metrics, including accuracy, precision, recall, and mean average precision. Experimental results revealed that System-2 outperformed System-1, showing a 3% improvement in accuracy, along with superior performance in all other metrics. Comments: 7 Pages, 7 figures, 1 table Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2411.15602 [cs.CV] (or arXiv:2411.15602v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.15602 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-152] How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking

【速读】：该论文试图解决视觉语言跟踪 (Vision-language tracking, VLT) 中语义信息在复杂场景下可能成为“干扰”的问题，导致现有VLT跟踪器在多基准测试中表现不如单一模态方法。解决方案的关键在于提出了VLTVerse，这是一个细粒度的评估框架，通过引入10个序列级挑战标签和6种多粒度语义信息，创建了一个灵活且多维的评估空间。该框架利用60个子空间对三种主流SOTA VLT跟踪器进行系统性细粒度评估，揭示了它们在复杂场景中的性能瓶颈，并提供了关于VLT评估的新视角。通过实验结果的解耦分析，论文还探讨了不同语义类型对特定挑战因素的影响，为提升VLT在数据、评估和算法维度上的性能提供了重要指导。

链接: https://arxiv.org/abs/2411.15600
作者: Xuchen Li,Shiyu Hu,Xiaokun Feng,Dailing Zhang,Meiqi Wu,Jing Zhang,Kaiqi Huang
关键词-EN: extends traditional single, single object tracking, traditional single object, Vision-language tracking, incorporating textual information
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint, Under Review

点击查看摘要

Abstract:Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information, providing semantic guidance to enhance tracking performance under challenging conditions like fast motion and deformations. However, current VLT trackers often underperform compared to single-modality methods on multiple benchmarks, with semantic information sometimes becoming a “distraction.” To address this, we propose VLTVerse, the first fine-grained evaluation framework for VLT trackers that comprehensively considers multiple challenge factors and diverse semantic information, hoping to reveal the role of language in VLT. Our contributions include: (1) VLTVerse introduces 10 sequence-level challenge labels and 6 types of multi-granularity semantic information, creating a flexible and multi-dimensional evaluation space for VLT; (2) leveraging 60 subspaces formed by combinations of challenge factors and semantic types, we conduct systematic fine-grained evaluations of three mainstream SOTA VLT trackers, uncovering their performance bottlenecks across complex scenarios and offering a novel perspective on VLT evaluation; (3) through decoupled analysis of experimental results, we examine the impact of various semantic types on specific challenge factors in relation to different algorithms, providing essential guidance for enhancing VLT across data, evaluation, and algorithmic dimensions. The VLTVerse, toolkit, and results will be available at \urlthis http URL.
zh

[CV-153] An adversarial feature learning based semantic communication method for Human 3D Reconstruction

【速读】：该论文试图解决在网络带宽有限且需要低延迟的场景下，人体3D重建技术中的数据传输和处理效率问题。解决方案的关键在于提出了一种基于对抗特征学习（Adversarial Feature Learning）的语义通信方法（AFLSC），通过提取和传输对3D重建任务至关重要的语义信息，优化数据流并缓解带宽压力。具体来说，发送端采用多任务学习（Multitask Learning）特征提取方法捕捉2D人体图像的空间布局、关键点、姿态和深度信息，并设计了基于对抗特征学习的语义编码技术进行高效编码和动态压缩传输。接收端则通过多层次语义特征解码方法将语义数据转换回关键图像特征，最终利用改进的ViT-diffusion模型生成人体3D网格模型。实验结果表明，该方法在数据传输效率和重建质量方面具有显著优势，适用于带宽受限的环境。

链接: https://arxiv.org/abs/2411.15595
作者: Shaojiang Liu,Jiajun Zou,Zhendan Liu,Meixia Dong,Zhiping Wan
关键词-EN: processing efficiency continue, human body, continue to rise, Learning-based Semantic Communication, scenarios where network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the widespread application of human body 3D reconstruction technology across various fields, the demands for data transmission and processing efficiency continue to rise, particularly in scenarios where network bandwidth is limited and low latency is required. This paper introduces an Adversarial Feature Learning-based Semantic Communication method (AFLSC) for human body 3D reconstruction, which focuses on extracting and transmitting semantic information crucial for the 3D reconstruction task, thereby significantly optimizing data flow and alleviating bandwidth pressure. At the sender’s end, we propose a multitask learning-based feature extraction method to capture the spatial layout, keypoints, posture, and depth information from 2D human images, and design a semantic encoding technique based on adversarial feature learning to encode these feature information into semantic data. We also develop a dynamic compression technique to efficiently transmit this semantic data, greatly enhancing transmission efficiency and reducing latency. At the receiver’s end, we design an efficient multi-level semantic feature decoding method to convert semantic data back into key image features. Finally, an improved ViT-diffusion model is employed for 3D reconstruction, producing human body 3D mesh models. Experimental results validate the advantages of our method in terms of data transmission efficiency and reconstruction quality, demonstrating its excellent potential for application in bandwidth-limited environments.
zh

[CV-154] Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

【速读】：该论文试图解决现有场景文本识别（Scene Text Recognition, STR）方法在识别艺术性和严重扭曲字符时表现不佳的问题。其核心在于通过增强模型对字符形态（character morphologies）的理解来提升识别能力。解决方案的关键在于：1) 提出在线生成策略（Online Generation Strategy），通过生成无背景的多样化字符样本，弥补合成数据简单性的不足，增强模型对字符形态的专注和泛化能力；2) 提出新的字符单向对齐损失（Character Unidirectional Alignment Loss），修正先前字符对比损失中的推导错误，统一同一字符在不同样本中的表示，从而减少类内分布的稀疏性和挑战性样本的模糊性。这些改进使得模型在常见基准和Union14M-Benchmark上达到了最先进的性能（94.7%和70.9%的平均准确率）。

链接: https://arxiv.org/abs/2411.15585
作者: Yadong Qu,Yuxin Wang,Bangbang Zhou,Zixiao Wang,Hongtao Xie,Yongdong Zhang
关键词-EN: Existing scene text, scene text recognition, severely distorted characters, Existing scene, text recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at this https URL.
zh

[CV-155] FLD: Data-efficient Evaluation Metric for Generative Models

【速读】：该论文试图解决现有生成图像质量评估指标（如Fréchet Inception Distance (FID)）在可靠性、数据效率、计算效率和适应新领域方面的不足。解决方案的关键在于提出了一种基于归一化流 (normalizing flows) 的新指标——Flow-based Likelihood Distance Plus (FLD+)。FLD+ 通过计算图像的密度（精确对数似然）来评估图像质量，具有以下优势：1) 对不同类型的图像退化（如噪声、遮挡、扩散步骤和生成模型大小）表现出强单调性；2) 训练稳定且高效，所需图像数量比FID少两个数量级；3) 计算效率更高，通过在低维潜在空间中应用归一化流实现；4) 易于在新领域（如医学图像）上重新训练，无需依赖预训练网络（如InceptionNetV3）。

链接: https://arxiv.org/abs/2411.15584
作者: Pranav Jeevan,Neeraj Nixon,Amit Sethi
关键词-EN: Fréchet Inception Distance, Fréchet Inception, Inception Distance, Flow-based Likelihood Distance, assess the quality
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics – such as InceptionNetV3 pre-trained on ImageNet.
zh

[CV-156] EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting

【速读】：该论文试图解决在自动驾驶场景中，基于3D/4D高斯喷射（Gaussian Splatting, GS）方法在复杂街道场景中对动态物体运动建模不足的问题。现有方法通常将街道场景分解为静态和动态物体，并采用监督学习（如使用3D边界框）或自监督学习（如不使用3D边界框）的方式学习高斯分布，但这些方法未能有效建模动态物体的运动特性（如行人和车辆的移动速度差异），导致场景分解效果不佳。论文提出的解决方案是显式运动分解（Explicit Motion Decomposition, EMD），通过引入可学习的运动嵌入（motion embeddings）到高斯分布中，增强对动态物体运动的建模，从而提升场景分解的效果。EMD是一种即插即用的方法，适用于多种基线方法，并提出了针对性的训练策略以应用于监督和自监督基线。

链接: https://arxiv.org/abs/2411.15582
作者: Xiaobao Wei,Qingpo Wuwu,Zhongyu Zhao,Zhuangzhe Wu,Nan Huang,Ming Lu,Ningning MA,Shanghang Zhang
关键词-EN: developing real-world simulators, Photorealistic reconstruction, street scenes, dynamic objects, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photorealistic reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. While recent methods based on 3D/4D Gaussian Splatting (GS) have demonstrated promising results, they still encounter challenges in complex street scenes due to the unpredictable motion of dynamic objects. Current methods typically decompose street scenes into static and dynamic objects, learning the Gaussians in either a supervised manner (e.g., w/ 3D bounding-box) or a self-supervised manner (e.g., w/o 3D bounding-box). However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. The proposed EMD is a plug-and-play approach applicable to various baseline methods. We also propose tailored training strategies to apply EMD to both supervised and self-supervised baselines. Through comprehensive experimentation, we illustrate the effectiveness of our approach with various established baselines. The code will be released at: this https URL.
zh

[CV-157] KG-DM: Training-free Chroma Key Content Generation Diffusion Model

【速读】：该论文试图解决大规模文本到图像生成模型（如 Stable Diffusion）在生成前景物体置于色键背景上的图像时，难以分离前景和背景元素的问题。解决方案的关键在于提出了一种无需训练的色键内容生成扩散模型（Training-Free Chroma Key Content Generation Diffusion Model, TKG-DM），通过优化初始随机噪声以生成前景物体在指定颜色背景上的图像。该方法首次探索了在初始噪声中操控颜色属性以实现背景的精确控制，从而无需微调即可实现前景和背景的精确分离。实验结果表明，该无需训练的方法在定性和定量评估中均优于现有方法，甚至达到或超越了微调模型的效果，并展示了其在其他生成任务（如一致性模型和文本到视频生成）中的广泛应用潜力。

链接: https://arxiv.org/abs/2411.15580
作者: Ryugo Morita,Stanislav Frolov,Brian Bernhard Moser,Takahiro Shirakawa,Ko Watanabe,Andreas Dengel,Jinjia Zhou
关键词-EN: Content Generation Diffusion, Chroma Key Content, textual fidelity, Generation Diffusion Model, Stable Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.
zh

[CV-158] Reassessing Layer Pruning in LLM s: New Insights and Methods

【速读】：该论文试图解决在大语言模型（LLMs）中如何有效进行层剪枝（layer pruning）的问题，特别是在资源受限的环境中如何减少计算资源的需求。解决方案的关键在于发现了一种简单而有效的层剪枝策略：即剪枝模型最后25%的层，然后对lm_head和剩余的最后三层进行微调。这种策略不仅显著减少了计算开销，而且在性能上超越了许多同规模的流行LLMs，如ChatGLM2-6B、Vicuna-7B-v1.5、Qwen1.5-7B和Baichuan2-7B。

链接: https://arxiv.org/abs/2411.15558
作者: Yao Lu,Hao Cheng,Yujie Fang,Zeyu Wang,Jiaheng Wei,Dongwei Xu,Qi Xuan,Xiaoniu Yang,Zhaowei Zhu
关键词-EN: posing significant challenges, achieved remarkable success, considerable scale necessitates, scale necessitates substantial, substantial computational resources
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final 25% of layers followed by fine-tuning the \textttlm_head and the remaining last three layer, yields remarkably strong performance. Following this guide, we prune Llama-3.1-8B-It and obtain a model that outperforms many popular LLMs of similar size, such as ChatGLM2-6B, Vicuna-7B-v1.5, Qwen1.5-7B and Baichuan2-7B. We release the optimal model weights on Huggingface, and the code is available on GitHub.
zh

[CV-159] LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces

【速读】：该论文试图解决无监督领域自适应（Unsupervised Domain Adaptation）中的关键问题，即如何在保持领域特定特征的同时，实现模型在未见领域间的知识迁移。现有方法难以平衡领域不变表示与领域特定特征的保留，主要原因是它们采用的对齐方法在潜在空间中将语义相似但领域差异显著的样本投影得过于接近。论文提出的解决方案是 \mnamelong，其关键在于从绝对坐标中的表示对齐转向潜在空间中等价概念的相对定位对齐。\mname 通过在语言空间中定义类标签间的语义/几何关系来构建领域无关的结构，并指导自适应过程，确保视觉空间中样本的组织反映参考的类间关系，同时保留领域特定特征。

链接: https://arxiv.org/abs/2411.15557
作者: Anxhelo Diko,Antonino Furnari,Luigi Cinque,Giovanni Maria Farinella
关键词-EN: Unsupervised domain adaptation, Unsupervised domain, domain adaptation remains, remains a critical, critical challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce \mnamelong, a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. \mname defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. %We empirically demonstrate \mname’s superiority in domain adaptation tasks across four diverse images and video datasets. Remarkably, \mname surpasses previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.
zh

[CV-160] ReWind: Understanding Long Videos with Instructed Learnable Memory

【速读】：该论文试图解决现有视觉语言模型（Vision-Language Models, VLMs）在处理长视频时面临的计算效率低、内存限制和时间连贯性理解困难的问题。解决方案的关键在于引入了一种名为ReWind的新型基于记忆的VLM，其核心创新包括：1) 一个动态可学习的记忆模块，采用独特的“读-感知-写”循环（read-perceive-write cycle）来存储和更新与指令相关的视觉信息，通过可学习的查询和跨注意力机制（cross-attentions）确保内存需求随token数量线性增长；2) 一种自适应帧选择机制，根据记忆内容识别指令相关的关键时刻，并选择高分辨率帧来丰富记忆表示，最终结合记忆内容和大型语言模型（Large Language Model, LLM）生成最终答案。这些创新使得ReWind在视觉问答（VQA）和时间定位任务中表现出优越性能，显著超越了现有方法。

链接: https://arxiv.org/abs/2411.15556
作者: Anxhelo Diko,Tinghuai Wang,Wassim Swaileh,Shiyan Sun,Ioannis Patras
关键词-EN: applications requiring integrated, requiring integrated understanding, integrated understanding textual, crucial for applications, applications requiring
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel \textbfread-perceive-write cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of tokens. In the second stage, we propose an adaptive frame selection mechanism guided by the memory content to identify instruction-relevant key moments. It enriches the memory representations with detailed spatial information by selecting a few high-resolution frames, which are then combined with the memory contents and fed into a Large Language Model (LLM) to generate the final answer. We empirically demonstrate ReWind’s superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks. Notably, ReWind achieves a +13% score gain and a +12% accuracy improvement on the MovieChat-1K VQA dataset and an +8% mIoU increase on Charades-STA for temporal grounding.
zh

[CV-161] Enhancing the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation

【速读】：该论文试图解决现有面部识别模型（Face Recognition, FR）在对抗样本（adversarial examples）攻击下的脆弱性问题，特别是提高对抗攻击的可迁移性（transferability）以揭示这些系统的盲点。解决方案的关键在于提出了一种名为多样化参数增强（Diverse Parameters Augmentation, DPA）的新方法，通过引入多样化的参数初始化来增强代理模型（surrogate models），从而生成更具迁移性的对抗样本。DPA方法包括两个核心阶段：多样化参数优化（Diverse Parameters Optimization, DPO）和硬模型聚合（Hard Model Aggregation, HMA）。在DPO阶段，通过使用预训练和随机参数初始化代理模型的参数，并在训练过程中保存中间模型，以获得多样化的代理模型集合。在HMA阶段，通过引入有益的扰动来增强多样化代理模型的特征图，进一步提高对抗样本的迁移性。实验结果表明，该方法能有效提升生成的对抗面部样本的可迁移性。

链接: https://arxiv.org/abs/2411.15555
作者: Fengfan Zhou,Bangjie Yin,Hefei Ling,Qianyu Zhou,Wenxuan Wang
关键词-EN: subtly manipulate benign, benign face images, manipulate benign face, surrogate models, Diverse Parameters Augmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face Recognition (FR) models are vulnerable to adversarial examples that subtly manipulate benign face images, underscoring the urgent need to improve the transferability of adversarial attacks in order to expose the blind spots of these systems. Existing adversarial attack methods often overlook the potential benefits of augmenting the surrogate model with diverse initializations, which limits the transferability of the generated adversarial examples. To address this gap, we propose a novel method called Diverse Parameters Augmentation (DPA) attack method, which enhances surrogate models by incorporating diverse parameter initializations, resulting in a broader and more diverse set of surrogate models. Specifically, DPA consists of two key stages: Diverse Parameters Optimization (DPO) and Hard Model Aggregation (HMA). In the DPO stage, we initialize the parameters of the surrogate model using both pre-trained and random parameters. Subsequently, we save the models in the intermediate training process to obtain a diverse set of surrogate models. During the HMA stage, we enhance the feature maps of the diversified surrogate models by incorporating beneficial perturbations, thereby further improving the transferability. Experimental results demonstrate that our proposed attack method can effectively enhance the transferability of the crafted adversarial face examples.
zh

[CV-162] Improving Transferable Targeted Attacks with Feature Tuning Mixup

【速读】：该论文试图解决深度神经网络在对抗样本攻击中的可转移性问题，特别是针对目标攻击的可转移性。解决方案的关键在于提出了一种名为特征调谐混合 (Feature Tuning Mixup, FTM) 的新方法，该方法通过在特征空间中结合随机噪声和优化噪声来增强目标攻击的可转移性。FTM 引入了可学习的特征扰动，并采用高效的随机更新策略进行优化，从而生成更具鲁棒性和可转移性的对抗样本。此外，通过集成多个经过 FTM 扰动的代理模型，进一步提升了攻击性能。实验结果表明，该方法在保持低计算成本的同时，显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2411.15553
作者: Kaisheng Liang,Xuelong Dai,Yanjie Li,Dong Wang,Bin Xiao
关键词-EN: Deep neural networks, neural networks exhibit, networks exhibit vulnerability, Deep neural, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks exhibit vulnerability to adversarial examples that can transfer across different models. A particularly challenging problem is developing transferable targeted attacks that can mislead models into predicting specific target classes. While various methods have been proposed to enhance attack transferability, they often incur substantial computational costs while yielding limited improvements. Recent clean feature mixup methods use random clean features to perturb the feature space but lack optimization for disrupting adversarial examples, overlooking the advantages of attack-specific perturbations. In this paper, we propose Feature Tuning Mixup (FTM), a novel method that enhances targeted attack transferability by combining both random and optimized noises in the feature space. FTM introduces learnable feature perturbations and employs an efficient stochastic update strategy for optimization. These learnable perturbations facilitate the generation of more robust adversarial examples with improved transferability. We further demonstrate that attack performance can be enhanced through an ensemble of multiple FTM-perturbed surrogate models. Extensive experiments on the ImageNet-compatible dataset across various models demonstrate that our method achieves significant improvements over state-of-the-art methods while maintaining low computational cost.
zh

[CV-163] NeRF Inpainting with Geometric Diffusion Prior and Balanced Score Distillation

【速读】：该论文试图解决现有NeRF（Neural Radiance Fields）图像修复方法在利用预训练扩散模型时表现不佳的问题，主要体现在两个方面：预训练扩散模型对几何信息的捕捉不足，以及现有Score Distillation Sampling (SDS)方法提供的指导不够优化。解决方案的关键在于引入了一种名为GB-NeRF的新框架，通过改进2D扩散先验的利用来增强NeRF图像修复。具体创新包括：同时学习外观和几何先验的微调策略，以及将这些几何先验整合到NeRF图像修复中的专用法向蒸馏损失。此外，论文提出了一种名为Balanced Score Distillation (BSD)的技术，该技术在外观和几何方面的修复质量上优于现有的SDS和Conditional Score Distillation (CSD)方法。

链接: https://arxiv.org/abs/2411.15551
作者: Menglin Zhang,Xin Luo,Yunwei Lan,Chang Liu,Rui Li,Kaidong Zhang,Ganlin Yang,Dong Liu
关键词-EN: Recent advances, Score Distillation, Score Distillation Sampling, pretrained diffusion models, leveraged pretrained diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in NeRF inpainting have leveraged pretrained diffusion models to enhance performance. However, these methods often yield suboptimal results due to their ineffective utilization of 2D diffusion priors. The limitations manifest in two critical aspects: the inadequate capture of geometric information by pretrained diffusion models and the suboptimal guidance provided by existing Score Distillation Sampling (SDS) methods. To address these problems, we introduce GB-NeRF, a novel framework that enhances NeRF inpainting through improved utilization of 2D diffusion priors. Our approach incorporates two key innovations: a fine-tuning strategy that simultaneously learns appearance and geometric priors and a specialized normal distillation loss that integrates these geometric priors into NeRF inpainting. We propose a technique called Balanced Score Distillation (BSD) that surpasses existing methods such as Score Distillation (SDS) and the improved version, Conditional Score Distillation (CSD). BSD offers improved inpainting quality in appearance and geometric aspects. Extensive experiments show that our method provides superior appearance fidelity and geometric consistency compared to existing approaches.
zh

[CV-164] Hierarchical Cross-Attention Network for Virtual Try-On

【速读】：该论文试图解决虚拟试衣任务中的挑战，提出了一种名为分层交叉注意力网络 (Hierarchical Cross-Attention Network, HCANet) 的创新解决方案。解决方案的关键在于两个主要阶段：几何匹配和试衣，以及在这两个阶段中引入的新型分层交叉注意力 (Hierarchical Cross-Attention, HCA) 模块。HCA 模块能够有效捕捉个体与服装模态之间的长程相关性，增强网络的深度和鲁棒性，通过分层方法细致地表示人与服装之间的交互，从而生成高度逼真的虚拟试衣效果。实验结果表明，HCANet 在定量指标和视觉真实性评估中均表现出色，成为虚拟试衣技术领域的先进解决方案。

链接: https://arxiv.org/abs/2411.15542
作者: Hao Tang,Bin Ren,Pingping Wu,Nicu Sebe
关键词-EN: Hierarchical Cross-Attention Network, virtual try-on task, virtual try-on, present an innovative, Hierarchical Cross-Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present an innovative solution for the challenges of the virtual try-on task: our novel Hierarchical Cross-Attention Network (HCANet). HCANet is crafted with two primary stages: geometric matching and try-on, each playing a crucial role in delivering realistic virtual try-on outcomes. A key feature of HCANet is the incorporation of a novel Hierarchical Cross-Attention (HCA) block into both stages, enabling the effective capture of long-range correlations between individual and clothing modalities. The HCA block enhances the depth and robustness of the network. By adopting a hierarchical approach, it facilitates a nuanced representation of the interaction between the person and clothing, capturing intricate details essential for an authentic virtual try-on experience. Our experiments establish the prowess of HCANet. The results showcase its performance across both quantitative metrics and subjective evaluations of visual realism. HCANet stands out as a state-of-the-art solution, demonstrating its capability to generate virtual try-on results that excel in accuracy and realism. This marks a significant step in advancing virtual try-on technologies.
zh

[CV-165] Optical-Flow Guided Prompt Optimization for Coherent Video Generation

【速读】：该论文试图解决文本到视频扩散模型在生成过程中面临的时间一致性问题。解决方案的关键在于提出了一种名为MotionPrompt的新框架，通过光流（optical flow）来引导视频生成过程。具体来说，论文训练了一个判别器来区分真实视频和生成视频中随机帧对之间的光流差异，并在反向采样步骤中优化可学习的token嵌入，利用训练好的判别器对随机帧对的梯度进行优化。这种方法能够在不降低生成内容保真度的情况下，生成视觉上连贯且符合自然运动动态的视频序列。

链接: https://arxiv.org/abs/2411.15540
作者: Hyelin Nam,Jaemin Kim,Dohun Lee,Jong Chul Ye
关键词-EN: made significant strides, significant strides, temporal consistency, made significant, face challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: project page: this https URL

点击查看摘要

Abstract:While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.
zh

[CV-166] Large Language Model with Region-guided Referring and Grounding for CT Report Generation

【速读】：该论文试图解决CT报告生成中现有方法仅考虑全局特征而忽略特定区域细节的问题，导致可能遗漏异常情况。解决方案的关键在于提出了Reg2RG，这是一个区域引导的参考和定位框架，通过聚焦于CT体积中的解剖区域来提升诊断性能。具体来说，Reg2RG利用通用分割模块的掩码来捕捉每个参考区域的局部特征，并通过局部特征解耦（LFD）策略保留局部高分辨率细节。随后，将局部特征与全局特征整合，以捕捉区域间的关系。此外，论文提出了区域-报告对齐（RRA）训练策略，利用区域识别来指导生成区域特定的报告，增强模型的参考和定位能力，同时提高报告的可解释性。最后，使用大型语言模型（LLM）作为语言解码器，从整合的视觉特征中生成报告，促进区域级别的理解。

链接: https://arxiv.org/abs/2411.15539
作者: Zhixuan Chen,Yequan Bie,Haibo Jin,Hao Chen
关键词-EN: Computed tomography, time-consuming and labor-intensive, crucial to assist, assist radiologists, radiologists in interpreting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model’s referring and grounding capabilities while also improving the report’s interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code will be made publicly available.
zh

[CV-167] MUNBa: Machine Unlearning via Nash Bargaining

【速读】：该论文试图解决机器遗忘 (Machine Unlearning, MU) 中遗忘特定概念/数据与保留模型整体效用之间的目标冲突问题。解决方案的关键在于将MU重新构建成一个双玩家合作博弈，其中遗忘玩家和保留玩家通过各自的梯度提议来最大化整体收益。基于纳什谈判理论，论文推导出一个封闭形式的解，指导模型向帕累托前沿移动，从而有效避免梯度冲突。该方法确保了均衡解，即任何偏离最终状态的行为都会导致双方整体目标的减少，从而在每个目标上实现最优性。

链接: https://arxiv.org/abs/2411.15537
作者: Jing Wu,Mehrtash Harandi
关键词-EN: Machine Unlearning, selectively erase harmful, erase harmful behaviors, aims to selectively, selectively erase
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto front, effectively avoiding the gradient conflicts. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm’s effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving superior performance on several benchmarks. For example, in the challenging scenario of sample-wise forgetting, our algorithm approaches the gold standard retrain baseline. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.
zh

[CV-168] CellPilot

【速读】：该论文试图解决在数字病理学中细胞和腺体分割的自动化与交互性之间的平衡问题。解决方案的关键在于引入了一个名为CellPilot的框架，该框架结合了自动分割和交互式细化的优势。CellPilot通过提供初始的自动分割结果，并允许用户在图形用户界面（GUI）中进行引导性的交互式修正，从而提高了分割的准确性和效率。该模型在超过675,000个掩码的九个多样化细胞和腺体分割数据集上进行了训练，涵盖了16个器官，展示了其在三个独立病理学数据集上的优越性能。此外，CellPilot的开源发布有助于推动更强大和通用的诊断模型的开发。

链接: https://arxiv.org/abs/2411.15514
作者: Philipp Endres,Valentin Koch,Julia A. Schnabel,Carsten Marr
关键词-EN: enabling improved visualization, increasingly digitized, streamlined workflows, microscopic study, study of diseased
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Histopathology, the microscopic study of diseased tissue, is increasingly digitized, enabling improved visualization and streamlined workflows. An important task in histopathology is the segmentation of cells and glands, essential for determining shape and frequencies that can serve as indicators of disease. Deep learning tools are widely used in histopathology. However, variability in tissue appearance and cell morphology presents challenges for achieving reliable segmentation, often requiring manual correction to improve accuracy. This work introduces CellPilot, a framework that bridges the gap between automatic and interactive segmentation by providing initial automatic segmentation as well as guided interactive refinement. Our model was trained on over 675,000 masks of nine diverse cell and gland segmentation datasets, spanning 16 organs. CellPilot demonstrates superior performance compared to other interactive tools on three held-out histopathological datasets while enabling automatic segmentation. We make the model and a graphical user interface designed to assist practitioners in creating large-scale annotated datasets available as open-source, fostering the development of more robust and generalized diagnostic models.
zh

[CV-169] Interactive Visual Assessment for Text-to-Image Generation Models

【速读】：该论文试图解决现有视觉生成模型评估方法在实际部署中面临的挑战，特别是评估框架的固定覆盖范围、不断变化的难度以及数据泄露风险等问题。解决方案的关键是提出了DyEval，一个基于大型语言模型（LLM）的动态交互式视觉评估框架。DyEval通过直观的视觉界面，使用户能够与生成模型进行协作评估，并根据模型反馈动态生成层次化、细粒度和多样化的文本输入，以持续探测模型的能力边界。此外，DyEval还包含一个上下文反思模块，用于挖掘测试输入的失败触发因素，并通过LLM的逻辑推理能力反映模型的潜在失败模式，从而支持深入分析。实验结果表明，DyEval能够有效帮助用户识别比传统方法多2.56倍的生成失败，并揭示复杂和罕见的失败模式，如代词生成和文化背景生成问题。

链接: https://arxiv.org/abs/2411.15509
作者: Xiaoyue Mi,Fan Tang,Juan Cao,Qiang Sheng,Ziyao Huang,Peng Li,Yang Liu,Tong-Yee Lee
关键词-EN: achieved remarkable progress, computer graphics applications, face significant challenges, real-world deployment, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Visual generation models have achieved remarkable progress in computer graphics applications but still face significant challenges in real-world deployment. Current assessment approaches for visual generation tasks typically follow an isolated three-phase framework: test input collection, model output generation, and user assessment. These fashions suffer from fixed coverage, evolving difficulty, and data leakage risks, limiting their effectiveness in comprehensively evaluating increasingly complex generation models. To address these limitations, we propose DyEval, an LLM-powered dynamic interactive visual assessment framework that facilitates collaborative evaluation between humans and generative models for text-to-image systems. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors, while adaptively generating hierarchical, fine-grained, and diverse textual inputs to continuously probe the capability boundaries of the models based on their feedback. Additionally, to provide interpretable analysis for users to further improve tested models, we develop a contextual reflection module that mines failure triggers of test inputs and reflects model potential failure patterns supporting in-depth analysis using the logical reasoning ability of LLM. Qualitative and quantitative experiments demonstrate that DyEval can effectively help users identify max up to 2.56 times generation failures than conventional methods, and uncover complex and rare failure patterns, such as issues with pronoun generation and specific cultural context generation. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems across various domains.
zh

[CV-170] AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

【速读】：该论文试图解决遥感图像目标检测 (Remote Sensing Image Object Detection, RSIOD) 中标注数据稀缺的问题。解决方案的关键在于提出了一个布局可控的扩散生成模型 (Layout-controllable Diffusion Generative Model, AeroGen)，该模型能够同时支持水平和旋转边界框的条件生成，从而生成符合特定布局和目标类别要求的高质量合成图像。此外，论文还提出了一种端到端的数据增强框架，该框架集成了多样性条件生成器和过滤机制，以增强生成数据的多样性和质量。实验结果表明，该方法生成的合成数据不仅质量高且多样性丰富，还能显著提升现有RSIOD模型的检测性能。

链接: https://arxiv.org/abs/2411.15497
作者: Datao Tang,Xiangyong Cao,Xuan Wu,Jialin Li,Jing Yao,Xueru Bai,Dongsheng Jiang,Yin Li,Deyu Meng
关键词-EN: Remote sensing image, Remote sensing, aims to identify, aerial imagery, identify and locate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7%, 4.3%, and 2.43%, respectively. The code is available at this https URL.
zh

[CV-171] Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation

【速读】：该论文试图解决急性缺血性卒中（Acute Ischemic Stroke, AIS）的自动化诊断报告生成问题，特别是在扩散加权成像（Diffusion Weighted Imaging, DWI）图像与放射学报告之间的跨模态映射难题。解决方案的关键在于提出了配对图像域检索与文本域增强（Paired Image-domain Retrieval and Text-domain Augmentation, PIRTA）框架，这是一个跨模态检索增强生成（Retrieval-Augmented Generation, RAG）方法。PIRTA通过将跨模态映射问题转化为在已有配对DWI图像和放射学报告的数据库中检索相似图像，从而避免了直接学习跨模态映射的困难。利用检索到的报告来增强查询图像的报告生成过程，实验结果表明PIRTA能够从3D DWI图像中准确检索相关报告，显著提高了报告生成的准确性，相比于直接使用最先进的跨模态语言模型的图像到文本生成方法。

链接: https://arxiv.org/abs/2411.15490
作者: Junhyeok Lee,Yujin Oh,Dahyoun Lee,Hyon Keun Joh,Chul-Ho Sohn,Sung Hyun Baik,Cheol Kyu Jung,Jung Hyun Park,Kyu Sung Choi,Byung-Hoon Kim,Jong Chul Ye
关键词-EN: Acute ischemic stroke, requires time-critical management, delayed intervention leading, Acute ischemic, AIS radiology reports
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.
zh

[CV-172] SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving

【速读】：该论文试图解决现有动态高斯溅射方法在复杂动态城市场景中依赖于昂贵的手动标注进行对象级监督，从而限制了其在实际应用中的可扩展性的问题。解决方案的关键在于引入SplatFlow，这是一种在神经运动流场（Neural Motion Flow Fields, NMFF）内实现自监督动态高斯溅射的方法。SplatFlow通过设计一个统一的框架，将时间依赖的4D高斯表示无缝集成到NMFF中，其中NMFF是一组隐式函数，用于建模LiDAR点和Gaussians的连续运动流场。这种方法能够有效分解静态背景和动态对象，分别用3D和4D高斯基元表示，并通过建模每个4D高斯在时间上的状态对应关系，聚合时间特征以增强动态组件的跨视图一致性。此外，SplatFlow通过从2D基础模型中提取特征到4D时空表示，进一步提高了动态场景的识别能力。实验结果表明，SplatFlow在Waymo Open Dataset和KITTI Dataset上的图像重建和新视图合成方面达到了最先进的性能。

链接: https://arxiv.org/abs/2411.15482
作者: Su Sun,Cheng Zhao,Zhuoyang Sun,Yingjie Victor Chen,Mei Chen
关键词-EN: Gaussian Splatting methods, Dynamic Gaussian Splatting, expensive manual labeling, Motion Flow Fields, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing Dynamic Gaussian Splatting methods for complex dynamic urban scenarios rely on accurate object-level supervision from expensive manual labeling, limiting their scalability in real-world applications. In this paper, we introduce SplatFlow, a Self-Supervised Dynamic Gaussian Splatting within Neural Motion Flow Fields (NMFF) to learn 4D space-time representations without requiring tracked 3D bounding boxes, enabling accurate dynamic scene reconstruction and novel view RGB, depth and flow synthesis. SplatFlow designs a unified framework to seamlessly integrate time-dependent 4D Gaussian representation within NMFF, where NMFF is a set of implicit functions to model temporal motions of both LiDAR points and Gaussians as continuous motion flow fields. Leveraging NMFF, SplatFlow effectively decomposes static background and dynamic objects, representing them with 3D and 4D Gaussian primitives, respectively. NMFF also models the status correspondences of each 4D Gaussian across time, which aggregates temporal features to enhance cross-view consistency of dynamic components. SplatFlow further improves dynamic scene identification by distilling features from 2D foundational models into 4D space-time representation. Comprehensive evaluations conducted on the Waymo Open Dataset and KITTI Dataset validate SplatFlow’s state-of-the-art (SOTA) performance for both image reconstruction and novel view synthesis in dynamic urban scenarios.
zh

[CV-173] KinMo: Kinematic-aware Human Motion Understanding and Generation

【速读】：该论文试图解决基于文本控制人体运动时难以捕捉和操控局部身体部位细微运动的问题。解决方案的关键在于提出了一种新的运动表示方法，该方法从运动学角度将运动分解为不同的身体关节组运动及其相互作用。论文设计了一个自动数据集收集流程，通过引入细粒度的局部关节组运动和交互描述，增强了现有的文本-运动基准。此外，论文引入了一种层次化的运动语义方法，逐步将关节级别的交互信息融合到全局动作级别的语义中，以实现模态对齐。通过这种层次结构，论文提出了一种从粗到细的运动合成过程，适用于各种生成和编辑下游应用。实验结果表明，该方法不仅提高了文本-运动检索中的关节空间理解能力，还实现了更精确的关节运动生成和控制。

链接: https://arxiv.org/abs/2411.15472
作者: Pengfei Zhang,Pinxin Liu,Hyeongwoo Kim,Pablo Garrido,Bindita Chaudhuri
关键词-EN: Controlling human motion, Controlling human, human motion based, computer vision, presents an important
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis, which struggle to capture subtle movements of local body parts. This limitation restricts the ability to isolate and manipulate specific movements. To address this, we propose a novel motion representation that decomposes motion into distinct body joint group movements and interactions from a kinematic perspective. We design an automatic dataset collection pipeline that enhances the existing text-motion benchmark by incorporating fine-grained local joint-group motion and interaction descriptions. To bridge the gap between text and motion domains, we introduce a hierarchical motion semantics approach that progressively fuses joint-level interaction information into the global action-level semantics for modality alignment. With this hierarchy, we introduce a coarse-to-fine motion synthesis procedure for various generation and editing downstream applications. Our quantitative and qualitative experiments demonstrate that the proposed formulation enhances text-motion retrieval by improving joint-spatial understanding, and enables more precise joint-motion generation and control. Project Page: \small\urlthis https URL
zh

[CV-174] Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning

【速读】：该论文试图解决持续学习 (Continual Learning, CL) 中的灾难性遗忘问题，即在模型学习新任务时如何避免遗忘之前任务的知识。解决方案的关键在于引入 Mamba-CL 框架，通过在特征子空间之外更新大规模 Mamba 基础模型中的状态空间模型 (State Space Models, SSMs) 的核心参数，确保每个 SSM 模块在当前和之前任务中输出的一致性。具体实现上，通过推导 Mamba 模型中四个关键的时间不变参数的整体一致性约束，并利用零空间投影 (null-space projection) 高效实现参数的正交性，从而在理论和实践中有效克服灾难性遗忘问题。

链接: https://arxiv.org/abs/2411.15469
作者: De Cheng,Yue Lu,Lingfeng He,Shizhou Zhang,Xi Yang,Nannan Wang,Xinbo Gao
关键词-EN: Continual Learning, previously learned knowledge, State Space Models, Mamba model, forgetting previously learned
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Learning (CL) aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge. Recently, State Space Models (SSMs), particularly the Mamba model, have achieved notable success in computer vision. Building on the strengths of SSMs, this study explores leveraging the Mamba model for CL. Therefore, we introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model by updating parameters orthogonal to the feature subspace of previous tasks. This approach theoretically guarantees the consistency objective aiming to preserves consistent output for each SSM module across both previous and current tasks, so as to overcome catastrophic forgetting issue. Specifically, we achieve this goal by deducing the overall consistency constraints on four key time-invariant parameters in the Mamba model, streamlining its recurrent state-space structure and non-linear discretization process in SSM. In practice, we apply the null-space projection to efficiently implement the orthogonality within Mamba model. Extensive experiments on four class-incremental benchmarks demonstrate the effectiveness of Mamba-CL for anti-forgetting, achieving superior performances to state-of-the-art methods. Code is available in the supplementary materials.
zh

[CV-175] SplatSDF: Boosting Neural Implicit SDF via Gaussian Splatting Fusion

【速读】：该论文试图解决场景级有符号距离函数（Signed Distance Function, SDF）重建的准确性和收敛速度问题。解决方案的关键在于提出了一种名为“SplatSDF”的新型神经隐式SDF，通过在架构层面上融合3D高斯喷射（3D Gaussian Splatting, 3DGS）和SDF-NeRF，显著提升了几何和光度准确性以及收敛速度。SplatSDF在训练阶段仅依赖3DGS作为输入，而在推理阶段保持与原始SDF-NeRF相同的复杂度和效率。

链接: https://arxiv.org/abs/2411.15468
作者: Runfa Blark Li,Keito Suzuki,Bang Du,Ki Myung Brian Le,Nikolay Atanasov,Truong Nguyen
关键词-EN: signed distance function, SDF, collision checking, neural implicit SDF, distance function
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:A signed distance function (SDF) is a useful representation for continuous-space geometry and many related operations, including rendering, collision checking, and mesh generation. Hence, reconstructing SDF from image observations accurately and efficiently is a fundamental problem. Recently, neural implicit SDF (SDF-NeRF) techniques, trained using volumetric rendering, have gained a lot of attention. Compared to earlier truncated SDF (TSDF) fusion algorithms that rely on depth maps and voxelize continuous space, SDF-NeRF enables continuous-space SDF reconstruction with better geometric and photometric accuracy. However, the accuracy and convergence speed of scene-level SDF reconstruction require further improvements for many applications. With the advent of 3D Gaussian Splatting (3DGS) as an explicit representation with excellent rendering quality and speed, several works have focused on improving SDF-NeRF by introducing consistency losses on depth and surface normals between 3DGS and SDF-NeRF. However, loss-level connections alone lead to incremental improvements. We propose a novel neural implicit SDF called “SplatSDF” to fuse 3DGSandSDF-NeRF at an architecture level with significant boosts to geometric and photometric accuracy and convergence speed. Our SplatSDF relies on 3DGS as input only during training, and keeps the same complexity and efficiency as the original SDF-NeRF during inference. Our method outperforms state-of-the-art SDF-NeRF models on geometric and photometric evaluation by the time of submission.
zh

[CV-176] Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

【速读】：该论文试图解决在零样本（zero-shot）条件下生成特定主题（subject-driven）图像时，传统方法需要大量时间和资源进行微调（fine-tuning），而现有零样本方法在主题对齐（subject alignment）方面表现不佳的问题。解决方案的关键在于引入了一种名为“Diptych Prompting”的新型零样本方法，该方法通过将任务重新解释为修复（inpainting）任务，利用大规模文本到图像模型中的双联画生成（diptych generation）的涌现特性，实现了精确的主题对齐。具体来说，Diptych Prompting将参考图像放置在左面板，并在右面板进行文本条件下的修复，通过去除参考图像的背景和增强面板间注意力权重来防止内容泄露并提升生成图像的细节质量。实验结果表明，该方法在视觉上优于现有的零样本图像提示方法，并支持多种图像生成应用，如主题驱动生成、风格化图像生成和主题驱动图像编辑。

链接: https://arxiv.org/abs/2411.15466
作者: Chaehun Shin,Jooyoung Choi,Heeseung Kim,Sungroh Yoon
关键词-EN: text prompt, aims to produce, desired context, context by accurately, accurately capturing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: this https URL
zh

[CV-177] MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

【速读】：该论文试图解决视觉-语言跟踪任务中现有Transformer方法在利用时间信息和动态更新参考特征方面的不足。解决方案的关键在于引入基于状态空间模型（State Space Model, SSM）的Mamba模型，即MambaVLT，通过其时间演化混合状态空间块和选择性局部增强块，有效捕捉多模态上下文信息并自适应更新参考特征。此外，引入的模态选择模块动态调整视觉和语言参考之间的权重，以减少单一模态可能带来的歧义。

链接: https://arxiv.org/abs/2411.15459
作者: Xinqi Liu,Li Zhou,Zikun Zhou,Jianqiu Chen,Zhenyu He
关键词-EN: tracking task aims, object tracking based, Existing Transformer-based vision-language, State Space, Transformer-based vision-language tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.
zh

[CV-178] Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在指令跟随能力上与单一语言模型（Large Language Models, LLMs）之间存在的显著差距问题。解决方案的关键在于提出了视觉模态令牌压缩（Visual-Modality Token Compression, VMTC）和跨模态注意力抑制（Cross-Modality Attention Inhibition, CMAI）策略。VMTC通过保留主要令牌并压缩冗余令牌来减少视觉模态的冗余信息，而CMAI则通过聚合文本到图像的注意力，生成文本到图像的焦点分数，并对低分数的文本-图像令牌对进行注意力抑制，从而在增强MLLMs的指令跟随能力的同时，保留其多模态理解和处理能力。

链接: https://arxiv.org/abs/2411.15453
作者: Te Yang,Jian Jia,Xiangyu Zhu,Weisong Zhao,Bo Wang,Yanhua Cheng,Yan Li,Shengyuan Liu,Quan Chen,Peng Jiang,Kun Gai,Zhen Lei
关键词-EN: Large Language Models, Language Models, Large Language, Multimodal Large Language, strong instruction-following capability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs. However, there is a significant gap in the instruction-following capabilities between the MLLMs and LLMs. In this study, we conduct a pilot experiment, which demonstrates that spatially down-sampling visual tokens significantly enhances the instruction-following capability of MLLMs. This is attributed to the substantial redundancy in visual modality. However, this intuitive method severely impairs the MLLM’s multimodal understanding capability. In this paper, we propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap between MLLMs and LLMs by inhibiting the influence of irrelevant visual tokens during content generation, increasing the instruction-following ability of the MLLMs while retaining their multimodal understanding capacity. In VMTC module, the primary tokens are retained and the redundant tokens are condensed by token clustering and merging. In CMAI process, we aggregate text-to-image attentions by text-to-text attentions to obtain a text-to-image focus score. Attention inhibition is performed on the text-image token pairs with low scores. Our comprehensive experiments over instruction-following capabilities and VQA-V2, GQA, TextVQA, MME and MMBench five benchmarks, demonstrate that proposed strategy significantly enhances the instruction following capability of MLLMs while preserving the ability to understand and process multimodal inputs.
zh

[CV-179] Gotta Hear Them All: Sound Source Aware Vision to Audio Generation

【速读】：该论文试图解决视觉到音频（V2A）合成中存在的沉浸感和表现力不足的问题，主要原因是现有方法仅依赖全局场景信息而忽略了局部声源（sound sources）的细节。解决方案的关键在于提出了一个声源感知视觉到音频生成器（Sound Source-Aware V2A, SSV2A）。SSV2A通过视觉检测和跨模态转换来局部感知场景中的多模态声源，并对比学习一个跨模态声源（Cross-Modal Sound Source, CMSS）流形，以语义区分每个声源。随后，通过注意力机制将这些CMSS语义混合成丰富的音频表示，最终由预训练的音频生成器输出声音。此外，论文还构建了一个新的单声源视觉-音频数据集VGGS3，并设计了声源匹配评分（Sound Source Matching Score）来评估局部音频的相关性。这是首次在声源级别上解决V2A生成问题，实验结果表明SSV2A在生成保真度和相关性方面优于现有最先进的方法。

链接: https://arxiv.org/abs/2411.15447
作者: Wei Guo,Heng Wang,Weidong Cai,Jianbo Ma
关键词-EN: synthesis has broad, applications in multimedia, broad applications, sound, sound sources
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 16 pages, 9 figures, source code released at this https URL

点击查看摘要

Abstract:Vision-to-audio (V2A) synthesis has broad applications in multimedia. Recent advancements of V2A methods have made it possible to generate relevant audios from inputs of videos or still images. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source-Aware V2A (SSV2A) generator. SSV2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross-modality translation. It then contrastively learns a Cross-Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS manifold, we curate a novel single-sound-source visual-audio dataset VGGS3 from VGGSound. We also design a Sound Source Matching Score to measure localized audio relevance. This is to our knowledge the first work to address V2A generation at the sound-source level. Extensive experiments show that SSV2A surpasses state-of-the-art methods in both generation fidelity and relevance. We further demonstrate SSV2A’s ability to achieve intuitive V2A control by compositing vision, text, and audio conditions. Our SSV2A generation can be tried and heard at this https URL .
zh

[CV-180] freePruner: A Training-free Approach for Large Multimodal Model Acceleration

【速读】：该论文试图解决大型多模态模型（Large Multimodal Models, LMMs）在视觉-语言任务中由于高计算需求而面临的部署挑战。解决方案的关键是提出了一种无需训练的token减少方法，称为freePruner。freePruner通过两阶段token选择策略实现加速：首先使用设计的贡献度指标识别捕捉高层语义信息的关键token，然后通过注意力模式分析选择保留低层视觉细节的补充token。这种方法无需重新训练或微调，可直接应用于任何开源LMM，并在主流视觉问答基准测试中实现了2倍加速，同时保持了相当的性能。此外，freePruner与其他后训练加速技术（如后训练量化）正交，可结合使用，为高效LMM部署提供了实用解决方案。

链接: https://arxiv.org/abs/2411.15446
作者: Bingxin Xu,Yuzhang Shang,Yunhao Ge,Qian Lou,Yan Yan
关键词-EN: Large Multimodal Models, Large Multimodal, high computational demands, demonstrated impressive capabilities, Multimodal Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated impressive capabilities in visual-language tasks but face significant deployment challenges due to their high computational demands. While recent token reduction methods show promise for accelerating LMMs, they typically require extensive retraining or fine-tuning, making them impractical for many state-of-the-art models, especially those with proprietary training data. We propose freePruner, a training-free token reduction approach that can be directly applied to any open-source LMM without additional training. Unlike existing methods that rely heavily on token merging operations, freePruner employs a two-stage token selection strategy: (1) identifying pivotal tokens that capture high-level semantic information using our designed contribution degree metric, and (2) selecting complementary tokens that preserve essential low-level visual details through attention pattern analysis. Extensive experiments demonstrate that freePruner achieves 2x acceleration while maintaining comparable performance across mainstream visual question-answering benchmarks in the training-free setting. Moreover, freePruner is orthogonal to and can be combined with other post-training acceleration techniques, such as post-training quantization, providing a practical solution for efficient LMM deployment.
zh

[CV-181] win Trigger Generative Networks for Backdoor Attacks against Object Detection

【速读】：该论文试图解决对象检测模型在训练和推理阶段易受后门攻击的问题。解决方案的关键在于提出了一种新颖的频率域双触发生成网络，用于生成不可见的触发器（invisible triggers）和可见的触发器（visible triggers）。不可见触发器在训练阶段植入模型，通过高斯平滑层和高频伪影分类器增强植入的隐蔽性；可见触发器在推理阶段激活后门，通过设计的新对齐损失优化，使其与原始模式不同但仍与不可见触发器的恶意激活行为对齐。这种双触发机制使得攻击过程难以追踪，显著降低了对象检测模型的mAP_0.5指标，分别达到70.0%和84.5%的降低效果。

链接: https://arxiv.org/abs/2411.15439
作者: Zhiying Li,Zhi Liu,Guanggang Geng,Shreyank N Gowda,Shuyuan Lin,Jian Weng,Xiaobo Jin
关键词-EN: invisible trigger generative, trigger generative, trigger generative networks, real-world applications, backdoor attacks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Object detectors, which are widely used in real-world applications, are vulnerable to backdoor attacks. This vulnerability arises because many users rely on datasets or pre-trained models provided by third parties due to constraints on data and resources. However, most research on backdoor attacks has focused on image classification, with limited investigation into object detection. Furthermore, the triggers for most existing backdoor attacks on object detection are manually generated, requiring prior knowledge and consistent patterns between the training and inference stages. This approach makes the attacks either easy to detect or difficult to adapt to various scenarios. To address these limitations, we propose novel twin trigger generative networks in the frequency domain to generate invisible triggers for implanting stealthy backdoors into models during training, and visible triggers for steady activation during inference, making the attack process difficult to trace. Specifically, for the invisible trigger generative network, we deploy a Gaussian smoothing layer and a high-frequency artifact classifier to enhance the stealthiness of backdoor implantation in object detectors. For the visible trigger generative network, we design a novel alignment loss to optimize the visible triggers so that they differ from the original patterns but still align with the malicious activation behavior of the invisible triggers. Extensive experimental results and analyses prove the possibility of using different triggers in the training stage and the inference stage, and demonstrate the attack effectiveness of our proposed visible trigger and invisible trigger generative networks, significantly reducing the mAP_0.5 of the object detectors by 70.0% and 84.5%, including YOLOv5 and YOLOv7 with different settings, respectively.
zh

[CV-182] ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

【速读】：该论文试图解决现有扩散模型在生成对话头像时存在的时间、3D和表情不一致的问题。解决方案的关键在于提出了一种名为ConsistentAvatar的新框架，通过引入时间敏感细节（Temporally-Sensitive Detail, TSD）映射来捕捉相邻帧之间的高频特征和轮廓变化，并利用时间一致性扩散模块将初始结果的TSD与视频帧的地面实况对齐。最终的头像生成依赖于对齐的TSD、粗糙的头部法线以及情感提示嵌入，从而约束扩散过程以生成时间上稳定的对话头像，同时抑制误差累积并提高多方面的连续性。

链接: https://arxiv.org/abs/2411.15436
作者: Haijie Yang,Zhenyu Zhang,Hao Tang,Jianjun Qian,Jian Yang
关键词-EN: shown impressive potential, shown impressive, impressive potential, consistent diffusion module, talking head generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: this https URL
zh

[CV-183] What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

【速读】：该论文试图解决从场景图生成图像时面临的挑战，特别是准确建模空间关系和对象交互的问题。解决方案的关键在于引入了一个名为Scene-Bench的综合基准，其中包括一个大规模数据集MegaSG，该数据集包含一百万张带有场景图注释的图像，用于训练和公平比较模型在多样化和复杂场景中的表现。此外，论文提出了一个名为SGScore的新评估指标，利用多模态大语言模型（LLMs）的链式思维推理能力来评估对象存在和关系准确性，从而提供比传统指标（如FID和CLIPScore）更有效的真实性一致性衡量方法。基于此评估框架，论文还开发了一个场景图反馈管道，通过迭代识别和纠正场景图与图像之间的差异来优化生成的图像。实验结果表明，Scene-Bench提供了比现有基准更全面和有效的评估框架，特别是在复杂场景生成方面，并且反馈策略显著提高了图像生成模型的真实性一致性。

链接: https://arxiv.org/abs/2411.15435
作者: Zuyao Chen,Jinlin Wu,Zhen Lei,Chang Wen Chen
关键词-EN: accurately modeling spatial, modeling spatial relationships, scene graphs remains, extensively studied, remains relatively underexplored
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.
zh

[CV-184] LDM-Morph: Latent diffusion model guided deformable image registration

【速读】：该论文试图解决现有基于深度学习的可变形图像配准方法中存在的语义信息缺失和相似性度量仅限于像素空间的问题。解决方案的关键在于提出了LDM-Morph算法，该算法通过集成潜在扩散模型（Latent Diffusion Model, LDM）提取的特征来丰富语义信息，并设计了基于潜在特征和全局特征的交叉注意力模块（Latent and Global Feature-based Cross-Attention, LGCA）以增强LDM和多头自注意力操作之间的语义信息交互。此外，论文还提出了一种分层度量方法，用于在原始像素空间和潜在特征空间中评估图像对的相似性，从而在提高配准精度的同时增强拓扑结构的保持。

链接: https://arxiv.org/abs/2411.15426
作者: Jiong Wu,Kuang Gong
关键词-EN: plays an essential, essential role, image registration plays, medical image tasks, deformable registration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deformable image registration plays an essential role in various medical image tasks. Existing deep learning-based deformable registration frameworks primarily utilize convolutional neural networks (CNNs) or Transformers to learn features to predict the deformations. However, the lack of semantic information in the learned features limits the registration performance. Furthermore, the similarity metric of the loss function is often evaluated only in the pixel space, which ignores the matching of high-level anatomical features and can lead to deformation folding. To address these issues, in this work, we proposed LDM-Morph, an unsupervised deformable registration algorithm for medical image registration. LDM-Morph integrated features extracted from the latent diffusion model (LDM) to enrich the semantic information. Additionally, a latent and global feature-based cross-attention module (LGCA) was designed to enhance the interaction of semantic information from LDM and global information from multi-head self-attention operations. Finally, a hierarchical metric was proposed to evaluate the similarity of image pairs in both the original pixel space and latent-feature space, enhancing topology preservation while improving registration accuracy. Extensive experiments on four public 2D cardiac image datasets show that the proposed LDM-Morph framework outperformed existing state-of-the-art CNNs- and Transformers-based registration methods regarding accuracy and topology preservation with comparable computational efficiency. Our code is publicly available at this https URL.
zh

[CV-185] OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

【速读】：该论文试图解决眼科手术视觉语言预训练（VLP）中的复杂性和标注数据稀缺的问题。解决方案的关键在于提出了OphCLIP，这是一个分层检索增强的视觉语言预训练框架，专门用于眼科手术流程的理解。OphCLIP利用了OphVL数据集，该数据集包含了超过375K个分层结构的视频-文本对，具有数万个不同属性的组合，涵盖手术、阶段/操作/动作、器械、药物以及更高级的方面如眼病原因、手术目标和术后恢复建议等。通过将短视频片段与详细的叙述描述以及完整视频与结构化标题对齐，OphCLIP能够学习细粒度和长期的视觉表示，捕捉复杂的手术细节和高级程序洞察。此外，OphCLIP设计了一个检索增强的预训练框架，利用未充分探索的大规模无声手术视频，自动检索语义相关内容以增强叙述视频的表示学习。实验结果表明，OphCLIP在阶段识别和多器械识别任务中表现出强大的泛化能力和优越的性能。

链接: https://arxiv.org/abs/2411.15421
作者: Ming Hu,Kun Yuan,Yaling Shen,Feilong Tang,Xiaohao Xu,Lin Zhou,Wei Li,Ying Chen,Zhongxing Xu,Zelin Peng,Siyuan Yan,Vinkle Srivastav,Diping Song,Tianbin Li,Danli Shi,Jin Ye,Nicolas Padoy,Nassir Navab,Junjun He
关键词-EN: practice involves complex, advanced medical knowledge, complex visual interpretation, Surgical practice involves, involves complex visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP’s robust generalization and superior performance.
zh

[CV-186] Semi-supervised Single-view 3D Reconstruction via Multi Shape Prior Fusion Strategy and Self-Attention

【速读】：该论文试图解决单视图三维重建（single-view 3D reconstruction）中对大量标注数据依赖的问题。解决方案的关键在于引入了一种创新的半监督学习框架，该框架通过多形状先验融合策略（multi shape prior fusion strategy）来指导生成更真实的物体结构，并结合自注意力模块（self-attention module）增强解码器的形状生成质量。实验结果表明，该方法在ShapeNet数据集上显著优于现有的监督学习方法，并在不同标注比例（1%、10%、20%）下表现出色，同时在Pix3D真实数据集上也展示了优异性能。

链接: https://arxiv.org/abs/2411.15420
作者: Wei Zhoua,Xinzhe Shia,Yunfeng Shea,Kunlong Liua,Yongqin Zhanga
关键词-EN: domain of single-view, expensive and time-intensive, techniques have frequently, frequently relied, relied on expensive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the domain of single-view 3D reconstruction, traditional techniques have frequently relied on expensive and time-intensive 3D annotation data. Facing the challenge of annotation acquisition, semi-supervised learning strategies offer an innovative approach to reduce the dependence on labeled data. Despite these developments, the utilization of this learning paradigm in 3D reconstruction tasks remains relatively constrained. In this research, we created an innovative semi-supervised framework for 3D reconstruction that distinctively uniquely introduces a multi shape prior fusion strategy, intending to guide the creation of more realistic object structures. Additionally, to improve the quality of shape generation, we integrated a self-attention module into the traditional decoder. In benchmark tests on the ShapeNet dataset, our method substantially outperformed existing supervised learning methods at diverse labeled ratios of 1%, 10%, and 20%. Moreover, it showcased excellent performance on the real-world Pix3D dataset. Through comprehensive experiments on ShapeNet, our framework demonstrated a 3.3% performance improvement over the baseline. Moreover, stringent ablation studies further confirmed the notable effectiveness of our approach. Our code has been released on this https URL
zh

[CV-187] FG-CXR: A Radiologist-Aligned Gaze Dataset for Enhancing Interpretability in Chest X-Ray Report Generation ACCV2024

【速读】：该论文试图解决在计算机辅助诊断 (CAD) 系统中，胸部X光 (CXR) 报告生成模型的解释性问题。现有方法生成的报告与实际放射科医生的解读之间存在显著差距，主要原因是这些模型无法准确反映放射科医生在诊断过程中使用的注意力机制和详细信息。解决方案的关键在于引入细粒度胸部X光 (FG-CXR) 数据集，该数据集提供了放射科医生生成的描述与对应解剖部位的注视注意力热图之间的精细匹配。论文进一步提出了一个可解释的放射科医生注意力生成网络 (Gen-XAI)，该网络通过模拟放射科医生的诊断过程，明确约束其输出与放射科医生的注视注意力和诊断记录紧密对齐，从而提高报告生成的准确性和解释性。

链接: https://arxiv.org/abs/2411.15413
作者: Trong Thang Pham,Ngoc-Vuong Ho,Nhat-Tan Bui,Thinh Phan,Patel Brijesh,Donald Adjeroh,Gianfranco Doretto,Anh Nguyen,Carol C. Wu,Hien Nguyen,Ngan Le
关键词-EN: Developing an interpretable, chest X-ray, crucial in Computer-aided, Computer-aided Diagnosis, interpretable system
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACCV 2024

点击查看摘要

Abstract:Developing an interpretable system for generating reports in chest X-ray (CXR) analysis is becoming increasingly crucial in Computer-aided Diagnosis (CAD) systems, enabling radiologists to comprehend the decisions made by these systems. Despite the growth of diverse datasets and methods focusing on report generation, there remains a notable gap in how closely these models’ generated reports align with the interpretations of real radiologists. In this study, we tackle this challenge by initially introducing Fine-Grained CXR (FG-CXR) dataset, which provides fine-grained paired information between the captions generated by radiologists and the corresponding gaze attention heatmaps for each anatomy. Unlike existing datasets that include a raw sequence of gaze alongside a report, with significant misalignment between gaze location and report content, our FG-CXR dataset offers a more grained alignment between gaze attention and diagnosis transcript. Furthermore, our analysis reveals that simply applying black-box image captioning methods to generate reports cannot adequately explain which information in CXR is utilized and how long needs to attend to accurately generate reports. Consequently, we propose a novel explainable radiologist’s attention generator network (Gen-XAI) that mimics the diagnosis process of radiologists, explicitly constraining its output to closely align with both radiologist’s gaze attention and transcript. Finally, we perform extensive experiments to illustrate the effectiveness of our method. Our datasets and checkpoint is available at this https URL.
zh

[CV-188] FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

【速读】：该论文试图解决大视觉语言模型（Vision-Language Models, VLMs）在细粒度图像区域组合信息感知方面的不足，特别是难以准确对齐分割掩码与相应语义以及精确描述所指区域组合特征的问题。解决方案的关键在于提出了FINECAPTION，这是一种新型VLM，能够识别任意掩码作为参考输入，并处理高分辨率图像以在不同粒度级别上进行组合图像描述。此外，论文还引入了COMPOSITIONCAP数据集，用于多粒度区域组合图像描述任务，特别是组合属性感知的区域图像描述，从而支持FINECAPTION模型的开发和评估。通过实验结果，论文展示了所提出模型相对于其他最先进VLMs的有效性，并分析了当前VLMs在识别各种视觉提示以进行组合区域图像描述方面的能力，指出了VLM设计和训练中需要改进的领域。

链接: https://arxiv.org/abs/2411.15411
作者: Hang Hua,Qing Liu,Lingzhi Zhang,Jing Shi,Zhifei Zhang,Yilin Wang,Jianming Zhang,Jiebo Luo
关键词-EN: significantly advanced multimodal, large Vision-Language Models, visual question answering, advanced multimodal tasks, enabling more sophisticated
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate reasoning across various applications, including image and video captioning, visual question answering, and cross-modal retrieval. Despite their superior capabilities, VLMs struggle with fine-grained image regional composition information perception. Specifically, they have difficulty accurately aligning the segmentation masks with the corresponding semantics and precisely describing the compositional aspects of the referred regions. However, compositionality - the ability to understand and generate novel combinations of known visual and textual components - is critical for facilitating coherent reasoning and understanding across modalities by VLMs. To address this issue, we propose FINECAPTION, a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different granularity levels. To support this endeavor, we introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning. Empirical results demonstrate the effectiveness of our proposed model compared to other state-of-the-art VLMs. Additionally, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training. Comments: Preprint Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.15411 [cs.CV] (or arXiv:2411.15411v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.15411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-189] Efficient Online Inference of Vision Transformers by Training-Free Tokenization

【速读】：该论文试图解决视觉变换器（Vision Transformers）部署成本高的问题，特别是如何在不显著影响性能和运行时间的情况下降低能耗。解决方案的关键是引入了一种名为视觉词令牌器（Visual Word Tokenizer, VWT）的无训练方法。VWT通过将频繁使用的图像块（visual subwords）分组为视觉词（visual words），同时保持不频繁的图像块不变，从而实现能耗的降低。该方法利用图像内或图像间的统计信息来识别相似的视觉概念以进行压缩。实验结果表明，VWT能够在最多增加20%运行时间的情况下，实现高达19%的能耗降低，相较于8-bit量化和令牌合并等现有方法，VWT在保持较高能效的同时，显著减少了运行时间的牺牲。

链接: https://arxiv.org/abs/2411.15397
作者: Leonidas Gee,Wing Yan Li,Viktoriia Sharmanska,Novi Quadrianto
关键词-EN: wider industrial adoption, deploying vision transformers, vision transformers increasingly, transformers increasingly represents, industrial adoption
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression requires additional end-to-end fine-tuning or incurs a significant drawback to runtime, thus making them ill-suited for online inference. We introduce the \textbfVisual Word Tokenizer (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups patches (visual subwords) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for compression. Experimentally, we demonstrate a reduction in wattage of up to 19% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to 2\times or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.
zh

[CV-190] Gradient-Free Classifier Guidance for Diffusion Model Sampling

【速读】：该论文试图解决扩散模型在图像生成过程中，如何在保持高图像保真度的同时，提高类别标签对齐精度的问题。解决方案的关键在于提出了一种名为无梯度分类器引导 (Gradient-free Classifier Guidance, GFCG) 的新方法，该方法在推理阶段完全利用预训练的分类器，而不使用梯度下降。通过在每个时间步动态确定时间自适应的参考类别标签和相应的引导尺度，GFCG 不仅提高了类别预测的准确性，还与现有的引导采样方法（如 CFG）互补，甚至在结合最先进的自动引导 (Autoguidance, ATG) 方法时，无需额外计算开销即可提升图像保真度并保持多样性。实验结果表明，GFCG 在 ImageNet 512×512 数据集上达到了 23.09 的 \textFD_\textDINOv2 值，同时分类精度达到 94.3%，超过了 ATG 的 90.2%。

链接: https://arxiv.org/abs/2411.15393
作者: Rahul Shenoy,Zhihong Pan,Kaushik Balakrishnan,Qisen Cheng,Yongmoon Jeon,Heejune Yang,Jaewon Kim
关键词-EN: outstanding learning capabilities, demonstrated outstanding learning, learning capabilities, effectively capturing, training dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image generation using diffusion models have demonstrated outstanding learning capabilities, effectively capturing the full distribution of the training dataset. They are known to generate wide variations in sampled images, albeit with a trade-off in image fidelity. Guided sampling methods, such as classifier guidance (CG) and classifier-free guidance (CFG), focus sampling in well-learned high-probability regions to generate images of high fidelity, but each has its limitations. CG is computationally expensive due to the use of back-propagation for classifier gradient descent, while CFG, being gradient-free, is more efficient but compromises class label alignment compared to CG. In this work, we propose an efficient guidance method that fully utilizes a pre-trained classifier without using gradient descent. By using the classifier solely in inference mode, a time-adaptive reference class label and corresponding guidance scale are determined at each time step for guided sampling. Experiments on both class-conditioned and text-to-image generation diffusion models demonstrate that the proposed Gradient-free Classifier Guidance (GFCG) method consistently improves class prediction accuracy. We also show GFCG to be complementary to other guided sampling methods like CFG. When combined with the state-of-the-art Autoguidance (ATG), without additional computational overhead, it enhances image fidelity while preserving diversity. For ImageNet 512 \times 512, we achieve a record \textFD_\textDINOv2 of 23.09, while simultaneously attaining a higher classification Precision (94.3%) compared to ATG (90.2%)
zh

[CV-191] Hatching-Box: Monitoring the Rearing Process of Drosophila Using an Embedded Imaging and in-vial Detection System

【速读】：该论文试图解决果蝇（Drosophila）发育行为自动监测和量化的问题，特别是在标准饲养瓶和常规饲养过程中，以消除显式实验的必要性。解决方案的关键在于结合定制的成像硬件与专用的检测和跟踪算法，形成名为Hatching-Box的新型成像和分析系统。该系统能够连续多天量化幼虫、满/空蛹和成虫的数量，并通过通用的客户端/服务器软件架构，实现对任意数量饲养瓶的同时监控。通过在近47万标注对象的数据集上评估系统，并在实际实验中验证，论文展示了Hatching-Box在长期实验中的应用潜力，以及在一般培养过程中自动化监测的优势。

链接: https://arxiv.org/abs/2411.15390
作者: Julian Bigge,Maite Ogueta,Luis Garcia,Benjamin Risse
关键词-EN: regular rearing routines, rendering explicit experiments, explicit experiments obsolete, Drosophila in standard, standard rearing vials
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:In this paper we propose the Hatching-Box, a novel imaging and analysis system to automatically monitor and quantify the developmental behavior of Drosophila in standard rearing vials and during regular rearing routines, rendering explicit experiments obsolete. This is achieved by combining custom tailored imaging hardware with dedicated detection and tracking algorithms, enabling the quantification of larvae, filled/empty pupae and flies over multiple days. Given the affordable and reproducible design of the Hatching-Box in combination with our generic client/server-based software, the system can easily be scaled to monitor an arbitrary amount of rearing vials simultaneously. We evaluated our system on a curated image dataset comprising nearly 470,000 annotated objects and performed several studies on real world experiments. We successfully reproduced results from well-established circadian experiments by comparing the eclosion periods of wild type flies to the clock mutants \textitper^short , \textitper^long and \textitper^0 without involvement of any manual labor. Furthermore we show, that the Hatching-Box is able to extract additional information about group behavior as well as to reconstruct the whole life-cycle of the individual specimens. These results not only demonstrate the applicability of our system for long-term experiments but also indicate its benefits for automated monitoring in the general cultivation process.
zh

[CV-192] A Constrast-Agnostic Method for Ultra-High Resolution Claustrum Segmentation

【速读】：该论文试图解决在典型分辨率下（如1 mm各向同性）的磁共振成像（MRI）中，由于其薄片状结构而难以自动分割的claustrum（屏状核）的问题。解决方案的关键在于提出了一种对比度和分辨率无关的方法，该方法基于SynthSeg分割框架，利用合成训练强度图像实现优异的泛化能力。具体而言，该方法仅需要标签图进行训练，因为对应的强度图像是在训练过程中动态合成的，具有随机对比度和分辨率。通过使用18个超高分辨率MRI扫描（主要为离体扫描）获得的claustrum手动标签，训练了一个深度学习网络进行自动分割，并在高分辨率案例中展示了其有效性（Dice系数=0.632，平均表面距离=0.458 mm，体积相似性=0.867，使用6折交叉验证）。此外，该方法还展示了在典型分辨率下的体内T1加权MRI扫描中的应用，以及在多模态成像（如T2加权、质子密度和定量T1扫描）中的鲁棒性。这是首次提出的一种准确且对对比度和分辨率变化鲁棒的超高分辨率claustrum自动分割方法。

链接: https://arxiv.org/abs/2411.15388
作者: Chiara Mauri,Ryan Fritz,Jocelyn Mora,Benjamin Billot,Juan Eugenio Iglesias,Koen Van Leemput,Jean Augustinack,Douglas N Greve
关键词-EN: band-like gray matter, gray matter structure, matter structure located, Magnetic Resonance Imaging, vivo Magnetic Resonance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 14 pages, 10 figures, 3 tables

点击查看摘要

Abstract:The claustrum is a band-like gray matter structure located between putamen and insula whose exact functions are still actively researched. Its sheet-like structure makes it barely visible in in vivo Magnetic Resonance Imaging (MRI) scans at typical resolutions and neuroimaging tools for its study, including methods for automatic segmentation, are currently very limited. In this paper, we propose a contrast- and resolution-agnostic method for claustrum segmentation at ultra-high resolution (0.35 mm isotropic); the method is based on the SynthSeg segmentation framework (Billot et al., 2023), which leverages the use of synthetic training intensity images to achieve excellent generalization. In particular, SynthSeg requires only label maps to be trained, since corresponding intensity images are synthesized on the fly with random contrast and resolution. We trained a deep learning network for automatic claustrum segmentation, using claustrum manual labels obtained from 18 ultra-high resolution MRI scans (mostly ex vivo). We demonstrated the method to work on these 18 high resolution cases (Dice score = 0.632, mean surface distance = 0.458 mm, and volumetric similarity = 0.867 using 6-fold Cross Validation (CV)), and also on in vivo T1-weighted MRI scans at typical resolutions (~1 mm isotropic). We also demonstrated that the method is robust in a test-retest setting and when applied to multimodal imaging (T2-weighted, Proton Density and quantitative T1 scans). To the best of our knowledge this is the first accurate method for automatic ultra-high resolution claustrum segmentation, which is robust against changes in contrast and resolution. The method is released at this https URL and as part of the neuroimaging package Freesurfer (Fischl, 2012).
zh

[CV-193] Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage

【速读】：该论文试图解决生成式文本到图像扩散模型（如 Stable Diffusion）在训练过程中使用未经授权数据可能导致的知识产权侵权或隐私侵犯问题。解决方案的关键在于提出了一种名为 \tech 的方法，该方法利用扩散过程进行受控的图像生成，保留输入图像的高级特征同时忽略水印所利用的低级细节。通过生成少量图像并对其进行微调，\tech 能够有效规避现有最先进的水印保护措施，从而增强生成模型的安全性。

链接: https://arxiv.org/abs/2411.15367
作者: Soumil Datta,Shih-Chieh Dai,Leo Yu,Guanhong Tao
关键词-EN: shown exceptional potential, generating high-quality images, Stable Diffusion, shown exceptional, exceptional potential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models, such as Stable Diffusion, have shown exceptional potential in generating high-quality images. However, recent studies highlight concerns over the use of unauthorized data in training these models, which may lead to intellectual property infringement or privacy violations. A promising approach to mitigate these issues is to apply a watermark to images and subsequently check if generative models reproduce similar watermark features. In this paper, we examine the robustness of various watermark-based protection methods applied to text-to-image models. We observe that common image transformations are ineffective at removing the watermark effect. Therefore, we propose \tech, that leverages the diffusion process to conduct controlled image generation on the protected input, preserving the high-level features of the input while ignoring the low-level details utilized by watermarks. A small number of generated images are then used to fine-tune protected models. Our experiments on three datasets and 140 text-to-image diffusion models reveal that existing state-of-the-art protections are not robust against RATTAN.
zh

[CV-194] Personalization of Wearable Sensor-Based Joint Kinematic Estimation Using Computer Vision for Hip Exoskeleton Applications

【速读】：该论文试图解决在患者监测、康复和外骨骼控制等应用中，准确估计下肢关节运动学的问题。现有基于可穿戴传感器和深度学习（DL）的方法通常需要大量新数据来适应未见过的步态模式，而计算机视觉领域的人体姿态估计模型虽然易于部署且能实时推理，但在无法使用摄像头的场景中不可行。论文提出的解决方案是一个基于计算机视觉的深度学习适应框架，用于实时关节运动学估计。该框架的关键在于仅使用少量数据（1-2个步态周期）和无需专业运动捕捉设备，通过迁移学习将时间卷积网络（TCN）适应于僵硬膝步态数据，从而在减少均方根误差方面表现出色，分别为9.7%和19.9%。这一框架展示了智能手机摄像头训练的深度学习模型在临床人群中实时估计关节运动学的潜力，特别是在可穿戴机器人应用中。

链接: https://arxiv.org/abs/2411.15366
作者: Changseob Song,Bogdan Ivanyuk-Skulskyi,Adrian Krieger,Kaitao Luo,Inseung Kang
关键词-EN: Accurate lower-limb joint, Accurate lower-limb, lower-limb joint kinematic, joint kinematic estimation, patient monitoring
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate lower-limb joint kinematic estimation is critical for applications such as patient monitoring, rehabilitation, and exoskeleton control. While previous studies have employed wearable sensor-based deep learning (DL) models for estimating joint kinematics, these methods often require extensive new datasets to adapt to unseen gait patterns. Meanwhile, researchers in computer vision have advanced human pose estimation models, which are easy to deploy and capable of real-time inference. However, such models are infeasible in scenarios where cameras cannot be used. To address these limitations, we propose a computer vision-based DL adaptation framework for real-time joint kinematic estimation. This framework requires only a small dataset (i.e., 1-2 gait cycles) and does not depend on professional motion capture setups. Using transfer learning, we adapted our temporal convolutional network (TCN) to stiff knee gait data, allowing the model to further reduce root mean square error by 9.7% and 19.9% compared to a TCN trained on only able-bodied and stiff knee datasets, respectively. Our framework demonstrates a potential for smartphone camera-trained DL models to estimate real-time joint kinematics across novel users in clinical populations with applications in wearable robots.
zh

[CV-195] UniGaussian: Driving Scene Reconstruction from Multiple Camera Models via Unified Gaussian Representations

【速读】：该论文试图解决在自动驾驶场景重建中，现有方法主要关注针孔相机而忽视鱼眼相机的问题。解决方案的关键在于提出了UniGaussian方法，该方法通过学习统一的3D高斯表示（3D Gaussian representation）来处理多种相机模型。具体来说，论文提出了一个新的可微分渲染方法，通过一系列针对鱼眼相机模型的仿射变换（affine transformations）来扭曲3D高斯，从而解决了3D高斯拼接与鱼眼相机兼容性的问题，同时保持了实时渲染的可微分性。此外，基于这种可微分渲染方法，设计了一个新的框架，通过适应不同相机模型的仿射变换和多模态监督来学习统一的3D高斯表示，从而实现了对多种传感器（针孔和鱼眼相机）和模态（深度、语义、法线和LiDAR点云）的综合理解。

链接: https://arxiv.org/abs/2411.15355
作者: Yuan Ren,Guile Wu,Runhao Li,Zheyuan Yang,Yibo Liu,Xingxin Chen,Tongtong Cao,Bingbing Liu
关键词-EN: Urban scene reconstruction, crucial for real-world, fisheye cameras, autonomous driving simulators, Gaussian representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical report

点击查看摘要

Abstract:Urban scene reconstruction is crucial for real-world autonomous driving simulators. Although existing methods have achieved photorealistic reconstruction, they mostly focus on pinhole cameras and neglect fisheye cameras. In fact, how to effectively simulate fisheye cameras in driving scene remains an unsolved problem. In this work, we propose UniGaussian, a novel approach that learns a unified 3D Gaussian representation from multiple camera models for urban scene reconstruction in autonomous driving. Our contributions are two-fold. First, we propose a new differentiable rendering method that distorts 3D Gaussians using a series of affine transformations tailored to fisheye camera models. This addresses the compatibility issue of 3D Gaussian splatting with fisheye cameras, which is hindered by light ray distortion caused by lenses or mirrors. Besides, our method maintains real-time rendering while ensuring differentiability. Second, built on the differentiable rendering method, we design a new framework that learns a unified Gaussian representation from multiple camera models. By applying affine transformations to adapt different camera models and regularizing the shared Gaussians with supervision from different modalities, our framework learns a unified 3D Gaussian representation with input data from multiple sources and achieves holistic driving scene understanding. As a result, our approach models multiple sensors (pinhole and fisheye cameras) and modalities (depth, semantic, normal and LiDAR point clouds). Our experiments show that our method achieves superior rendering quality and fast rendering speed for driving scene simulation.
zh

[CV-196] Zero-Shot Coreset Selection: Efficient Pruning for Unlabeled Data

【速读】：该论文试图解决深度学习中大规模数据存储、标注和模型训练的高成本问题，特别是如何在不依赖标注数据的情况下选择具有代表性的数据子集（即无标注数据的核心集选择）。解决方案的关键是提出了Zero-Shot Coreset Selection (ZCore)方法，该方法利用现有的基础模型生成无标注数据的零样本嵌入空间，并通过量化嵌入分布中的覆盖度和冗余度来评估每个数据样本的相对重要性，从而高效地选择核心集。ZCore无需依赖标注数据或对候选数据进行训练，显著降低了标注成本，并在多个数据集上表现优于现有的基于标签的方法。

链接: https://arxiv.org/abs/2411.15349
作者: Brent A. Griffin,Jacob Marks,Jason J. Corso
关键词-EN: learning increasingly relies, Deep learning increasingly, coreset selection, increasingly relies, relies on massive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning increasingly relies on massive data with substantial costs for storage, annotation, and model training. To reduce these costs, coreset selection aims to find a representative subset of data to train models while ideally performing on par with the full data training. State-of-the-art coreset methods use carefully-designed criteria to quantify the importance of each data example via ground truth labels and dataset-specific training, then select examples whose scores lie in a certain range to construct a coreset. These methods work well in their respective settings, however, they cannot select data that are unlabeled, which is the majority of real-world data. To that end, this paper motivates and formalizes the problem of unlabeled coreset selection to enable greater scale and reduce annotation costs for deep learning. As a solution, we develop Zero-Shot Coreset Selection (ZCore), a method that efficiently selects coresets without ground truth labels or training on candidate data. Instead, ZCore uses existing foundation models to generate a zero-shot embedding space for unlabeled data, then quantifies the relative importance of each example based on overall coverage and redundancy within the embedding distribution. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, leading to a strong baseline for future research in unlabeled coreset selection. On ImageNet, ZCore selections achieve a downstream model accuracy of 53.99% with only 10% training data, which outperforms label-based methods while removing annotation requirements for 1.15 million images. Our code is publicly available at this https URL.
zh

[CV-197] here is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks

【速读】：该论文试图解决的问题是Segment Anything Model (SAM)在缺乏标签的情况下生成掩码的能力是否具备固有的语义理解，从而适用于更广泛的视觉任务。解决方案的关键在于通过多阶段的方法探索如何增强SAM的语义理解能力。首先，论文通过比较SAM的基础图像编码器在分类任务中的效能与已建立的模型（如CLIP和DINOv2），发现SAM在特征表示中缺乏语义区分能力，这限制了其在需要类别区分任务中的应用。基于这一发现，论文进一步探讨了通过轻量级微调进行上下文学习以引入语义信息的方法，但发现其对未见类别的泛化能力有限。最终，论文提出了一种无需训练的方法，通过利用DINOv2的特征来增强SAM的语义理解，实现基于特征相似性的实例级类别区分。研究表明，结合外部语义源是提升SAM在复杂视觉任务中实用性的一个有前景的方向。

链接: https://arxiv.org/abs/2411.15288
作者: Miguel Espinosa,Chenhongyi Yang,Linus Ericsson,Steven McDonagh,Elliot J. Crowley
关键词-EN: label-agnostic mask generation, mask generation, originally designed, designed for label-agnostic, label-agnostic mask
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Work in progress

点击查看摘要

Abstract:The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach towards exploring this question. We firstly quantify SAM’s semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM’s utility with respect to complex visual tasks that require semantic understanding.
zh

[CV-198] When Spatial meets Temporal in Action Recognition

【速读】：该论文试图解决视频动作识别中空间信息与时间信息有效整合的问题。现有方法通常侧重于空间特征（如物体外观）或时间动态（如运动），而很少全面整合两者。论文提出的解决方案之关键是引入了一种名为“时间整合与运动增强 (Temporal Integration and Motion Enhancement, TIME) 层”的新型预处理技术。该层通过重新排列原始视频序列，将 N^2 个时间演变的帧嵌入到一个 N \times N 的空间网格中，生成新的视频帧，从而在保留时间顺序的同时平衡了空间和时间信息。这种变换使得新帧既包含丰富的空间细节，又突出了时间动态，从而提高了现有视频模型的兼容性和识别精度。

链接: https://arxiv.org/abs/2411.15284
作者: Huilin Chen,Lei Wang,Yifan Chen,Tom Gedeon,Piotr Koniusz
关键词-EN: made significant strides, temporal, significant strides, TIME layer, made significant
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Research report

点击查看摘要

Abstract:Video action recognition has made significant strides, but challenges remain in effectively using both spatial and temporal information. While existing methods often focus on either spatial features (e.g., object appearance) or temporal dynamics (e.g., motion), they rarely address the need for a comprehensive integration of both. Capturing the rich temporal evolution of video frames, while preserving their spatial details, is crucial for improving accuracy. In this paper, we introduce the Temporal Integration and Motion Enhancement (TIME) layer, a novel preprocessing technique designed to incorporate temporal information. The TIME layer generates new video frames by rearranging the original sequence, preserving temporal order while embedding N^2 temporally evolving frames into a single spatial grid of size N \times N . This transformation creates new frames that balance both spatial and temporal information, making them compatible with existing video models. When N=1 , the layer captures rich spatial details, similar to existing methods. As N increases ( N\geq2 ), temporal information becomes more prominent, while the spatial information decreases to ensure compatibility with model inputs. We demonstrate the effectiveness of the TIME layer by integrating it into popular action recognition models, such as ResNet-50, Vision Transformer, and Video Masked Autoencoders, for both RGB and depth video data. Our experiments show that the TIME layer enhances recognition accuracy, offering valuable insights for video processing tasks.
zh

[CV-199] Foundation Cures Personalization: Recovering Facial Personalized Models Prompt Consistency

【速读】：该论文试图解决在文本到图像生成领域中面部个性化任务中，身份嵌入（identity embedding）机制导致的提示一致性（prompt consistency）问题。具体来说，身份嵌入机制在处理包含多个面部属性的提示时，会削弱其他属性的效果，从而影响生成图像与提示的一致性。论文提出的解决方案是FreeCure，一个无需训练的框架，通过利用基础模型（foundation models）的内在知识来提高个性化模型的一致性。关键在于通过提取基础模型去噪过程中的交叉注意力（cross-attention）和语义图（semantic maps），识别并增强容易定位的属性（如头发、配饰等），并通过噪声混合策略和基于反转的过程来增强个性化模型输出中的多个属性。该方法无需额外训练，能够非侵入性地增强多种面部属性，并可无缝集成到现有的流行个性化模型中。

链接: https://arxiv.org/abs/2411.15277
作者: Yiyang Cai,Zhengkai Jiang,Yulong Liu,Chunyang Jiang,Wei Xue,Wenhan Luo,Yike Guo
关键词-EN: crucial downstream task, Facial personalization represents, Facial personalization, represents a crucial, crucial downstream
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial personalization represents a crucial downstream task in the domain of text-to-image generation. To preserve identity fidelity while ensuring alignment with user-defined prompts, current mainstream frameworks for facial personalization predominantly employ identity embedding mechanisms to associate identity information with textual embeddings. However, our experiments show that identity embeddings compromise the effectiveness of other tokens within the prompt, thereby hindering high prompt consistency, particularly when prompts involve multiple facial attributes. Moreover, previous works overlook the fact that their corresponding foundation models hold great potential to generate faces aligning to prompts well and can be easily leveraged to cure these ill-aligned attributes in personalized models. Building upon these insights, we propose FreeCure, a training-free framework that harnesses the intrinsic knowledge from the foundation models themselves to improve the prompt consistency of personalization models. First, by extracting cross-attention and semantic maps from the denoising process of foundation models, we identify easily localized attributes (e.g., hair, accessories, etc). Second, we enhance multiple attributes in the outputs of personalization models through a novel noise-blending strategy coupled with an inversion-based process. Our approach offers several advantages: it eliminates the need for training; it effectively facilitates the enhancement for a wide array of facial attributes in a non-intrusive manner; and it can be seamlessly integrated into existing popular personalization models. FreeCure has demonstrated significant improvements in prompt consistency across a diverse set of state-of-the-art facial personalization models while maintaining the integrity of original identity fidelity.
zh

[CV-200] Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras

【速读】：该论文试图解决事件相机（Event cameras）数据量有限的问题，通过引入一个定制的U形状态空间模型知识转移（U-shaped State Space Model Knowledge Transfer, USKT）框架，实现事件数据到RGB数据的转换。解决方案的关键在于USKT框架，它能够生成与RGB帧兼容的输入，使得事件数据能够有效复用预训练的RGB模型，从而在参数调优最小化的前提下实现竞争性性能。此外，USKT架构中提出的双向反向状态空间模型（Bidirectional Reverse State Space Model, BiR-SSM）通过共享权重策略，在提高模型效率的同时节省计算资源。

链接: https://arxiv.org/abs/2411.15276
作者: Yuhui Lin,Jiahao Zhang,Siyuan Li,Jimin Xiao,Ding Xu,Wenjun Wu,Jiaxuan Lu
关键词-EN: emerging imaging technology, offer distinct advantages, including reduced energy, traditional RGB cameras, reduced energy consumption
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Event cameras, as an emerging imaging technology, offer distinct advantages over traditional RGB cameras, including reduced energy consumption and higher frame rates. However, the limited quantity of available event data presents a significant challenge, hindering their broader development. To alleviate this issue, we introduce a tailored U-shaped State Space Model Knowledge Transfer (USKT) framework for Event-to-RGB knowledge transfer. This framework generates inputs compatible with RGB frames, enabling event data to effectively reuse pre-trained RGB models and achieve competitive performance with minimal parameter tuning. Within the USKT architecture, we also propose a bidirectional reverse state space model. Unlike conventional bidirectional scanning mechanisms, the proposed Bidirectional Reverse State Space Model (BiR-SSM) leverages a shared weight strategy, which facilitates efficient modeling while conserving computational resources. In terms of effectiveness, integrating USKT with ResNet50 as the backbone improves model performance by 0.95%, 3.57%, and 2.9% on DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets, respectively, underscoring USKT’s adaptability and effectiveness. The code will be made available upon acceptance.
zh

[CV-201] EADReg: Probabilistic Correspondence Generation with Efficient Autoregressive Diffusion Model for Outdoor Point Cloud Registration

【速读】：该论文试图解决在户外LiDAR点云配准（PCR）任务中，由于点云的稀疏性、不规则性和大规模尺度导致的难以建立密集全局点对点对应关系的问题。解决方案的关键在于提出了一种名为EADReg的新框架，该框架基于自回归扩散模型，采用从粗到细的配准范式。在粗配准阶段，使用双向高斯混合模型（Bi-directional Gaussian Mixture Model, BGMM）来剔除异常点并获得纯净的点云对，通过建立源和目标帧的高斯混合模型（GMMs）之间的对应关系，实现基于过滤特征和几何信息的可靠粗配准。在精细配准阶段，将扩散模型应用于PCR视为自回归过程，生成鲁棒的点对应关系，并在上层进行迭代细化。尽管扩散模型通常被批评为推理速度慢，但EADReg实现了与基于卷积的方法相当的运行时间。

链接: https://arxiv.org/abs/2411.15271
作者: Linrui Gong,Jiuming Liu,Junyi Ma,Lihao Liu,Yaonan Wang,Hesheng Wang
关键词-EN: Gaussian Mixture Model, challenging cases, Diffusion models, Bi-directional Gaussian Mixture, shown the great
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have shown the great potential in the point cloud registration (PCR) task, especially for enhancing the robustness to challenging cases. However, existing diffusion-based PCR methods primarily focus on instance-level scenarios and struggle with outdoor LiDAR points, where the sparsity, irregularity, and huge point scale inherent in LiDAR points pose challenges to establishing dense global point-to-point correspondences. To address this issue, we propose a novel framework named EADReg for efficient and robust registration of LiDAR point clouds based on autoregressive diffusion models. EADReg follows a coarse-to-fine registration paradigm. In the coarse stage, we employ a Bi-directional Gaussian Mixture Model (BGMM) to reject outlier points and obtain purified point cloud pairs. BGMM establishes correspondences between the Gaussian Mixture Models (GMMs) from the source and target frames, enabling reliable coarse registration based on filtered features and geometric information. In the fine stage, we treat diffusion-based PCR as an autoregressive process to generate robust point correspondences, which are then iteratively refined on upper layers. Despite common criticisms of diffusion-based methods regarding inference speed, EADReg achieves runtime comparable to convolutional-based methods. Extensive experiments on the KITTI and NuScenes benchmark datasets highlight the state-of-the-art performance of our proposed method. Codes will be released upon publication.
zh

[CV-202] Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI

【速读】：该论文试图解决基于梯度的解释性技术在图像模型中的几个主要缺陷：（1）需要白盒访问模型；（2）易受对抗攻击；（3）生成的归因偏离图像流形，导致解释不忠实于模型且不符合人类感知。解决方案的关键是引入了一种名为“无导数扩散流形约束梯度 (FreeMCG)”的新方法，通过利用集合卡尔曼滤波器和扩散模型，实现对模型梯度的无导数近似，并将其投影到数据流形上，仅依赖于模型的输出。这种方法在反事实生成和特征归因任务中展示了其有效性，并取得了最先进的结果，同时保留了可解释性AI工具的基本特性。

链接: https://arxiv.org/abs/2411.15265
作者: Won Jun Kim,Hyungjin Chung,Jaemin Kim,Sangmin Lee,Byeongsu Sim,Jong Chul Ye
关键词-EN: Gradient-based methods, prototypical family, explainability techniques, image-based models, Gradient-based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Gradient-based methods are a prototypical family of explainability techniques, especially for image-based models. Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG), a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model’s gradient projected onto the data manifold, requiring access only to the model’s outputs. We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks, counterfactual explanation and feature attribution, we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.
zh

[CV-203] AI-Driven Real-Time Monitoring of Ground-Nesting Birds: A Case Study on Curlew Detection Using YOLOv10

【速读】：该论文试图解决野生鸟类，特别是地巢鸟类如长嘴杓鹬（curlew, Numenius arquata）的实时监测问题，以应对其种群数量显著下降的现状。解决方案的关键在于开发并应用一种基于AI的实时物种检测系统，该系统利用定制训练的YOLOv10模型，通过3/4G连接的摄像头与Conservation AI平台结合，实现对长嘴杓鹬及其雏鸟的高效检测与分类。该系统能够在11个威尔士的巢址中实现高精度的实时数据处理，显著提高了监测效率，并为生物多样性评估和早期保护干预提供了及时、准确的数据支持。

链接: https://arxiv.org/abs/2411.15263
作者: Carl Chalmers,Paul Fergus,Serge Wich,Steven N Longmore,Naomi Davies Walsh,Lee Oliver,James Warrington,Julieanne Quinlan,Katie Appleby
关键词-EN: signal significant environmental, Effective monitoring, ecosystem health, wildlife is critical, critical for assessing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective monitoring of wildlife is critical for assessing biodiversity and ecosystem health, as declines in key species often signal significant environmental changes. Birds, particularly ground-nesting species, serve as important ecological indicators due to their sensitivity to environmental pressures. Camera traps have become indispensable tools for monitoring nesting bird populations, enabling data collection across diverse habitats. However, the manual processing and analysis of such data are resource-intensive, often delaying the delivery of actionable conservation insights. This study presents an AI-driven approach for real-time species detection, focusing on the curlew (Numenius arquata), a ground-nesting bird experiencing significant population declines. A custom-trained YOLOv10 model was developed to detect and classify curlews and their chicks using 3/4G-enabled cameras linked to the Conservation AI platform. The system processes camera trap data in real-time, significantly enhancing monitoring efficiency. Across 11 nesting sites in Wales, the model achieved high performance, with a sensitivity of 90.56%, specificity of 100%, and F1-score of 95.05% for curlew detections, and a sensitivity of 92.35%, specificity of 100%, and F1-score of 96.03% for curlew chick detections. These results demonstrate the capability of AI-driven monitoring systems to deliver accurate, timely data for biodiversity assessments, facilitating early conservation interventions and advancing the use of technology in ecological research.
zh

[CV-204] MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

【速读】：该论文试图解决现有视频生成模型在生成包含多个场景、连贯叙事和一致角色的长视频时面临的挑战。解决方案的关键在于提出了MovieBench，这是一个层次化的电影级数据集，专门用于长视频生成的分析、评估和训练。MovieBench的主要贡献包括：(1) 提供具有丰富连贯故事线和多场景叙事的电影长度视频；(2) 确保角色外观和音频在场景间的一致性；(3) 包含高层次电影信息和详细镜头级别描述的分层数据结构。通过这些创新，MovieBench旨在为长视频生成领域带来新的见解和挑战，例如在多个场景中保持角色身份的一致性。

链接: https://arxiv.org/abs/2411.15262
作者: Weijia Wu,Mingyu Liu,Zeyu Zhu,Xi Xia,Haoen Feng,Wen Wang,Kevin Qinghong Lin,Chunhua Shen,Mike Zheng Shou
关键词-EN: Stable Video Diffusion, show promising results, long video generation, Recent advancements, video generation models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project website is at: this https URL . Code: this https URL

点击查看摘要

Abstract:Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) movie-length videos featuring rich, coherent storylines and multi-scene narratives, (2) consistency of character appearance and audio across scenes, and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions. Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters. The dataset will be public and continuously maintained, aiming to advance the field of long video generation. Data can be found at: this https URL.
zh

[CV-205] VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing

【速读】：该论文试图解决高质量视频编辑面临的挑战，特别是缺乏基于真实数据的大规模开源视频编辑数据集、视频数据表示所需的大量标记导致的高训练成本，以及现有视频编辑模型交互性有限的问题。解决方案的关键在于引入了一个名为VIVID-10M的大规模混合图像-视频局部编辑数据集，以及一个基于此数据集训练的通用交互式视频局部编辑模型VIVID。VIVID-10M数据集包含9.7M样本，旨在降低数据构建和模型训练成本，涵盖广泛的编辑任务。VIVID模型支持实体的添加、修改和删除，并通过关键帧引导的交互式视频编辑机制，使用户能够迭代编辑关键帧并将其传播到其他帧，从而减少达到预期效果的延迟。实验评估表明，该方法在视频局部编辑方面达到了最先进的性能。

链接: https://arxiv.org/abs/2411.15260
作者: Jiahao Hu,Tianxiong Zhong,Xuebo Wang,Boyuan Jiang,Xingye Tian,Fei Yang,Pengfei Wan,Di Zhang
关键词-EN: Diffusion-based image editing, made remarkable progress, Diffusion-based image, video editing, editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 14 figures

点击查看摘要

Abstract:Diffusion-based image editing models have made remarkable progress in recent years. However, achieving high-quality video editing remains a significant challenge. One major hurdle is the absence of open-source, large-scale video editing datasets based on real-world data, as constructing such datasets is both time-consuming and costly. Moreover, video data requires a significantly larger number of tokens for representation, which substantially increases the training costs for video editing models. Lastly, current video editing models offer limited interactivity, often making it difficult for users to express their editing requirements effectively in a single attempt. To address these challenges, this paper introduces a dataset VIVID-10M and a baseline model VIVID. VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs, which comprises 9.7M samples that encompass a wide range of video editing tasks. VIVID is a Versatile and Interactive VIdeo local eDiting model trained on VIVID-10M, which supports entity addition, modification, and deletion. At its core, a keyframe-guided interactive video editing mechanism is proposed, enabling users to iteratively edit keyframes and propagate it to other frames, thereby reducing latency in achieving desired outcomes. Extensive experimental evaluations show that our approach achieves state-of-the-art performance in video local editing, surpassing baseline methods in both automated metrics and user studies. The VIVID-10M dataset and the VIVID editing model will be available at \urlthis https URL.
zh

[CV-206] LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation

【速读】：该论文试图解决基于扩散模型的文本到图像生成中个性化和可控生成实例的问题。解决方案的关键在于提出了LocRef-Diffusion模型，该模型通过引入Layout-net和appearance-net来实现实例位置和外观的精确控制。Layout-net利用显式的实例布局信息和实例区域交叉注意力模块来控制实例的生成位置，而appearance-net则通过提取实例外观特征并通过交叉注意力机制将其集成到扩散模型中，从而提高生成图像的外观保真度。实验结果表明，该方法在布局和外观引导生成方面达到了最先进的性能。

链接: https://arxiv.org/abs/2411.15252
作者: Fan Deng,Yaguang Wu,Xinyang Yu,Xiangjun Huang,Jian Yang,Guangyu Yan,Qiang Xu
关键词-EN: achieved remarkable success, generating high-quality images, achieved remarkable, remarkable success, success in generating
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances’ appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.
zh

[CV-207] AnyText2: Visual Text Generation and Editing With Customizable Attributes

【速读】：该论文试图解决在文本到图像 (Text-to-Image, T2I) 生成领域中，难以精确控制多语言文本的字体和颜色属性的问题。解决方案的关键在于提出了AnyText2方法，该方法包含两个主要组件：首先，引入WriteNet+AttnX架构，将文本渲染能力注入预训练的T2I模型，相比前作AnyText，不仅提升了图像真实感，还提高了19.8%的推理速度；其次，开发了文本嵌入模块 (Text Embedding Module)，用于从场景图像中提取字体和颜色，并分别编码为条件，从而实现对每行文本属性的定制化控制，使得中文和英文的文本准确性分别提高了3.3%和9.3%。

链接: https://arxiv.org/abs/2411.15245
作者: Yuxiang Tuo,Yifeng Geng,Liefeng Bo
关键词-EN: garnered significant attention, domain progresses, significant attention, seamlessly integrates, integrates with visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As the text-to-image (T2I) domain progresses, generating text that seamlessly integrates with visual content has garnered significant attention. However, even with accurate text generation, the inability to control font and color can greatly limit certain applications, and this issue remains insufficiently addressed. This paper introduces AnyText2, a novel method that enables precise control over multilingual text attributes in natural scene image generation and editing. Our approach consists of two main components. First, we propose a WriteNet+AttnX architecture that injects text rendering capabilities into a pre-trained T2I model. Compared to its predecessor, AnyText, our new approach not only enhances image realism but also achieves a 19.8% increase in inference speed. Second, we explore techniques for extracting fonts and colors from scene images and develop a Text Embedding Module that encodes these text attributes separately as conditions. As an extension of AnyText, this method allows for customization of attributes for each line of text, leading to improvements of 3.3% and 9.3% in text accuracy for Chinese and English, respectively. Through comprehensive experiments, we demonstrate the state-of-the-art performance of our method. The code and model will be made open-source in this https URL.
zh

[CV-208] Adversarial Prompt Distillation for Vision-Language Models

【速读】：该论文试图解决预训练视觉-语言模型（Vision-Language Models, VLMs）如对比语言-图像预训练模型（Contrastive Language-Image Pre-Training, CLIP）在对抗攻击下的脆弱性问题。解决方案的关键在于提出了一种新的方法——对抗提示蒸馏（Adversarial Prompt Distillation, APD），该方法结合了对抗提示调优（Adversarial Prompt Tuning, APT）和知识蒸馏（Knowledge Distillation），通过为视觉和文本模态同时添加提示，并利用一个干净的预训练教师CLIP模型来提升学生CLIP模型在下游任务中的表现，从而增强模型的对抗鲁棒性和自然性能。实验结果表明，APD在自然和对抗性能方面均优于当前最先进的APT方法。

链接: https://arxiv.org/abs/2411.15244
作者: Lin Luo,Xin Wang,Bojia Zi,Shihao Zhao,Xingjun Ma
关键词-EN: Contrastive Language-Image Pre-Training, Large pre-trained Vision-Language, Contrastive Language-Image, Adversarial Prompt Tuning, Large pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large pre-trained Vision-Language Models (VLMs) such as Contrastive Language-Image Pre-Training (CLIP) have been shown to be susceptible to adversarial attacks, raising concerns about their deployment in safety-critical scenarios like autonomous driving and medical diagnosis. One promising approach for improving the robustness of pre-trained VLMs is Adversarial Prompt Tuning (APT), which combines adversarial training with prompt tuning. However, existing APT methods are mostly single-modal methods that design prompt(s) for only the visual or textual modality, limiting their effectiveness in either robustness or clean accuracy. In this work, we propose a novel method called Adversarial Prompt Distillation (APD) that combines APT with knowledge distillation to boost the adversarial robustness of CLIP. Specifically, APD is a bimodal method that adds prompts for both the visual and textual modalities while leveraging a cleanly pre-trained teacher CLIP model to distill and boost the performance of the student CLIP model on downstream tasks. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our APD over the current state-of-the-art APT methods in terms of both natural and adversarial performances. The effectiveness of our APD method validates the possibility of using a non-robust teacher to improve the generalization and robustness of VLMs.
zh

[CV-209] EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

【速读】：该论文试图解决在资源受限环境中部署神经网络时，如何高效捕捉全局依赖关系的问题。解决方案的关键在于提出了基于隐藏状态混合器的状态空间对偶模型（HSM-SSD），通过重新设计SSD层以实现隐藏状态内的通道混合操作，并引入多阶段隐藏状态融合来增强表示能力，从而显著降低计算成本并提高模型性能。Efficient Vision Mamba (EfficientViM) 架构在ImageNet-1k上实现了新的速度-准确性权衡，相比SHViT模型在速度更快的同时提供了高达0.7%的性能提升。

链接: https://arxiv.org/abs/2411.15241
作者: Sanghyeok Lee,Joonmyung Choi,Hyunwoo J. Kim
关键词-EN: resource-constrained environments, built lightweight architectures, deployment of neural, neural networks, networks in resource-constrained
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens. Yet, efficient vision backbones built with SSM have been explored less. In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states. Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at this https URL.
zh

[CV-210] Faithful Label-free Knowledge Distillation

【速读】：该论文试图解决在大规模计算机视觉基础模型中，如何通过知识蒸馏（Knowledge Distillation）技术训练出高性能且轻量化的学生模型的问题。解决方案的关键在于提出了一种名为“Teacher in the Middle (TinTeM)”的无标签知识蒸馏方法，该方法通过学习教师网络潜在空间到学生网络的近似正交映射，从而生成一个更忠实于教师网络行为的学生模型。这种方法不仅在模型鲁棒性、泛化能力和分布外检测（OOD detection）方面表现优异，还能在特定任务数据集上训练出更准确、更具泛化能力和OOD检测性能的模型，为在小数据集上训练高性能轻量级模型提供了竞争性路径。

链接: https://arxiv.org/abs/2411.15239
作者: Evelyn J. Mannix,Liam Hodgkinson,Howard Bondell
关键词-EN: inductive bias, Knowledge distillation approaches, Knowledge distillation, model compression techniques, teacher network
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation approaches are model compression techniques, with the goal of training a highly performant student model by using a teacher network that is larger or contains a different inductive bias. These approaches are particularly useful when applied to large computer vision foundation models, which can be compressed into smaller variants that retain desirable properties such as improved robustness. This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM), which improves on previous methods by learning an approximately orthogonal mapping from the latent space of the teacher to the student network. This produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection. It is further shown that knowledge distillation with TinTeM on task specific datasets leads to more accurate models with greater generalisability and OOD detection performance, and that this technique provides a competitive pathway for training highly performant lightweight models on small datasets.
zh

[CV-211] Stain-Invariant Representation for Tissue Classification in Histology Images

【速读】：该论文试图解决在计算病理学 (CPath) 中，由于数字化病理切片 (WSI) 的多样性（包括染色协议、扫描仪和组织类型等因素）导致的领域偏移问题，这使得深度学习 (DL) 算法在多队列设置中的训练和测试面临显著挑战。解决方案的关键在于提出了一种框架，通过染色矩阵扰动生成训练图像的染色增强版本，并采用染色正则化损失来强制源图像和增强图像的特征表示之间的一致性。这种方法促使模型学习染色不变性，从而实现领域不变性的特征表示，最终在跨领域的结直肠癌图像的多类组织类型分类任务中取得了优于现有最先进方法的性能。

链接: https://arxiv.org/abs/2411.15237
作者: Manahil Raza,Saad Bashir,Talha Qaiser,Nasir Rajpoot
关键词-EN: digitising histology slides, histology slides involves, slides involves multiple, involves multiple factors, final appearance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The process of digitising histology slides involves multiple factors that can affect a whole slide image’s (WSI) final appearance, including the staining protocol, scanner, and tissue type. This variability constitutes a domain shift and results in significant problems when training and testing deep learning (DL) algorithms in multi-cohort settings. As such, developing robust and generalisable DL models in computational pathology (CPath) remains an open challenge. In this regard, we propose a framework that generates stain-augmented versions of the training images using stain matrix perturbation. Thereafter, we employed a stain regularisation loss to enforce consistency between the feature representations of the source and augmented images. Doing so encourages the model to learn stain-invariant and, consequently, domain-invariant feature representations. We evaluate the performance of the proposed model on cross-domain multi-class tissue type classification of colorectal cancer images and have achieved improved performance compared to other state-of-the-art methods.
zh

[CV-212] xt Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

【速读】：该论文试图解决文本到图像扩散模型中，由于文本嵌入（text embeddings）未能准确捕捉语法关系，导致交叉注意力图（cross-attention maps）错误地聚焦于相同图像区域，从而引发生成图像中对象缺失或属性绑定错误的问题。解决方案的关键在于提出一种方法，通过测试时优化（test-time optimization）直接将文本注意力图（text attention maps）中的语法关系转移到交叉注意力模块中，从而增强图像与文本之间的语义对齐，无需依赖外部指导。

链接: https://arxiv.org/abs/2411.15236
作者: Jeeyung Kim,Erfan Esmaeili,Qiang Qiu
关键词-EN: text attention maps, image regions attended, maps, specific image regions, attention maps
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, “a black car and a white clock”, the cross-attention maps for “black” and “car” should focus on overlapping regions to depict a black car, while “car” and “clock” should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens – used as conditioning inputs – can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.
zh

[CV-213] CODE-CL: COnceptor-Based Gradient Projection for DEep Continual Learning

【速读】：该论文试图解决深度神经网络在持续学习（Continual Learning）过程中面临的灾难性遗忘（catastrophic forgetting）问题。解决方案的关键在于引入基于概念矩阵（conceptor matrix）的梯度投影方法，即COnceptor-based gradient projection for DEep Continual Learning (CODE-CL)。CODE-CL通过利用概念矩阵表示法，灵活处理高度相关的任务，保留过去任务的重要权重子空间，并通过限制更新到正交子空间来减少遗忘。该方法通过编码过去任务输入空间中的方向重要性，允许在新知识集成时根据任务相关性进行调节，从而在不显著破坏先前知识的情况下，增强对相关任务的学习能力。实验结果表明，CODE-CL在持续学习图像分类基准测试中表现优异，显著减少了遗忘现象，超越了大多数最先进的方法。

链接: https://arxiv.org/abs/2411.15235
作者: Marco Paul E. Apolinario,Kaushik Roy
关键词-EN: integrate new concepts, enabling adaptability, dynamic environments, ability to progressively, progressively integrate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Continual learning, or the ability to progressively integrate new concepts, is fundamental to intelligent beings, enabling adaptability in dynamic environments. In contrast, artificial deep neural networks face the challenge of catastrophic forgetting when learning new tasks sequentially. To alleviate the problem of forgetting, recent approaches aim to preserve essential weight subspaces for previous tasks by limiting updates to orthogonal subspaces via gradient projection. While effective, this approach can lead to suboptimal performance, particularly when tasks are highly correlated. In this work, we introduce COnceptor-based gradient projection for DEep Continual Learning (CODE-CL), a novel method that leverages conceptor matrix representations, a computational model inspired by neuroscience, to more flexibly handle highly correlated tasks. CODE-CL encodes directional importance within the input space of past tasks, allowing new knowledge integration in directions modulated by 1-S , where S represents the direction’s relevance for prior tasks. Additionally, we analyze task overlap using conceptor-based representations to identify highly correlated tasks, facilitating efficient forward knowledge transfer through scaled projection within their intersecting subspace. This strategy enhances flexibility, allowing learning in correlated tasks without significantly disrupting previous knowledge. Extensive experiments on continual learning image classification benchmarks validate CODE-CL’s efficacy, showcasing superior performance with minimal forgetting, outperforming most state-of-the-art methods.
zh

[CV-214] LPLgrad: Optimizing Active Learning Through Gradient Norm Sample Selection and Auxiliary Model Training

【速读】：该论文试图解决机器学习模型在缺乏大量标注数据的情况下表现不佳的问题。解决方案的关键在于提出了一种新的主动学习方法——损失预测损失与梯度范数（Loss Prediction Loss with Gradient Norm, LPLgrad），该方法通过两个阶段来提高图像分类任务的准确性：(i) 训练阶段，通过联合训练主模型和辅助模型来预测输入特征的损失，从而有效提取复杂特征和学习数据内在模式；(ii) 查询阶段，通过计算未标注数据集中样本的熵值的梯度范数来量化模型的不确定性，优先选择梯度范数最高的样本进行标注，从而在最小化标注工作量的同时提升模型性能。该方法在多个真实世界数据集上的广泛评估表明，其在少量标注图像的情况下，准确性显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2411.15217
作者: Shreen Gul,Mohamed Elmahallawy,Sanjay Madria,Ardhendu Tripathy
关键词-EN: strong generalization capabilities, Machine learning models, Machine learning, Loss Prediction Loss, generalization capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine learning models are increasingly being utilized across various fields and tasks due to their outstanding performance and strong generalization capabilities. Nonetheless, their success hinges on the availability of large volumes of annotated data, the creation of which is often labor-intensive, time-consuming, and expensive. Many active learning (AL) approaches have been proposed to address these challenges, but they often fail to fully leverage the information from the core phases of AL, such as training on the labeled set and querying new unlabeled samples. To bridge this gap, we propose a novel AL approach, Loss Prediction Loss with Gradient Norm (LPLgrad), designed to quantify model uncertainty effectively and improve the accuracy of image classification tasks. LPLgrad operates in two distinct phases: (i) \em Training Phase aims to predict the loss for input features by jointly training a main model and an auxiliary model. Both models are trained on the labeled data to maximize the efficiency of the learning process, an aspect often overlooked in previous AL methods. This dual-model approach enhances the ability to extract complex input features and learn intrinsic patterns from the data effectively; (ii) \em Querying Phase that quantifies the uncertainty of the main model to guide sample selection. This is achieved by calculating the gradient norm of the entropy values for samples in the unlabeled dataset. Samples with the highest gradient norms are prioritized for labeling and subsequently added to the labeled set, improving the model’s performance with minimal labeling effort. Extensive evaluations on real-world datasets demonstrate that the LPLgrad approach outperforms state-of-the-art methods by order of magnitude in terms of accuracy on a small number of labeled images, yet achieving comparable training and querying times in multiple image classification tasks.
zh

[CV-215] Image Harmonization using Robust Restricted CDF Matching

【速读】：该论文试图解决机器学习算法在实际应用中面临的输入数据变异性问题，特别是在不同用户、机构和扫描设备之间数据差异显著的情况下。解决方案的关键在于采用基于累积分布函数（Cumulative Distribution Function, CDF）匹配的图像调和方法，通过曲线拟合实现图像强度的非线性变换。这种非线性变换不仅能够“平滑且弹性”地匹配模板，还能保留输入数据中的局部变异性，这对于后续的机器学习处理至关重要。与传统的直方图匹配算法相比，该方法在保持重要特征的同时，提供了更好的控制和直观性，尤其在与基于机器学习的方法相比时更为明显。尽管该方法在MRI图像上进行了演示，但其通用性足以应用于其他类型的成像数据。

链接: https://arxiv.org/abs/2411.15213
作者: Roman Stoklasa
关键词-EN: Deployment of machine, machine learning algorithms, difficult task, Cumulative Distribution Function, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI 2025)

点击查看摘要

Abstract:Deployment of machine learning algorithms into real-world practice is still a difficult task. One of the challenges lies in the unpredictable variability of input data, which may differ significantly among individual users, institutions, scanners, etc. The input data variability can be decreased by using suitable data preprocessing with robust data harmonization. In this paper, we present a method of image harmonization using Cumulative Distribution Function (CDF) matching based on curve fitting. This approach does not ruin local variability and individual important features. The transformation of image intensities is non-linear but still ``smooth and elastic", as compared to other known histogram matching algorithms. Non-linear transformation allows for a very good match to the template. At the same time, elasticity constraints help to preserve local variability among individual inputs, which may encode important features for subsequent machine-learning processing. The pre-defined template CDF offers a better and more intuitive control for the input data transformation compared to other methods, especially ML-based ones. Even though we demonstrate our method for MRI images, the method is generic enough to apply to other types of imaging data.
zh

[CV-216] LightLLM : A Versatile Large Language Model for Predictive Light Sensing

【速读】：该论文试图解决将预训练的大型语言模型（LLMs）适应于基于光感知的特定任务的问题。解决方案的关键在于提出了LightLLM模型，该模型通过微调预训练的LLMs，结合传感器数据编码器、上下文提示和融合层，将传感器数据与文本提示融合成统一的表示形式。关键的创新点在于保持预训练LLM的参数不变，仅通过添加轻量级的可训练组件来进行微调，从而在最小计算开销和重新训练成本的情况下，实现对新任务的灵活适应。实验结果表明，LightLLM在光基定位、户外太阳能预测和室内太阳能估计等任务中显著优于现有最先进的方法，并且在未见过的环境中表现出色。

链接: https://arxiv.org/abs/2411.15211
作者: Jiawei Hu,Hong Jia,Mahbub Hassan,Lina Yao,Brano Kusy,Wen Hu
关键词-EN: fine tunes pre-trained, tunes pre-trained large, pre-trained large language, large language models, fine tunes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 15 pages, 14 figures, 5 tables

点击查看摘要

Abstract:We propose LightLLM, a model that fine tunes pre-trained large language models (LLMs) for light-based sensing tasks. It integrates a sensor data encoder to extract key features, a contextual prompt to provide environmental information, and a fusion layer to combine these inputs into a unified representation. This combined input is then processed by the pre-trained LLM, which remains frozen while being fine-tuned through the addition of lightweight, trainable components, allowing the model to adapt to new tasks without altering its original parameters. This approach enables flexible adaptation of LLM to specialized light sensing tasks with minimal computational overhead and retraining effort. We have implemented LightLLM for three light sensing tasks: light-based localization, outdoor solar forecasting, and indoor solar estimation. Using real-world experimental datasets, we demonstrate that LightLLM significantly outperforms state-of-the-art methods, achieving 4.4x improvement in localization accuracy and 3.4x improvement in indoor solar estimation when tested in previously unseen environments. We further demonstrate that LightLLM outperforms ChatGPT-4 with direct prompting, highlighting the advantages of LightLLM’s specialized architecture for sensor data fusion with textual prompts.
zh

[CV-217] owards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

【速读】：该论文试图解决在安全关键应用中深度学习模型对抗扰动的鲁棒性评估问题，特别是在大规模测试和确保评估反映真实世界对抗风险方面的挑战。解决方案的关键在于提出了一种新的个体攻击方法——概率边界攻击 (Probability Margin Attack, PMA)，该方法在概率空间而非logits空间定义对抗边界。PMA不仅在性能上优于当前最先进的个体攻击方法，还为后续的集成攻击提供了基础。此外，论文通过构建百万级数据集CC1M，首次对对抗训练的ImageNet模型进行了百万级白盒对抗鲁棒性评估，揭示了个体与集成攻击以及小规模与大规模评估之间的鲁棒性差距。

链接: https://arxiv.org/abs/2411.15210
作者: Yong Xie,Weijie Zheng,Hanxun Huang,Guangnan Ye,Xingjun Ma
关键词-EN: deep learning models, safety-critical applications, evaluating their vulnerabilities, reliability and trustworthiness, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As deep learning models are increasingly deployed in safety-critical applications, evaluating their vulnerabilities to adversarial perturbations is essential for ensuring their reliability and trustworthiness. Over the past decade, a large number of white-box adversarial robustness evaluation methods (i.e., attacks) have been proposed, ranging from single-step to multi-step methods and from individual to ensemble methods. Despite these advances, challenges remain in conducting meaningful and comprehensive robustness evaluations, particularly when it comes to large-scale testing and ensuring evaluations reflect real-world adversarial risks. In this work, we focus on image classification models and propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We analyze the relationship between PMA and existing cross-entropy or logits-margin-based attacks, and show that PMA can outperform the current state-of-the-art individual methods. Building on PMA, we propose two types of ensemble attacks that balance effectiveness and efficiency. Furthermore, we create a million-scale dataset, CC1M, derived from the existing CC3M dataset, and use it to conduct the first million-scale white-box adversarial robustness evaluation of adversarially-trained ImageNet models. Our findings provide valuable insights into the robustness gaps between individual versus ensemble attacks and small-scale versus million-scale evaluations.
zh

[CV-218] DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh

【速读】：该论文试图解决现有文本驱动虚拟形象生成方法中，将人体与服装作为一个整体3D模型处理，导致服装替换困难和用户对生成过程控制不足的问题。解决方案的关键在于提出了DAGSM（Disentangled Avatar Generation with Semantic Modeling）这一新流程，通过将穿着衣物的人体各部分（如身体、上衣/下衣）分别建模为GS增强网格（GSM），并引入语义算法实现人体与服装以及服装之间的更好分离。此外，通过视图一致性纹理优化模块，包括跨视图注意力机制和入射角加权去噪（IAW-DE）策略，提升纹理质量，从而生成高质量、可替换服装并支持逼真动画的虚拟形象。

链接: https://arxiv.org/abs/2411.15205
作者: Jingyu Zhuang,Di Kang,Linchao Bao,Liang Lin,Guanbin Li
关键词-EN: Text-driven avatar generation, gained significant attention, significant attention owing, Text-driven avatar, gained significant
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Text-driven avatar generation has gained significant attention owing to its convenience. However, existing methods typically model the human body with all garments as a single 3D model, limiting its usability, such as clothing replacement, and reducing user control over the generation process. To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. Specifically, we model each part (e.g., body, upper/lower clothes) of the clothed human as one GS-enhanced mesh (GSM), which is a traditional mesh attached with 2D Gaussians to better handle complicated textures (e.g., woolen, translucent clothes) and produce realistic cloth animations. During the generation, we first create the unclothed body, followed by a sequence of individual cloth generation based on the body, where we introduce a semantic-based algorithm to achieve better human-cloth and garment-garment separation. To improve texture quality, we propose a view-consistent texture refinement module, including a cross-view attention mechanism for texture style consistency and an incident-angle-weighted denoising (IAW-DE) strategy to update the appearance. Extensive experiments have demonstrated that DAGSM generates high-quality disentangled avatars, supports clothing replacement and realistic animation, and outperforms the baselines in visual quality.
zh

[CV-219] Label Distribution Shift-Aware Prediction Refinement for Test-Time Adaptation

【速读】：该论文试图解决测试时适应（Test-time adaptation, TTA）方法在面对标签分布偏移（label distribution shifts）时性能显著下降的问题。解决方案的关键在于引入了一种名为标签分布偏移感知预测优化（label Distribution shift-Aware prediction Refinement for Test-time adaptation, DART）的新方法。DART通过在训练数据中使用具有多样类分布的批次来训练一个预测优化模块，该模块在测试时用于检测和纠正类分布偏移，从而显著提高测试数据的伪标签准确性。这种方法在CIFAR-10C上展示了5-18%的准确性提升，并且在没有标签分布偏移的情况下不会导致性能下降，使其成为现有TTA方法中一个有价值的插件工具。

链接: https://arxiv.org/abs/2411.15204
作者: Minguk Jang,Hye Won Chung
关键词-EN: encountering input distribution, distribution shifts, TTA methods, TTA, label distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) is an effective approach to mitigate performance degradation of trained models when encountering input distribution shifts at test time. However, existing TTA methods often suffer significant performance drops when facing additional class distribution shifts. We first analyze TTA methods under label distribution shifts and identify the presence of class-wise confusion patterns commonly observed across different covariate shifts. Based on this observation, we introduce label Distribution shift-Aware prediction Refinement for Test-time adaptation (DART), a novel TTA method that refines the predictions by focusing on class-wise confusion patterns. DART trains a prediction refinement module during an intermediate time by exposing it to several batches with diverse class distributions using the training dataset. This module is then used during test time to detect and correct class distribution shifts, significantly improving pseudo-label accuracy for test data. Our method exhibits 5-18% gains in accuracy under label distribution shifts on CIFAR-10C, without any performance degradation when there is no label distribution shift. Extensive experiments on CIFAR, PACS, OfficeHome, and ImageNet benchmarks demonstrate DART’s ability to correct inaccurate predictions caused by test-time distribution shifts. This improvement leads to enhanced performance in existing TTA methods, making DART a valuable plug-in tool.
zh

[CV-220] Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking COLING2025

【速读】：该论文试图解决当前视觉语言模型 (Vision Language Models, VLMs) 评估基准在测试模型对复杂视觉和文本内容的理解和处理能力方面的不足。现有基准通常侧重于简单任务，无法全面评估模型在深度推理和多模态数据整合方面的能力。论文提出的解决方案是引入 PARROT-360V 基准，这是一个包含 2487 个复杂视觉谜题的综合性基准，旨在测试 VLMs 在复杂视觉推理任务中的表现。关键在于通过 PARROT-360V 评估领先模型（如 GPT-4o、Claude-3.5-Sonnet 和 Gemini-1.5-Pro）在结合视觉线索与语言技能解决任务方面的能力，从而揭示当前 VLMs 在处理复杂、多步骤推理任务中的局限性，并强调需要更强大的评估框架来推动该领域的发展。

链接: https://arxiv.org/abs/2411.15201
作者: Harsha Vardhan Khurdula,Basem Rizk,Indus Khaitan,Janit Anjaria,Aviral Srivastava,Rajvardhan Khaitan
关键词-EN: evaluating Vision Language, Vision Language Models, evaluating Vision, assessing model abilities, Vision Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, Accepted at COLING 2025

点击查看摘要

Abstract:Current benchmarks for evaluating Vision Language Models (VLMs) often fall short in thoroughly assessing model abilities to understand and process complex visual and textual content. They typically focus on simple tasks that do not require deep reasoning or the integration of multiple data modalities to solve an original problem. To address this gap, we introduce the PARROT-360V Benchmark, a novel and comprehensive benchmark featuring 2487 challenging visual puzzles designed to test VLMs on complex visual reasoning tasks. We evaluated leading models: GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro, using PARROT-360V to assess their capabilities in combining visual clues with language skills to solve tasks in a manner akin to human problem-solving. Our findings reveal a notable performance gap: state-of-the-art models scored between 28 to 56 percentage on our benchmark, significantly lower than their performance on popular benchmarks. This underscores the limitations of current VLMs in handling complex, multi-step reasoning tasks and highlights the need for more robust evaluation frameworks to advance the field.
zh

[CV-221] Deep Learning-Based Classification of Hyperkinetic Movement Disorders in Children

【速读】：该论文试图解决儿童超动力运动障碍（Hyperkinetic Movement Disorders, HMDs）的诊断难题，特别是区分肌张力障碍（dystonia）和舞蹈症（chorea）。解决方案的关键在于开发了一种基于神经网络的模型，该模型通过视频记录的儿童执行运动任务的影像来区分这两种疾病。模型结合了图卷积网络（Graph Convolutional Network, GCN）和长短期记忆网络（Long Short-Term Memory, LSTM），分别用于捕捉空间关系和时间动态，并加入了注意力机制（Attention mechanisms）以提高模型的可解释性。该模型在50个视频数据集上进行了训练和验证，取得了85%的准确率、81%的敏感性和88%的特异性，展示了深度学习在提高HMD诊断准确性和效率方面的潜力。

链接: https://arxiv.org/abs/2411.15200
作者: Nandika Ramamurthy,Dr Daniel Lumsden,Dr Rachel Sparks
关键词-EN: Hyperkinetic movement disorders, pose significant diagnostic, overlapping clinical features, significant diagnostic challenges, abnormal twisting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 59 pages, 20 figures

点击查看摘要

Abstract:Hyperkinetic movement disorders (HMDs) in children, including dystonia (abnormal twisting) and chorea (irregular, random movements), pose significant diagnostic challenges due to overlapping clinical features. The prevalence of dystonia ranges from 2 to 50 per million, and chorea from 5 to 10 per 100,000. These conditions are often diagnosed with delays averaging 4.75 to 7.83 years. Traditional diagnostic methods depend on clinical history and expert physical examinations, but specialized tests are ineffective due to the complex pathophysiology of these disorders. This study develops a neural network model to differentiate between dystonia and chorea from video recordings of paediatric patients performing motor tasks. The model integrates a Graph Convolutional Network (GCN) to capture spatial relationships and Long Short-Term Memory (LSTM) networks to account for temporal dynamics. Attention mechanisms were incorporated to improve model interpretability. The model was trained and validated on a dataset of 50 videos (31 chorea-predominant, 19 dystonia-predominant) collected under regulatory approval from Guy’s and St Thomas’ NHS Foundation Trust. The model achieved 85% accuracy, 81% sensitivity, and 88% specificity at 15 frames per second. Attention maps highlighted the model’s ability to correctly identify involuntary movement patterns, with misclassifications often due to occluded body parts or subtle movement variations. This work demonstrates the potential of deep learning to improve the accuracy and efficiency of HMD diagnosis and could contribute to more reliable, interpretable clinical tools.
zh

[CV-222] Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation

【速读】：该论文试图解决扩散模型在生成过程中的可控性问题，即如何不仅控制生成结果的类型，还能控制生成过程的长度和参数。解决方案的关键在于提出了一个名为自适应可控扩散 (Adaptively Controllable Diffusion, AC-Diff) 模型的新框架。该框架通过条件时间步 (Conditional Time-Step, CTS) 模块来确定生成所需的步数，并通过自适应混合噪声调度 (Adaptive Hybrid Noise Schedule, AHNS) 模块来估计扩散率参数。此外，该模型还采用了相应的自适应采样机制进行训练，以根据条件调整自身，从而提升整体性能。最终目标是大幅减少生成步骤的平均数量和执行时间，同时保持与现有文献中扩散模型相同的性能。

链接: https://arxiv.org/abs/2411.15199
作者: Yucheng Xing,Xiaodong Liu,Xin Wang
关键词-EN: artificial intelligence, represent the creativity, development of artificial, put onto generative, important aspect
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the development of artificial intelligence, more and more attention has been put onto generative models, which represent the creativity, a very important aspect of intelligence. In recent years, diffusion models have been studied and proven to be more reasonable and effective than previous methods. However, common diffusion frameworks suffer from controllability problems. Although extra conditions have been considered by some work to guide the diffusion process for a specific target generation, it only controls the generation result but not its process. In this work, we propose a new adaptive framework, \textitAdaptively Controllable Diffusion (AC-Diff) Model , to automatically and fully control the generation process, including not only the type of generation result but also the length and parameters of the generation process. Both inputs and conditions will be first fed into a \textitConditional Time-Step (CTS) Module to determine the number of steps needed for a generation. Then according to the length of the process, the diffusion rate parameters will be estimated through our \textitAdaptive Hybrid Noise Schedule (AHNS) Module . We further train the network with the corresponding adaptive sampling mechanism to learn how to adjust itself according to the conditions for the overall performance improvement. To enable its practical applications, AC-Diff is expected to largely reduce the average number of generation steps and execution time while maintaining the same performance as done in the literature diffusion models.
zh

[CV-223] Gradient-Weighted Feature Back-Projection: A Fast Alternative to Feature Distillation in 3D Gaussian Splatting

【速读】：该论文试图解决在高斯光栅化中进行特征场渲染的问题，特别是如何在无需训练的情况下实现高质量的2D和3D分割。解决方案的关键在于提出了一种训练无关的方法，通过将2D特征反投影到预训练的3D高斯分布中，并基于每个高斯分布在最终渲染中的影响进行加权求和。这种方法不仅在2D分割上表现出色，而且在3D分割上也取得了高质量的结果，无需后续处理，且在速度和可扩展性方面与基于训练的方法相当。

链接: https://arxiv.org/abs/2411.15193
作者: Joji Joseph,Bharadwaj Amrutur,Shalabh Bhatnagar
关键词-EN: Gaussian splatting, feature field rendering, introduce a training-free, feature field, field rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a training-free method for feature field rendering in Gaussian splatting. Our approach back-projects 2D features into pre-trained 3D Gaussians, using a weighted sum based on each Gaussian’s influence in the final rendering. While most training-based feature field rendering methods excel at 2D segmentation but perform poorly at 3D segmentation without post-processing, our method achieves high-quality results in both 2D and 3D segmentation. Experimental results demonstrate that our approach is fast, scalable, and offers performance comparable to training-based methods.
zh

[CV-224] LegoPET: Hierarchical Feature Guided Conditional Diffusion for PET Image Reconstruction

【速读】：该论文试图解决传统迭代技术在正电子发射断层扫描（PET）图像重建中存在的局限性，特别是深度学习（DL）方法在直接从原始数据（sinograms）重建PET图像时可能产生的过度平滑或引入伪影的问题。解决方案的关键是引入了一种分层特征引导的条件扩散概率模型（cDPM），称为LegoPET。该模型通过分层特征引导，能够在保持输入与输出图像之间的一致性和对应关系的同时，提高图像重建的感知质量和像素级别的PSNR/SSIM指标。实验结果表明，LegoPET不仅改进了cDPM的性能，还在视觉质量和定量指标上超越了现有的基于DL的PET图像重建技术。

链接: https://arxiv.org/abs/2411.16629
作者: Yiran Sun,Osama Mawlawi
关键词-EN: Positron emission tomography, cancer detection due, Positron emission, emission tomography, processes in vivo
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Positron emission tomography (PET) is widely utilized for cancer detection due to its ability to visualize functional and biological processes in vivo. PET images are usually reconstructed from histogrammed raw data (sinograms) using traditional iterative techniques (e.g., OSEM, MLEM). Recently, deep learning (DL) methods have shown promise by directly mapping raw sinogram data to PET images. However, DL approaches that are regression-based or GAN-based often produce overly smoothed images or introduce various artifacts respectively. Image-conditioned diffusion probabilistic models (cDPMs) are another class of likelihood-based DL techniques capable of generating highly realistic and controllable images. While cDPMs have notable strengths, they still face challenges such as maintain correspondence and consistency between input and output images when they are from different domains (e.g., sinogram vs. image domain) as well as slow convergence rates. To address these limitations, we introduce LegoPET, a hierarchical feature guided conditional diffusion model for high-perceptual quality PET image reconstruction from sinograms. We conducted several experiments demonstrating that LegoPET not only improves the performance of cDPMs but also surpasses recent DL-based PET image reconstruction techniques in terms of visual quality and pixel-level PSNR/SSIM metrics. Our code is available at this https URL.
zh

[CV-225] PriorPath: Coarse-To-Fine Approach for Controlled De-Novo Pathology Semantic Masks Generation

【速读】：该论文试图解决在数字病理学中，由于组织样本多样性和图像标注的复杂性导致的偏差数据集问题，从而限制了基于这些数据集训练的算法的应用性。解决方案的关键在于提出了一种名为PriorPath的管道，该管道能够从粗粒度图像中生成详细的、现实的语义掩码，从而实现对生成掩码的空间排列的控制，进而控制合成图像的细胞特征。PriorPath不仅能够覆盖语义掩码空间，还能提供比先前方法更好的真实掩码相似性，从而在一个平台上同时生成逼真的掩码和图像，为计算病理学中的AI应用提供了一种先进的、可控的解决方案。

链接: https://arxiv.org/abs/2411.16515
作者: Nati Daniel,May Nathan,Eden Azeroual,Yael Fisher,Yonatan Savir
关键词-EN: Incorporating artificial intelligence, offers promising prospects, Incorporating artificial, digital pathology offers, pathology offers promising
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Incorporating artificial intelligence (AI) into digital pathology offers promising prospects for automating and enhancing tasks such as image analysis and diagnostic processes. However, the diversity of tissue samples and the necessity for meticulous image labeling often result in biased datasets, constraining the applicability of algorithms trained on them. To harness synthetic histopathological images to cope with this challenge, it is essential not only to produce photorealistic images but also to be able to exert control over the cellular characteristics they depict. Previous studies used methods to generate, from random noise, semantic masks that captured the spatial distribution of the tissue. These masks were then used as a prior for conditional generative approaches to produce photorealistic histopathological images. However, as with many other generative models, this solution exhibits mode collapse as the model fails to capture the full diversity of the underlying data distribution. In this work, we present a pipeline, coined PriorPath, that generates detailed, realistic, semantic masks derived from coarse-grained images delineating tissue regions. This approach enables control over the spatial arrangement of the generated masks and, consequently, the resulting synthetic images. We demonstrated the efficacy of our method across three cancer types, skin, prostate, and lung, showcasing PriorPath’s capability to cover the semantic mask space and to provide better similarity to real masks compared to previous methods. Our approach allows for specifying desired tissue distributions and obtaining both photorealistic masks and images within a single platform, thus providing a state-of-the-art, controllable solution for generating histopathological images to facilitate AI for computational pathology.
zh

[CV-226] Comparison of Generative Learning Methods for Turbulence Modeling

【速读】：该论文试图解决湍流流动数值模拟中的高计算成本问题，特别是在技术相关问题中难以负担高分辨率技术如直接数值模拟 (DNS) 和大涡模拟 (LES) 的计算资源。解决方案的关键在于利用生成式概率模型 (Generative Probabilistic Models) 的最新进展，特别是变分自编码器 (VAE)、深度卷积生成对抗网络 (DCGAN) 和去噪扩散概率模型 (DDPM)，来模拟二维卡门涡街。通过使用大涡模拟 (LES) 获取训练数据，论文评估了这些模型捕捉湍流流动统计特性和空间结构的能力。结果表明，DDPM 和 DCGAN 能够有效复制流动分布，显示出它们作为高效且准确的湍流建模工具的潜力。特别是 DCGAN，尽管训练难度较大（如模式崩溃问题），但其推理和训练时间最短，所需数据较少，且生成的结果与输入流最为一致。

链接: https://arxiv.org/abs/2411.16417
作者: Claudia Drygala,Edmund Ross,Francesca di Mare,Hanno Gottschalk
关键词-EN: Direct Numerical Simulation, Large Eddy Simulation, present significant challenges, high computational cost, Numerical simulations
类目: Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Numerical simulations of turbulent flows present significant challenges in fluid dynamics due to their complexity and high computational cost. High resolution techniques such as Direct Numerical Simulation (DNS) and Large Eddy Simulation (LES) are generally not computationally affordable, particularly for technologically relevant problems. Recent advances in machine learning, specifically in generative probabilistic models, offer promising alternatives for turbulence modeling. This paper investigates the application of three generative models - Variational Autoencoders (VAE), Deep Convolutional Generative Adversarial Networks (DCGAN), and Denoising Diffusion Probabilistic Models (DDPM) - in simulating a 2D Kármán vortex street around a fixed cylinder. Training data was obtained by means of LES. We evaluate each model’s ability to capture the statistical properties and spatial structures of the turbulent flow. Our results demonstrate that DDPM and DCGAN effectively replicate the flow distribution, highlighting their potential as efficient and accurate tools for turbulence modeling. We find a strong argument for DCGAN, as although they are more difficult to train (due to problems such as mode collapse), they gave the fastest inference and training time, require less data to train compared to VAE and DDPM, and provide the results most closely aligned with the input stream. In contrast, VAE train quickly (and can generate samples quickly) but do not produce adequate results, and DDPM, whilst effective, is significantly slower at both inference and training time.
zh

[CV-227] Privacy-Preserving Federated Foundation Model for Generalist Ultrasound Artificial Intelligence

【速读】：该论文试图解决传统超声诊断中依赖医生经验、图像质量欠佳以及诊断错误率高等问题。解决方案的关键在于提出了 UltraFedFM，一种创新的隐私保护超声基础模型。UltraFedFM 通过联邦学习（Federated Learning）在分布于9个国家的16家医疗机构中进行协作预训练，利用超过100万张涵盖19个器官和10种超声模态的图像数据。这一方法不仅解决了大规模标注数据带来的隐私问题，还通过广泛的多样化数据和安全训练框架，使模型展现出强大的泛化能力和诊断能力，显著提升了临床诊断的准确性和隐私保护水平。

链接: https://arxiv.org/abs/2411.16380
作者: Yuncheng Jiang,Chun-Mei Feng,Jinke Ren,Jun Wei,Zixun Zhang,Yiwen Hu,Yunbi Liu,Rui Sun,Xuemei Tang,Juan Du,Xiang Wan,Yong Xu,Bo Du,Xin Gao,Guangyu Wang,Shaohua Zhou,Shuguang Cui,Rick Siow Mong Goh,Yong Liu,Zhen Li
关键词-EN: non-invasive nature, nature and real-time, Ultrasound, Ultrasound imaging, clinical diagnosis due
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound imaging is widely used in clinical diagnosis due to its non-invasive nature and real-time capabilities. However, conventional ultrasound diagnostics face several limitations, including high dependence on physician expertise and suboptimal image quality, which complicates interpretation and increases the likelihood of diagnostic errors. Artificial intelligence (AI) has emerged as a promising solution to enhance clinical diagnosis, particularly in detecting abnormalities across various biomedical imaging modalities. Nonetheless, current AI models for ultrasound imaging face critical challenges. First, these models often require large volumes of labeled medical data, raising concerns over patient privacy breaches. Second, most existing models are task-specific, which restricts their broader clinical utility. To overcome these challenges, we present UltraFedFM, an innovative privacy-preserving ultrasound foundation model. UltraFedFM is collaboratively pre-trained using federated learning across 16 distributed medical institutions in 9 countries, leveraging a dataset of over 1 million ultrasound images covering 19 organs and 10 ultrasound modalities. This extensive and diverse data, combined with a secure training framework, enables UltraFedFM to exhibit strong generalization and diagnostic capabilities. It achieves an average area under the receiver operating characteristic curve of 0.927 for disease diagnosis and a dice similarity coefficient of 0.878 for lesion segmentation. Notably, UltraFedFM surpasses the diagnostic accuracy of mid-level ultrasonographers and matches the performance of expert-level sonographers in the joint diagnosis of 8 common systemic diseases. These findings indicate that UltraFedFM can significantly enhance clinical diagnostics while safeguarding patient privacy, marking an advancement in AI-driven ultrasound imaging for future clinical applications.
zh

[CV-228] WTDUN: Wavelet Tree-Structured Sampling and Deep Unfolding Network for Image Compressed Sensing

【速读】：该论文试图解决现有深度展开网络在压缩感知（Compressed Sensing, CS）中面临的两个主要问题：1) 直接从单通道图像中学习，导致特征表示简单，未能充分捕捉复杂特征；2) 对不同图像成分进行均匀处理，忽视了各成分的特性。解决方案的关键在于提出了一种新的基于小波域的深度展开框架，命名为WTDUN。该框架直接在多尺度小波子带上操作，利用小波系数的固有稀疏性和多尺度结构，实现树状采样和重建，从而有效捕捉和突出图像中的重要特征。具体来说，树状重建设计旨在捕捉多尺度子带之间的相互依赖关系，识别细粒度和粗粒度特征，显著提高重建质量。此外，提出了一种小波域自适应采样方法，根据每个小波子带的重要性分配测量值，从而大幅提升采样能力。这种方法通过有针对性地关注重要子带，考虑其能量和稀疏性，更高效地捕捉关键信息，同时舍弃不重要的信息，实现更有效和详细的重建。

链接: https://arxiv.org/abs/2411.16336
作者: Kai Han,Jin Wang,Yunhui Shi,Hanqin Cai,Nam Ling,Baocai Yin
关键词-EN: gained increasing attention, Deep unfolding networks, Deep unfolding, compressed sensing, networks have gained
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 20pages,Accepted by ACM Transactions on Multimedia Computing Communications and Applications (TOMM)

点击查看摘要

Abstract:Deep unfolding networks have gained increasing attention in the field of compressed sensing (CS) owing to their theoretical interpretability and superior reconstruction performance. However, most existing deep unfolding methods often face the following issues: 1) they learn directly from single-channel images, leading to a simple feature representation that does not fully capture complex features; and 2) they treat various image components uniformly, ignoring the characteristics of different components. To address these issues, we propose a novel wavelet-domain deep unfolding framework named WTDUN, which operates directly on the multi-scale wavelet subbands. Our method utilizes the intrinsic sparsity and multi-scale structure of wavelet coefficients to achieve a tree-structured sampling and reconstruction, effectively capturing and highlighting the most important features within images. Specifically, the design of tree-structured reconstruction aims to capture the inter-dependencies among the multi-scale subbands, enabling the identification of both fine and coarse features, which can lead to a marked improvement in reconstruction quality. Furthermore, a wavelet domain adaptive sampling method is proposed to greatly improve the sampling capability, which is realized by assigning measurements to each wavelet subband based on its importance. Unlike pure deep learning methods that treat all components uniformly, our method introduces a targeted focus on important subbands, considering their energy and sparsity. This targeted strategy lets us capture key information more efficiently while discarding less important information, resulting in a more effective and detailed reconstruction. Extensive experimental results on various datasets validate the superior performance of our proposed method.
zh

[CV-229] Oriented histogram-based vector field embedding for characterizing 4D CT data sets in radiotherapy

【速读】：该论文试图解决在肺部放射治疗中，由于呼吸运动导致的肺组织移动问题，从而影响治疗精确性的挑战。解决方案的关键在于利用预处理信息（如计划CT扫描和变形图像配准得到的矢量场）进行前瞻性分析，并通过降维技术将这些信息转化为低维度的2D定向直方图表示。具体来说，论文提出了一种基于球坐标变换的体素级降维方法，并结合无监督的UMAP嵌入技术，将患者特定的运动信息编码为可比较的低维表示。这种方法不仅能够有效地比较不同患者的呼吸模式，还能为个性化治疗策略的制定提供依据。

链接: https://arxiv.org/abs/2411.16314
作者: Frederic Madesta,Lukas Wimmert,Tobias Gauer,René Werner,Thilo Sentker
关键词-EN: optimize treatment outcomes, target volume, primary objective, outcomes by minimizing, minimizing exposure
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In lung radiotherapy, the primary objective is to optimize treatment outcomes by minimizing exposure to healthy tissues while delivering the prescribed dose to the target volume. The challenge lies in accounting for lung tissue motion due to breathing, which impacts precise treatment alignment. To address this, the paper proposes a prospective approach that relies solely on pre-treatment information, such as planning CT scans and derived data like vector fields from deformable image registration. This data is compared to analogous patient data to tailor treatment strategies, i.e., to be able to review treatment parameters and success for similar patients. To allow for such a comparison, an embedding and clustering strategy of prospective patient data is needed. Therefore, the main focus of this study lies on reducing the dimensionality of deformable registration-based vector fields by employing a voxel-wise spherical coordinate transformation and a low-dimensional 2D oriented histogram representation. Afterwards, a fully unsupervised UMAP embedding of the encoded vector fields (i.e., patient-specific motion information) becomes applicable. The functionality of the proposed method is demonstrated with 71 in-house acquired 4D CT data sets and 33 external 4D CT data sets. A comprehensive analysis of the patient clusters is conducted, focusing on the similarity of breathing patterns of clustered patients. The proposed general approach of reducing the dimensionality of registration vector fields by encoding the inherent information into oriented histograms is, however, applicable to other tasks.
zh

[CV-230] EigenHearts: Cardiac Diseases Classification Using EigenFaces Approach

【速读】：该论文试图解决心血管医学中利用数据科学技术进行心脏疾病分类时面临的图像数据量不足的问题。解决方案的关键在于将经典的EigenFaces方法与卷积神经网络（CNN）相结合，应用于小鼠超声心动图图像的分类。具体来说，通过主成分分析（PCA）对超声心动图数据进行预处理，生成一组称为“eigenhearts”的特征向量，然后将这些特征向量投影到原始图像上，再输入到CNN中进行分类。实验结果表明，采用奇异值分解（SVD）进行预处理后，分类准确率显著提高了约50%。

链接: https://arxiv.org/abs/2411.16227
作者: Nourelhouda Groun,Maria Villalba-Orero,Lucia Casado-Martin,Enrique Lara-Pezzi,Eusebio Valero,Soledad Le Clainche,Jesus Garicano-Mena
关键词-EN: medical imaging plays, making precise diagnoses, accurately classifying cardiac, cardiovascular medicine, medical imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 9 figures, 3 tables

点击查看摘要

Abstract:In the realm of cardiovascular medicine, medical imaging plays a crucial role in accurately classifying cardiac diseases and making precise diagnoses. However, the field faces significant challenges when integrating data science techniques, as a significant volume of images is required for these techniques. As a consequence, it is necessary to investigate different avenues to overcome this challenge. In this contribution, we offer an innovative tool to conquer this limitation. In particular, we delve into the application of a well recognized method known as the EigenFaces approach to classify cardiac diseases. This approach was originally motivated for efficiently representing pictures of faces using principal component analysis, which provides a set of eigenvectors (aka eigenfaces), explaining the variation between face images. As this approach proven to be efficient for face recognition, it motivated us to explore its efficiency on more complicated data bases. In particular, we integrate this approach, with convolutional neural networks (CNNs) to classify echocardiography images taken from mice in five distinct cardiac conditions (healthy, diabetic cardiomyopathy, myocardial infarction, obesity and TAC hypertension). Performing a preprocessing step inspired from the eigenfaces approach on the echocardiography datasets, yields sets of pod modes, which we will call eigenhearts. To demonstrate the proposed approach, we compare two testcases: (i) supplying the CNN with the original images directly, (ii) supplying the CNN with images projected into the obtained pod modes. The results show a substantial and noteworthy enhancement when employing SVD for pre-processing, with classification accuracy increasing by approximately 50%.
zh

[CV-231] UltraSam: A Foundation Model for Ultrasound using Large Open-Access Segmentation Datasets

【速读】：该论文试图解决超声图像分析中的自动化挑战，主要由于解剖结构的复杂性和标注数据的有限性。解决方案的关键在于采用数据中心的方法，构建了迄今为止最大的公开超声分割数据集 US-43d，包含超过28万张图像和50多个解剖结构的分割掩码。论文进一步提出了 UltraSam，这是对 Segment Anything Model (SAM) 的适应性改进，专门针对超声图像进行训练，支持点提示和框提示。UltraSam 不仅在基于提示的分割任务上显著优于现有的 SAM 风格模型，还展示了其作为基础模型在下游分析任务中的广泛适用性，特别是在微调后在多种分割和分类任务中表现出色。

链接: https://arxiv.org/abs/2411.16222
作者: Adrien Meyer,Aditya Murali,Didier Mutter,Nicolas Padoy
关键词-EN: Automated ultrasound image, limited annotated data, Automated ultrasound, annotated data, challenging due
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Purpose: Automated ultrasound image analysis is challenging due to anatomical complexity and limited annotated data. To tackle this, we take a data-centric approach, assembling the largest public ultrasound segmentation dataset and training a versatile visual foundation model tailored for ultrasound. Methods: We compile US-43d, a large-scale collection of 43 open-access ultrasound datasets with over 280,000 images and segmentation masks for more than 50 anatomical structures. We then introduce UltraSam, an adaptation of the Segment Anything Model (SAM) that is trained on US-43d and supports both point- and box-prompts. Finally, we introduce a new use case for SAM-style models by using UltraSam as a model initialization that can be fine-tuned for various downstream analysis tasks, demonstrating UltraSam’s foundational capabilities. Results: UltraSam achieves vastly improved performance over existing SAM-style models for prompt-based segmentation on three diverse public datasets. Moreover, an UltraSam-initialized Vision Transformer surpasses ImageNet-, SAM-, and MedSAM-initialized models in various downstream segmentation and classification tasks, highlighting UltraSam’s effectiveness as a foundation model. Conclusion: We compile US-43d, a large-scale unified ultrasound dataset, and introduce UltraSam, a powerful multi-purpose SAM-style model for ultrasound images. We release our code and pretrained models at this https URL and invite the community to further this effort by contributing high-quality datasets. Comments: 7 pages, 3 figures, 3 tables Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.16222 [eess.IV] (or arXiv:2411.16222v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2411.16222 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Adrien Meyer [view email] [v1] Mon, 25 Nov 2024 09:33:44 UTC (17,676 KB)
zh

[CV-232] High-Resolution Be Aware! Improving the Self-Supervised Real-World Super-Resolution

【速读】：该论文试图解决自监督学习在超分辨率（Super-Resolution, SR）任务中由于缺乏真实世界的高分辨率图像作为参考，导致生成的超分辨率图像不自然的问题。解决方案的关键在于增强对高分辨率图像的感知，具体包括：1) 引入一个控制器来根据超分辨率结果的质量动态调整退化模型的参数，以更好地模拟真实世界的退化过程；2) 提出一种新的特征对齐正则化器（feature-alignment regularizer），直接约束超分辨率图像的分布，使其更接近自然图像的特征分布。通过这些改进，论文的方法能够微调现有的SR模型，使其在目标真实世界领域中生成具有最先进感知性能的自然超分辨率图像。

链接: https://arxiv.org/abs/2411.16175
作者: Yuehan Zhang,Angela Yao
关键词-EN: learning is crucial, Self-supervised learning, real-world settings, ground-truth images, images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures

点击查看摘要

Abstract:Self-supervised learning is crucial for super-resolution because ground-truth images are usually unavailable for real-world settings. Existing methods derive self-supervision from low-resolution images by creating pseudo-pairs or by enforcing a low-resolution reconstruction objective. These methods struggle with insufficient modeling of real-world degradations and the lack of knowledge about high-resolution imagery, resulting in unnatural super-resolved results. This paper strengthens awareness of the high-resolution image to improve the self-supervised real-world super-resolution. We propose a controller to adjust the degradation modeling based on the quality of super-resolution results. We also introduce a novel feature-alignment regularizer that directly constrains the distribution of super-resolved images. Our method finetunes the off-the-shelf SR models for a target real-world domain. Experiments show that it produces natural super-resolved images with state-of-the-art perceptual performance.
zh

[CV-233] Learning Optimal Lattice Vector Quantizers for End-to-end Neural Image Compression NEURIPS2024

【速读】：该论文试图解决神经图像压缩中传统格点向量量化（Lattice Vector Quantization, LVQ）在非均匀源分布下的非自适应性和次优性问题。解决方案的关键在于提出了一种新的学习方法，设计了针对潜在特征样本统计的最优率失真格点向量量化（Optimal Lattice Vector Quantization, OLVQ）码本。通过更好地拟合LVQ结构以适应任意给定的潜在样本分布，OLVQ方法显著提升了现有量化方案在神经图像压缩中的率失真性能，同时保留了均匀标量量化的适应性。

链接: https://arxiv.org/abs/2411.16119
作者: Xi Zhang,Xiaolin Wu
关键词-EN: Neural image compression, powerful vector quantization, uniform scalar quantization, deploy uniform scalar, Lattice vector quantization
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:It is customary to deploy uniform scalar quantization in the end-to-end optimized Neural image compression methods, instead of more powerful vector quantization, due to the high complexity of the latter. Lattice vector quantization (LVQ), on the other hand, presents a compelling alternative, which can exploit inter-feature dependencies more effectively while keeping computational efficiency almost the same as scalar quantization. However, traditional LVQ structures are designed/optimized for uniform source distributions, hence nonadaptive and suboptimal for real source distributions of latent code space for Neural image compression tasks. In this paper, we propose a novel learning method to overcome this weakness by designing the rate-distortion optimal lattice vector quantization (OLVQ) codebooks with respect to the sample statistics of the latent features to be compressed. By being able to better fit the LVQ structures to any given latent sample distribution, the proposed OLVQ method improves the rate-distortion performances of the existing quantization schemes in neural image compression significantly, while retaining the amenability of uniform scalar quantization.
zh

[CV-234] Global spatio-temporal downscaling of ERA5 precipitation through generative AI

【速读】：该论文试图解决全球尺度下高分辨率降水数据的缺乏问题，特别是在ERA5再分析数据集无法捕捉到高时空变异性的局部强降雨事件的情况下。解决方案的关键是引入了一个基于条件生成对抗网络（cGAN）的深度学习模型——spateGAN-ERA5，该模型能够将ERA5降水数据的分辨率从24 km和1小时提升到2 km和10分钟，生成具有真实时空模式和高精度降雨率分布的高分辨率降水场，包括极端降雨事件。该模型通过在德国训练并在美国和澳大利亚进行验证，展示了其在全球范围内的强泛化能力和应用潜力，为水文和气象研究提供了关键的高分辨率降水数据支持。

链接: https://arxiv.org/abs/2411.16098
作者: Luca Glawion,Julius Polz,Harald Kunstmann,Benjamin Fersch,Christian Chwala
关键词-EN: determining freshwater resources, agricultural yield, spatial and temporal, human lives, lives by determining
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
备注:

点击查看摘要

Abstract:The spatial and temporal distribution of precipitation has a significant impact on human lives by determining freshwater resources and agricultural yield, but also rainfall-driven hazards like flooding or landslides. While the ERA5 reanalysis dataset provides consistent long-term global precipitation information that allows investigations of these impacts, it lacks the resolution to capture the high spatio-temporal variability of precipitation. ERA5 misses intense local rainfall events that are crucial drivers of devastating flooding - a critical limitation since extreme weather events become increasingly frequent. Here, we introduce spateGAN-ERA5, the first deep learning based spatio-temporal downscaling of precipitation data on a global scale. SpateGAN-ERA5 uses a conditional generative adversarial neural network (cGAN) that enhances the resolution of ERA5 precipitation data from 24 km and 1 hour to 2 km and 10 minutes, delivering high-resolution rainfall fields with realistic spatio-temporal patterns and accurate rain rate distribution including extremes. Its computational efficiency enables the generation of a large ensemble of solutions, addressing uncertainties inherent to the challenges of downscaling. Trained solely on data from Germany and validated in the US and Australia considering diverse climate zones, spateGAN-ERA5 demonstrates strong generalization indicating a robust global applicability. SpateGAN-ERA5 fulfils a critical need for high-resolution precipitation data in hydrological and meteorological research, offering new capabilities for flood risk assessment, AI-enhanced weather forecasting, and impact modelling to address climate-driven challenges worldwide.
zh

[CV-235] Peritumoral Expansion Radiomics for Improved Lung Cancer Classification

【速读】：该论文试图解决的问题是如何通过肺结节分割及其周围肿瘤区域（peritumoral regions）的影响来提高基于放射组学（radiomics）的肺癌分类准确性。解决方案的关键在于：首先，使用四种不同的分割技术（Otsu, Fuzzy C-Means (FCM), Gaussian Mixture Model (GMM), 和 K-Nearest Neighbors (KNN)）对3D CT扫描中的结节进行分割；其次，通过扩展初始结节分割至周围肿瘤区域（2, 4, 6, 8, 10, 和 12 mm）来分析其对分类的影响；最后，采用多种机器学习分类器（如随机森林、逻辑回归和KNN）进行分类，并与深度学习模型（如Foundation Model for Cancer Biomarkers (FMCB) 和 ResNet50-SWS++）进行性能对比。研究结果表明，包含周围肿瘤区域的扩展显著提升了分类性能，尤其是在8 mm扩展时达到最佳（AUC = 0.78），显示出放射组学方法在肺癌分类中的优越性。

链接: https://arxiv.org/abs/2411.16008
作者: Fakrul Islam Tushar
关键词-EN: Gaussian Mixture Model, Gaussian Mixture, regions influence radionics-based, radionics-based lung cancer, including Random Forest
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 2 table, 5 figures

点击查看摘要

Abstract:Purpose: This study investigated how nodule segmentation and surrounding peritumoral regions influence radionics-based lung cancer classification. Methods: Using 3D CT scans with bounding box annotated nodules, we generated 3D segmentations using four techniques: Otsu, Fuzzy C-Means (FCM), Gaussian Mixture Model (GMM), and K-Nearest Neighbors (KNN). Radiomics features were extracted using the PyRadiomics library, and multiple machine-learning-based classifiers, including Random Forest, Logistic Regression, and KNN, were employed to classify nodules as cancerous or non-cancerous. The best-performing segmentation and model were further analyzed by expanding the initial nodule segmentation into the peritumoral region (2, 4, 6, 8, 10, and 12 mm) to understand the influence of the surrounding area on classification. Additionally, we compared our results to deep learning-based feature extractors Foundation Model for Cancer Biomarkers (FMCB) and other state-of-the-art baseline models. Results: Incorporating peritumoral regions significantly enhanced performance, with the best result obtained at 8 mm expansion (AUC = 0.78). Compared to image-based deep learning models, such as FMCB (AUC = 0.71) and ResNet50-SWS++ (AUC = 0.71), our radiomics-based approach demonstrated superior classification accuracy. Conclusion: The study highlights the importance of peritumoral expansion in improving lung cancer classification using radiomics. These findings can inform the development of more robust AI-driven diagnostic tools.
zh

[CV-236] Cross-organ Deployment of EOS Detection AI without Retraining: Feasibility and Limitation

【速读】：该论文试图解决慢性鼻窦炎（Chronic rhinosinusitis, CRS）中嗜酸性粒细胞（Eosinophils, Eos）计数的自动化问题。由于手动计数Eos在组织学样本中既耗时又费力，因此利用AI驱动的自动化方法进行评估成为迫切需求。论文的关键解决方案是利用最初为上消化道（Gastrointestinal, GI）数据训练的CircleSnake模型，对鼻部组织的全片图像（Whole Slide Images, WSIs）进行Eos细胞的分割，以评估是否可以在不重新训练模型的情况下，将用于GI的Eos分割模型应用于鼻部组织。实验结果显示，在某些WSIs中，该方法表现出良好的准确性，尽管性能在不同病例间存在差异。

链接: https://arxiv.org/abs/2411.15942
作者: Yifei Wu,Juming Xiong,Tianyuan Yao,Ruining Deng,Junlin Guo,Jialin Yue,Naweed Chowdhury,Yuankai Huo
关键词-EN: Chronic rhinosinusitis, discolored nasal drainage, facial pressure, olfactory dysfunction, paranasal sinuses
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures. Accepted by SPIE Medical Imaging 2025 on October 28, 2024

点击查看摘要

Abstract:Chronic rhinosinusitis (CRS) is characterized by persistent inflammation in the paranasal sinuses, leading to typical symptoms of nasal congestion, facial pressure, olfactory dysfunction, and discolored nasal drainage, which can significantly impact quality-of-life. Eosinophils (Eos), a crucial component in the mucosal immune response, have been linked to disease severity in CRS. The diagnosis of eosinophilic CRS typically uses a threshold of 10-20 eos per high-power field (HPF). However, manually counting Eos in histological samples is laborious and time-intensive, making the use of AI-driven methods for automated evaluations highly desirable. Interestingly, eosinophils are predominantly located in the gastrointestinal (GI) tract, which has prompted the release of numerous deep learning models trained on GI data. This study leverages a CircleSnake model initially trained on upper-GI data to segment Eos cells in whole slide images (WSIs) of nasal tissues. It aims to determine the extent to which Eos segmentation models developed for the GI tract can be adapted to nasal applications without retraining. The experimental results show promising accuracy in some WSIs, although, unsurprisingly, the performance varies across cases. This paper details these performance outcomes, delves into the reasons for such variations, and aims to provide insights that could guide future development of deep learning models for eosinophilic CRS.
zh

[CV-237] PromptHSI: Universal Hyperspectral Image Restoration Framework for Composite Degradation

【速读】：该论文试图解决在全光谱图像（HSI）恢复任务中，由于RGB图像与HSI特征的领域差异、严重复合降质下的视觉提示信息损失以及通过文本提示难以捕捉HSI特定降质表示等问题。解决方案的关键在于提出了PromptHSI，这是首个通用的全光谱图像恢复框架。通过利用基于HSI降质特征的频率感知特征调制，将文本提示分解为强度和偏置控制器，以有效指导恢复过程并避免领域差异。该统一架构在细粒度恢复和全局信息恢复任务中表现出色，实验结果显示在各种降质组合下具有优越性能，显示出在实际遥感应用中的巨大潜力。

链接: https://arxiv.org/abs/2411.15922
作者: Chia-Ming Lee,Ching-Heng Cheng,Yu-Fan Lin,Yi-Ching Cheng,Wo-Ting Liao,Chih-Chung Hsu,Fu-En Yang,Yu-Chiang Frank Wang
关键词-EN: allowing degraded images, Recent developments, single restoration model, RGB image restoration, allowing degraded
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Recent developments in All-in-One (AiO) RGB image restoration and prompt learning have enabled the representation of distinct degradations through prompts, allowing degraded images to be effectively addressed by a single restoration model. However, this paradigm faces significant challenges when transferring to hyperspectral image (HSI) restoration tasks due to: 1) the domain gap between RGB and HSI features and difference on their structures, 2) information loss in visual prompts under severe composite degradations, and 3) difficulties in capturing HSI-specific degradation representations through text prompts. To address these challenges, we propose PromptHSI, the first universal AiO HSI restoration framework. By leveraging the frequency-aware feature modulation based on characteristics of HSI degradations, we decompose text prompts into intensity and bias controllers to effectively guide the restoration process while avoiding domain gaps. Our unified architecture excels at both fine-grained recovery and global information restoration tasks. Experimental results demonstrate superior performance under various degradation combinations, indicating great potential for practical remote sensing applications. The source code and dataset will be publicly released.
zh

[CV-238] Optimizing Brain Tumor Segmentation with MedNeXt: BraTS 2024 SSA and Pediatrics

【速读】：该论文试图解决在脑部MRI扫描中自动分割肿瘤的问题，特别是在面对不同人群（如撒哈拉以南非洲地区和儿童）时，由于数据分布差异导致的模型性能下降问题。解决方案的关键在于采用MedNeXt模型，结合全面的模型集成和细致的后处理步骤，以提高模型在未见数据上的泛化能力。具体来说，研究通过在BraTS-2024 SSA和Pediatric Tumor数据集上的实验，展示了其方法在Dice相似系数（DSC）和Hausdorff距离（HD95）上的优异表现，分别为0.896和14.682（SSA数据集），以及0.830和37.508（Pediatric Tumor数据集）。

链接: https://arxiv.org/abs/2411.15872
作者: Sarim Hashmi,Juan Lugo,Abdelrahman Elsayed,Dinesh Saggurthi,Mohammed Elseiagy,Alikhan Nurkamal,Jaskaran Walia,Fadillah Adamsyah Maani,Mohammad Yaqub
关键词-EN: Identifying key pathological, key pathological features, Identifying key, key pathological, pathological features
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:\usepackageurl Identifying key pathological features in brain MRIs is crucial for the long-term survival of glioma patients. However, manual segmentation is time-consuming, requiring expert intervention and is susceptible to human error. Therefore, significant research has been devoted to developing machine learning methods that can accurately segment tumors in 3D multimodal brain MRI scans. Despite their progress, state-of-the-art models are often limited by the data they are trained on, raising concerns about their reliability when applied to diverse populations that may introduce distribution shifts. Such shifts can stem from lower quality MRI technology (e.g., in sub-Saharan Africa) or variations in patient demographics (e.g., children). The BraTS-2024 challenge provides a platform to address these issues. This study presents our methodology for segmenting tumors in the BraTS-2024 SSA and Pediatric Tumors tasks using MedNeXt, comprehensive model ensembling, and thorough postprocessing. Our approach demonstrated strong performance on the unseen validation set, achieving an average Dice Similarity Coefficient (DSC) of 0.896 on the BraTS-2024 SSA dataset and an average DSC of 0.830 on the BraTS Pediatric Tumor dataset. Additionally, our method achieved an average Hausdorff Distance (HD95) of 14.682 on the BraTS-2024 SSA dataset and an average HD95 of 37.508 on the BraTS Pediatric this http URL GitHub repository can be accessed here: Project Repository: this https URL
zh

[CV-239] Variable-size Symmetry-based Graph Fourier Transforms for image compression

【速读】：该论文试图解决传统压缩系统中使用的线性变换在适应多样统计特性时面临的挑战，特别是在学习变换所需的大规模数据集的构建和计算复杂度方面的问题。解决方案的关键在于引入了一种新的基于对称性的图傅里叶变换（Symmetry-based Graph Fourier Transforms, SBGFTs），并将其扩展到NxN网格的一般情况。SBGFTs通过其非分离性和对称性，能够在保持低计算复杂度的同时实现稀疏信号表示。论文提出的算法通过在网格上添加特定的对称连接来生成对称图，从而避免了数据依赖的适应性需求。此外，论文还利用视频帧内编码中的最优图与预测模式之间的相关性，减少了变换集的基数，提出了一个低复杂度的编码框架。实验结果表明，SBGFTs在最新的VVC帧内编码中集成的显式多变换选择（MTS）中表现优异，提供了6.23%的比特率节省，且仅伴随着边际的平均复杂度增加。

链接: https://arxiv.org/abs/2411.15824
作者: Alessandro Gnutti,Fabrizio Guerrini,Riccardo Leonardi,Antonio Ortega
关键词-EN: Modern compression systems, Modern compression, decoding processes, systems use linear, linear transformations
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern compression systems use linear transformations in their encoding and decoding processes, with transforms providing compact signal representations. While multiple data-dependent transforms for image/video coding can adapt to diverse statistical characteristics, assembling large datasets to learn each transform is challenging. Also, the resulting transforms typically lack fast implementation, leading to significant computational costs. Thus, despite many papers proposing new transform families, the most recent compression standards predominantly use traditional separable sinusoidal transforms. This paper proposes integrating a new family of Symmetry-based Graph Fourier Transforms (SBGFTs) of variable sizes into a coding framework, focusing on the extension from our previously introduced 8x8 SBGFTs to the general case of NxN grids. SBGFTs are non-separable transforms that achieve sparse signal representation while maintaining low computational complexity thanks to their symmetry properties. Their design is based on our proposed algorithm, which generates symmetric graphs on the grid by adding specific symmetrical connections between nodes and does not require any data-dependent adaptation. Furthermore, for video intra-frame coding, we exploit the correlations between optimal graphs and prediction modes to reduce the cardinality of the transform sets, thus proposing a low-complexity framework. Experiments show that SBGFTs outperform the primary transforms integrated in the explicit Multiple Transform Selection (MTS) used in the latest VVC intra-coding, providing a bit rate saving percentage of 6.23%, with only a marginal increase in average complexity. A MATLAB implementation of the proposed algorithm is available online at [1].
zh

[CV-240] A Novel Data Augmentation Tool for Enhancing Machine Learning Classification: A New Application of the Higher Order Dynamic Mode Decomposition for Improved Cardiac Disease Identification

【速读】：该论文试图解决使用超声心动图图像分类心脏疾病时分类准确率不高的问题。解决方案的关键在于结合高阶动态模式分解 (Higher Order Dynamic Mode Decomposition, HODMD) 和卷积神经网络 (Convolutional Neural Network, CNN) 进行数据增强。具体来说，HODMD 首先被用作特征提取技术，从健康和患有不同心脏疾病的鼠类超声心动图数据集中提取与疾病相关的显著特征，并将其表示为一组 DMD 模式。这些 DMD 模式随后作为输入数据输入到 CNN 中，从而扩充了数据集的维度。通过对比仅使用原始超声心动图图像训练 CNN 和结合原始图像与 DMD 模式训练 CNN 的分类性能，结果显示后者显著提高了分类准确率，最高可达 22%。这表明 HODMD 作为一种数据增强技术具有巨大的潜力。

链接: https://arxiv.org/abs/2411.15809
作者: Nourelhouda Groun,Maria Villalba-Orero,Lucia Casado-Martin,Enrique Lara-Pezzi,Eusebio Valero,Jesus Garicano-Mena,Soledad Le Clainche
关键词-EN: modal decomposition method, convolutional neural network, dynamic mode decomposition, higher order dynamic, HODMD algorithm
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, 2 tables

点击查看摘要

Abstract:In this work, a data-driven, modal decomposition method, the higher order dynamic mode decomposition (HODMD), is combined with a convolutional neural network (CNN) in order to improve the classification accuracy of several cardiac diseases using echocardiography images. The HODMD algorithm is used first as feature extraction technique for the echocardiography datasets, taken from both healthy mice and mice afflicted by different cardiac diseases (Diabetic Cardiomyopathy, Obesity, TAC Hypertrophy and Myocardial Infarction). A total number of 130 echocardiography datasets are used in this work. The dominant features related to each cardiac disease were identified and represented by the HODMD algorithm as a set of DMD modes, which then are used as the input to the CNN. In a way, the database dimension was augmented, hence HODMD has been used, for the first time to the authors knowledge, for data augmentation in the machine learning framework. Six sets of the original echocardiography databases were hold out to be used as unseen data to test the performance of the CNN. In order to demonstrate the efficiency of the HODMD technique, two testcases are studied: the CNN is first trained using the original echocardiography images only, and second training the CNN using a combination of the original images and the DMD modes. The classification performance of the designed trained CNN shows that combining the original images with the DMD modes improves the results in all the testcases, as it improves the accuracy by up to 22%. These results show the great potential of using the HODMD algorithm as a data augmentation technique.
zh

[CV-241] Medical Slice Transformer: Improved Diagnosis and Explainability on 3D Medical Images with DINOv2

【速读】：该论文试图解决3D医学影像数据标注稀缺的问题，并提升深度学习模型在医学影像分析中的诊断准确性和可解释性。解决方案的关键在于将2D自监督模型DINOv2扩展应用于3D医学影像分析，通过引入Medical Slice Transformer (MST)框架，结合Transformer架构与2D特征提取器DINOv2，实现对3D医学影像的高效分析。MST在多个临床数据集上的诊断性能优于传统的3D卷积神经网络（如3D ResNet），并且在可解释性方面表现出更精确和解剖学上更正确的显著性图（saliency maps）。

链接: https://arxiv.org/abs/2411.15802
作者: Gustav Müller-Franzes,Firas Khader,Robert Siepmann,Tianyu Han,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn
关键词-EN: diagnosing complex conditions, cross-sectional imaging techniques, essential clinical cross-sectional, complex conditions, MST
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:MRI and CT are essential clinical cross-sectional imaging techniques for diagnosing complex conditions. However, large 3D datasets with annotations for deep learning are scarce. While methods like DINOv2 are encouraging for 2D image analysis, these methods have not been applied to 3D medical images. Furthermore, deep learning models often lack explainability due to their “black-box” nature. This study aims to extend 2D self-supervised models, specifically DINOv2, to 3D medical imaging while evaluating their potential for explainable outcomes. We introduce the Medical Slice Transformer (MST) framework to adapt 2D self-supervised models for 3D medical image analysis. MST combines a Transformer architecture with a 2D feature extractor, i.e., DINOv2. We evaluate its diagnostic performance against a 3D convolutional neural network (3D ResNet) across three clinical datasets: breast MRI (651 patients), chest CT (722 patients), and knee MRI (1199 patients). Both methods were tested for diagnosing breast cancer, predicting lung nodule dignity, and detecting meniscus tears. Diagnostic performance was assessed by calculating the Area Under the Receiver Operating Characteristic Curve (AUC). Explainability was evaluated through a radiologist’s qualitative comparison of saliency maps based on slice and lesion correctness. P-values were calculated using Delong’s test. MST achieved higher AUC values compared to ResNet across all three datasets: breast (0.94 \pm 0.01 vs. 0.91 \pm 0.02, P=0.02), chest (0.95 \pm 0.01 vs. 0.92 \pm 0.02, P=0.13), and knee (0.85 \pm 0.04 vs. 0.69 \pm 0.05, P=0.001). Saliency maps were consistently more precise and anatomically correct for MST than for ResNet. Self-supervised 2D models like DINOv2 can be effectively adapted for 3D medical imaging using MST, offering enhanced diagnostic accuracy and explainability compared to convolutional neural networks.
zh

[CV-242] M3-CVC: Controllable Video Compression with Multimodal Generative Models ICASSP2025

【速读】：该论文试图解决传统和神经视频编解码器在超低比特率编码场景下遇到的控制性和通用性限制问题。解决方案的关键在于提出了M3-CVC，一个结合多模态生成模型的可控视频压缩框架。该框架通过语义-运动复合策略进行关键帧选择，以保留关键信息，并利用基于对话的大型多模态模型（LMM）提取层次化的时空细节，增强帧间和帧内表示，从而提高视频保真度和编码的可解释性。此外，M3-CVC采用条件扩散模型和文本引导的关键帧压缩方法，确保高保真度的帧重建。在解码过程中，LMM生成的文本描述指导扩散过程，准确恢复原始视频内容。实验结果表明，M3-CVC在超低比特率场景下显著优于最新的VVC标准，特别是在保持语义和感知保真度方面。

链接: https://arxiv.org/abs/2411.15798
作者: Rui Wan,Qi Zheng,Yibo Fan
关键词-EN: codecs commonly encounter, commonly encounter limitations, Traditional and neural, neural video codecs, video codecs commonly
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Traditional and neural video codecs commonly encounter limitations in controllability and generality under ultra-low-bitrate coding scenarios. To overcome these challenges, we propose M3-CVC, a controllable video compression framework incorporating multimodal generative models. The framework utilizes a semantic-motion composite strategy for keyframe selection to retain critical information. For each keyframe and its corresponding video clip, a dialogue-based large multimodal model (LMM) approach extracts hierarchical spatiotemporal details, enabling both inter-frame and intra-frame representations for improved video fidelity while enhancing encoding interpretability. M3-CVC further employs a conditional diffusion-based, text-guided keyframe compression method, achieving high fidelity in frame reconstruction. During decoding, textual descriptions derived from LMMs guide the diffusion process to restore the original video’s content accurately. Experimental results demonstrate that M3-CVC significantly outperforms the state-of-the-art VVC standard in ultra-low bitrate scenarios, particularly in preserving semantic and perceptual fidelity.
zh

[CV-243] Enhancing the automatic segmentation and analysis of 3D liver vasculature models

【速读】：该论文旨在解决肝癌患者手术评估中血管树（特别是门静脉和肝静脉树）的自动分割和分析问题。解决方案的关键在于开发一个基于深度学习和图像处理技术的自动化管道，以改进血管树的3D分割、骨架化和后续分析。具体来说，研究首先探讨了可微分骨架化方法（如ClDice和形态学骨架化损失）对整体血管分割性能的影响，并研究了如何改善血管树的连通性。其次，研究将单类血管分割转换为多类分割，分离门静脉和肝静脉树，并基于先前的两类血管分割模型和树的连通组件及骨架分析，提供特定解剖分支的子标注，从而实现血管树的形态学分析。最终，该方法成功改进了现有的骨架化方法，创建了一个高质量的多类血管分割，并通过外科医生的验证，证明了其低误差率。此外，研究还创建了一个包含77个病例的高质量肝血管数据集，并提供了一种根据解剖结构标注血管树的方法，以实现独特的肝血管形态学分析。

链接: https://arxiv.org/abs/2411.15778
作者: Yassine Machta,Omar Ali,Kevin Hakkakian,Ana Vlascenau,Amaury Facque,Nicolas Golse,Irene Vignon-Clementel
关键词-EN: Surgical assessment, Toggle, cancer patients requires, patients requires identification, vessel
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Internship at Simbiotx

点击查看摘要

Abstract:Surgical assessment of liver cancer patients requires identification of the vessel trees from medical images. Specifically, the venous trees - the portal (perfusing) and the hepatic (draining) trees are important for understanding the liver anatomy and disease state, and perform surgery planning. This research aims to improve the 3D segmentation, skeletonization, and subsequent analysis of vessel trees, by creating an automatic pipeline based on deep learning and image processing techniques. The first part of this work explores the impact of differentiable skeletonization methods such as ClDice and morphological skeletonization loss, on the overall liver vessel segmentation performance. To this aim, it studies how to improve vessel tree connectivity. The second part of this study converts a single class vessel segmentation into multi-class ones, separating the two venous trees. It builds on the previous two-class vessel segmentation model, which vessel tree outputs might be entangled, and on connected components and skeleton analyses of the trees. After providing sub-labeling of the specific anatomical branches of each venous tree, these algorithms also enable a morphometric analysis of the vessel trees by extracting various geometrical markers. In conclusion, we propose a method that successfully improves current skeletonization methods, for extensive vascular trees that contain vessels of different calibers. The separation algorithm creates a clean multi-class segmentation of the vessels, validated by surgeons to provide low error. A new, publicly shared high-quality liver vessel dataset of 77 cases is thus created. Finally a method to annotate vessel trees according to anatomy is provided, enabling a unique liver vessel morphometry analysis. Comments: Internship at Simbiotx Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.15778 [eess.IV] (or arXiv:2411.15778v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2411.15778 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yassine Machta [view email] [v1] Sun, 24 Nov 2024 10:58:48 UTC (7,734 KB) Full-text links: Access Paper: View a PDF of the paper titled Enhancing the automatic segmentation and analysis of 3D liver vasculature models, by Yassine Machta and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: eess.IV prev | next new | recent | 2024-11 Change to browse by: cs cs.AI cs.CV eess References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-244] Comparative Analysis of Diffusion Generative Models in Computational Pathology

【速读】：该论文试图解决在计算病理学领域中，如何利用扩散生成模型 (Diffusion Generative Models, DGM) 生成高质量的合成病理数据，以提升深度学习模型在病理学中的准确性。解决方案的关键在于通过深入比较分析不同扩散方法在病理数据集上的应用效果，特别是针对不同视野 (Fields of View, FOV) 的数据集，发现DGM在生成高质量合成数据方面的显著优势。此外，通过调整生成图像的大小来模拟不同的视野，进一步验证了DGM在增强合成病理数据质量和多样性方面的潜力。

链接: https://arxiv.org/abs/2411.15719
作者: Denisha Thakkar,Vincent Quoc-Huy Trinh,Sonal Varma,Samira Ebrahimi Kahou,Hassan Rivaz,Mahdi S. Hosseini
关键词-EN: Diffusion Generative Models, garnering significant interest, Diffusion Generative, Generative Models, computer vision
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted paper under review

点击查看摘要

Abstract:Diffusion Generative Models (DGM) have rapidly surfaced as emerging topics in the field of computer vision, garnering significant interest across a wide array of deep learning applications. Despite their high computational demand, these models are extensively utilized for their superior sample quality and robust mode coverage. While research in diffusion generative models is advancing, exploration within the domain of computational pathology and its large-scale datasets has been comparatively gradual. Bridging the gap between the high-quality generation capabilities of Diffusion Generative Models and the intricate nature of pathology data, this paper presents an in-depth comparative analysis of diffusion methods applied to a pathology dataset. Our analysis extends to datasets with varying Fields of View (FOV), revealing that DGMs are highly effective in producing high-quality synthetic data. An ablative study is also conducted, followed by a detailed discussion on the impact of various methods on the synthesized histopathology images. One striking observation from our experiments is how the adjustment of image size during data generation can simulate varying fields of view. These findings underscore the potential of DGMs to enhance the quality and diversity of synthetic pathology data, especially when used with real data, ultimately increasing accuracy of deep learning models in histopathology. Code is available from this https URL
zh

[CV-245] Machine-agnostic Automated Lumbar MRI Segmentation using a Cascaded Model Based on Generative Neurons

【速读】：该论文试图解决腰椎MRI图像中腰椎椎体和椎间盘的自动分割问题。解决方案的关键在于采用了一种级联模型，结合了感兴趣区域（ROI）检测和基于自组织操作神经网络（Self-ONN）的编码器-解码器网络进行分割。具体来说，YOLOv8 medium模型在ROI提取方面表现出色，而Self-ONN结合DenseNet121编码器在腰椎椎体和椎间盘分割上表现优异，通过10折交叉验证验证了其高准确性，平均交并比（IoU）达到83.66%，敏感性为91.44%，Dice相似系数（DSC）为91.03%。

链接: https://arxiv.org/abs/2411.15656
作者: Promit Basak,Rusab Sarmun,Saidul Kabir,Israa Al-Hashimi,Enamul Hoque Bhuiyan,Anwarul Hasan,Muhammad Salman Khan,Muhammad E. H. Chowdhury
关键词-EN: modern diagnosis systems, Self-organized Operational Neural, Operational Neural Network, Automated lumbar spine, diagnosis systems
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 Pages, 11 Figures, Expert Systems with Applications, 2024

点击查看摘要

Abstract:Automated lumbar spine segmentation is very crucial for modern diagnosis systems. In this study, we introduce a novel machine-agnostic approach for segmenting lumbar vertebrae and intervertebral discs from MRI images, employing a cascaded model that synergizes an ROI detection and a Self-organized Operational Neural Network (Self-ONN)-based encoder-decoder network for segmentation. Addressing the challenge of diverse MRI modalities, our methodology capitalizes on a unique dataset comprising images from 12 scanners and 34 subjects, enhanced through strategic preprocessing and data augmentation techniques. The YOLOv8 medium model excels in ROI extraction, achieving an excellent performance of 0.916 mAP score. Significantly, our Self-ONN-based model, combined with a DenseNet121 encoder, demonstrates excellent performance in lumbar vertebrae and IVD segmentation with a mean Intersection over Union (IoU) of 83.66%, a sensitivity of 91.44%, and Dice Similarity Coefficient (DSC) of 91.03%, as validated through rigorous 10-fold cross-validation. This study not only showcases an effective approach to MRI segmentation in spine-related disorders but also sets the stage for future advancements in automated diagnostic tools, emphasizing the need for further dataset expansion and model refinement for broader clinical applicability.
zh

[CV-246] Comparative Analysis of Resource-Efficient CNN Architectures for Brain Tumor Classification

【速读】：该论文试图解决在脑肿瘤分类任务中，深度学习模型如ResNet-18和VGG-16虽然准确率高但计算复杂度高的问题。解决方案的关键在于提出了一种简单而有效的卷积神经网络（CNN）架构，该架构在保持竞争性能的同时，显著降低了计算复杂度和资源需求。通过在两个公开数据集（Br35H和Brain Tumor MRI Dataset）上的实验，自定义CNN在二分类和多分类任务中均表现出色，尤其是在少样本学习（few-shot learning）场景下，其性能提升显著。这表明，设计良好的低复杂度CNN架构可以作为深度预训练模型在医学影像任务中的高效替代方案。

链接: https://arxiv.org/abs/2411.15596
作者: Md Ashik Khan,Ankit Kumar Verma
关键词-EN: Accurate brain tumor, Brain Tumor MRI, Tumor MRI Dataset, brain tumor, brain tumor classification
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Accurate brain tumor classification in MRI images is critical for timely diagnosis and treatment planning. While deep learning models like ResNet-18, VGG-16 have shown high accuracy, they often come with increased complexity and computational demands. This study presents a comparative analysis of effective yet simple Convolutional Neural Network (CNN) architecture and pre-trained ResNet18, and VGG16 model for brain tumor classification using two publicly available datasets: Br35H:: Brain Tumor Detection 2020 and Brain Tumor MRI Dataset. The custom CNN architecture, despite its lower complexity, demonstrates competitive performance with the pre-trained ResNet18 and VGG16 models. In binary classification tasks, the custom CNN achieved an accuracy of 98.67% on the Br35H dataset and 99.62% on the Brain Tumor MRI Dataset. For multi-class classification, the custom CNN, with a slight architectural modification, achieved an accuracy of 98.09%, on the Brain Tumor MRI Dataset. Comparatively, ResNet18 and VGG16 maintained high performance levels, but the custom CNNs provided a more computationally efficient alternative. Additionally,the custom CNNs were evaluated using few-shot learning (0, 5, 10, 15, 20, 40, and 80 shots) to assess their robustness, achieving notable accuracy improvements with increased shots. This study highlights the potential of well-designed, less complex CNN architectures as effective and computationally efficient alternatives to deeper, pre-trained models for medical imaging tasks, including brain tumor classification. This study underscores the potential of custom CNNs in medical imaging tasks and encourages further exploration in this direction.
zh

[CV-247] Classifier Enhanced Deep Learning Model for Erythroblast Differentiation with Limited Data ICPR2024

【速读】：该论文试图解决在临床环境中区分红系前体细胞（Erythroblast）与白细胞（WBCs）的诊断挑战。解决方案的关键在于利用深度学习模型ResNet-50作为基础，结合多种机器学习分类器（如SVM、XG-Boost、KNN和Random Forest），通过评估不同训练数据量下的分类效果，发现ResNet50-SVM分类器在总体测试准确率和红系前体细胞检测准确率上均优于其他模型，尤其是在训练数据极少的情况下（仅1%的数据量），仍能保持高精度（测试准确率86.75%，红系前体细胞精度98.9%），显著优于传统的预训练ResNet-50模型。这一方法为资源匮乏环境下的小规模和独特数据集提供了高分类精度的解决方案。

链接: https://arxiv.org/abs/2411.15592
作者: Buddhadev Goswami,Adithya B. Somaraj,Prantar Chakrabarti,Ravindra Gudi,Nirmal Punjabi
关键词-EN: present significant diagnostic, genetic diseases affecting, Hematological disorders, significant diagnostic challenges, affecting blood formation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, Accepted for the 27th International Conference on Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:Hematological disorders, which involve a variety of malignant conditions and genetic diseases affecting blood formation, present significant diagnostic challenges. One such major challenge in clinical settings is differentiating Erythroblast from WBCs. Our approach evaluates the efficacy of various machine learning (ML) classifiers - SVM, XG-Boost, KNN, and Random Forest - using the ResNet-50 deep learning model as a backbone in detecting and differentiating erythroblast blood smear images across training splits of different sizes. Our findings indicate that the ResNet50-SVM classifier consistently surpasses other models’ overall test accuracy and erythroblast detection accuracy, maintaining high performance even with minimal training data. Even when trained on just 1% (168 images per class for eight classes) of the complete dataset, ML classifiers such as SVM achieved a test accuracy of 86.75% and an erythroblast precision of 98.9%, compared to 82.03% and 98.6% of pre-trained ResNet-50 models without any classifiers. When limited data is available, the proposed approach outperforms traditional deep learning models, thereby offering a solution for achieving higher classification accuracy for small and unique datasets, especially in resource-scarce settings.
zh

[CV-248] MulModSeg: Enhancing Unpaired Multi-Modal Medical Image Segmentation with Modality-Conditioned Text Embedding and Alternating Training WACV-2025

【速读】：该论文试图解决医学影像分割中多模态数据（如CT和MR图像）的自动分割问题，特别是在不同模态间保持一致性能的挑战。解决方案的关键在于提出了一个简单而有效的多模态分割策略（MulModSeg），该策略包含两个核心设计：一是通过冻结的文本编码器引入模态条件文本嵌入框架，为现有分割框架增加模态感知能力，而无需显著的结构修改或计算开销；二是采用交替训练程序，促进未配对CT和MR输入中关键特征的整合。这些设计使得MulModSeg在全卷积网络和Transformer为基础的骨干网络上，均能显著优于先前的方法，实现对腹部多器官和心脏亚结构的精准分割。

链接: https://arxiv.org/abs/2411.15576
作者: Chengyin Li,Hui Zhu,Rafi Ibn Sultan,Hassan Bagher Ebadian,Prashant Khanduri,Chetty Indrin,Kundan Thind,Dongxiao Zhu
关键词-EN: Computed Tomography, Magnetic Resonance, scans and Magnetic, types of Computed, diverse field
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV-2025

点击查看摘要

Abstract:In the diverse field of medical imaging, automatic segmentation has numerous applications and must handle a wide variety of input domains, such as different types of Computed Tomography (CT) scans and Magnetic Resonance (MR) images. This heterogeneity challenges automatic segmentation algorithms to maintain consistent performance across different modalities due to the requirement for spatially aligned and paired images. Typically, segmentation models are trained using a single modality, which limits their ability to generalize to other types of input data without employing transfer learning techniques. Additionally, leveraging complementary information from different modalities to enhance segmentation precision often necessitates substantial modifications to popular encoder-decoder designs, such as introducing multiple branched encoding or decoding paths for each modality. In this work, we propose a simple Multi-Modal Segmentation (MulModSeg) strategy to enhance medical image segmentation across multiple modalities, specifically CT and MR. It incorporates two key designs: a modality-conditioned text embedding framework via a frozen text encoder that adds modality awareness to existing segmentation frameworks without significant structural modifications or computational overhead, and an alternating training procedure that facilitates the integration of essential features from unpaired CT and MR inputs. Through extensive experiments with both Fully Convolutional Network and Transformer-based backbones, MulModSeg consistently outperforms previous methods in segmenting abdominal multi-organ and cardiac substructures for both CT and MR modalities. The code is available in this \hrefthis https URLlink.
zh

[CV-249] Multi-scale Cascaded Large-Model for Whole-body ROI Segmentation

【速读】：该论文试图解决器官风险区域分割（Organs-at-risk segmentation）中的不确定性、目标选择偏差以及模型验证不足的问题，这些问题限制了现有方法在实际应用中的通用性和可靠性。解决方案的关键在于提出了一个创新的级联网络架构——多尺度级联融合网络（Multi-scale Cascaded Fusing Network, MCFNet）。MCFNet通过结合锐利提取骨干（Sharp Extraction Backbone）和灵活连接骨干（Flexible Connection Backbone），在降采样和跳跃连接阶段分别增强特征提取，从而有效捕捉复杂的多尺度和多分辨率特征。这种设计不仅提高了分割精度，还确保了计算效率，即使在低分辨率图像中也能精确捕捉细节。此外，论文还引入了自适应损失聚合策略（adaptive loss aggregation strategy），进一步优化了模型训练过程，提升了分割精度和效率。通过在多个数据集上的广泛验证，MCFNet展示了优越的性能，为放射治疗和手术提供了更可靠的图像引导支持。

链接: https://arxiv.org/abs/2411.15526
作者: Rui Hao,Dayu Tan,Yansen Su,Chunhou Zheng
关键词-EN: critical for ensuring, Multi-scale Cascaded Fusing, Cascaded Fusing Network, Flexible Connection Backbone, Sharp Extraction Backbone
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Organs-at-risk segmentation is critical for ensuring the safety and precision of radiotherapy and surgical procedures. However, existing methods for organs-at-risk image segmentation often suffer from uncertainties and biases in target selection, as well as insufficient model validation experiments, limiting their generality and reliability in practical applications. To address these issues, we propose an innovative cascaded network architecture called the Multi-scale Cascaded Fusing Network (MCFNet), which effectively captures complex multi-scale and multi-resolution features. MCFNet includes a Sharp Extraction Backbone and a Flexible Connection Backbone, which respectively enhance feature extraction in the downsampling and skip-connection stages. This design not only improves segmentation accuracy but also ensures computational efficiency, enabling precise detail capture even in low-resolution images. We conduct experiments using the A6000 GPU on diverse datasets from 671 patients, including 36,131 image-mask pairs across 10 different datasets. MCFNet demonstrates strong robustness, performing consistently well across 10 datasets. Additionally, MCFNet exhibits excellent generalizability, maintaining high accuracy in different clinical scenarios. We also introduce an adaptive loss aggregation strategy to further optimize the model training process, improving both segmentation accuracy and efficiency. Through extensive validation, MCFNet demonstrates superior performance compared to existing methods, providing more reliable image-guided support. Our solution aims to significantly improve the precision and safety of radiotherapy and surgical procedures, advancing personalized treatment. The code has been made available on GitHub:this https URL.
zh

[CV-250] SPA: Efficient User-Preference Alignment against Uncertainty in Medical Image Segmentation

【速读】：该论文试图解决医学图像分割中存在的固有不确定性问题，这种不确定性源于图像质量的不完美和标注者在模糊像素上的标注偏好差异，这些偏好受标注者的专业知识和临床背景的影响。具体来说，同一像素在不同的临床应用中可能被标注为不同的类别（如在诊断中标注为肿瘤以避免低估严重性，而在放射治疗中标注为正常组织以防止对敏感结构的损害）。为了应对这种多样化的下游应用需求，论文提出了一种名为SPA的分割框架，其关键在于通过提供少数几个最能捕捉不确定性的不同分割候选方案，减少临床医生的工作量，并引入一种概率机制，利用用户反馈来调整模型的分割偏好，从而实现高效的测试时适应性和用户可调性。该解决方案在多种医学图像分割任务（如彩色眼底图像、CT和MRI）中展示了显著减少临床医生时间和精力的效果，以及强大的适应性和最先进的分割性能。

链接: https://arxiv.org/abs/2411.15513
作者: Jiayuan Zhu,Junde Wu,Cheng Ouyang,Konstantinos Kamnitsas,Alison Noble
关键词-EN: imperfect image quality, segmentation data inherently, inherently contain uncertainty, segmentation, data inherently
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation data inherently contain uncertainty, often stemming from both imperfect image quality and variability in labeling preferences on ambiguous pixels, which depend on annotators’ expertise and the clinical context of the annotations. For instance, a boundary pixel might be labeled as tumor in diagnosis to avoid under-assessment of severity, but as normal tissue in radiotherapy to prevent damage to sensitive structures. As segmentation preferences vary across downstream applications, it is often desirable for an image segmentation model to offer user-adaptable predictions rather than a fixed output. While prior uncertainty-aware and interactive methods offer adaptability, they are inefficient at test time: uncertainty-aware models require users to choose from numerous similar outputs, while interactive models demand significant user input through click or box prompts to refine segmentation. To address these challenges, we propose \textbfSPA, a segmentation framework that efficiently adapts to diverse test-time preferences with minimal human interaction. By presenting users a select few, distinct segmentation candidates that best capture uncertainties, it reduces clinician workload in reaching the preferred segmentation. To accommodate user preference, we introduce a probabilistic mechanism that leverages user feedback to adapt model’s segmentation preference. The proposed framework is evaluated on a diverse range of medical image segmentation tasks: color fundus images, CT, and MRI. It demonstrates 1) a significant reduction in clinician time and effort compared with existing interactive segmentation approaches, 2) strong adaptability based on human feedback, and 3) state-of-the-art image segmentation performance across diverse modalities and semantic labels.
zh

[CV-251] Improved Background Estimation for Gas Plume Identification in Hyperspectral Images

【速读】：该论文试图解决长波红外（LWIR）高光谱成像中背景估计不准确导致气体识别困难的问题。解决方案的关键在于提出两种新的背景估计方法（K-Nearest Segments算法和PCA），并与三种现有方法进行比较，以确定哪种方法能最有效地估计真实背景辐射，从而提高神经网络分类模型的识别信心。研究结果表明，PCA方法在背景估计方面表现最佳，误差平方和（MSE）比全局背景估计方法低18,000倍，而K-Nearest Segments算法将神经网络识别信心的中位数提高了53.2%。

链接: https://arxiv.org/abs/2411.15378
作者: Scout Jarman,Zigfried Hampel-Arias,Adra Carr,Kevin R. Moon
关键词-EN: Longwave infrared, LWIR sensors, identifying effluent gases, LWIR, background
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 10 figures, submitted and under review to IEEE Transactions on Geoscience and Remote Sensing

点击查看摘要

Abstract:Longwave infrared (LWIR) hyperspectral imaging can be used for many tasks in remote sensing, including detecting and identifying effluent gases by LWIR sensors on airborne platforms. Once a potential plume has been detected, it needs to be identified to determine exactly what gas or gases are present in the plume. During identification, the background underneath the plume needs to be estimated and removed to reveal the spectral characteristics of the gas of interest. Current standard practice is to use ``global" background estimation, where the average of all non-plume pixels is used to estimate the background for each pixel in the plume. However, if this global background estimate does not model the true background under the plume well, then the resulting signal can be difficult to identify correctly. The importance of proper background estimation increases when dealing with weak signals, large libraries of gases of interest, and with uncommon or heterogeneous backgrounds. In this paper, we propose two methods of background estimation, in addition to three existing methods, and compare each against global background estimation to determine which perform best at estimating the true background radiance under a plume, and for increasing identification confidence using a neural network classification model. We compare the different methods using 640 simulated plumes. We find that PCA is best at estimating the true background under a plume, with a median of 18,000 times less MSE compared to global background estimation. Our proposed K-Nearest Segments algorithm improves median neural network identification confidence by 53.2%.
zh

[CV-252] Deep Learning-Based Automatic Delineation of Liver Domes in kV Triggered Images for Online Breath-hold Reproducibility Verification of Liver Stereotactic Body Radiation Therapy

【速读】：该论文试图解决立体定向体部放疗 (Stereotactic Body Radiation Therapy, SBRT) 在治疗肝癌和肝转移瘤时，确保呼吸保持一致性的问题。解决方案的关键在于开发了一种基于深度学习的自动化流程，用于从千伏级触发图像中自动描绘肝脏顶部的区域。该流程包括训练一个U-Net模型进行肝脏区域分割，随后通过阈值处理、边缘检测和形态学操作提取肝脏顶部。该方法显著减少了手动验证肝脏顶部位置的需求，提高了放疗的精确性和效率。

链接: https://arxiv.org/abs/2411.15322
作者: Sugandima Weragoda,Ping Xia,Kevin Stephans,Neil Woody,Michael Martens,Robert Brown,Bingqi Guo
关键词-EN: Stereotactic Body Radiation, Body Radiation Therapy, Stereotactic Body, Radiation Therapy, Body Radiation
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereotactic Body Radiation Therapy (SBRT) can be a precise, minimally invasive treatment method for liver cancer and liver metastases. However, the effectiveness of SBRT relies on the accurate delivery of the dose to the tumor while sparing healthy tissue. Challenges persist in ensuring breath-hold reproducibility, with current methods often requiring manual verification of liver dome positions from kV-triggered images. To address this, we propose a proof-of-principle study of a deep learning-based pipeline to automatically delineate the liver dome from kV-planar images. From 24 patients who received SBRT for liver cancer or metastasis inside liver, 711 KV-triggered images acquired for online breath-hold verification were included in the current study. We developed a pipeline comprising a trained U-Net for automatic liver dome region segmentation from the triggered images followed by extraction of the liver dome via thresholding, edge detection, and morphological operations. The performance and generalizability of the pipeline was evaluated using 2-fold cross validation. The training of the U-Net model for liver region segmentation took under 30 minutes and the automatic delineation of a liver dome for any triggered image took less than one second. The RMSE and rate of detection for Fold1 with 366 images was (6.4 +/- 1.6) mm and 91.7%, respectively. For Fold2 with 345 images, the RMSE and rate of detection was (7.7 +/- 2.3) mm and 76.3% respectively.
zh

[CV-253] Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration

【速读】：该论文试图解决在图像恢复任务中，基于扩散模型的方法在处理线性逆问题时由于近似误差导致的恢复质量下降问题。解决方案的关键在于提出了一种时间变化的低通滤波器（time-varying low-pass filter），在频率域中逐步引入高频信息，并根据数据分布设计了一个自适应的频率调度（adaptive curriculum）。这种方法显著提高了在运动去模糊（motion deblurring）和图像去雾（image dehazing）等挑战性图像恢复任务中的性能。

链接: https://arxiv.org/abs/2411.15295
作者: Darshan Thaker,Abhishek Goyal,René Vidal
关键词-EN: recover high-quality images, Image restoration aims, aims to recover, recover high-quality, degraded observations
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration aims to recover high-quality images from degraded observations. When the degradation process is known, the recovery problem can be formulated as an inverse problem, and in a Bayesian context, the goal is to sample a clean reconstruction given the degraded observation. Recently, modern pretrained diffusion models have been used for image restoration by modifying their sampling procedure to account for the degradation process. However, these methods often rely on certain approximations that can lead to significant errors and compromised sample quality. In this paper, we provide the first rigorous analysis of this approximation error for linear inverse problems under distributional assumptions on the space of natural images, demonstrating cases where previous works can fail dramatically. Motivated by our theoretical insights, we propose a simple modification to existing diffusion-based restoration methods. Our approach introduces a time-varying low-pass filter in the frequency domain of the measurements, progressively incorporating higher frequencies during the restoration process. We develop an adaptive curriculum for this frequency schedule based on the underlying data distribution. Our method significantly improves performance on challenging image restoration tasks including motion deblurring and image dehazing.
zh

[CV-254] A Plug-and-Play Temporal Normalization Module for Robust Remote Photoplethysmography

【速读】：该论文试图解决远程光电容积描记法 (Remote Photoplethysmography, rPPG) 在提取心率信号时因运动或光照伪影导致的长期信号变化丢失问题，从而提高心率测量的准确性。解决方案的关键是引入时间归一化 (Temporal Normalization, TN) 模块，该模块通过捕捉去趋势后的长期时间归一化特征，有效缓解了运动和光照伪影的影响，显著提升了rPPG网络在心率测量任务中的性能。TN作为一个灵活的即插即用模块，兼容任何端到端的rPPG网络架构，并在多个先进rPPG方法和广泛使用的数据集上展示了显著的性能提升，尤其是在较小模型中表现更为突出。

链接: https://arxiv.org/abs/2411.15283
作者: Kegang Wang,Jiankai Tang,Yantao Wei,Mingxuan Liu,Xin Liu,Yuntao Wang
关键词-EN: extracts PPG signals, showing strong potential, Remote photoplethysmography, extracts PPG, PPG signals
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) extracts PPG signals from subtle color changes in facial videos, showing strong potential for health applications. However, most rPPG methods rely on intensity differences between consecutive frames, missing long-term signal variations affected by motion or lighting artifacts, which reduces accuracy. This paper introduces Temporal Normalization (TN), a flexible plug-and-play module compatible with any end-to-end rPPG network architecture. By capturing long-term temporally normalized features following detrending, TN effectively mitigates motion and lighting artifacts, significantly boosting the rPPG prediction performance. When integrated into four state-of-the-art rPPG methods, TN delivered performance improvements ranging from 34.3% to 94.2% in heart rate measurement tasks across four widely-used datasets. Notably, TN showed even greater performance gains in smaller models. We further discuss and provide insights into the mechanisms behind TN’s effectiveness.
zh

[CV-255] Feature-interactive Siamese graph encoder-based image analysis to predict STAS from histopathology images in lung cancer

【速读】：该论文试图解决肺部肿瘤中通过空气空间扩散（Spread through air spaces, STAS）的检测问题，这一现象对预后评估和手术决策至关重要。传统病理学检测方法存在主观性强、耗时长且易误诊的问题，限制了其在大规模应用中的效果。论文提出的解决方案是VERN模型，该模型利用特征交互的孪生图编码器（feature-interactive Siamese graph encoder）从肺部肿瘤病理图像中预测STAS。VERN通过捕捉空间拓扑特征并结合特征共享和跳跃连接（skip connections）来增强模型训练效果。该模型在内部验证中达到了0.9215的AUC，并在冷冻和石蜡包埋测试切片中分别达到了0.8275和0.8829的AUC，展示了临床级别的性能。VERN在单一队列和三个外部数据集上的验证表明其具有稳健的预测性能和泛化能力，提供了一个开放平台以提升STAS诊断的效率和准确性。

链接: https://arxiv.org/abs/2411.15274
作者: Liangrui Pan,Qingchun Liang,Wenwu Zeng,Yijun Peng,Zhenyu Zhao,Yiyi Liang,Jiadi Luo,Xiang Wang,Shaoliang Peng
关键词-EN: guiding surgical decisions, distinct invasion pattern, Spread through air, air spaces, crucial for prognosis
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: accept for publication in npj Precision Oncology

点击查看摘要

Abstract:Spread through air spaces (STAS) is a distinct invasion pattern in lung cancer, crucial for prognosis assessment and guiding surgical decisions. Histopathology is the gold standard for STAS detection, yet traditional methods are subjective, time-consuming, and prone to misdiagnosis, limiting large-scale applications. We present VERN, an image analysis model utilizing a feature-interactive Siamese graph encoder to predict STAS from lung cancer histopathological images. VERN captures spatial topological features with feature sharing and skip connections to enhance model training. Using 1,546 histopathology slides, we built a large single-cohort STAS lung cancer dataset. VERN achieved an AUC of 0.9215 in internal validation and AUCs of 0.8275 and 0.8829 in frozen and paraffin-embedded test sections, respectively, demonstrating clinical-grade performance. Validated on a single-cohort and three external datasets, VERN showed robust predictive performance and generalizability, providing an open platform (this http URL) to enhance STAS diagnosis efficiency and accuracy.
zh

[CV-256] MambaIRv2: Attentive State Space Restoration

【速读】：该论文试图解决基于Mamba的图像恢复模型在因果建模上的局限性，即每个token仅依赖于其扫描序列中的前驱，限制了图像中像素的全局利用，从而影响了图像恢复的效果。解决方案的关键在于提出MambaIRv2，通过引入非因果建模能力（non-causal modeling ability），使其能够像ViTs一样达到注意力状态空间恢复模型。具体来说，MambaIRv2通过设计一种注意力状态空间方程，允许模型超越扫描序列进行关注，从而在一次扫描中实现图像展开。此外，论文还引入了一种语义引导的邻域机制（semantic-guided neighboring mechanism），以促进远距离但相似像素之间的交互。实验结果表明，MambaIRv2在轻量级超分辨率任务中比SRFormer提高了0.35dB的PSNR，同时在经典超分辨率任务中比HAT高出0.29dB，且参数量减少了9.3%。

链接: https://arxiv.org/abs/2411.15269
作者: Hang Guo,Yong Guo,Yaohua Zha,Yulun Zhang,Wenbo Li,Tao Dai,Shu-Tao Xia,Yawei Li
关键词-EN: recently demonstrated significant, demonstrated significant potential, balancing global reception, Mamba-based image restoration, image restoration backbones
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Technical report

点击查看摘要

Abstract:The Mamba-based image restoration backbones have recently demonstrated significant potential in balancing global reception and computational efficiency. However, the inherent causal modeling limitation of Mamba, where each token depends solely on its predecessors in the scanned sequence, restricts the full utilization of pixels across the image and thus presents new challenges in image restoration. In this work, we propose MambaIRv2, which equips Mamba with the non-causal modeling ability similar to ViTs to reach the attentive state space restoration model. Specifically, the proposed attentive state-space equation allows to attend beyond the scanned sequence and facilitate image unfolding with just one single scan. Moreover, we further introduce a semantic-guided neighboring mechanism to encourage interaction between distant but similar pixels. Extensive experiments show our MambaIRv2 outperforms SRFormer by \textbfeven 0.35dB PSNR for lightweight SR even with \textbf9.3% less parameters and suppresses HAT on classic SR by \textbfup to 0.29dB. Code is available at \urlthis https URL.
zh

[CV-257] OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction

【速读】：该论文试图解决极端曝光条件下复杂现实场景中的曝光校正问题。解决方案的关键在于提出了一种名为Omnidirectional Spectral Mamba (OSMamba)的新型曝光校正网络，该网络结合了状态空间模型和生成扩散模型的优势。具体来说，OSMamba引入了一种全向频谱扫描机制，将Mamba模型适应于频域，以捕捉深度图像特征的幅度和相位谱中的全面长程依赖关系，从而增强光照校正和结构恢复。此外，论文还开发了一种双域先验生成器，通过学习良好曝光的图像来生成包含严重欠曝光和过曝光区域正确信息的去降解扩散先验，以实现更好的细节恢复。

链接: https://arxiv.org/abs/2411.15255
作者: Gehui Li,Bin Chen,Chen Zhao,Lei Zhang,Jian Zhang
关键词-EN: fundamental problem, problem in computer, computer vision, extreme exposure conditions, Omnidirectional Spectral Mamba
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Exposure correction is a fundamental problem in computer vision and image processing. Recently, frequency domain-based methods have achieved impressive improvement, yet they still struggle with complex real-world scenarios under extreme exposure conditions. This is due to the local convolutional receptive fields failing to model long-range dependencies in the spectrum, and the non-generative learning paradigm being inadequate for retrieving lost details from severely degraded regions. In this paper, we propose Omnidirectional Spectral Mamba (OSMamba), a novel exposure correction network that incorporates the advantages of state space models and generative diffusion models to address these limitations. Specifically, OSMamba introduces an omnidirectional spectral scanning mechanism that adapts Mamba to the frequency domain to capture comprehensive long-range dependencies in both the amplitude and phase spectra of deep image features, hence enhancing illumination correction and structure recovery. Furthermore, we develop a dual-domain prior generator that learns from well-exposed images to generate a degradation-free diffusion prior containing correct information about severely under- and over-exposed regions for better detail restoration. Extensive experiments on multiple-exposure and mixed-exposure datasets demonstrate that the proposed OSMamba achieves state-of-the-art performance both quantitatively and qualitatively.
zh

[CV-258] Unsupervised Machine Learning for Osteoporosis Diagnosis Using Singh Index Clustering on Hip Radiographs

【速读】：该论文试图解决骨质疏松症诊断中手动计算Singh Index (SI)的耗时和专业要求高的问题。解决方案的关键在于利用机器学习算法自动化SI的识别。研究通过开发自定义的卷积神经网络架构进行特征提取，并在未标注的838张髋部X光图像数据集上进行验证，展示了优于传统模型的聚类性能。此外，研究强调了数据集平衡、图像质量和临床数据的重要性，并建议结合患者临床数据和参考图像，以及采用图像预处理技术来提高诊断准确性。探索半监督和自监督学习方法也被认为是解决大规模数据集标注挑战的有效途径。

链接: https://arxiv.org/abs/2411.15253
作者: Vimaladevi Madhivanan,Kalavakonda Vijaya,Abhay Lal,Senthil Rithika,Shamala Karupusamy Subramaniam,Mohamed Sameer
关键词-EN: aging population worldwide, altered bone structure, diminished bone mass, population worldwide, increasing susceptibility
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Osteoporosis, a prevalent condition among the aging population worldwide, is characterized by diminished bone mass and altered bone structure, increasing susceptibility to fractures. It poses a significant and growing global public health challenge over the next decade. Diagnosis typically involves Dual-energy X-ray absorptiometry to measure bone mineral density, yet its mass screening utility is limited. The Singh Index (SI) provides a straightforward, semi-quantitative means of osteoporosis diagnosis through plain hip radiographs, assessing trabecular patterns in the proximal femur. Although cost-effective and accessible, manual SI calculation is time-intensive and requires expertise. This study aims to automate SI identification from radiographs using machine learning algorithms. An unlabelled dataset of 838 hip X-ray images from Indian adults aged 20-70 was utilized. A custom convolutional neural network architecture was developed for feature extraction, demonstrating superior performance in cluster homogeneity and heterogeneity compared to established models. Various clustering algorithms categorized images into six SI grade clusters, with comparative analysis revealing only two clusters with high Silhouette Scores for promising classification. Further scrutiny highlighted dataset imbalance and emphasized the importance of image quality and additional clinical data availability. The study suggests augmenting X-ray images with patient clinical data and reference images, alongside image pre-processing techniques, to enhance diagnostic accuracy. Additionally, exploring semi-supervised and self-supervised learning methods may mitigate labelling challenges associated with large datasets.
zh

[CV-259] Optimized Vessel Segmentation: A Structure-Agnostic Approach with Small Vessel Enhancement and Morphological Correction

【速读】：该论文试图解决血管分割中的普遍性问题，特别是针对血管影像中的稀疏性、细粒度、低对比度、数据分布变异性以及保持拓扑结构的关键需求。解决方案的关键在于提出了一种优化的血管分割框架，该框架采用了一种结构无关的方法，结合小血管增强和形态学校正，以实现多模态血管分割。该方法不仅提高了分割的准确性和泛化能力，还显著改善了分割结果的连通性，相较于现有方法，连通性提高了34.6%。

链接: https://arxiv.org/abs/2411.15251
作者: Dongning Song,Weijian Huang,Jiarun Liu,Md Jahidul Islam,Hao Yang,Shanshan Wang
关键词-EN: Accurate segmentation, postoperative analyses, assessments and postoperative, vessel segmentation, segmentation
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figurres, submitted to TIP

点击查看摘要

Abstract:Accurate segmentation of blood vessels is essential for various clinical assessments and postoperative analyses. However, the inherent challenges of vascular imaging, such as sparsity, fine granularity, low contrast, data distribution variability, and the critical need for preserving topological structure, making generalized vessel segmentation particularly complex. While specialized segmentation methods have been developed for specific anatomical regions, their over-reliance on tailored models hinders broader applicability and generalization. General-purpose segmentation models introduced in medical imaging often fail to address critical vascular characteristics, including the connectivity of segmentation results. To overcome these limitations, we propose an optimized vessel segmentation framework: a structure-agnostic approach incorporating small vessel enhancement and morphological correction for multi-modality vessel segmentation. To train and validate this framework, we compiled a comprehensive multi-modality dataset spanning 17 datasets and benchmarked our model against six SAM-based methods and 17 expert models. The results demonstrate that our approach achieves superior segmentation accuracy, generalization, and a 34.6% improvement in connectivity, underscoring its clinical potential. An ablation study further validates the effectiveness of the proposed improvements. We will release the code and dataset at github following the publication of this work.
zh

[CV-260] J-Invariant Volume Shuffle for Self-Supervised Cryo-Electron Tomogram Denoising on Single Noisy Volume

【速读】：该论文试图解决Cryo-Electron Tomography (Cryo-ET)图像中的低信噪比问题，特别是在缺乏配对数据的情况下，传统去噪方法和监督学习方法难以应对复杂的噪声模式。论文提出了一种新颖的自监督学习模型，通过单一的噪声体积图像进行Cryo-ET体积图像的去噪。解决方案的关键在于采用了一种U形J-不变盲点网络，结合稀疏中心掩码卷积、扩张通道注意力块以及体积无序/有序技术。这些技术扩展了感受野并利用多尺度表示，显著提高了噪声减少和结构保留的效果，从而在实验中展现出优于现有方法的性能，推动了Cryo-ET数据处理在结构生物学研究中的应用。

链接: https://arxiv.org/abs/2411.15248
作者: Xiwei Liu,Mohamad Kassab,Min Xu,Qirong Ho
关键词-EN: Cryo-Electron Tomography, enables detailed, visualization of cellular, suffers from low, imaging constraints
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Cryo-Electron Tomography (Cryo-ET) enables detailed 3D visualization of cellular structures in near-native states but suffers from low signal-to-noise ratio due to imaging constraints. Traditional denoising methods and supervised learning approaches often struggle with complex noise patterns and the lack of paired datasets. Self-supervised methods, which utilize noisy input itself as a target, have been studied; however, existing Cryo-ET self-supervised denoising methods face significant challenges due to losing information during training and the learned incomplete noise patterns. In this paper, we propose a novel self-supervised learning model that denoises Cryo-ET volumetric images using a single noisy volume. Our method features a U-shape J-invariant blind spot network with sparse centrally masked convolutions, dilated channel attention blocks, and volume unshuffle/shuffle technique. The volume-unshuffle/shuffle technique expands receptive fields and utilizes multi-scale representations, significantly improving noise reduction and structural preservation. Experimental results demonstrate that our approach achieves superior performance compared to existing methods, advancing Cryo-ET data processing for structural biology research
zh

[CV-261] Learning Volumetric Neural Deformable Models to Recover 3D Regional Heart Wall Motion from Multi-Planar Tagged MRI

【速读】：该论文试图解决从多平面标记磁共振成像（Multi-planar tagged MRI）中准确恢复心脏壁三维真实运动的问题。由于真实运动的采样不完整以及从多个成像平面上观察到的表观运动线索融合的困难，这一任务具有挑战性。论文提出的解决方案关键在于引入了一类新的体积神经变形模型（volumetric neural deformable models, υNDMs），该模型通过一组低维全局变形参数函数和一个正则化的局部变形场来表示心脏壁的几何和运动。为了学习从2D表观运动到3D真实运动的映射，论文设计了一种混合点变换器，结合了点交叉注意力和自注意力机制。点交叉注意力用于融合2D表观运动线索，而自注意力机制则以编码器-解码器结构进一步细化这些线索并映射到3D真实运动。实验结果表明，该方法在从稀疏的2D表观运动线索中恢复密集的3D真实运动方面具有高精度。

链接: https://arxiv.org/abs/2411.15233
作者: Meng Ye,Bingyu Xin,Bangwei Guo,Leon Axel,Dimitris Metaxas
关键词-EN: Multi-planar tagged MRI, heart wall motion, Multi-planar tagged, tagged MRI, motion
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-planar tagged MRI is the gold standard for regional heart wall motion evaluation. However, accurate recovery of the 3D true heart wall motion from a set of 2D apparent motion cues is challenging, due to incomplete sampling of the true motion and difficulty in information fusion from apparent motion cues observed on multiple imaging planes. To solve these challenges, we introduce a novel class of volumetric neural deformable models ( \upsilon NDMs). Our \upsilon NDMs represent heart wall geometry and motion through a set of low-dimensional global deformation parameter functions and a diffeomorphic point flow regularized local deformation field. To learn such global and local deformation for 2D apparent motion mapping to 3D true motion, we design a hybrid point transformer, which incorporates both point cross-attention and self-attention mechanisms. While use of point cross-attention can learn to fuse 2D apparent motion cues into material point true motion hints, point self-attention hierarchically organised as an encoder-decoder structure can further learn to refine these hints and map them into 3D true motion. We have performed experiments on a large cohort of synthetic 3D regional heart wall motion dataset. The results demonstrated the high accuracy of our method for the recovery of dense 3D true motion from sparse 2D apparent motion cues. Project page is at this https URL.
zh

人工智能

[AI-0] OPMOS: Ordered Parallel Multi-Objective Shortest-Path

链接: https://arxiv.org/abs/2411.16667
作者: Leo Gold,Adam Bienkowski,David Sidoti,Krishna Pattipati,Omer Khan
关键词-EN: NP-hard MOS problem, multi-attribute graph, finds a set, MOS, Pareto-optimal solutions
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Data Structures and Algorithms (cs.DS); Performance (cs.PF)
*备注: 15 pages

点击查看摘要

Abstract:The Multi-Objective Shortest-Path (MOS) problem finds a set of Pareto-optimal solutions from a start node to a destination node in a multi-attribute graph. To solve the NP-hard MOS problem, the literature explores heuristic multi-objective A*-style algorithmic approaches. A generalized MOS algorithm maintains a “frontier” of partial paths at each node and performs ordered processing to ensure that Pareto-optimal paths are generated to reach the goal node. The algorithm becomes computationally intractable as the number of objectives increases due to a rapid increase in the non-dominated paths, and the concomitantly large increase in Pareto-optimal solutions. While prior works have focused on algorithmic methods to reduce the complexity, we tackle this challenge by exploiting parallelism using an algorithm-architecture approach. The key insight is that MOS algorithms rely on the ordered execution of partial paths to maintain high work efficiency. The OPMOS framework, proposed herein, unlocks ordered parallelism and efficiently exploits the concurrent execution of multiple paths in MOS. Experimental evaluation using the NVIDIA GH200 Superchip shows the performance scaling potential of OPMOS on work efficiency and parallelism using a real-world application to ship routing.

[AI-1] Recommender Systems for Good (RS4Good): Survey of Use Cases and a Call to Action for Research that Matters

链接: https://arxiv.org/abs/2411.16645
作者: Dietmar Jannach,Alan Said,Marko Tkalčič,Markus Zanker
关键词-EN: developing increasingly sophisticated, increasingly sophisticated recommendation, sophisticated recommendation models, developing increasingly, increasingly sophisticated
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the area of recommender systems, the vast majority of research efforts is spent on developing increasingly sophisticated recommendation models, also using increasingly more computational resources. Unfortunately, most of these research efforts target a very small set of application domains, mostly e-commerce and media recommendation. Furthermore, many of these models are never evaluated with users, let alone put into practice. The scientific, economic and societal value of much of these efforts by scholars therefore remains largely unclear. To achieve a stronger positive impact resulting from these efforts, we posit that we as a research community should more often address use cases where recommender systems contribute to societal good (RS4Good). In this opinion piece, we first discuss a number of examples where the use of recommender systems for problems of societal concern has been successfully explored in the literature. We then proceed by outlining a paradigmatic shift that is needed to conduct successful RS4Good research, where the key ingredients are interdisciplinary collaborations and longitudinal evaluation approaches with humans in the loop.

[AI-2] Inference-Time Policy Steering through Human Interactions

链接: https://arxiv.org/abs/2411.16627
作者: Yanwei Wang,Lirui Wang,Yilun Du,Balakumar Sundaralingam,Xuning Yang,Yu-Wei Chao,Claudia Perez-D’Arpino,Dieter Fox,Julie Shah
关键词-EN: autonomously accomplish multimodal, Generative policies trained, long-horizon tasks, accomplish multimodal, policies trained
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative policies trained with human demonstrations can autonomously accomplish multimodal, long-horizon tasks. However, during inference, humans are often removed from the policy execution loop, limiting the ability to guide a pre-trained policy towards a specific sub-goal or trajectory shape among multiple predictions. Naive human intervention may inadvertently exacerbate distribution shift, leading to constraint violations or execution failures. To better align policy output with human intent without inducing out-of-distribution errors, we propose an Inference-Time Policy Steering (ITPS) framework that leverages human interactions to bias the generative sampling process, rather than fine-tuning the policy on interaction data. We evaluate ITPS across three simulated and real-world benchmarks, testing three forms of human interaction and associated alignment distance metrics. Among six sampling strategies, our proposed stochastic sampling with diffusion policy achieves the best trade-off between alignment and distribution shift. Videos are available at this https URL.

[AI-3] F – A Model of Events based on the Foundational Ontology DOLCEDnS Ultralite

链接: https://arxiv.org/abs/2411.16609
作者: Ansgar Scherp,Thomas Franz,Carsten Saathoff,Steffen Staab
关键词-EN: distributed event-based systems, events hinders interoperability, formal model, event-based systems, hinders interoperability
类目: Artificial Intelligence (cs.AI)
*备注: Reprint of KCAP 2009 paper with republished ontologies

点击查看摘要

Abstract:The lack of a formal model of events hinders interoperability in distributed event-based systems. In this paper, we present a formal model of events, called Event-Model-F. The model is based on the foundational ontology DOLCE+DnS Ultralite (DUL) and provides comprehensive support to represent time and space, objects and persons, as well as mereological, causal, and correlative relationships between events. In addition, the Event-Model-F provides a flexible means for event composition, modeling event causality and event correlation, and representing different interpretations of the same event. The Event-Model-F is developed following the pattern-oriented approach of DUL, is modularized in different ontologies, and can be easily extended by domain specific ontologies.

[AI-4] Representation Collapsing Problems in Vector Quantization

链接: https://arxiv.org/abs/2411.16550
作者: Wenhao Zhao,Qiran Zou,Rushi Shah,Dianbo Liu
关键词-EN: discretizes continuous representations, Vector quantization, technique in machine, machine learning, learning that discretizes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, under review

点击查看摘要

Abstract:Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we investigate representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values. This collapse fundamentally compromises the model’s ability to capture diverse data patterns. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that restricted initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.

[AI-5] Interpreting Language Reward Models via Contrastive Explanations

链接: https://arxiv.org/abs/2411.16502
作者: Junqi Jiang,Tom Bewley,Saumitra Mishra,Freddy Lecue,Manuela Veloso
关键词-EN: large language models’, Reward models, language models’, crucial component, comparing reward scores
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reward models (RMs) are a crucial component in the alignment of large language models’ (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM’s local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.

[AI-6] Characterized Diffusion Networks for Enhanced Autonomous Driving Trajectory Prediction

链接: https://arxiv.org/abs/2411.16457
作者: Haoming Li
关键词-EN: Characterized Diffusion Module, Characterized Diffusion, Diffusion Module, Spatial-Temporal Interaction Network, combining a Characterized
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 0 figures

点击查看摘要

Abstract:In this paper, we present a novel trajectory prediction model for autonomous driving, combining a Characterized Diffusion Module and a Spatial-Temporal Interaction Network to address the challenges posed by dynamic and heterogeneous traffic environments. Our model enhances the accuracy and reliability of trajectory predictions by incorporating uncertainty estimation and complex agent interactions. Through extensive experimentation on public datasets such as NGSIM, HighD, and MoCAD, our model significantly outperforms existing state-of-the-art methods. We demonstrate its ability to capture the underlying spatial-temporal dynamics of traffic scenarios and improve prediction precision, especially in complex environments. The proposed model showcases strong potential for application in real-world autonomous driving systems.

[AI-7] IFeD: a Tiny Integer-based Federated learning algorithm with Direct feedback alignment

链接: https://arxiv.org/abs/2411.16442
作者: Luca Colombo,Alessandro Falcetta,Manuel Roveri
关键词-EN: extremely resource-constrained devices, tiny machine learning, deep learning models, learning models directly, Training machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training machine and deep learning models directly on extremely resource-constrained devices is the next challenge in the field of tiny machine learning. The related literature in this field is very limited, since most of the solutions focus only on on-device inference or model adaptation through online learning, leaving the training to be carried out on external Cloud services. An interesting technological perspective is to exploit Federated Learning (FL), which allows multiple devices to collaboratively train a shared model in a distributed way. However, the main drawback of state-of-the-art FL algorithms is that they are not suitable for running on tiny devices. For the first time in the literature, in this paper we introduce TIFeD, a Tiny Integer-based Federated learning algorithm with Direct Feedback Alignment (DFA) entirely implemented by using an integer-only arithmetic and being specifically designed to operate on devices with limited resources in terms of memory, computation and energy. Besides the traditional full-network operating modality, in which each device of the FL setting trains the entire neural network on its own local data, we propose an innovative single-layer TIFeD implementation, which enables each device to train only a portion of the neural network model and opens the door to a new way of distributing the learning procedure across multiple devices. The experimental results show the feasibility and effectiveness of the proposed solution. The proposed TIFeD algorithm, with its full-network and single-layer implementations, is made available to the scientific community as a public repository.

[AI-8] Unsupervised Event Outlier Detection in Continuous Time

链接: https://arxiv.org/abs/2411.16427
作者: Somjit Nath,Yik Chau Lui,Siqi Liu
关键词-EN: sequence data record, Event sequence, continuous time, Event sequence data, Event sequence forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Event sequence data record the occurrences of events in continuous time. Event sequence forecasting based on temporal point processes (TPPs) has been extensively studied, but outlier or anomaly detection, especially without any supervision from humans, is still underexplored. In this work, we develop, to the best our knowledge, the first unsupervised outlier detection approach to detecting abnormal events. Our novel unsupervised outlier detection framework is based on ideas from generative adversarial networks (GANs) and reinforcement learning (RL). We train a ‘generator’ that corrects outliers in the data with a ‘discriminator’ that learns to discriminate the corrected data from the real data, which may contain outliers. A key insight is that if the generator made a mistake in the correction, it would generate anomalies that are different from the anomalies in the real data, so it serves as data augmentation for the discriminator learning. Different from typical GAN-based outlier detection approaches, our method employs the generator to detect outliers in an online manner. The experimental results show that our method can detect event outliers more accurately than the state-of-the-art approaches.

[AI-9] urbofan Engine Remaining Useful Life (RUL) Prediction Based on Bi-Directional Long Short-Term Memory (BLSTM)

链接: https://arxiv.org/abs/2411.16422
作者: Abedin Sherifi
关键词-EN: turbofan engine, rapidly evolving, aviation industry, industry is rapidly, Turbofan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The aviation industry is rapidly evolving, driven by advancements in technology. Turbofan engines used in commercial aerospace are very complex systems. The majority of turbofan engine components are susceptible to degradation over the life of their operation. Turbofan engine degradation has an impact to engine performance, operability, and reliability. Predicting accurate remaining useful life (RUL) of a commercial turbofan engine based on a variety of complex sensor data is of paramount importance for the safety of the passengers, safety of flight, and for cost effective operations. That is why it is essential for turbofan engines to be monitored, controlled, and maintained. RUL predictions can either come from model-based or data-based approaches. The model-based approach can be very expensive due to the complexity of the mathematical models and the deep expertise that is required in the domain of physical systems. The data-based approach is more frequently used nowadays thanks to the high computational complexity of computers, the advancements in Machine Learning (ML) models, and advancements in sensors. This paper is going to be focused on Bi-Directional Long Short-Term Memory (BLSTM) models but will also provide a benchmark of several RUL prediction databased models. The proposed RUL prediction models are going to be evaluated based on engine failure prediction benchmark dataset Commercial Modular Aero-Propulsion System Simulation (CMAPSS). The CMAPSS dataset is from NASA which contains turbofan engine run to failure events.

[AI-10] CATP-LLM : Empowering Large Language Models for Cost-Aware Tool Planning

链接: https://arxiv.org/abs/2411.16313
作者: Duo Wu,Jinghe Wang,Yuan Meng,Yanning Zhang,Le Sun,Zhi Wang
关键词-EN: Utilizing large language, automatically schedule external, Utilizing large, schedule external tools, large language models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: In submission

点击查看摘要

Abstract:Utilizing large language models (LLMs) for tool planning has emerged as a promising avenue for developing general AI systems, where LLMs automatically schedule external tools (e.g. vision models) to tackle complex tasks based on task descriptions. To push this paradigm toward practical applications, it is crucial for LLMs to consider tool execution costs (e.g. execution time) for tool planning. Unfortunately, prior studies overlook the tool execution costs, leading to the generation of expensive plans of which the costs outweigh task performance. To fill this gap, we propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework, which for the first time provides a coherent design to empower LLMs for cost-aware tool planning. Specifically, CATP-LLM incorporates a tool planning language to enhance the LLM to generate non-sequential plans of multiple branches for efficient concurrent tool execution and cost reduction. Moreover, it further designs a cost-aware offline reinforcement learning algorithm to fine-tune the LLM to optimize the performance-cost trade-off in tool planning. In lack of public cost-related datasets, we further present OpenCATP, the first platform for cost-aware planning evaluation. Experiments on OpenCATP show that CATP-LLM outperforms GPT-4 even when using Llama2-7B as its backbone, with the average improvement of 28.2%-30.2% higher plan performance and 24.7%-45.8% lower costs even on the challenging planning tasks. The codes of CATP-LLM and OpenCATP will be publicly available.

[AI-11] he SVASR System for Text-dependent Speaker Verification (TdSV) AAIC Challenge 2024

链接: https://arxiv.org/abs/2411.16276
作者: Mohammadreza Molavi,Reza Khodadadi
关键词-EN: high-performance biometric systems, text-dependent speaker verification, designed to address, paper introduces, introduces an efficient
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper introduces an efficient and accurate pipeline for text-dependent speaker verification (TDSV), designed to address the need for high-performance biometric systems. The proposed system incorporates a Fast-Conformer-based ASR module to validate speech content, filtering out Target-Wrong (TW) and Impostor-Wrong (IW) trials. For speaker verification, we propose a feature fusion approach that combines speaker embeddings extracted from wav2vec-BERT and ReDimNet models to create a unified speaker representation. This system achieves competitive results on the TDSV 2024 Challenge test set, with a normalized min-DCF of 0.0452 (rank 2), highlighting its effectiveness in balancing accuracy and robustness.

[AI-12] Probing for Consciousness in Machines

链接: https://arxiv.org/abs/2411.16262
作者: Mathis Immertreu,Achim Schilling,Andreas Maier,Patrick Krauss
关键词-EN: Antonio Damasio theory, Antonio Damasio, proposed by Antonio, Damasio theory, develop core consciousness
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:This study explores the potential for artificial agents to develop core consciousness, as proposed by Antonio Damasio’s theory of consciousness. According to Damasio, the emergence of core consciousness relies on the integration of a self model, informed by representations of emotions and feelings, and a world model. We hypothesize that an artificial agent, trained via reinforcement learning (RL) in a virtual environment, can develop preliminary forms of these models as a byproduct of its primary task. The agent’s main objective is to learn to play a video game and explore the environment. To evaluate the emergence of world and self models, we employ probes-feedforward classifiers that use the activations of the trained agent’s neural networks to predict the spatial positions of the agent itself. Our results demonstrate that the agent can form rudimentary world and self models, suggesting a pathway toward developing machine consciousness. This research provides foundational insights into the capabilities of artificial agents in mirroring aspects of human consciousness, with implications for future advancements in artificial intelligence.

[AI-13] Batch Bayesian Optimization via Expected Subspace Improvement

链接: https://arxiv.org/abs/2411.16206
作者: Dawei Zhan,Zhaoxi Zeng,Shuoxiao Wei,Ping Wu
关键词-EN: Extending Bayesian optimization, Bayesian optimization, Extending Bayesian, Bayesian optimization algorithm, parallel computing technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Extending Bayesian optimization to batch evaluation can enable the designer to make the most use of parallel computing technology. Most of current batch approaches use artificial functions to simulate the sequential Bayesian optimization algorithm’s behavior to select a batch of points for parallel evaluation. However, as the batch size grows, the accumulated error introduced by these artificial functions increases rapidly, which dramatically decreases the optimization efficiency of the algorithm. In this work, we propose a simple and efficient approach to extend Bayesian optimization to batch evaluation. Different from existing batch approaches, the idea of the new approach is to draw a batch of subspaces of the original problem and select one acquisition point from each subspace. To achieve this, we propose the expected subspace improvement criterion to measure the amount of the improvement that a candidate point can achieve within a certain subspace. By optimizing these expected subspace improvement functions simultaneously, we can get a batch of query points for expensive evaluation. Numerical experiments show that our proposed approach can achieve near-linear speedup when compared with the sequential Bayesian optimization algorithm, and performs very competitively when compared with eight state-of-the-art batch algorithms. This work provides a simple yet efficient approach for batch Bayesian optimization. A Matlab implementation of our approach is available at this https URL

[AI-14] MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

链接: https://arxiv.org/abs/2411.16158
作者: Yu Zhang,Mingzi Wang,Lancheng Zou,Wulong Liu,Hui-Ling Zhen,Mingxuan Yuan,Bei Yu
关键词-EN: Transformer-based large language, large language models, model sizes continue, achieved remarkable success, deployment remains challenging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Transformer-based large language models (LLMs) have achieved remarkable success as model sizes continue to grow, yet their deployment remains challenging due to significant computational and memory demands. Quantization has emerged as a promising solution, and state-of-the-art quantization algorithms for LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), where lower-precision weights are multiplied with higher-precision activations. Despite its benefits, current hardware accelerators such as GPUs and TPUs lack native support for efficient mpGEMM, leading to inefficient dequantization operations in the main sequential loop. To address this limitation, we introduce MixPE, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference. MixPE leverages two key innovations to minimize dequantization overhead and unlock the full potential of low-bit quantization. First, recognizing that scale and zero point are shared within each quantization group, we propose performing dequantization after per-group mpGEMM, significantly reducing dequantization overhead. Second, instead of relying on conventional multipliers, MixPE utilizes efficient shift\add operations for multiplication, optimizing both computation and energy efficiency. Our experimental results demonstrate that MixPE surpasses the state-of-the-art quantization accelerators by 2.6\times speedup and 1.4\times energy reduction.

[AI-15] Graph Adapter of EEG Foundation Models for Parameter Efficient Fine Tuning

链接: https://arxiv.org/abs/2411.16155
作者: Toyotaro Suzumura,Hiroki Kanezashi,Shotaro Akahori
关键词-EN: diagnosing mental diseases, Graph Neural Networks, neural network models, neural network, complex neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Under review

点击查看摘要

Abstract:In diagnosing mental diseases from electroencephalography (EEG) data, neural network models such as Transformers have been employed to capture temporal dynamics. Additionally, it is crucial to learn the spatial relationships between EEG sensors, for which Graph Neural Networks (GNNs) are commonly used. However, fine-tuning large-scale complex neural network models simultaneously to capture both temporal and spatial features increases computational costs due to the more significant number of trainable parameters. It causes the limited availability of EEG datasets for downstream tasks, making it challenging to fine-tune large models effectively. We propose EEG-GraphAdapter (EGA), a parameter-efficient fine-tuning (PEFT) approach to address these challenges. EGA is integrated into pre-trained temporal backbone models as a GNN-based module and fine-tuned itself alone while keeping the backbone model parameters frozen. This enables the acquisition of spatial representations of EEG signals for downstream tasks, significantly reducing computational overhead and data requirements. Experimental evaluations on healthcare-related downstream tasks of Major Depressive Disorder and Abnormality Detection demonstrate that our EGA improves performance by up to 16.1% in the F1-score compared with the backbone BENDR model.

[AI-16] SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations

链接: https://arxiv.org/abs/2411.16147
作者: Youngjun Sim,Jinsung Yoon,Young-Joo Suh
关键词-EN: target speaker utterance, single target speaker, enables the transformation, single target, speaker utterance
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 5 pages

点击查看摘要

Abstract:One-shot voice conversion (VC) is a method that enables the transformation between any two speakers using only a single target speaker utterance. Existing methods often rely on complex architectures and pre-trained speaker verification (SV) models to improve the fidelity of converted speech. Recent works utilizing K-means quantization (KQ) with self-supervised learning (SSL) features have proven capable of capturing content information from speech. However, they often struggle to preserve speaking variation, such as prosodic detail and phonetic variation, particularly with smaller codebooks. In this work, we propose a simple yet effective one-shot VC model that utilizes the characteristics of SSL features and speech attributes. Our approach addresses the issue of losing speaking variation, enabling high-fidelity voice conversion trained with only reconstruction losses, without requiring external speaker embeddings. We demonstrate the performance of our model across 6 evaluation metrics, with results highlighting the benefits of the speaking variation compensation method.

[AI-17] End-to-End Steering for Autonomous Vehicles via Conditional Imitation Co-Learning

链接: https://arxiv.org/abs/2411.16131
作者: Mahmoud M. Kishky,Hesham M. Eraqi,Khaled F. Elsayed
关键词-EN: involves complex tasks, driving involves complex, Autonomous driving involves, behavior prediction, data fusion
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: NCTA 2024 Best Paper Honorable Mention

点击查看摘要

Abstract:Autonomous driving involves complex tasks such as data fusion, object and lane detection, behavior prediction, and path planning. As opposed to the modular approach which dedicates individual subsystems to tackle each of those tasks, the end-to-end approach treats the problem as a single learnable task using deep neural networks, reducing system complexity and minimizing dependency on heuristics. Conditional imitation learning (CIL) trains the end-to-end model to mimic a human expert considering the navigational commands guiding the vehicle to reach its destination, CIL adopts specialist network branches dedicated to learn the driving task for each navigational command. Nevertheless, the CIL model lacked generalization when deployed to unseen environments. This work introduces the conditional imitation co-learning (CIC) approach to address this issue by enabling the model to learn the relationships between CIL specialist branches via a co-learning matrix generated by gated hyperbolic tangent units (GTUs). Additionally, we propose posing the steering regression problem as classification, we use a classification-regression hybrid loss to bridge the gap between regression and classification, we also propose using co-existence probability to consider the spatial tendency between the steering classes. Our model is demonstrated to improve autonomous driving success rate in unseen environment by 62% on average compared to the CIL method.

[AI-18] Why the Agent Made that Decision: Explaining Deep Reinforcement Learning with Vision Masks

链接: https://arxiv.org/abs/2411.16120
作者: Rui Zuo,Zifan Wang,Simon Khan,Garrett Ethan Katz,Qinru Qiu
关键词-EN: deep neural networks, deep reinforcement learning, deep neural, deep reinforcement, neural networks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the inherent lack of transparency in deep neural networks, it is challenging for deep reinforcement learning (DRL) agents to gain trust and acceptance from users, especially in safety-critical applications such as medical diagnosis and military operations. Existing methods for explaining an agent’s decision either require to retrain the agent using models that support explanation generation or rely on perturbation-based techniques to reveal the significance of different input features in the decision making process. However, retraining the agent may compromise its integrity and performance, while perturbation-based methods have limited performance and lack knowledge accumulation or learning capabilities. Moreover, since each perturbation is performed independently, the joint state of the perturbed inputs may not be physically meaningful. To address these challenges, we introduce \textbfVisionMask , a standalone explanation model trained end-to-end to identify the most critical regions in the agent’s visual input that can explain its actions. VisionMask is trained in a self-supervised manner without relying on human-generated labels. Importantly, its training does not alter the agent model, hence preserving the agent’s performance and integrity. We evaluate VisionMask on Super Mario Bros (SMB) and three Atari games. Compared to existing methods, VisionMask achieves a 14.9% higher insertion accuracy and a 30.08% higher F1-Score in reproducing original actions from the selected visual explanations. We also present examples illustrating how VisionMask can be used for counterfactual analysis.

[AI-19] LLM Pirate: LLM s for Black-box Hardware IP Piracy NDSS

链接: https://arxiv.org/abs/2411.16111
作者: Vasudev Gohil,Matthew DeLorenzo,Veera Vishwa Achuta Sai Venkat Nallam,Joey See,Jeyavijayan Rajendran
关键词-EN: piracy detection tools, large language models, detection tools, piracy detection, language models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted by NDSS Symposium 2025

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has enabled the ability to effectively analyze and generate code nearly instantaneously, resulting in their widespread adoption in software development. Following this advancement, researchers and companies have begun integrating LLMs across the hardware design and verification process. However, these highly potent LLMs can also induce new attack scenarios upon security vulnerabilities across the hardware development process. One such attack vector that has not been explored is intellectual property (IP) piracy. Given that this attack can manifest as rewriting hardware designs to evade piracy detection, it is essential to thoroughly evaluate LLM capabilities in performing this task and assess the mitigation abilities of current IP piracy detection tools. Therefore, in this work, we propose LLMPirate, the first LLM-based technique able to generate pirated variations of circuit designs that successfully evade detection across multiple state-of-the-art piracy detection tools. We devise three solutions to overcome challenges related to integration of LLMs for hardware circuit designs, scalability to large circuits, and effectiveness, resulting in an end-to-end automated, efficient, and practical formulation. We perform an extensive experimental evaluation of LLMPirate using eight LLMs of varying sizes and capabilities and assess their performance in pirating various circuit designs against four state-of-the-art, widely-used piracy detection tools. Our experiments demonstrate that LLMPirate is able to consistently evade detection on 100% of tested circuits across every detection tool. Additionally, we showcase the ramifications of LLMPirate using case studies on IBEX and MOR1KX processors and a GPS module, that we successfully pirate. We envision that our work motivates and fosters the development of better IP piracy detection tools. Comments: Accepted by NDSS Symposium 2025 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.16111 [cs.CR] (or arXiv:2411.16111v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.16111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] An Empirical Study of Vulnerability Detection using Federated Learning

链接: https://arxiv.org/abs/2411.16099
作者: Peiheng Zhou,Ming Hu,Xingrun Quan,Yawen Peng,Xiaofei Xie,Yanxin Yang,Chengwei Liu,Yueming Wu,Mingsong Chen
关键词-EN: vulnerability detection, Deep Learning, vulnerability, FL-based vulnerability detection, detection
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Although Deep Learning (DL) methods becoming increasingly popular in vulnerability detection, their performance is seriously limited by insufficient training data. This is mainly because few existing software organizations can maintain a complete set of high-quality samples for DL-based vulnerability detection. Due to the concerns about privacy leakage, most of them are reluctant to share data, resulting in the data silo problem. Since enables collaboratively model training without data sharing, Federated Learning (FL) has been investigated as a promising means of addressing the data silo problem in DL-based vulnerability detection. However, since existing FL-based vulnerability detection methods focus on specific applications, it is still far unclear i) how well FL adapts to common vulnerability detection tasks and ii) how to design a high-performance FL solution for a specific vulnerability detection task. To answer these two questions, this paper first proposes VulFL, an effective evaluation framework for FL-based vulnerability detection. Then, based on VulFL, this paper conducts a comprehensive study to reveal the underlying capabilities of FL in dealing with different types of CWEs, especially when facing various data heterogeneity scenarios. Our experimental results show that, compared to independent training, FL can significantly improve the detection performance of common AI models on all investigated CWEs, though the performance of FL-based vulnerability detection is limited by heterogeneous data. To highlight the performance differences between different FL solutions for vulnerability detection, we extensively investigate the impacts of different configuration strategies for each framework component of VulFL. Our study sheds light on the potential of FL in vulnerability detection, which can be used to guide the design of FL-based solutions for vulnerability detection.

[AI-21] HiDP: Hierarchical DNN Partitioning for Distributed Inference on Heterogeneous Edge Platforms DATE

链接: https://arxiv.org/abs/2411.16086
作者: Zain Taufique,Aman Vyas,Antonio Miele,Pasi Liljeberg,Anil Kanduri
关键词-EN: Deep Neural Network, distribute Deep Neural, Neural Network, Deep Neural, distribute Deep
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 8 figures, 1 table, and 1 algorithm. The manuscript is accepted to be published in 28th Design, Automation and Test in Europe Conference (IEEE DATE, 2025)

点击查看摘要

Abstract:Edge inference techniques partition and distribute Deep Neural Network (DNN) inference tasks among multiple edge nodes for low latency inference, without considering the core-level heterogeneity of edge nodes. Further, default DNN inference frameworks also do not fully utilize the resources of heterogeneous edge nodes, resulting in higher inference latency. In this work, we propose a hierarchical DNN partitioning strategy (HiDP) for distributed inference on heterogeneous edge nodes. Our strategy hierarchically partitions DNN workloads at both global and local levels by considering the core-level heterogeneity of edge nodes. We evaluated our proposed HiDP strategy against relevant distributed inference techniques over widely used DNN models on commercial edge devices. On average our strategy achieved 38% lower latency, 46% lower energy, and 56% higher throughput in comparison with other relevant approaches.

[AI-22] Performance Implications of Multi-Chiplet Neural Processing Units on Autonomous Driving Perception DATE’2025

链接: https://arxiv.org/abs/2411.16007
作者: Mohanad Odema,Luke Chen,Hyoukjun Kwon,Mohammad Abdullah Al Faruque
关键词-EN: Neural Processing Units, emerging chiplet-based Neural, chiplet-based Neural Processing, constrained automotive settings, chiplet-based Neural
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注: DATE’2025

点击查看摘要

Abstract:We study the application of emerging chiplet-based Neural Processing Units to accelerate vehicular AI perception workloads in constrained automotive settings. The motivation stems from how chiplets technology is becoming integral to emerging vehicular architectures, providing a cost-effective trade-off between performance, modularity, and customization; and from perception models being the most computationally demanding workloads in a autonomous driving system. Using the Tesla Autopilot perception pipeline as a case study, we first breakdown its constituent models and profile their performance on different chiplet accelerators. From the insights, we propose a novel scheduling strategy to efficiently deploy perception workloads on multi-chip AI accelerators. Our experiments using a standard DNN performance simulator, MAESTRO, show our approach realizes 82% and 2.8x increase in throughput and processing engines utilization compared to monolithic accelerator designs.

[AI-23] FedLLM : Efficient LLM Inference Based on Federated Learning

链接: https://arxiv.org/abs/2411.16003
作者: Shengwen Ding,Chenhui Hu
关键词-EN: Large Language Models, Large Language, Language Models, herald a transformative, artificial intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) herald a transformative era in artificial intelligence (AI). However, the expansive scale of data and parameters of LLMs requires high-demand computational and memory resources, restricting their accessibility to a broader range of users and researchers. This paper introduces an effective approach that enhances the operational efficiency and affordability of LLM inference. By utilizing transformer-based federated learning (FL) with model-parallel distributed training, our model efficiently distributes the computational loads and memory requirements across a network of participants. This strategy permits users, especially those with limited resources to train state-of-the-art LLMs collaboratively. We also innovate an incentive mechanism within the FL framework, rewarding constructive contributions and filtering out malicious activities, thereby safeguarding the integrity and reliability of the training process. Concurrently, we leverage memory hierarchy strategies and Singular Value Decomposition (SVD) on weight matrices to boost computational and memory efficiencies further. Our results, derived from formulaic analyses and numerical calculations, demonstrate significant optimization of resource use and democratize access to cutting-edge LLMs, ensuring that a wide scale of users can both contribute to and benefit from these advanced models.

[AI-24] PIANIST: Learning Partially Observable World Models with LLM s for Multi-Agent Decision Making NEURIPS

链接: https://arxiv.org/abs/2411.15998
作者: Jonathan Light,Sixue Xing,Yuanzhe Liu,Weiqin Chen,Min Cai,Xiusi Chen,Guanzhi Wang,Wei Cheng,Yisong Yue,Ziniu Hu
关键词-EN: complex decision-making tasks, decision-making tasks remains, Effective extraction, complex decision-making, decision-making tasks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Published at Language Gamification Workshop 2024 @ NeurIPS

点击查看摘要

Abstract:Effective extraction of the world knowledge in LLMs for complex decision-making tasks remains a challenge. We propose a framework PIANIST for decomposing the world model into seven intuitive components conducive to zero-shot LLM generation. Given only the natural language description of the game and how input observations are formatted, our method can generate a working world model for fast and efficient MCTS simulation. We show that our method works well on two different games that challenge the planning and decision making skills of the agent for both language and non-language based action taking, without any training on domain-specific training data or explicitly defined world model.

[AI-25] Ensuring Fair LLM Serving Amid Diverse Applications

链接: https://arxiv.org/abs/2411.15997
作者: Redwan Ibne Seraj Khan,Kunal Jain,Haiying Shen,Ankur Mallick,Anjaly Parayil,Anoop Kulkarni,Steve Kofsky,Pankhuri Choudhary,Renèe St. Amant,Rujia Wang,Yue Cheng,Ali R. Butt,Victor Rühle,Chetan Bansal,Saravan Rajmohan
关键词-EN: large language model, serving platform hosting, multi-tenant large language, language model, creating unfairness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In a multi-tenant large language model (LLM) serving platform hosting diverse applications, some users may submit an excessive number of requests, causing the service to become unavailable to other users and creating unfairness. Existing fairness approaches do not account for variations in token lengths across applications and multiple LLM calls, making them unsuitable for such platforms. To address the fairness challenge, this paper analyzes millions of requests from thousands of users on MS CoPilot, a real-world multi-tenant LLM platform hosted by Microsoft. Our analysis confirms the inadequacy of existing methods and guides the development of FairServe, a system that ensures fair LLM access across diverse applications. FairServe proposes application-characteristic aware request throttling coupled with a weighted service counter based scheduling technique to curb abusive behavior and ensure fairness. Our experimental results on real-world traces demonstrate FairServe’s superior performance compared to the state-of-the-art method in ensuring fairness. We are actively working on deploying our system in production, expecting to benefit millions of customers world-wide.

[AI-26] Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format HPCA2025

链接: https://arxiv.org/abs/2411.15982
作者: Chao Fang,Man Shi,Robin Geens,Arne Symons,Zhongfeng Wang,Marian Verhelst
关键词-EN: weight-only quantized large, leverage low-bit integer, quantized large language, large language models, weight-only quantized
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025)

点击查看摘要

Abstract:The widely-used, weight-only quantized large language models (LLMs), which leverage low-bit integer (INT) weights and retain floating-point (FP) activations, reduce storage requirements while maintaining accuracy. However, this shifts the energy and latency bottlenecks towards the FP activations that are associated with costly memory accesses and computations. Existing LLM accelerators focus primarily on computation optimizations, overlooking the potential of jointly optimizing FP computations and data movement, particularly for the dominant FP-INT GeMM operations in LLM inference. To address these challenges, we investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy. Based on our findings, we first propose the Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation. Secondly, we develop an iterative post-training adaptive precision search algorithm that optimizes the bit-width for different LLM modules to balance model accuracy, energy efficiency, and inference speed. Lastly, a suite of hardware optimization techniques is proposed to maximally exploit the benefits of the Anda format. These include a bit-plane-based data organization scheme, Anda-enhanced processing units with bit-serial computation, and a runtime bit-plane Anda compressor to simultaneously optimize storage, computation, and memory footprints. Our evaluations on FPINT GeMM operations show that Anda achieves a 2.4x speedup, 4.0x area efficiency, and 3.1x energy efficiency improvement on average for popular LLMs including OPT, LLaMA, and LLaMA-2 series over the GPU-like FP-FP baseline. Anda demonstrates strong adaptability across various application scenarios, accuracy requirements, and system performance, enabling efficient LLM inference across a wide range of deployment scenarios. Comments: To appear in 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025) Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.15982 [cs.AR] (or arXiv:2411.15982v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2411.15982 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] Advancing Transformative Education: Generative AI as a Catalyst for Equity and Innovation

链接: https://arxiv.org/abs/2411.15971
作者: Chiranjeevi Bura,Praveen Kumar Myakala
关键词-EN: enhancing administrative efficiency, enabling personalized learning, fostering creative engagement, personalized learning, enhancing administrative
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Generative AI is transforming education by enabling personalized learning, enhancing administrative efficiency, and fostering creative engagement. This paper explores the opportunities and challenges these tools bring to pedagogy, proposing actionable frameworks to address existing equity gaps. Ethical considerations such as algorithmic bias, data privacy, and AI role in human centric education are emphasized. The findings underscore the need for responsible AI integration that ensures accessibility, equity, and innovation in educational systems.

[AI-28] Partial Identifiability and Misspecification in Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2411.15951
作者: Joar Skalse,Alessandro Abate
关键词-EN: Inverse Reinforcement Learning, Inverse Reinforcement, aim of Inverse, Reinforcement Learning, IRL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function R from a policy \pi . This problem is difficult, for several reasons. First of all, there are typically multiple reward functions which are compatible with a given policy; this means that the reward function is only partially identifiable, and that IRL contains a certain fundamental degree of ambiguity. Secondly, in order to infer R from \pi , an IRL algorithm must have a behavioural model of how \pi relates to R . However, the true relationship between human preferences and human behaviour is very complex, and practically impossible to fully capture with a simple model. This means that the behavioural model in practice will be misspecified, which raises the worry that it might lead to unsound inferences if applied to real-world data. In this paper, we provide a comprehensive mathematical analysis of partial identifiability and misspecification in IRL. Specifically, we fully characterise and quantify the ambiguity of the reward function for all of the behavioural models that are most common in the current IRL literature. We also provide necessary and sufficient conditions that describe precisely how the observed demonstrator policy may differ from each of the standard behavioural models before that model leads to faulty inferences about the reward function R . In addition to this, we introduce a cohesive framework for reasoning about partial identifiability and misspecification in IRL, together with several formal tools that can be used to easily derive the partial identifiability and misspecification robustness of new IRL models, or analyse other kinds of reward learning algorithms.

[AI-29] A Training-Free Approach for Music Style Transfer with Latent Diffusion Models

链接: https://arxiv.org/abs/2411.15913
作者: Sooyoung Kim,Joonwoo Kwon,Heehwan Wang,Shinjae Yoo,Yuewei Lin,Jiook Cha
关键词-EN: detailed textual descriptions, offering exciting possibilities, Latent Diffusion Models, requires extensive training, pre-trained Latent Diffusion
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Codes will be released upon acceptance

点击查看摘要

Abstract:Music style transfer, while offering exciting possibilities for personalized music generation, often requires extensive training or detailed textual descriptions. This paper introduces a novel training-free approach leveraging pre-trained Latent Diffusion Models (LDMs). By manipulating the self-attention features of the LDM, we effectively transfer the style of reference music onto content music without additional training. Our method achieves superior style transfer and melody preservation compared to existing methods. This work opens new creative avenues for personalized music generation.

[AI-30] Navigating the Effect of Parametrization for Dimensionality Reduction NEURIPS2024

链接: https://arxiv.org/abs/2411.15894
作者: Haiyang Huang,Yingfan Wang,Cynthia Rudin
关键词-EN: Parametric dimensionality reduction, traditional approaches typically, dimensionality reduction methods, unseen datasets, dimensionality reduction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Parametric dimensionality reduction methods have gained prominence for their ability to generalize to unseen datasets, an advantage that traditional approaches typically lack. Despite their growing popularity, there remains a prevalent misconception among practitioners about the equivalence in performance between parametric and non-parametric methods. Here, we show that these methods are not equivalent – parametric methods retain global structure but lose significant local details. To explain this, we provide evidence that parameterized approaches lack the ability to repulse negative pairs, and the choice of loss function also has an impact. Addressing these issues, we developed a new parametric method, ParamRepulsor, that incorporates Hard Negative Mining and a loss function that applies a strong repulsive force. This new method achieves state-of-the-art performance on local structure preservation for parametric methods without sacrificing the fidelity of global structural representation. Our code is available at this https URL.

[AI-31] Distribution-aware Online Continual Learning for Urban Spatio-Temporal Forecasting

链接: https://arxiv.org/abs/2411.15893
作者: Chengxin Wang,Gary Tan,Swagato Barman Roy,Beng Chin Ooi
关键词-EN: forecasting is crucial, trip planning, intelligent scheduling, scheduling and trip, Urban
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Urban spatio-temporal (ST) forecasting is crucial for various urban applications such as intelligent scheduling and trip planning. Previous studies focus on modeling ST correlations among urban locations in offline settings, which often neglect the non-stationary nature of urban ST data, particularly, distribution shifts over time. This oversight can lead to degraded performance in real-world scenarios. In this paper, we first analyze the distribution shifts in urban ST data, and then introduce DOST, a novel online continual learning framework tailored for ST data characteristics. DOST employs an adaptive ST network equipped with a variable-independent adapter to address the unique distribution shifts at each urban location dynamically. Further, to accommodate the gradual nature of these shifts, we also develop an awake-hibernate learning strategy that intermittently fine-tunes the adapter during the online phase to reduce computational overhead. This strategy integrates a streaming memory update mechanism designed for urban ST sequential data, enabling effective network adaptation to new patterns while preventing catastrophic forgetting. Experimental results confirm DOST’s superiority over state-of-the-art models on four real-world datasets, providing online forecasts within an average of 0.1 seconds and achieving a 12.89% reduction in forecast errors compared to baseline models.

[AI-32] Creating Scalable AGI: the Open General Intelligence Framework

链接: https://arxiv.org/abs/2411.15832
作者: Daniel A. Dollinger,Michael Singleton
关键词-EN: solves current scalability, current scalability issues, scalability issues plaguing, Open General Intelligence, plaguing the field
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, IEEE SYSCON 2025 Submission

点击查看摘要

Abstract:This paper introduces a novel general artificial intelligence systems architecture that provides generalized flexibility and solves current scalability issues plaguing the field. The architecture, OGI (Open General Intelligence), utilizes a dynamic processing system to control and delegate across specialized artificial intelligence modules. It is intended to be used as a reference design for intelligent systems, providing human-like cognitive flexibility for generalized artificial intelligence across various real-world applications.

[AI-33] Efficient and Private: Memorisation under differentially private parameter-efficient fine-tuning in language models

链接: https://arxiv.org/abs/2411.15831
作者: Olivia Ma,Jonathan Passerat-Palmbach,Dmitrii Usynin
关键词-EN: leak sensitive training, specific tasks introduces, sensitive training data, tasks introduces privacy, specific tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) for specific tasks introduces privacy risks, as models may inadvertently memorise and leak sensitive training data. While Differential Privacy (DP) offers a solution to mitigate these risks, it introduces significant computational and performance trade-offs, particularly with standard fine-tuning approaches. Previous work has primarily focused on full-parameter updates, which are computationally intensive and may not fully leverage DPs potential in large models. In this work, we address these shortcomings by investigating Parameter-Efficient Fine-Tuning (PEFT) methods under DP constraints. We show that PEFT methods achieve comparable performance to standard fine-tuning while requiring fewer parameters and significantly reducing privacy leakage. Furthermore, we incorporate a data poisoning experiment involving intentional mislabelling to assess model memorisation and directly measure privacy risks. Our findings indicate that PEFT methods not only provide a promising alternative but also serve as a complementary approach for privacy-preserving, resource-efficient fine-tuning of LLMs.

[AI-34] Broad Critic Deep Actor Reinforcement Learning for Continuous Control

链接: https://arxiv.org/abs/2411.15806
作者: Shiron Thalagala,Pak Kin Wong,Xiaozheng Wang
关键词-EN: demonstrates promising results, demonstrates promising, deep reinforcement learning, proposed algorithm, promising results
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:In the domain of continuous control, deep reinforcement learning (DRL) demonstrates promising results. However, the dependence of DRL on deep neural networks (DNNs) results in the demand for extensive data and increased computational complexity. To address this issue, a novel hybrid architecture for actor-critic reinforcement learning (RL) algorithms is introduced. The proposed architecture integrates the broad learning system (BLS) with DNN, aiming to merge the strengths of both distinct architectural paradigms. Specifically, the critic network is implemented using BLS, while the actor network is constructed with a DNN. For the estimations of the critic network parameters, ridge regression is employed, and the parameters of the actor network are optimized through gradient descent. The effectiveness of the proposed algorithm is evaluated by applying it to two classic continuous control tasks, and its performance is compared with the widely recognized deep deterministic policy gradient (DDPG) algorithm. Numerical results show that the proposed algorithm is superior to the DDPG algorithm in terms of computational efficiency, along with an accelerated learning trajectory. Application of the proposed algorithm in other actor-critic RL algorithms is suggested for investigation in future studies.

[AI-35] Benchmarking Active Learning for NILM

链接: https://arxiv.org/abs/2411.15805
作者: Dhruv Patel,Ankita Kumari Jain,Haikoo Khandor,Xhitij Choudhary,Nipun Batra
关键词-EN: Non-intrusive load monitoring, Non-intrusive load, disaggregating total household, total household power, household power consumption
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-intrusive load monitoring (NILM) focuses on disaggregating total household power consumption into appliance-specific usage. Many advanced NILM methods are based on neural networks that typically require substantial amounts of labeled appliance data, which can be challenging and costly to collect in real-world settings. We hypothesize that appliance data from all households does not uniformly contribute to NILM model improvements. Thus, we propose an active learning approach to selectively install appliance monitors in a limited number of houses. This work is the first to benchmark the use of active learning for strategically selecting appliance-level data to optimize NILM performance. We first develop uncertainty-aware neural networks for NILM and then install sensors in homes where disaggregation uncertainty is highest. Benchmarking our method on the publicly available Pecan Street Dataport dataset, we demonstrate that our approach significantly outperforms a standard random baseline and achieves performance comparable to models trained on the entire dataset. Using this approach, we achieve comparable NILM accuracy with approximately 30% of the data, and for a fixed number of sensors, we observe up to a 2x reduction in disaggregation errors compared to random sampling.

[AI-36] A review on Machine Learning based User-Centric Multimedia Streaming Techniques

链接: https://arxiv.org/abs/2411.15801
作者: Monalisa Ghosh,Chetna Singhal
关键词-EN: information exchange, modern era, increasing demand, streaming, multimedia
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Computer Communications

点击查看摘要

Abstract:The multimedia content and streaming are a major means of information exchange in the modern era and there is an increasing demand for such services. This coupled with the advancement of future wireless networks B5G/6G and the proliferation of intelligent handheld mobile devices, has facilitated the availability of multimedia content to heterogeneous mobile users. Apart from the conventional video, the 360 ^o videos have gained popularity with the emerging virtual reality applications. All formats of videos (conventional and 360 ^o ) undergo processing, compression, and transmission across dynamic wireless channels with restricted bandwidth to facilitate the streaming services. This causes video impairments, leading to quality degradation and poses challenges in delivering good Quality-of-Experience (QoE) to the viewers. The QoE is a prominent subjective quality measure to assess multimedia services. This requires end-to-end QoE evaluation. Efficient multimedia streaming techniques can improve the service quality while dealing with dynamic network and end-user challenges. A paradigm shift in user-centric multimedia services is envisioned with a focus on Machine Learning (ML) based QoE modeling and streaming strategies. This survey paper presents a comprehensive overview of the overall and continuous, time varying QoE modeling for the purpose of QoE management in multimedia services. It also examines the recent research on intelligent and adaptive multimedia streaming strategies, with a special emphasis on ML based techniques for video (conventional and 360 ^o ) streaming. This paper discusses the overall and continuous QoE modeling to optimize the end-user viewing experience, efficient video streaming with a focus on user-centric strategies, associated datasets for modeling and streaming, along with existing shortcoming and open challenges.

[AI-37] Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning

链接: https://arxiv.org/abs/2411.15796
作者: Qi Li,Cheng-Long Wang,Yinzhi Cao,Di Wang
关键词-EN: machine learning systems, learning systems, systematically explore, issues of dataset, machine learning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we systematically explore the data privacy issues of dataset pruning in machine learning systems. Our findings reveal, for the first time, that even if data in the redundant set is solely used before model training, its pruning-phase membership status can still be detected through attacks. Since this is a fully upstream process before model training, traditional model output-based privacy inference methods are completely unsuitable. To address this, we introduce a new task called Data-Centric Membership Inference and propose the first ever data-centric privacy inference paradigm named Data Lineage Inference (DaLI). Under this paradigm, four threshold-based attacks are proposed, named WhoDis, CumDis, ArraDis and SpiDis. We show that even without access to downstream models, adversaries can accurately identify the redundant set with only limited prior knowledge. Furthermore, we find that different pruning methods involve varying levels of privacy leakage, and even the same pruning method can present different privacy risks at different pruning fractions. We conducted an in-depth analysis of these phenomena and introduced a metric called the Brimming score to offer guidance for selecting pruning methods with privacy protection in mind.

[AI-38] Decoding Urban Industrial Complexity: Enhancing Knowledge-Driven Insights via IndustryScopeGPT

链接: https://arxiv.org/abs/2411.15758
作者: Siqi Wang,Chao Liang,Yunfan Gao,Yang Liu,Jing Li,Haofen Wang
关键词-EN: urban economic growth, economic growth, industrial park, Industrial Park Planning, Large Language Models
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: 9 pages, 6 figures, the 32nd ACM International Conference on Multimedia

点击查看摘要

Abstract:Industrial parks are critical to urban economic growth. Yet, their development often encounters challenges stemming from imbalances between industrial requirements and urban services, underscoring the need for strategic planning and operations. This paper introduces IndustryScopeKG, a pioneering large-scale multi-modal, multi-level industrial park knowledge graph, which integrates diverse urban data including street views, corporate, socio-economic, and geospatial information, capturing the complex relationships and semantics within industrial parks. Alongside this, we present the IndustryScopeGPT framework, which leverages Large Language Models (LLMs) with Monte Carlo Tree Search to enhance tool-augmented reasoning and decision-making in Industrial Park Planning and Operation (IPPO). Our work significantly improves site recommendation and functional planning, demonstrating the potential of combining LLMs with structured datasets to advance industrial park management. This approach sets a new benchmark for intelligent IPPO research and lays a robust foundation for advancing urban industrial development. The dataset and related code are available at this https URL.

[AI-39] Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting

链接: https://arxiv.org/abs/2411.15743
作者: Liran Nochumsohn,Michal Moshkovitz,Orly Avner,Dotan Di Castro,Omri Azencot
关键词-EN: Time series forecasting, numerous real-world applications, requiring accurate predictions, Time series, observed patterns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting is critical in numerous real-world applications, requiring accurate predictions of future values based on observed patterns. While traditional forecasting techniques work well in in-domain scenarios with ample data, they struggle when data is scarce or not available at all, motivating the emergence of zero-shot and few-shot learning settings. Recent advancements often leverage large-scale foundation models for such tasks, but these methods require extensive data and compute resources, and their performance may be hindered by ineffective learning from the available training set. This raises a fundamental question: What factors influence effective learning from data in time series forecasting? Toward addressing this, we propose using Fourier analysis to investigate how models learn from synthetic and real-world time series data. Our findings reveal that forecasters commonly suffer from poor learning from data with multiple frequencies and poor generalization to unseen frequencies, which impedes their predictive performance. To alleviate these issues, we present a novel synthetic data generation framework, designed to enhance real data or replace it completely by creating task-specific frequency information, requiring only the sampling rate of the target data. Our approach, Freq-Synth, improves the robustness of both foundation as well as nonfoundation forecast models in zero-shot and few-shot settings, facilitating more reliable time series forecasting under limited data scenarios.

[AI-40] Fusion Matters: Learning Fusion in Deep Click-through Rate Prediction Models WSDM2025

链接: https://arxiv.org/abs/2411.15731
作者: Kexin Zhang,Fuyuan Lyu,Xing Tang,Dugang Liu,Chen Ma,Kaize Ding,Xiuqiang He,Xue Liu
关键词-EN: previous Click-Through Rate, modeling feature interactions, Click-Through Rate, proposing complex components, shallow or deep
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by WSDM 2025

点击查看摘要

Abstract:The evolution of previous Click-Through Rate (CTR) models has mainly been driven by proposing complex components, whether shallow or deep, that are adept at modeling feature interactions. However, there has been less focus on improving fusion design. Instead, two naive solutions, stacked and parallel fusion, are commonly used. Both solutions rely on pre-determined fusion connections and fixed fusion operations. It has been repetitively observed that changes in fusion design may result in different performances, highlighting the critical role that fusion plays in CTR models. While there have been attempts to refine these basic fusion strategies, these efforts have often been constrained to specific settings or dependent on specific components. Neural architecture search has also been introduced to partially deal with fusion design, but it comes with limitations. The complexity of the search space can lead to inefficient and ineffective results. To bridge this gap, we introduce OptFusion, a method that automates the learning of fusion, encompassing both the connection learning and the operation selection. We have proposed a one-shot learning algorithm tackling these tasks concurrently. Our experiments are conducted over three large-scale datasets. Extensive experiments prove both the effectiveness and efficiency of OptFusion in improving CTR model performance. Our code implementation is available here\urlthis https URL.

[AI-41] Understanding Student Acceptance Trust and Attitudes Toward AI-Generated Images for Educational Purposes

链接: https://arxiv.org/abs/2411.15710
作者: Aung Pyae
关键词-EN: Recent advancements, artificial intelligence, including the creative, advancements in artificial, broadened the applicability
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI) have broadened the applicability of AI-generated images across various sectors, including the creative industry and design. However, their utilization in educational contexts, particularly among undergraduate students in computer science and software engineering, remains underexplored. This study adopts an exploratory approach, employing questionnaires and interviews, to assess students’ acceptance, trust, and positive attitudes towards AI-generated images for educational tasks such as presentations, reports, and web design. The results reveal high acceptance, trust, and positive attitudes among students who value the ease of use and potential academic benefits. However, concerns regarding the lack of technical precision, where the AI fails to accurately produce images as specified by prompts, moderately impact their practical application in detail-oriented educational tasks. These findings suggest a need for developing comprehensive guidelines that address ethical considerations and intellectual property issues, while also setting quality standards for AI-generated images to enhance their educational use. Enhancing the capabilities of AI tools to meet precise user specifications could foster creativity and improve educational outcomes in technical disciplines.

[AI-42] Nimbus: Secure and Efficient Two-Party Inference for Transformers NIPS2024

链接: https://arxiv.org/abs/2411.15707
作者: Zhengyi Li,Kang Yang,Jin Tan,Wen-jie Lu,Haoqi Wu,Xiao Wang,Yu Yu,Derun Zhao,Yancheng Zheng,Minyi Guo,Jingwen Leng
关键词-EN: machine learning tasks, gained significant attention, significant attention due, mathsf, times
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted by NIPS 2024

点击查看摘要

Abstract:Transformer models have gained significant attention due to their power in machine learning tasks. Their extensive deployment has raised concerns about the potential leakage of sensitive information during inference. However, when being applied to Transformers, existing approaches based on secure two-party computation (2PC) bring about efficiency limitations in two folds: (1) resource-intensive matrix multiplications in linear layers, and (2) complex non-linear activation functions like \mathsfGELU and \mathsfSoftmax . This work presents a new two-party inference framework \mathsfNimbus for Transformer models. For the linear layer, we propose a new 2PC paradigm along with an encoding approach to securely compute matrix multiplications based on an outer-product insight, which achieves 2.9\times \sim 12.5\times performance improvements compared to the state-of-the-art (SOTA) protocol. For the non-linear layer, through a new observation of utilizing the input distribution, we propose an approach of low-degree polynomial approximation for \mathsfGELU and \mathsfSoftmax , which improves the performance of the SOTA polynomial approximation by 2.9\times \sim 4.0\times , where the average accuracy loss of our approach is 0.08% compared to the non-2PC inference without privacy. Compared with the SOTA two-party inference, \mathsfNimbus improves the end-to-end performance of \bert inference by 2.7\times \sim 4.7\times across different network settings.

[AI-43] Quantile deep learning models for multi-step ahead time series prediction

链接: https://arxiv.org/abs/2411.15674
作者: Jimmy Cheung,Smruthi Rangarajan,Amelia Maddocks,Xizhe Chen,Rohitash Chandra
关键词-EN: deep learning models, deep learning, learning models, learning, time series prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Uncertainty quantification is crucial in time series prediction, and quantile regression offers a valuable mechanism for uncertainty quantification which is useful for extreme value forecasting. Although deep learning models have been prominent in multi-step ahead prediction, the development and evaluation of quantile deep learning models have been limited. We present a novel quantile regression deep learning framework for multi-step time series prediction. In this way, we elevate the capabilities of deep learning models by incorporating quantile regression, thus providing a more nuanced understanding of predictive values. We provide an implementation of prominent deep learning models for multi-step ahead time series prediction and evaluate their performance under high volatility and extreme conditions. We include multivariate and univariate modelling, strategies and provide a comparison with conventional deep learning models from the literature. Our models are tested on two cryptocurrencies: Bitcoin and Ethereum, using daily close-price data and selected benchmark time series datasets. The results show that integrating a quantile loss function with deep learning provides additional predictions for selected quantiles without a loss in the prediction accuracy when compared to the literature. Our quantile model has the ability to handle volatility more effectively and provides additional information for decision-making and uncertainty quantification through the use of quantiles when compared to conventional deep learning models.

[AI-44] IRSKG: Unified Intrusion Response System Knowledge Graph Ontology for Cyber Defense

链接: https://arxiv.org/abs/2411.15672
作者: Damodar Panigrahi,Shaswata Mitra,Subash Neupane,Sudip Mittal,Benjamin A. Blakely
关键词-EN: Autonomous Intelligent Cyber-defense, Intelligent Cyber-defense Agents, increasingly difficult, difficult to detect, detect and prevent
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Cyberattacks are becoming increasingly difficult to detect and prevent due to their sophistication. In response, Autonomous Intelligent Cyber-defense Agents (AICAs) are emerging as crucial solutions. One prominent AICA agent is the Intrusion Response System (IRS), which is critical for mitigating threats after detection. IRS uses several Tactics, Techniques, and Procedures (TTPs) to mitigate attacks and restore the infrastructure to normal operations. Continuous monitoring of the enterprise infrastructure is an essential TTP the IRS uses. However, each system serves different purposes to meet operational needs. Integrating these disparate sources for continuous monitoring increases pre-processing complexity and limits automation, eventually prolonging critical response time for attackers to exploit. We propose a unified IRS Knowledge Graph ontology (IRSKG) that streamlines the onboarding of new enterprise systems as a source for the AICAs. Our ontology can capture system monitoring logs and supplemental data, such as a rules repository containing the administrator-defined policies to dictate the IRS responses. Besides, our ontology permits us to incorporate dynamic changes to adapt to the evolving cyber-threat landscape. This robust yet concise design allows machine learning models to train effectively and recover a compromised system to its desired state autonomously with explainability.

[AI-45] Aligning Generalisation Between Humans and Machines

链接: https://arxiv.org/abs/2411.15626
作者: Filip Ilievski,Barbara Hammer,Frank van Harmelen,Benjamin Paassen,Sascha Saralajew,Ute Schmid,Michael Biehl,Marianna Bolognesi,Xin Luna Dong,Kiril Gashteovski,Pascal Hitzler,Giuseppe Marra,Pasquale Minervini,Martin Mundt,Axel-Cyrille Ngonga Ngomo,Alessandro Oltramari,Gabriella Pasi,Zeynep G. Saribatur,Luciano Serafini,John Shawe-Taylor,Vered Shwartz,Gabriella Skitalinskaya,Clemens Stachl,Gido M. van de Ven,Thomas Villmann
关键词-EN: including generative approaches, Recent advances, decision support, including generative, generative approaches
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in AI – including generative approaches – have resulted in technology that can support humans in scientific discovery and decision support but may also disrupt democracies and target individuals. The responsible use of AI increasingly shows the need for human-AI teaming, necessitating effective interaction between humans and machines. A crucial yet often overlooked aspect of these interactions is the different ways in which humans and machines generalise. In cognitive science, human generalisation commonly involves abstraction and concept learning. In contrast, AI generalisation encompasses out-of-domain generalisation in machine learning, rule-based reasoning in symbolic AI, and abstraction in neuro-symbolic AI. In this perspective paper, we combine insights from AI and cognitive science to identify key commonalities and differences across three dimensions: notions of generalisation, methods for generalisation, and evaluation of generalisation. We map the different conceptualisations of generalisation in AI and cognitive science along these three dimensions and consider their role in human-AI teaming. This results in interdisciplinary challenges across AI and cognitive science that must be tackled to provide a foundation for effective and cognitively supported alignment in human-AI teaming scenarios.

[AI-46] Class Order Disorder in Wikidata and First Fixes

链接: https://arxiv.org/abs/2411.15550
作者: Peter F. Patel-Schneider,Ege Atacan Doğan
关键词-EN: Wikidata, class order, large ontology, Wikidata ontology, ontology with classes
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Wikidata has a large ontology with classes at several orders. The Wikidata ontology has long been known to have violations of class order and information related to class order that appears suspect. SPARQL queries were evaluated against Wikidata to determine the prevalence of several kinds of violations and suspect information and the results analyzed. Some changes were manually made to Wikidata to remove some of these results and the queries rerun, showing the effect of the changes. Suggestions are provided on how the problems uncovered might be addressed, either though better tooling or involvement of the Wikidata community.

[AI-47] Instruct or Interact? Exploring and Eliciting LLM s Capability in Code Snippet Adaptation Through Prompt Engineering ICSE2025

链接: https://arxiv.org/abs/2411.15501
作者: Tanghaoran Zhang,Yue Yu,Xinjun Mao,Shangwen Wang,Kang Yang,Yao Lu,Zhang Zhang,Yuxin Zhao
关键词-EN: software development process, Code snippet adaptation, Code snippet, Code, code generation task
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 12 pages, 10 figures, accepted by ICSE 2025

点击查看摘要

Abstract:Code snippet adaptation is a fundamental activity in the software development process. Unlike code generation, code snippet adaptation is not a “free creation”, which requires developers to tailor a given code snippet in order to fit specific requirements and the code context. Recently, large language models (LLMs) have confirmed their effectiveness in the code generation task with promising results. However, their performance on adaptation, a reuse-oriented and context-dependent code change prediction task, is still unclear. To bridge this gap, we conduct an empirical study to investigate the performance and issues of LLMs on the adaptation task. We first evaluate the adaptation performances of three popular LLMs and compare them to the code generation task. Our result indicates that their adaptation ability is weaker than generation, with a nearly 15% decrease on pass@1 and more context-related errors. By manually inspecting 200 cases, we further investigate the causes of LLMs’ sub-optimal performance, which can be classified into three categories, i.e., Unclear Requirement, Requirement Misalignment and Context Misapplication. Based on the above empirical research, we propose an interactive prompting approach to eliciting LLMs’ adaptation ability. Experimental result reveals that our approach greatly improve LLMs’ adaptation performance. The best-performing Human-LLM interaction successfully solves 159 out of the 202 identified defects and improves the pass@1 and pass@5 by over 40% compared to the initial instruction-based prompt. Considering human efforts, we suggest multi-agent interaction as a trade-off, which can achieve comparable performance with excellent generalization ability. We deem that our approach could provide methodological assistance for autonomous code snippet reuse and adaptation with LLMs.

[AI-48] A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks

链接: https://arxiv.org/abs/2411.15470
作者: Rohit Dandamudi,Gema Rodríguez-Pérez
关键词-EN: Code Language Models, programming language settings, low-resource programming language, poses significant challenges, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: 5 pages, ASEW 2024

点击查看摘要

Abstract:Evaluating the performance of Code Language Models (CLMs) for software engineering tasks, especially in multilingual and low-resource programming language settings, poses significant challenges. These challenges are primarily due to the lack of high-quality benchmarks across various programming languages and the imbalanced nature of the CLMs training corpus. Although recent advances in one of the common downstream tasks, code generation, have shown promise by introducing translated benchmarks using different methodologies, there is a current lack of empirical evidence assessing these benchmarks. To address this gap, we conducted a preliminary study to evaluate the performance of Poly-Coder, a pioneering open-source, multilingual CLM built for code generation. We utilized two existing state-of-the-art translations of the popular code generation benchmark, HumanEval, facilitated by the OctoPack and MultiPL-E studies. Our results suggest that the outcomes observed in these translated benchmarks align well with evaluation metrics used during the training phase, such as perplexity, thereby validating their effectiveness in estimating the performance of CLMs. However, we identified several inconsistencies in the CLMs’ performance across the translated benchmarks and encountered challenges in replicating the results. These initial insights highlight the need for more comprehensive empirical studies to fully understand translated benchmarks’ methodological approaches, limitations, and reproducibility. Such studies are essential to ensure their reliability before they are widely adopted.

[AI-49] ANGNN: a Concise Scalable and Effective Graph Neural Networks with Top-m Attention Mechanism for Graph Representation Learning WWW

链接: https://arxiv.org/abs/2411.15458
作者: Jiawei E,Yinglong Zhang,Xuewen Xia,Xing Xu
关键词-EN: processing structured data, flexible architectural designs, Graph Transformer models, Graph Transformer, Graph Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The code and ArXivNet dataset are available at this https URL

点击查看摘要

Abstract:In the field of deep learning, Graph Neural Networks (GNNs) and Graph Transformer models, with their outstanding performance and flexible architectural designs, have become leading technologies for processing structured data, especially graph data. Traditional GNNs often face challenges in capturing information from distant vertices effectively. In contrast, Graph Transformer models are particularly adept at managing long-distance node relationships. Despite these advantages, Graph Transformer models still encounter issues with computational and storage efficiency when scaled to large graph datasets. To address these challenges, we propose an innovative Graph Neural Network (GNN) architecture that integrates a Top-m attention mechanism aggregation component and a neighborhood aggregation component, effectively enhancing the model’s ability to aggregate relevant information from both local and extended neighborhoods at each layer. This method not only improves computational efficiency but also enriches the node features, facilitating a deeper analysis of complex graph structures. Additionally, to assess the effectiveness of our proposed model, we have applied it to citation sentiment prediction, a novel task previously unexplored in the GNN field. Accordingly, we constructed a dedicated citation network, ArXivNet. In this dataset, we specifically annotated the sentiment polarity of the citations (positive, neutral, negative) to enable in-depth sentiment analysis. Our approach has shown superior performance across a variety of tasks including vertex classification, link prediction, sentiment prediction, graph regression, and visualization. It outperforms existing methods in terms of effectiveness, as demonstrated by experimental results on multiple datasets.

[AI-50] Hindi audio-video-Deepfake (HAV-DF): A Hindi language-based Audio-video Deepfake Dataset

链接: https://arxiv.org/abs/2411.15457
作者: Sukhandeep Kaur,Mubashir Buhari,Naman Khandelwal,Priyansh Tyagi,Kiran Sharma
关键词-EN: offer great potential, pose significant risks, Deepfakes offer great, Hindi, innovation and creativity
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Graphics (cs.GR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Deepfakes offer great potential for innovation and creativity, but they also pose significant risks to privacy, trust, and security. With a vast Hindi-speaking population, India is particularly vulnerable to deepfake-driven misinformation campaigns. Fake videos or speeches in Hindi can have an enormous impact on rural and semi-urban communities, where digital literacy tends to be lower and people are more inclined to trust video content. The development of effective frameworks and detection tools to combat deepfake misuse requires high-quality, diverse, and extensive datasets. The existing popular datasets like FF-DF (FaceForensics++), and DFDC (DeepFake Detection Challenge) are based on English language… Hence, this paper aims to create a first novel Hindi deep fake dataset, named ``Hindi audio-video-Deepfake’’ (HAV-DF). The dataset has been generated using the faceswap, lipsyn and voice cloning methods. This multi-step process allows us to create a rich, varied dataset that captures the nuances of Hindi speech and facial expressions, providing a robust foundation for training and evaluating deepfake detection models in a Hindi language context. It is unique of its kind as all of the previous datasets contain either deepfake videos or synthesized audio. This type of deepfake dataset can be used for training a detector for both deepfake video and audio datasets. Notably, the newly introduced HAV-DF dataset demonstrates lower detection accuracy’s across existing detection methods like Headpose, Xception-c40, etc. Compared to other well-known datasets FF-DF, and DFDC. This trend suggests that the HAV-DF dataset presents deeper challenges to detect, possibly due to its focus on Hindi language content and diverse manipulation techniques. The HAV-DF dataset fills the gap in Hindi-specific deepfake datasets, aiding multilingual deepfake detection development.

[AI-51] MUFM: A Mamba-Enhanced Feedback Model for Micro Video Popularity Prediction

链接: https://arxiv.org/abs/2411.15455
作者: Jiacheng Lu,Mingyuan Xiao,Weijian Wang,Yuxin Du,Yi Cui,Jingnan Zhao,Cheng Hua
关键词-EN: surge in micro-videos, micro-videos is transforming, transforming the concept, vast multi-modal datasets, Mamba Hawkes process
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
*备注: 14 pages,9 figures

点击查看摘要

Abstract:The surge in micro-videos is transforming the concept of popularity. As researchers delve into vast multi-modal datasets, there is a growing interest in understanding the origins of this popularity and the forces driving its rapid expansion. Recent studies suggest that the virality of short videos is not only tied to their inherent multi-modal content but is also heavily influenced by the strength of platform recommendations driven by audience feedback. In this paper, we introduce a framework for capturing long-term dependencies in user feedback and dynamic event interactions, based on the Mamba Hawkes process. Our experiments on the large-scale open-source multi-modal dataset show that our model significantly outperforms state-of-the-art approaches across various metrics by 23.2%. We believe our model’s capability to map the relationships within user feedback behavior sequences will not only contribute to the evolution of next-generation recommendation algorithms and platform applications but also enhance our understanding of micro video dissemination and its broader societal impact.

[AI-52] Automatic High-quality Verilog Assertion Generation through Subtask-Focused Fine-Tuned LLM s and Iterative Prompting

链接: https://arxiv.org/abs/2411.15442
作者: Mohammad Shahidzadeh,Behnam Ghavami,Steve Wilton,Lesley Shannon
关键词-EN: Formal Property Verification, Formal Property, Property Verification, crucial for ensuring, ensuring the completeness
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Formal Property Verification (FPV), using SystemVerilog Assertions (SVA), is crucial for ensuring the completeness of design with respect to the specification. However, writing SVA is a laborious task and has a steep learning curve. In this work, we present a large language model (LLM) -based flow to automatically generate high-quality SVA from the design specification documents, named \ToolName. We introduce a novel sub-task-focused fine-tuning approach that effectively addresses functionally incorrect assertions produced by baseline LLMs, leading to a remarkable 7.3-fold increase in the number of functionally correct assertions. Recognizing the prevalence of syntax and semantic errors, we also developed an iterative refinement method that enhances the LLM’s initial outputs by systematically re-prompting it to correct identified issues. This process is further strengthened by a custom compiler that generates meaningful error messages, guiding the LLM towards improved accuracy. The experiments demonstrate a 26% increase in the number of assertions free from syntax errors using this approach, showcasing its potential to streamline the FPV process.

[AI-53] GeoAI-Enhanced Community Detection on Spatial Networks with Graph Deep Learning

链接: https://arxiv.org/abs/2411.15428
作者: Yunlei Liang,Jiawei Zhu,Wen Ye,Song Gao
关键词-EN: modeling geographic phenomena, spatial interaction plays, important role, Graph Attention Networks, Graph Convolutional Networks
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:Spatial networks are useful for modeling geographic phenomena where spatial interaction plays an important role. To analyze the spatial networks and their internal structures, graph-based methods such as community detection have been widely used. Community detection aims to extract strongly connected components from the network and reveal the hidden relationships between nodes, but they usually do not involve the attribute information. To consider edge-based interactions and node attributes together, this study proposed a family of GeoAI-enhanced unsupervised community detection methods called region2vec based on Graph Attention Networks (GAT) and Graph Convolutional Networks (GCN). The region2vec methods generate node neural embeddings based on attribute similarity, geographic adjacency and spatial interactions, and then extract network communities based on node embeddings using agglomerative clustering. The proposed GeoAI-based methods are compared with multiple baselines and perform the best when one wants to maximize node attribute similarity and spatial interaction intensity simultaneously within the spatial network communities. It is further applied in the shortage area delineation problem in public health and demonstrates its promise in regionalization problems.

[AI-54] Learning a local trading strategy: deep reinforcement learning for grid-scale renewable energy integration

链接: https://arxiv.org/abs/2411.15422
作者: Caleb Ju,Constance Crozier
关键词-EN: Variable renewable generation, renewable generation increases, Variable renewable, balancing power supply, Grid-scale batteries co-located
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Accepted to HICSS58

点击查看摘要

Abstract:Variable renewable generation increases the challenge of balancing power supply and demand. Grid-scale batteries co-located with generation can help mitigate this misalignment. This paper explores the use of reinforcement learning (RL) for operating grid-scale batteries co-located with solar power. Our results show RL achieves an average of 61% (and up to 96%) of the approximate theoretical optimal (non-causal) operation, outperforming advanced control methods on average. Our findings suggest RL may be preferred when future signals are hard to predict. Moreover, RL has two significant advantages compared to simpler rules-based control: (1) that solar energy is more effectively shifted towards high demand periods, and (2) increased diversity of battery dispatch across different locations, reducing potential ramping issues caused by super-position of many similar actions.

[AI-55] he Decoy Dilemma in Online Medical Information Evaluation: A Comparative Study of Credibility Assessments by LLM and Human Judges

链接: https://arxiv.org/abs/2411.15396
作者: Jiqun Liu,Jiangen He
关键词-EN: cognitively biased, LLMs, extent LLMs behave, human, LLM judgments compared
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Can AI be cognitively biased in automated information judgment tasks? Despite recent progresses in measuring and mitigating social and algorithmic biases in AI and large language models (LLMs), it is not clear to what extent LLMs behave “rationally”, or if they are also vulnerable to human cognitive bias triggers. To address this open problem, our study, consisting of a crowdsourcing user experiment and a LLM-enabled simulation experiment, compared the credibility assessments by LLM and human judges under potential decoy effects in an information retrieval (IR) setting, and empirically examined the extent to which LLMs are cognitively biased in COVID-19 medical (mis)information assessment tasks compared to traditional human assessors as a baseline. The results, collected from a between-subject user experiment and a LLM-enabled replicate experiment, demonstrate that 1) Larger and more recent LLMs tend to show a higher level of consistency and accuracy in distinguishing credible information from misinformation. However, they are more likely to give higher ratings for misinformation due to the presence of a more salient, decoy misinformation result; 2) While decoy effect occurred in both human and LLM assessments, the effect is more prevalent across different conditions and topics in LLM judgments compared to human credibility ratings. In contrast to the generally assumed “rationality” of AI tools, our study empirically confirms the cognitive bias risks embedded in LLM agents, evaluates the decoy impact on LLMs against human credibility assessments, and thereby highlights the complexity and importance of debiasing AI agents and developing psychology-informed AI audit techniques and policies for automated judgment tasks and beyond.

[AI-56] Inducing Human-like Biases in Moral Reasoning Language Models NEURIPS2024

链接: https://arxiv.org/abs/2411.15386
作者: Artem Karpov,Seong Hah Cho,Austin Meek,Raymond Koopmanschap,Lucy Farnik,Bogdan-Ionut Cirstea
关键词-EN: moral reasoning, humans performing moral, performing moral reasoning, large language models, humans performing
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted to the 2nd Workshop on Unifying Representations in Neural Models (UniReps) at NeurIPS 2024

点击查看摘要

Abstract:In this work, we study the alignment (BrainScore) of large language models (LLMs) fine-tuned for moral reasoning on behavioral data and/or brain data of humans performing the same task. We also explore if fine-tuning several LLMs on the fMRI data of humans performing moral reasoning can improve the BrainScore. We fine-tune several LLMs (BERT, RoBERTa, DeBERTa) on moral reasoning behavioral data from the ETHICS benchmark [Hendrycks et al., 2020], on the moral reasoning fMRI data from Koster-Hale et al. [2013], or on both. We study both the accuracy on the ETHICS benchmark and the BrainScores between model activations and fMRI data. While larger models generally performed better on both metrics, BrainScores did not significantly improve after fine-tuning.

[AI-57] Nd-BiMamba2: A Unified Bidirectional Architecture for Multi-Dimensional Data Processing

链接: https://arxiv.org/abs/2411.15380
作者: Hao Liu
关键词-EN: Deep learning models, Deep learning, specially designed architectures, require specially designed, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning models often require specially designed architectures to process data of different dimensions, such as 1D time series, 2D images, and 3D volumetric data. Existing bidirectional models mainly focus on sequential data, making it difficult to scale effectively to higher dimensions. To address this issue, we propose a novel multi-dimensional bidirectional neural network architecture, named Nd-BiMamba2, which efficiently handles 1D, 2D, and 3D data. Nd-BiMamba2 is based on the Mamba2 module and introduces innovative bidirectional processing mechanisms and adaptive padding strategies to capture bidirectional information in multi-dimensional data while maintaining computational efficiency. Unlike existing methods that require designing specific architectures for different dimensional data, Nd-BiMamba2 adopts a unified architecture with a modular design, simplifying development and maintenance costs. To verify the portability and flexibility of Nd-BiMamba2, we successfully exported it to ONNX and TorchScript and tested it on different hardware platforms (e.g., CPU, GPU, and mobile devices). Experimental results show that Nd-BiMamba2 runs efficiently on multiple platforms, demonstrating its potential in practical applications. The code is open-source: this https URL

[AI-58] AdamZ: An Enhanced Optimisation Method for Neural Network Training

链接: https://arxiv.org/abs/2411.15375
作者: Ilia Zaznov(Department of Computer Science, University of Reading, Reading, UK),Atta Badii(Department of Computer Science, University of Reading, Reading, UK),Alfonso Dufour(ICMA Centre, Henley Business School, University of Reading, Reading, UK),Julian Kunkel(Department of Computer Science/GWDG, University of Göttingen, Goettingen, Germany)
关键词-EN: enhance convergence efficiency, neural network training, developed to enhance, Adam optimiser, advanced variant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 13 pages, 9 figures, 3 tables

点击查看摘要

Abstract:AdamZ is an advanced variant of the Adam optimiser, developed to enhance convergence efficiency in neural network training. This optimiser dynamically adjusts the learning rate by incorporating mechanisms to address overshooting and stagnation, that are common challenges in optimisation. Specifically, AdamZ reduces the learning rate when overshooting is detected and increases it during periods of stagnation, utilising hyperparameters such as overshoot and stagnation factors, thresholds, and patience levels to guide these adjustments. While AdamZ may lead to slightly longer training times compared to some other optimisers, it consistently excels in minimising the loss function, making it particularly advantageous for applications where precision is critical. Benchmarking results demonstrate the effectiveness of AdamZ in maintaining optimal learning rates, leading to improved model performance across diverse tasks.

[AI-59] Deep Policy Gradient Methods Without Batch Updates Target Networks or Replay Buffers

链接: https://arxiv.org/abs/2411.15370
作者: Gautham Vasan,Mohamed Elsayed,Alireza Azimi,Jiamin He,Fahim Shariar,Colin Bellinger,Martha White,A. Rupam Mahmood
关键词-EN: Modern deep policy, require large replay, large replay buffers, simulated robotic tasks, Modern deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: In The Thirty-eighth Annual Conference on Neural Information Processing Systems. Source code at this https URL and companion video at this https URL

点击查看摘要

Abstract:Modern deep policy gradient methods achieve effective performance on simulated robotic tasks, but they all require large replay buffers or expensive batch updates, or both, making them incompatible for real systems with resource-limited computers. We show that these methods fail catastrophically when limited to small replay buffers or during incremental learning, where updates only use the most recent sample without batch updates or a replay buffer. We propose a novel incremental deep policy gradient method – Action Value Gradient (AVG) and a set of normalization and scaling techniques to address the challenges of instability in incremental learning. On robotic simulation benchmarks, we show that AVG is the only incremental method that learns effectively, often achieving final performance comparable to batch policy gradient methods. This advancement enabled us to show for the first time effective deep reinforcement learning with real robots using only incremental updates, employing a robotic manipulator and a mobile robot.

[AI-60] Designing Cellular Manufacturing System in Presence of Alternative Process Plans

链接: https://arxiv.org/abs/2411.15361
作者: Md. Kutub Uddin,Md. Saiful Islam,Md Abrar Jahin,Md. Tanjid Hossen Irfan,Md. Saiful Islam Seam,M. F. Mridha
关键词-EN: cellular manufacturing systems, CMS involves grouping, manufacturing systems, numerous technological, cellular manufacturing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the design of cellular manufacturing systems (CMS), numerous technological and managerial decisions must be made at both the design and operational stages. The first step in designing a CMS involves grouping parts and machines. In this paper, four integer programming formulations are presented for grouping parts and machines in a CMS at both the design and operational levels for a generalized grouping problem, where each part has more than one process plan, and each operation of a process plan can be performed on more than one machine. The minimization of inter-cell and intra-cell movements is achieved by assigning the maximum possible number of consecutive operations of a part type to the same cell and to the same machine, respectively. The suitability of minimizing inter-cell and intra-cell movements as an objective, compared to other objectives such as minimizing investment costs on machines, operating costs, etc., is discussed. Numerical examples are included to illustrate the workings of the formulations.

[AI-61] Regulator-Manufacturer AI Agents Modeling: Mathematical Feedback-Driven Multi-Agent LLM Framework

链接: https://arxiv.org/abs/2411.15356
作者: Yu Han,Zekun Guo
关键词-EN: global authorities presents, authorities presents significant, presents significant challenges, necessitating agile strategies, maintain market access
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The increasing complexity of regulatory updates from global authorities presents significant challenges for medical device manufacturers, necessitating agile strategies to sustain compliance and maintain market access. Concurrently, regulatory bodies must effectively monitor manufacturers’ responses and develop strategic surveillance plans. This study employs a multi-agent modeling approach, enhanced with Large Language Models (LLMs), to simulate regulatory dynamics and examine the adaptive behaviors of key actors, including regulatory bodies, manufacturers, and competitors. These agents operate within a simulated environment governed by regulatory flow theory, capturing the impacts of regulatory changes on compliance decisions, market adaptation, and innovation strategies. Our findings illuminate the influence of regulatory shifts on industry behaviour and identify strategic opportunities for improving regulatory practices, optimizing compliance, and fostering innovation. By leveraging the integration of multi-agent systems and LLMs, this research provides a novel perspective and offers actionable insights for stakeholders navigating the evolving regulatory landscape of the medical device industry.

[AI-62] GeoScatt-GNN: A Geometric Scattering Transform-Based Graph Neural Network Model for Ames Mutagenicity Prediction

链接: https://arxiv.org/abs/2411.15331
作者: Abdeljalil Zoubir,Badr Missaoui
关键词-EN: ground-breaking approaches, paper tackles, tackles the pressing, pressing challenge, introducing three ground-breaking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:This paper tackles the pressing challenge of mutagenicity prediction by introducing three ground-breaking approaches. First, it showcases the superior performance of 2D scattering coefficients extracted from molecular images, compared to traditional molecular descriptors. Second, it presents a hybrid approach that combines geometric graph scattering (GGS), Graph Isomorphism Networks (GIN), and machine learning models, achieving strong results in mutagenicity prediction. Third, it introduces a novel graph neural network architecture, MOLG3-SAGE, which integrates GGS node features into a fully connected graph structure, delivering outstanding predictive accuracy. Experimental results on the ZINC dataset demonstrate significant improvements, emphasizing the effectiveness of blending 2D and geometric scattering techniques with graph neural networks. This study illustrates the potential of GNNs and GGS for mutagenicity prediction, with broad implications for drug discovery and chemical safety assessment.

[AI-63] Influence functions and regularity tangents for efficient active learning

链接: https://arxiv.org/abs/2411.15292
作者: Frederik Eaton
关键词-EN: model, data point, paper we describe, describe an efficient, providing a regression
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 33 pages, 4 figures

点击查看摘要

Abstract:In this paper we describe an efficient method for providing a regression model with a sense of curiosity about its data. In the field of machine learning, our framework for representing curiosity is called Active Learning, which means automatically choosing data points for which to query labels in the semisupervised setting. The methods we propose are based on computing a “regularity tangent” vector that can be calculated (with only a constant slow-down) together with the model’s parameter vector during training. We then take the inner product of this tangent vector with the gradient vector of the model’s loss at a given data point to obtain a measure of the influence of that point on the complexity of the model. There is only a single regularity tangent vector, of the same dimension as the parameter vector. Thus, in the proposed technique, once training is complete, evaluating our “curiosity” about a potential query data point can be done as quickly as calculating the model’s loss gradient at that point. The new vector only doubles the amount of storage required by the model. We show that the quantity computed by our technique is an example of an “influence function”, and that it measures the expected squared change in model complexity incurred by up-weighting a given data point. We propose a number of ways for using this quantity to choose new training data for a model in the framework of active learning.

[AI-64] Forecasting Unseen Points of Interest Visits Using Context and Proximity Priors

链接: https://arxiv.org/abs/2411.15285
作者: Ziyao Li,Shang-Ling Hsu,Cyrus Shahabi
关键词-EN: Understanding human mobility, including crowd management, human mobility behavior, Understanding human, location-based recommendations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 2024 IEEE International Conference on Big Data workshop BSD 2024

点击查看摘要

Abstract:Understanding human mobility behavior is crucial for numerous applications, including crowd management, location-based recommendations, and the estimation of pandemic spread. Machine learning models can predict the Points of Interest (POIs) that individuals are likely to visit in the future by analyzing their historical visit patterns. Previous studies address this problem by learning a POI classifier, where each class corresponds to a POI. However, this limits their applicability to predict a new POI that was not in the training data, such as the opening of new restaurants. To address this challenge, we propose a model designed to predict a new POI outside the training data as long as its context is aligned with the user’s interests. Unlike existing approaches that directly predict specific POIs, our model first forecasts the semantic context of potential future POIs, then combines this with a proximity-based prior probability distribution to determine the exact POI. Experimental results on real-world visit data demonstrate that our model outperforms baseline methods that do not account for semantic contexts, achieving a 17% improvement in accuracy. Notably, as new POIs are introduced over time, our model remains robust, exhibiting a lower decline rate in prediction accuracy compared to existing methods.

[AI-65] ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation

链接: https://arxiv.org/abs/2411.15281
作者: Junzhang Liu,Tingkai Liu,Yueyuan Sui,Stephen Xia
关键词-EN: variable inference time, adapts pretrained Transformer, inference time compute, pretrained Transformer models, post-training technique
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce ElastiFormer, a post-training technique that adapts pretrained Transformer models into an elastic counterpart with variable inference time compute. ElastiFormer introduces small routing modules (as low as .00006% additional trainable parameters) to dynamically selects subsets of network parameters and input tokens to be processed by each layer of the pretrained network in an inputdependent manner. The routing modules are trained using self-distillation losses to minimize the differences between the output of the pretrained-model and their elastic counterparts. As ElastiFormer makes no assumption regarding the modality of the pretrained Transformer model, it can be readily applied to all modalities covering causal language modeling, image modeling as well as visual-language modeling tasks. We show that 20% to 50% compute saving could be achieved for different components of the transformer layer, which could be further reduced by adding very low rank LoRA weights (rank 1) trained via the same distillation objective. Finally, by comparing routing trained on different subsets of ImageNet, we show that ElastiFormer is robust against the training domain.

[AI-66] Curriculum-enhanced GroupDRO: Challenging the Norm of Avoiding Curriculum Learning in Subpopulation Shift Setups

链接: https://arxiv.org/abs/2411.15272
作者: Antonio Barbalau
关键词-EN: spurious correlations featured, easily learnable spurious, learnable spurious correlations, Curriculum Learning, subpopulation shift scenarios
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In subpopulation shift scenarios, a Curriculum Learning (CL) approach would only serve to imprint the model weights, early on, with the easily learnable spurious correlations featured. To the best of our knowledge, none of the current state-of-the-art subpopulation shift approaches employ any kind of curriculum. To overcome this, we design a CL approach aimed at initializing the model weights in an unbiased vantage point in the hypothesis space which sabotages easy convergence towards biased hypotheses during the final optimization based on the entirety of the available data. We hereby propose a Curriculum-enhanced Group Distributionally Robust Optimization (CeGDRO) approach, which prioritizes the hardest bias-confirming samples and the easiest bias-conflicting samples, leveraging GroupDRO to balance the initial discrepancy in terms of difficulty. We benchmark our proposed method against the most popular subpopulation shift datasets, showing an increase over the state-of-the-art results across all scenarios, up to 6.2% on Waterbirds.

[AI-67] he Explabox: Model-Agnostic Machine Learning Transparency Analysis

链接: https://arxiv.org/abs/2411.15257
作者: Marcel Robeer,Michiel Bron,Elize Herrewijnen,Riwish Hoeseni,Floris Bex
关键词-EN: responsible machine learning, machine learning, development and usage, transparent and responsible, responsible machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:We present the Explabox: an open-source toolkit for transparent and responsible machine learning (ML) model development and usage. Explabox aids in achieving explainable, fair and robust models by employing a four-step strategy: explore, examine, explain and expose. These steps offer model-agnostic analyses that transform complex ‘ingestibles’ (models and data) into interpretable ‘digestibles’. The toolkit encompasses digestibles for descriptive statistics, performance metrics, model behavior explanations (local and global), and robustness, security, and fairness assessments. Implemented in Python, Explabox supports multiple interaction modes and builds on open-source packages. It empowers model developers and testers to operationalize explainability, fairness, auditability, and security. The initial release focuses on text data and models, with plans for expansion. Explabox’s code and documentation are available open-source at this https URL.

[AI-68] A Unified Energy Management Framework for Multi-Timescale Forecasting in Smart Grids

链接: https://arxiv.org/abs/2411.15254
作者: Dafang Zhao,Xihao Piao,Zheng Chen,Zhengmao Li,Ittetsu Taniguchi
关键词-EN: smart grid strategies, successful power system, power system management, peak shaving, Accurate forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to PES GM 2025

点击查看摘要

Abstract:Accurate forecasting of the electrical load, such as the magnitude and the timing of peak power, is crucial to successful power system management and implementation of smart grid strategies like demand response and peak shaving. In multi-time-scale optimization scheduling, rolling optimization is a common solution. However, rolling optimization needs to consider the coupling of different optimization objectives across time scales. It is challenging to accurately capture the mid- and long-term dependencies in time series data. This paper proposes Multi-pofo, a multi-scale power load forecasting framework, that captures such dependency via a novel architecture equipped with a temporal positional encoding layer. To validate the effectiveness of the proposed model, we conduct experiments on real-world electricity load data. The experimental results show that our approach outperforms compared to several strong baseline methods.

[AI-69] Is Attention All You Need For Actigraphy? Foundation Models of Wearable Accelerometer Data for Mental Health Research

链接: https://arxiv.org/abs/2411.15240
作者: Franklin Y. Ruan,Aiwei Zhang,Jenny Y. Oh,SouYoung Jin,Nicholas C Jacobson
关键词-EN: wearable devices continue, provided valuable data, Wearable accelerometry, wearable devices, provided valuable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Quantitative Methods (q-bio.QM)
*备注: Supplementary material can be found at the end of the document

点击查看摘要

Abstract:Wearable accelerometry (actigraphy) has provided valuable data for clinical insights since the 1970s and is increasingly important as wearable devices continue to become widespread. The effectiveness of actigraphy in research and clinical contexts is heavily dependent on the modeling architecture utilized. To address this, we developed the Pretrained Actigraphy Transformer (PAT)–the first pretrained and fully attention-based model designed specifically to handle actigraphy. PAT was pretrained on actigraphy from 29,307 participants in NHANES, enabling it to deliver state-of-the-art performance when fine-tuned across various actigraphy prediction tasks in the mental health domain, even in data-limited scenarios. For example, when trained to predict benzodiazepine usage using actigraphy from only 500 labeled participants, PAT achieved an 8.8 percentage-point AUC improvement over the best baseline. With fewer than 2 million parameters and built-in model explainability, PAT is robust yet easy to deploy in health research settings. GitHub: this https URL Comments: Supplementary material can be found at the end of the document Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Quantitative Methods (q-bio.QM) Cite as: arXiv:2411.15240 [cs.LG] (or arXiv:2411.15240v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.15240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-70] IterIS: Iterative Inference-Solving Alignment for LoRA Merging

链接: https://arxiv.org/abs/2411.15231
作者: Hongxu Chen,Runshi Li,Bowei Zhu,Zhen Wang,Long Chen
关键词-EN: Low-rank adaptations, specific downstream tasks, domains for specific, specific downstream, LoRA merging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low-rank adaptations (LoRA) are widely used to fine-tune large models across various domains for specific downstream tasks. While task-specific LoRAs are often available, concerns about data privacy and intellectual property can restrict access to training data, limiting the acquisition of a multi-task model through gradient-based training. In response, LoRA merging presents an effective solution by combining multiple LoRAs into a unified adapter while maintaining data privacy. Prior works on LoRA merging primarily frame it as an optimization problem, yet these approaches face several limitations, including the rough assumption about input features utilized in optimization, massive sample requirements, and the unbalanced optimization objective. These limitations can significantly degrade performance. To address these, we propose a novel optimization-based method, named IterIS: 1) We formulate LoRA merging as an advanced optimization problem to mitigate the rough assumption. Additionally, we employ an iterative inference-solving framework in our algorithm. It can progressively refine the optimization objective for improved performance. 2) We introduce an efficient regularization term to reduce the need for massive sample requirements (requiring only 1-5% of the unlabeled samples compared to prior methods). 3) We utilize adaptive weights in the optimization objective to mitigate potential unbalances in LoRA merging process. Our method demonstrates significant improvements over multiple baselines and state-of-the-art methods in composing tasks for text-to-image diffusion, vision-language models, and large language models. Furthermore, our layer-wise algorithm can achieve convergence with minimal steps, ensuring efficiency in both memory and computation.

[AI-71] A No Free Lunch Theorem for Human-AI Collaboration

链接: https://arxiv.org/abs/2411.15230
作者: Kenny Peng,Nikhil Garg,Jon Kleinberg
关键词-EN: combined performance exceeds, gold standard, combined performance, performance exceeds, human and algorithm
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The gold standard in human-AI collaboration is complementarity – when combined performance exceeds both the human and algorithm alone. We investigate this challenge in binary classification settings where the goal is to maximize 0-1 accuracy. Given two or more agents who can make calibrated probabilistic predictions, we show a “No Free Lunch”-style result. Any deterministic collaboration strategy (a function mapping calibrated probabilities into binary classifications) that does not essentially always defer to the same agent will sometimes perform worse than the least accurate agent. In other words, complementarity cannot be achieved “for free.” The result does suggest one model of collaboration with guarantees, where one agent identifies “obvious” errors of the other agent. We also use the result to understand the necessary conditions enabling the success of other collaboration techniques, providing guidance to human-AI collaboration.

[AI-72] Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation

链接: https://arxiv.org/abs/2411.15224
作者: Seokil Ham,Hee-Seon Kim,Sangmin Woo,Changick Kim
关键词-EN: remain largely unexplored, Mamba remain largely, Mamba architecture, Diagonal-centric Linear Transformation, replacement for Transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the growing interest in Mamba architecture as a potential replacement for Transformer architecture, parameter-efficient fine-tuning (PEFT) approaches for Mamba remain largely unexplored. In our study, we introduce two key insights-driven strategies for PEFT in Mamba architecture: (1) While state-space models (SSMs) have been regarded as the cornerstone of Mamba architecture, then expected to play a primary role in transfer learning, our findings reveal that Projectors – not SSMs – are the predominant contributors to transfer learning, and (2) Based on our observation that adapting pretrained Projectors to new tasks can be effectively approximated through a near-diagonal linear transformation, we propose a novel PEFT method specialized to Mamba architecture: Projector-targeted Diagonal-centric Linear Transformation (ProDiaL). ProDiaL focuses on optimizing only diagonal-centric linear transformation matrices, without directly fine-tuning the pretrained Projector weights. This targeted approach allows efficient task adaptation, utilizing less than 1% of the total parameters, and exhibits strong performance across both vision and language Mamba models, highlighting its versatility and effectiveness.

[AI-73] Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation

链接: https://arxiv.org/abs/2411.15222
作者: Ke Zhao(1),Huayang Huang(1),Miao Li(1),Yu Wu(1) ((1) Wuhan University)
关键词-EN: Language-conditioned robotic learning, enhanced robot adaptability, significantly enhanced robot, execute diverse tasks, verbal commands
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Language-conditioned robotic learning has significantly enhanced robot adaptability by enabling a single model to execute diverse tasks in response to verbal commands. Despite these advancements, security vulnerabilities within this domain remain largely unexplored. This paper addresses this gap by proposing a novel adversarial prompt attack tailored to language-conditioned robotic models. Our approach involves crafting a universal adversarial prefix that induces the model to perform unintended actions when added to any original prompt. We demonstrate that existing adversarial techniques exhibit limited effectiveness when directly transferred to the robotic domain due to the inherent robustness of discretized robotic action spaces. To overcome this challenge, we propose to optimize adversarial prefixes based on continuous action representations, circumventing the discretization process. Additionally, we identify the beneficial impact of intermediate features on adversarial attacks and leverage the negative gradient of intermediate self-attention features to further enhance attack efficacy. Extensive experiments on VIMA models across 13 robot manipulation tasks validate the superiority of our method over existing approaches and demonstrate its transferability across different model variants.

[AI-74] Suspected Undeclared Use of Artificial Intelligence in the Academic Literature: An Analysis of the Academ-AI Dataset

链接: https://arxiv.org/abs/2411.15218
作者: Alex Glynn
关键词-EN: generative artificial intelligence, artificial intelligence, writing process, generative artificial, OpenAI ChatGPT
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 24 pages, 8 figures

点击查看摘要

Abstract:Since generative artificial intelligence (AI) tools such as OpenAI’s ChatGPT became widely available, researchers have used them in the writing process. The consensus of the academic publishing community is that such usage must be declared in the published article. Academ-AI documents examples of suspected undeclared AI usage in the academic literature, discernible primarily due to the appearance in research papers of idiosyncratic verbiage characteristic of large language model (LLM)-based chatbots. This analysis of the first 500 examples collected reveals that the problem is widespread, penetrating the journals and conference proceedings of highly respected publishers. Undeclared AI seems to appear in journals with higher citation metrics and higher article processing charges (APCs), precisely those outlets that should theoretically have the resources and expertise to avoid such oversights. An extremely small minority of cases are corrected post publication, and the corrections are often insufficient to rectify the problem. The 500 examples analyzed here likely represent a small fraction of the undeclared AI present in the academic literature, much of which may be undetectable. Publishers must enforce their policies against undeclared AI usage in cases that are detectable; this is the best defense currently available to the academic publishing community against the proliferation of undisclosed AI.

[AI-75] Dist Loss: Enhancing Regression in Few-Shot Region through Distribution Distance Constraint

链接: https://arxiv.org/abs/2411.15216
作者: Guangkun Nie,Gongzheng Tang,Shenda Hong
关键词-EN: posing significant challenges, Dist Loss, deep learning models, real-world scenarios, posing significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imbalanced data distributions are prevalent in real-world scenarios, posing significant challenges in both imbalanced classification and imbalanced regression tasks. They often cause deep learning models to overfit in areas of high sample density (many-shot regions) while underperforming in areas of low sample density (few-shot regions). This characteristic restricts the utility of deep learning models in various sectors, notably healthcare, where areas with few-shot data hold greater clinical relevance. While recent studies have shown the benefits of incorporating distribution information in imbalanced classification tasks, such strategies are rarely explored in imbalanced regression. In this paper, we address this issue by introducing a novel loss function, termed Dist Loss, designed to minimize the distribution distance between the model’s predictions and the target labels in a differentiable manner, effectively integrating distribution information into model training. Dist Loss enables deep learning models to regularize their output distribution during training, effectively enhancing their focus on few-shot regions. We have conducted extensive experiments across three datasets spanning computer vision and healthcare: IMDB-WIKI-DIR, AgeDB-DIR, and ECG-Ka-DIR. The results demonstrate that Dist Loss effectively mitigates the negative impact of imbalanced data distribution on model performance, achieving state-of-the-art results in sparse data regions. Furthermore, Dist Loss is easy to integrate, complementing existing methods.

[AI-76] S2ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

链接: https://arxiv.org/abs/2411.15215
作者: Mingze Yin,Hanjing Zhou,Jialu Wu,Yiheng Zhu,Yuxuan Zhan,Zitai Kong,Hongxia Xu,Chang-Yu Hsieh,Jintai Chen,Tingjun Hou,Jian Wu
关键词-EN: demonstrating promising therapeutic, promising therapeutic efficacy, demonstrating promising, numerous diseases, safeguard our health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S ^2 ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S ^2 ALM’s representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S ^2 ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S ^2 ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S ^2 ALM’s ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.

[AI-77] Urban Region Embeddings from Service-Specific Mobile Traffic Data

链接: https://arxiv.org/abs/2411.15214
作者: Giulio Loddi,Chiara Pugliese,Francesco Lettich,Fabio Pinelli,Chiara Renso
关键词-EN: high spatio-temporal resolution, phone data collected, advent of advanced, includes detailed, spatio-temporal resolution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:With the advent of advanced 4G/5G mobile networks, mobile phone data collected by operators now includes detailed, service-specific traffic information with high spatio-temporal resolution. In this paper, we leverage this type of data to explore its potential for generating high-quality representations of urban regions. To achieve this, we present a methodology for creating urban region embeddings from service-specific mobile traffic data, employing a temporal convolutional network-based autoencoder, transformers, and learnable weighted sum models to capture key urban features. In the extensive experimental evaluation conducted using a real-world dataset, we demonstrate that the embeddings generated by our methodology effectively capture urban characteristics. Specifically, our embeddings are compared against those of a state-of-the-art competitor across two downstream tasks. Additionally, through clustering techniques, we investigate how well the embeddings produced by our methodology capture the temporal dynamics and characteristics of the underlying urban regions. Overall, this work highlights the potential of service-specific mobile traffic data for urban research and emphasizes the importance of making such data accessible to support public innovation.

[AI-78] Effective Analog ICs Floorplanning with Relational Graph Neural Networks and Reinforcement Learning DATE25

链接: https://arxiv.org/abs/2411.15212
作者: Davide Basso,Luca Bortolussi,Mirjana Videnovic-Misic,Husni Habal
关键词-EN: devices and modules, placement of components, layout engineer, Analog, Analog integrated circuit
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 7 pages, 7 figures, Accepted at DATE25

点击查看摘要

Abstract:Analog integrated circuit (IC) floorplanning is typically a manual process with the placement of components (devices and modules) planned by a layout engineer. This process is further complicated by the interdependence of floorplanning and routing steps, numerous electric and layout-dependent constraints, as well as the high level of customization expected in analog design. This paper presents a novel automatic floorplanning algorithm based on reinforcement learning. It is augmented by a relational graph convolutional neural network model for encoding circuit features and positional constraints. The combination of these two machine learning methods enables knowledge transfer across different circuit designs with distinct topologies and constraints, increasing the \emphgeneralization ability of the solution. Applied to 6 industrial circuits, our approach surpassed established floorplanning techniques in terms of speed, area and half-perimeter wire length. When integrated into a \emphprocedural generator for layout completion, overall layout time was reduced by 67.3% with a 8.3% mean area reduction compared to manual layout.

[AI-79] M2oE: Multimodal Collaborative Expert Peptide Model

链接: https://arxiv.org/abs/2411.15208
作者: Zengzhu Guo,Zhiqi Ma
关键词-EN: biomolecules comprised, comprised of amino, amino acids, acids that play, play an important
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: accepted by bibm 2024

点击查看摘要

Abstract:Peptides are biomolecules comprised of amino acids that play an important role in our body. In recent years, peptides have received extensive attention in drug design and synthesis, and peptide prediction tasks help us better search for functional peptides. Typically, we use the primary sequence and structural information of peptides for model encoding. However, recent studies have focused more on single-modal information (structure or sequence) for prediction without multi-modal approaches. We found that single-modal models are not good at handling datasets with less information in that particular modality. Therefore, this paper proposes the M2oE multi-modal collaborative expert peptide model. Based on previous work, by integrating sequence and spatial structural information, employing expert model and Cross-Attention Mechanism, the model’s capabilities are balanced and improved. Experimental results indicate that the M2oE model performs excellently in complex task predictions.

[AI-80] Self-Supervised Conditional Distribution Learning on Graphs

链接: https://arxiv.org/abs/2411.15206
作者: Jie Chen,Hua Mao,Yuanbiao Gou,Zhu Wang,Xi Peng
关键词-EN: shown promising performance, shown promising, promising performance, negative pairs, GCL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:Graph contrastive learning (GCL) has shown promising performance in semisupervised graph classification. However, existing studies still encounter significant challenges in GCL. First, successive layers in graph neural network (GNN) tend to produce more similar node embeddings, while GCL aims to increase the dissimilarity between negative pairs of node embeddings. This inevitably results in a conflict between the message-passing mechanism of GNNs and the contrastive learning of negative pairs via intraviews. Second, leveraging the diversity and quantity of data provided by graph-structured data augmentations while preserving intrinsic semantic information is challenging. In this paper, we propose a self-supervised conditional distribution learning (SSCDL) method designed to learn graph representations from graph-structured data for semisupervised graph classification. Specifically, we present an end-to-end graph representation learning model to align the conditional distributions of weakly and strongly augmented features over the original features. This alignment effectively reduces the risk of disrupting intrinsic semantic information through graph-structured data augmentation. To avoid conflict between the message-passing mechanism and contrastive learning of negative pairs, positive pairs of node representations are retained for measuring the similarity between the original features and the corresponding weakly augmented features. Extensive experiments with several benchmark graph datasets demonstrate the effectiveness of the proposed SSCDL method.

[AI-81] K-means Derived Unsupervised Feature Selection using Improved ADMM

链接: https://arxiv.org/abs/2411.15197
作者: Ziheng Sun,Chris Ding,Jicong Fan
关键词-EN: unsupervised feature selection, Feature selection, K-means UFS, K-means Derived Unsupervised, unsupervised feature
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Feature selection is important for high-dimensional data analysis and is non-trivial in unsupervised learning problems such as dimensionality reduction and clustering. The goal of unsupervised feature selection is finding a subset of features such that the data points from different clusters are well separated. This paper presents a novel method called K-means Derived Unsupervised Feature Selection (K-means UFS). Unlike most existing spectral analysis based unsupervised feature selection methods, we select features using the objective of K-means. We develop an alternating direction method of multipliers (ADMM) to solve the NP-hard optimization problem of our K-means UFS model. Extensive experiments on real datasets show that our K-means UFS is more effective than the baselines in selecting features for clustering.

[AI-82] ailoring the Hyperparameters of a Wide-Kernel Convolutional Neural Network to Fit Different Bearing Fault Vibration Datasets

链接: https://arxiv.org/abs/2411.15191
作者: Dan Hudson,Jurgen van den Hoogen,Martin Atzmueller
关键词-EN: damaged machine bearings, algorithms are reported, perfect at distinguishing, distinguishing the vibrations, vibrations arising
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 71 pages, 14 figures, 7 tables

点击查看摘要

Abstract:State-of-the-art algorithms are reported to be almost perfect at distinguishing the vibrations arising from healthy and damaged machine bearings, according to benchmark datasets at least. However, what about their application to new data? In this paper, we are able to confirm that neural networks for bearing fault detection can be crippled by incorrect hyperparameterisation, and also that the correct hyperparameter settings can actually change when transitioning to new data. The paper weaves together multiple methods to explain the behaviour of the hyperparameters of a wide-kernel convolutional neural network and how to set them. Since guidance already exists for generic hyperparameters like minibatch size, we focus on how to set architecture-specific hyperparameters such as the width of the convolutional kernels, a topic which might otherwise be obscure. We reflect different data properties by fusing information from seven different benchmark datasets, and our results show that the kernel size in the first layer in particular is sensitive to changes in the data. Looking deeper, we use manipulated copies of one dataset in an attempt to spot why the kernel size sometimes needs to change. The relevance of sampling rate is studied by using different levels of resampling, and spectral content is studied by increasingly filtering out high frequencies. At the end of our paper we conclude by stating clear guidance on how to set the hyperparameters of our neural network architecture.

[AI-83] Order Is All You Need for Categorical Data Clustering

链接: https://arxiv.org/abs/2411.15189
作者: Yiqun Zhang,Mingjie Zhao,Hong Jia,Yiu-ming Cheung
关键词-EN: data mining tasks, Categorical data, Categorical data composed, nominal valued attributes, mining tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Categorical data composed of nominal valued attributes are ubiquitous in knowledge discovery and data mining tasks. Due to the lack of well-defined metric space, categorical data distributions are difficult to intuitively understand. Clustering is a popular technique suitable for data analysis. However, the success of clustering often relies on reasonable distance metrics, which happens to be what categorical data naturally lack. Therefore, the cluster analysis of categorical data is considered a critical but challenging problem. This paper introduces the new finding that the order relation among attribute values is the decisive factor in clustering accuracy, and is also the key to understanding the categorical data clusters. To automatically obtain the orders, we propose a new learning paradigm that allows joint learning of clusters and the orders. It turns out that clustering with order learning achieves superior clustering accuracy, and the learned orders provide intuition for understanding the cluster distribution of categorical data. Extensive experiments with statistical evidence and case studies have verified the effectiveness of the new ``order is all you need’’ insight and the proposed method.

[AI-84] Hybrid Gaussian Process Regression with Temporal Feature Extraction for Partially Interpretable Remaining Useful Life Interval Prediction in Aeroengine Prognostics

链接: https://arxiv.org/abs/2411.15185
作者: Tian Niu,Zijun Xu,Heng Luo,Ziqing Zhou
关键词-EN: Remaining Useful Life, estimation of Remaining, plays a pivotal, pivotal role, role in intelligent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The estimation of Remaining Useful Life (RUL) plays a pivotal role in intelligent manufacturing systems and Industry 4.0 technologies. While recent advancements have improved RUL prediction, many models still face interpretability and compelling uncertainty modeling challenges. This paper introduces a modified Gaussian Process Regression (GPR) model for RUL interval prediction, tailored for the complexities of manufacturing process development. The modified GPR predicts confidence intervals by learning from historical data and addresses uncertainty modeling in a more structured way. The approach effectively captures intricate time-series patterns and dynamic behaviors inherent in modern manufacturing systems by coupling GPR with deep adaptive learning-enhanced AI process models. Moreover, the model evaluates feature significance to ensure more transparent decision-making, which is crucial for optimizing manufacturing processes. This comprehensive approach supports more accurate RUL predictions and provides transparent, interpretable insights into uncertainty, contributing to robust process development and management.

[AI-85] Forecasting Application Counts in Talent Acquisition Platforms: Harnessing Multimodal Signals using LMs

链接: https://arxiv.org/abs/2411.15182
作者: Md Ahsanul Kabir,Kareem Abdelfatah,Shushan He,Mohammed Korayem,Mohammad Al Hasan
关键词-EN: machine learning, talent acquisition, day activities, optimizing their day, day
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As recruitment and talent acquisition have become more and more competitive, recruitment firms have become more sophisticated in using machine learning (ML) methodologies for optimizing their day to day activities. But, most of published ML based methodologies in this area have been limited to the tasks like candidate matching, job to skill matching, job classification and normalization. In this work, we discuss a novel task in the recruitment domain, namely, application count forecasting, motivation of which comes from designing of effective outreach activities to attract qualified applicants. We show that existing auto-regressive based time series forecasting methods perform poorly for this task. Henceforth, we propose a multimodal LM-based model which fuses job-posting metadata of various modalities through a simple encoder. Experiments from large real-life datasets from CareerBuilder LLC show the effectiveness of the proposed method over existing state-of-the-art methods.

[AI-86] Multi-layer matrix factorization for cancer subtyping using full and partial multi-omics dataset

链接: https://arxiv.org/abs/2411.15180
作者: Yingxuan Ren,Fengtao Ren,Bo Yang
关键词-EN: molecular markers specific, distinct subtypes based, cellular origins, inherent heterogeneity, commonly categorized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Cancer, with its inherent heterogeneity, is commonly categorized into distinct subtypes based on unique traits, cellular origins, and molecular markers specific to each type. However, current studies primarily rely on complete multi-omics datasets for predicting cancer subtypes, often overlooking predictive performance in cases where some omics data may be missing and neglecting implicit relationships across multiple layers of omics data integration. This paper introduces Multi-Layer Matrix Factorization (MLMF), a novel approach for cancer subtyping that employs multi-omics data clustering. MLMF initially processes multi-omics feature matrices by performing multi-layer linear or nonlinear factorization, decomposing the original data into latent feature representations unique to each omics type. These latent representations are subsequently fused into a consensus form, on which spectral clustering is performed to determine subtypes. Additionally, MLMF incorporates a class indicator matrix to handle missing omics data, creating a unified framework that can manage both complete and incomplete multi-omics data. Extensive experiments conducted on 10 multi-omics cancer datasets, both complete and with missing values, demonstrate that MLMF achieves results that are comparable to or surpass the performance of several state-of-the-art approaches.

[AI-87] Harnessing Scale and Physics: A Multi-Graph Neural Operator Framework for PDEs on Arbitrary Geometries

链接: https://arxiv.org/abs/2411.15178
作者: Zhihao Li,Haoze Song,Di Xiao,Zhilu Lai,Wei Wang
关键词-EN: Partial Differential Equations, Partial Differential, Differential Equations, traditional computational approaches, textbf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Partial Differential Equations (PDEs) underpin many scientific phenomena, yet traditional computational approaches often struggle with complex, nonlinear systems and irregular geometries. This paper introduces the \textbfAMG method, a \textbfMulti-\textbfGraph neural operator approach designed for efficiently solving PDEs on \textbfArbitrary geometries. AMG leverages advanced graph-based techniques and dynamic attention mechanisms within a novel GraphFormer architecture, enabling precise management of diverse spatial domains and complex data interdependencies. By constructing multi-scale graphs to handle variable feature frequencies and a physics graph to encapsulate inherent physical properties, AMG significantly outperforms previous methods, which are typically limited to uniform grids. We present a comprehensive evaluation of AMG across six benchmarks, demonstrating its consistent superiority over existing state-of-the-art models. Our findings highlight the transformative potential of tailored graph neural operators in surmounting the challenges faced by conventional PDE solvers. Our code and datasets are available on \urlthis https URL.

[AI-88] Decentralizing Test-time Adaptation under Heterogeneous Data Streams

链接: https://arxiv.org/abs/2411.15173
作者: Zixian Su,Jingwei Guo,Xi Yang,Qiufeng Wang,Kaizhu Huang
关键词-EN: uniform target estimation, shown promise, promise in addressing, training and testing, effectiveness diminishes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Test-Time Adaptation (TTA) has shown promise in addressing distribution shifts between training and testing data, its effectiveness diminishes with heterogeneous data streams due to uniform target estimation. As previous attempts merely stabilize model fine-tuning over time to handle continually changing environments, they fundamentally assume a homogeneous target domain at any moment, leaving the intrinsic real-world data heterogeneity unresolved. This paper delves into TTA under heterogeneous data streams, moving beyond current model-centric limitations. By revisiting TTA from a data-centric perspective, we discover that decomposing samples into Fourier space facilitates an accurate data separation across different frequency levels. Drawing from this insight, we propose a novel Frequency-based Decentralized Adaptation (FreDA) framework, which transitions data from globally heterogeneous to locally homogeneous in Fourier space and employs decentralized adaptation to manage diverse distribution this http URL, we devise a novel Fourier-based augmentation strategy to assist in decentralizing adaptation, which individually enhances sample quality for capturing each type of distribution shifts. Extensive experiments across various settings (corrupted, natural, and medical environments) demonstrate the superiority of our proposed framework over the state-of-the-arts.

[AI-89] Memory-Driven Metaheuristics: Improving Optimization Performance

链接: https://arxiv.org/abs/2411.15151
作者: Salar Farahmand-Tabar
关键词-EN: mimic natural processes, find optimal solutions, stochastic optimization algorithms, Memory mechanisms, metaheuristic algorithms
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 25 pages, 5 figures, book chapter, Springer

点击查看摘要

Abstract:Metaheuristics are stochastic optimization algorithms that mimic natural processes to find optimal solutions to complex problems. The success of metaheuristics largely depends on the ability to effectively explore and exploit the search space. Memory mechanisms have been introduced in several popular metaheuristic algorithms to enhance their performance. This chapter explores the significance of memory in metaheuristic algorithms and provides insights from well-known algorithms. The chapter begins by introducing the concept of memory, and its role in metaheuristic algorithms. The key factors influencing the effectiveness of memory mechanisms are discussed, such as the size of the memory, the information stored in memory, and the rate of information decay. A comprehensive analysis of how memory mechanisms are incorporated into popular metaheuristic algorithms is presented and concludes by highlighting the importance of memory in metaheuristic performance and providing future research directions for improving memory mechanisms. The key takeaways are that memory mechanisms can significantly enhance the performance of metaheuristics by enabling them to explore and exploit the search space effectively and efficiently, and that the choice of memory mechanism should be tailored to the problem domain and the characteristics of the search space.

[AI-90] he Fundamental Rights Impact Assessment (FRIA) in the AI Act: Roots legal obligations and key elements for a model template

链接: https://arxiv.org/abs/2411.15149
作者: Alessandro Mantelero
关键词-EN: context which gave, gave rise, obligation to carry, Impact Assessment, Act
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:What is the context which gave rise to the obligation to carry out a Fundamental Rights Impact Assessment (FRIA) in the AI Act? How has assessment of the impact on fundamental rights been framed by the EU legislator in the AI Act? What methodological criteria should be followed in developing the FRIA? These are the three main research questions that this article aims to address, through both legal analysis of the relevant provisions of the AI Act and discussion of various possible models for assessment of the impact of AI on fundamental rights. The overall objective of this article is to fill existing gaps in the theoretical and methodological elaboration of the FRIA, as outlined in the AI Act. In order to facilitate the future work of EU and national bodies and AI operators in placing this key tool for human-centric and trustworthy AI at the heart of the EU approach to AI design and development, this article outlines the main building blocks of a model template for the FRIA. While this proposal is consistent with the rationale and scope of the AI Act, it is also applicable beyond the cases listed in Article 27 and can serve as a blueprint for other national and international regulatory initiatives to ensure that AI is fully consistent with human rights.

[AI-91] Delegating Responsibilities to Intelligent Autonomous Systems: Challenges and Benefits

链接: https://arxiv.org/abs/2411.15147
作者: Gordana Dodig-Crnkovic,Gianfranco Basti,Tobias Holstein
关键词-EN: systems increasingly operate, autonomy and adaptability, increasingly operate, operate with autonomy, traditional boundaries
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As AI systems increasingly operate with autonomy and adaptability, the traditional boundaries of moral responsibility in techno-social systems are being challenged. This paper explores the evolving discourse on the delegation of responsibilities to intelligent autonomous agents and the ethical implications of such practices. Synthesizing recent developments in AI ethics, including concepts of distributed responsibility and ethical AI by design, the paper proposes a functionalist perspective as a framework. This perspective views moral responsibility not as an individual trait but as a role within a socio-technical system, distributed among human and artificial agents. As an example of ‘AI ethical by design,’ we present Basti and Vitiello’s implementation. They suggest that AI can act as artificial moral agents by learning ethical guidelines and using Deontic Higher-Order Logic to assess decisions ethically. Motivated by the possible speed and scale beyond human supervision and ethical implications, the paper argues for ‘AI ethical by design’, while acknowledging the distributed, shared, and dynamic nature of responsibility. This functionalist approach offers a practical framework for navigating the complexities of AI ethics in a rapidly evolving technological landscape.

[AI-92] dafny-annotator: AI-Assisted Verification of Dafny Programs

链接: https://arxiv.org/abs/2411.15143
作者: Gabriel Poesia,Chloe Loughridge,Nada Amin
关键词-EN: high additional cost, reduce software bugs, Formal verification, drastically reduce software, software bugs
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Formal verification has the potential to drastically reduce software bugs, but its high additional cost has hindered large-scale adoption. While Dafny presents a promise to significantly reduce the effort to write verified programs, users are often required to provide logical annotations to aid the verifier. Here, we explore using a combination of Large Language Models and search to build dafny-annotator: a tool that adds logical annotations to a Dafny method until the verifier can prove it correct. On a test set from the DafnyBench collection of programs, greedy search guided by LLaMa 3.1 8B successfully annotates only 15.7% of the methods. Since this data-driven approach is hindered by the lack of large-scale training data, we propose a method for open-ended synthesis of new Dafny programs in a flexible pipeline where LLMs formulate high-level ideas, implement them, and incrementally propose changes to existing programs, which Dafny validates. This gives us a synthetic dataset, DafnySynth, which we use to augment DafnyBench for training. Fine-tuning on both datasets boosts LLaMa 8B’s success rate to 50.6% – significantly better than the base model, or training on either dataset alone. Our results suggest a path towards capable AI assistants for languages that don’t yet have large-scale human-generated examples. In turn, such assistants might reduce friction for users and ultimately drive adoption.

[AI-93] CatNet: Effective FDR Control in LSTM with Gaussian Mirrors and SHAP Feature Importance

链接: https://arxiv.org/abs/2411.16666
作者: Jiaan Han,Junxiao Chen,Yanzhe Fu
关键词-EN: False Discovery Rate, controls False Discovery, Discovery Rate, False Discovery, effectively controls False
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:We introduce CatNet, an algorithm that effectively controls False Discovery Rate (FDR) and selects significant features in LSTM with the Gaussian Mirror (GM) method. To evaluate the feature importance of LSTM in time series, we introduce a vector of the derivative of the SHapley Additive exPlanations (SHAP) to measure feature importance. We also propose a new kernel-based dependence measure to avoid multicollinearity in the GM algorithm, to make a robust feature selection with controlled FDR. We use simulated data to evaluate CatNet’s performance in both linear models and LSTM models with different link functions. The algorithm effectively controls the FDR while maintaining a high statistical power in all cases. We also evaluate the algorithm’s performance in different low-dimensional and high-dimensional cases, demonstrating its robustness in various input dimensions. To evaluate CatNet’s performance in real world applications, we construct a multi-factor investment portfolio to forecast the prices of S\P 500 index components. The results demonstrate that our model achieves superior predictive accuracy compared to traditional LSTM models without feature selection and FDR control. Additionally, CatNet effectively captures common market-driving features, which helps informed decision-making in financial markets by enhancing the interpretability of predictions. Our study integrates of the Gaussian Mirror algorithm with LSTM models for the first time, and introduces SHAP values as a new feature importance metric for FDR control methods, marking a significant advancement in feature selection and error control for neural networks.

[AI-94] Naive Algorithmic Collusion: When Do Bandit Learners Cooperate and When Do They Compete?

链接: https://arxiv.org/abs/2411.16574
作者: Connor Douglas,Foster Provost,Arun Sundararajan
关键词-EN: making pricing decisions, residential home rentals, competitive decision settings, competitive decision, pricing decisions
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注: To be published in proceedings of International Conference on Information Systems 2024

点击查看摘要

Abstract:Algorithmic agents are used in a variety of competitive decision settings, notably in making pricing decisions in contexts that range from online retail to residential home rentals. Business managers, algorithm designers, legal scholars, and regulators alike are all starting to consider the ramifications of “algorithmic collusion.” We study the emergent behavior of multi-armed bandit machine learning algorithms used in situations where agents are competing, but they have no information about the strategic interaction they are engaged in. Using a general-form repeated Prisoner’s Dilemma game, agents engage in online learning with no prior model of game structure and no knowledge of competitors’ states or actions (e.g., no observation of competing prices). We show that these context-free bandits, with no knowledge of opponents’ choices or outcomes, still will consistently learn collusive behavior - what we call “naive collusion.” We primarily study this system through an analytical model and examine perturbations to the model through simulations. Our findings have several notable implications for regulators. First, calls to limit algorithms from conditioning on competitors’ prices are insufficient to prevent algorithmic collusion. This is a direct result of collusion arising even in the naive setting. Second, symmetry in algorithms can increase collusion potential. This highlights a new, simple mechanism for “hub-and-spoke” algorithmic collusion. A central distributor need not imbue its algorithm with supra-competitive tendencies for apparent collusion to arise; it can simply arise by using certain (common) machine learning algorithms. Finally, we highlight that collusive outcomes depend starkly on the specific algorithm being used, and we highlight market and algorithmic conditions under which it will be unknown a priori whether collusion occurs. Comments: To be published in proceedings of International Conference on Information Systems 2024 Subjects: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA) Cite as: arXiv:2411.16574 [econ.GN] (or arXiv:2411.16574v1 [econ.GN] for this version) https://doi.org/10.48550/arXiv.2411.16574 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-95] Graph Neural Networks-based Parameter Design towards Large-Scale Superconducting Quantum Circuits for Crosstalk Mitigation

链接: https://arxiv.org/abs/2411.16354
作者: Hao Ai,Yu-xi Liu
关键词-EN: quantum computing chips, superconducting quantum circuits, superconducting quantum computing, quantum computing, superconducting quantum
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To demonstrate supremacy of quantum computing, increasingly large-scale superconducting quantum computing chips are being designed and fabricated, sparking the demand for electronic design automation in pursuit of better efficiency and effectiveness. However, the complexity of simulating quantum systems poses a significant challenge to computer-aided design of quantum chips. Harnessing the scalability of graph neural networks (GNNs), we here propose a parameter designing algorithm for large-scale superconducting quantum circuits. The algorithm depends on the so-called ‘three-stair scaling’ mechanism, which comprises two neural-network models: an evaluator supervisedly trained on small-scale circuits for applying to medium-scale circuits, and a designer unsupervisedly trained on medium-scale circuits for applying to large-scale ones. We demonstrate our algorithm in mitigating quantum crosstalk errors, which are commonly present and closely related to the graph structures and parameter assignments of superconducting quantum circuits. Parameters for both single- and two-qubit gates are considered simultaneously. Numerical results indicate that the well-trained designer achieves notable advantages not only in efficiency but also in effectiveness, especially for large-scale circuits. For example, in superconducting quantum circuits consisting of around 870 qubits, the trained designer requires only 27 seconds to complete the frequency designing task which necessitates 90 minutes for the traditional Snake algorithm. More importantly, the crosstalk errors using our algorithm are only 51% of those produced by the Snake algorithm. Overall, this study initially demonstrates the advantages of applying graph neural networks to design parameters in quantum processors, and provides insights for systems where large-scale numerical simulations are challenging in electronic design automation.

[AI-96] Deciphering genomic codes using advanced NLP techniques: a scoping review

链接: https://arxiv.org/abs/2411.16084
作者: Shuyan Cheng,Yishu Wei,Yiliang Zhou,Zihan Xu,Drew N Wright,Jinze Liu,Yifan Peng
关键词-EN: genomic sequencing data, Natural Language Processing, Large Language Models, data presents challenges, human genomic sequencing
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Objectives: The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly Large Language Models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data. Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type. Results: A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility. Discussion: The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while also providing a better understanding of its complex structures. It has the potential to drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is also needed to discuss and overcome current limitations, enhancing model transparency and applicability. Subjects: Genomics (q-bio.GN); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.16084 [q-bio.GN] (or arXiv:2411.16084v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2411.16084 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuyan Cheng [view email] [v1] Mon, 25 Nov 2024 04:35:56 UTC (542 KB)

[AI-97] he brain versus AI: World-model-based versatile circuit computation underlying diverse functions in the neocortex and cerebellum

链接: https://arxiv.org/abs/2411.16075
作者: Shogo Ohmae,Keiko Ohmae
关键词-EN: general-purpose circuit computations, circuit computations offer, significant recent advances, significant recent, offer a potential
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI’s significant recent advances using general-purpose circuit computations offer a potential window into how the neocortex and cerebellum of the brain are able to achieve a diverse range of functions across sensory, cognitive, and motor domains, despite their uniform circuit structures. However, comparing the brain and AI is challenging unless clear similarities exist, and past reviews have been limited to comparison of brain-inspired vision AI and the visual neocortex. Here, to enable comparisons across diverse functional domains, we subdivide circuit computation into three elements – circuit structure, input/outputs, and the learning algorithm – and evaluate the similarities for each element. With this novel approach, we identify wide-ranging similarities and convergent evolution in the brain and AI, providing new insights into key concepts in neuroscience. Furthermore, inspired by processing mechanisms of AI, we propose a new theory that integrates established neuroscience theories, particularly the theories of internal models and the mirror neuron system. Both the neocortex and cerebellum predict future world events from past information and learn from prediction errors, thereby acquiring models of the world. These models enable three core processes: (1) Prediction – generating future information, (2) Understanding – interpreting the external world via compressed and abstracted sensory information, and (3) Generation – repurposing the future-information generation mechanism to produce other types of outputs. The universal application of these processes underlies the ability of the neocortex and cerebellum to accomplish diverse functions with uniform circuits. Our systematic approach, insights, and theory promise groundbreaking advances in understanding the brain.

[AI-98] State-Space Large Audio Language Models

链接: https://arxiv.org/abs/2411.15685
作者: Saurabhchand Bhati,Yuan Gong,Leonid Karlinsky,Hilde Kuehne,Rogerio Feris,James Glass
关键词-EN: Large Language Models, Large Audio Language, Large Language, Audio Language Models, Language Models
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Audio Language Models (LALM) combine the audio perception models and the Large Language Models (LLM) and show a remarkable ability to reason about the input audio, infer the meaning, and understand the intent. However, these systems rely on Transformers which scale quadratically with the input sequence lengths which poses computational challenges in deploying these systems in memory and time-constrained scenarios. Recently, the state-space models (SSMs) have emerged as an alternative to transformer networks. While there have been successful attempts to replace transformer-based audio perception models with state-space ones, state-space-based LALMs remain unexplored. First, we begin by replacing the transformer-based audio perception module and then replace the transformer-based LLM and propose the first state-space-based LALM. Experimental results demonstrate that space-based LALM despite having a significantly lower number of parameters performs competitively with transformer-based LALMs on close-ended tasks on a variety of datasets. Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.15685 [eess.AS] (or arXiv:2411.15685v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2411.15685 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-99] Deep Learning for THz Channel Estimation and Beamforming Prediction via Sub-6GHz Channel

链接: https://arxiv.org/abs/2411.15589
作者: Sagnik Bhattacharya,Abhishek K. Gupta
关键词-EN: THz communication systems, full potential, THz channel, communication systems achieve, vital importance
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: Published: 2022 IEEE International Conference on Signal Processing and Communications (SPCOM 2022)

点击查看摘要

Abstract:An efficient channel estimation is of vital importance to help THz communication systems achieve their full potential. Conventional uplink channel estimation methods, such as least square estimation, are practically inefficient for THz systems because of their large computation overhead. In this paper, we propose an efficient convolutional neural network (CNN) based THz channel estimator that estimates the THz channel factors using uplink sub-6GHz channel. Further, we use the estimated THz channel factors to predict the optimal beamformer from a pre-given codebook, using a dense neural network. We not only get rid of the overhead associated with the conventional methods, but also achieve near-optimal spectral efficiency rates using the proposed beamformer predictor. The proposed method also outperforms deep learning based beamformer predictors accepting THz channel matrices as input, thus proving the validity and efficiency of our sub-6GHz based approach.

[AI-100] An unconditional distribution learning advantage with shallow quantum circuits

链接: https://arxiv.org/abs/2411.15548
作者: N. Pirnay,S. Jerbi,J.-P. Seifert,J. Eisert
关键词-EN: near-term quantum circuits, practical applications, quantum, core challenges, challenges of research
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 7 + 5 pages, 2 figures

点击查看摘要

Abstract:One of the core challenges of research in quantum computing is concerned with the question whether quantum advantages can be found for near-term quantum circuits that have implications for practical applications. Motivated by this mindset, in this work, we prove an unconditional quantum advantage in the probably approximately correct (PAC) distribution learning framework with shallow quantum circuit hypotheses. We identify a meaningful generative distribution learning problem where constant-depth quantum circuits using one and two qubit gates (QNC^0) are superior compared to constant-depth bounded fan-in classical circuits (NC^0) as a choice for hypothesis classes. We hence prove a PAC distribution learning separation for shallow quantum circuits over shallow classical circuits. We do so by building on recent results by Bene Watts and Parham on unconditional quantum advantages for sampling tasks with shallow circuits, which we technically uplift to a hyperplane learning problem, identifying non-local correlations as the origin of the quantum advantage.

[AI-101] Adaptive Intelligence: leveraging insights from adaptive behavior in animals to build flexible AI systems

链接: https://arxiv.org/abs/2411.15234
作者: Mackenzie Weygandt Mathis
关键词-EN: animals continually adjust, environmental feedback, continually adjust, adjust their actions, actions based
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Biological intelligence is inherently adaptive – animals continually adjust their actions based on environmental feedback. However, creating adaptive artificial intelligence (AI) remains a major challenge. The next frontier is to go beyond traditional AI to develop “adaptive intelligence,” defined here as harnessing insights from biological intelligence to build agents that can learn online, generalize, and rapidly adapt to changes in their environment. Recent advances in neuroscience offer inspiration through studies that increasingly focus on how animals naturally learn and adapt their world models. In this Perspective, I will review the behavioral and neural foundations of adaptive biological intelligence, the parallel progress in AI, and explore brain-inspired approaches for building more adaptive algorithms.

[AI-102] Balancing property optimization and constraint satisfaction for constrained multi-property molecular optimization

链接: https://arxiv.org/abs/2411.15183
作者: Xin Xia,Yajie Zhang,Xiangxiang Zeng,Xingyi Zhang,Chunhou Zheng,Yansen Su
关键词-EN: discover improved molecules, Molecular optimization, aims to discover, discover improved, critical step
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Molecular optimization, which aims to discover improved molecules from a vast chemical search space, is a critical step in chemical development. Various artificial intelligence technologies have demonstrated high effectiveness and efficiency on molecular optimization tasks. However, few of these technologies focus on balancing property optimization with constraint satisfaction, making it difficult to obtain high-quality molecules that not only possess desirable properties but also meet various constraints. To address this issue, we propose a constrained multi-property molecular optimization framework (CMOMO), which is a flexible and efficient method to simultaneously optimize multiple molecular properties while satisfying several drug-like constraints. CMOMO improves multiple properties of molecules with constraints based on dynamic cooperative optimization, which dynamically handles the constraints across various scenarios. Besides, CMOMO evaluates multiple properties within discrete chemical spaces cooperatively with the evolution of molecules within an implicit molecular space to guide the evolutionary search. Experimental results show the superior performance of the proposed CMOMO over five state-of-the-art molecular optimization methods on two benchmark tasks of simultaneously optimizing multiple non-biological activity properties while satisfying two structural constraints. Furthermore, the practical applicability of CMOMO is verified on two practical tasks, where it identified a collection of candidate ligands of \beta 2-adrenoceptor GPCR and candidate inhibitors of glycogen synthase kinase-3 \beta with high properties and under drug-like constraints.

[AI-103] Adaptive Sensor Placement Inspired by Bee Foraging: Towards Efficient Environment Monitoring

链接: https://arxiv.org/abs/2411.15159
作者: Sai Krishna Reddy Sathi
关键词-EN: precision agriculture efficiently, Artificial Bee Colony, sustainable robotics, agriculture efficiently, combines Artificial Bee
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper aims to make a mark in the future of sustainable robotics, where efficient algorithms are required to carry out tasks like environmental monitoring and precision agriculture efficiently. We proposed a hybrid algorithm that combines Artificial Bee Colony (ABC) with Levy flight to optimize adaptive sensor placement alongside an important notion of hotspots from domain knowledge experts. By enhancing exploration and exploitation, our approach significantly improves the identification of critical hotspots. This algorithm also finds its usecases for broader search and rescue operations applications, demonstrating its potential in optimization problems across various domains.

机器学习

[LG-0] Exploring Discrete Flow Matching for 3D De Novo Molecule Generation NEURIPS2024

链接: https://arxiv.org/abs/2411.16644
作者: Ian Dunn,David R. Koes
关键词-EN: Deep generative models, Deep generative, facilitate chemical discovery, Flow matching, potential to facilitate
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Presented at the NeurIPS 2024 Machine Learning for Structural Biology Workshop

点击查看摘要

Abstract:Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Flow matching is a recently proposed generative modeling framework that has achieved impressive performance on a variety of tasks including those on biomolecular structures. The seminal flow matching framework was developed only for continuous data. However, de novo molecular design tasks require generating discrete data such as atomic elements or sequences of amino acid residues. Several discrete flow matching methods have been proposed recently to address this gap. In this work we benchmark the performance of existing discrete flow matching methods for 3D de novo small molecule generation and provide explanations of their differing behavior. As a result we present FlowMol-CTMC, an open-source model that achieves state of the art performance for 3D de novo design with fewer learnable parameters than existing methods. Additionally, we propose the use of metrics that capture molecule quality beyond local chemical valency constraints and towards higher-order structural motifs. These metrics show that even though basic constraints are satisfied, the models tend to produce unusual and potentially problematic functional groups outside of the training data distribution. Code and trained models for reproducing this work are available at \urlthis https URL.

[LG-1] Graph Pooling with Local Cluster Selection

链接: https://arxiv.org/abs/2411.16615
作者: Yizhu Chen
关键词-EN: produce coarsened graphs, family of operations, inputs and produce, produce coarsened, Graph poolings
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Graph poolings in GNNs are a family of operations which take graphs as inputs and produce coarsened graphs as output. Modern graph poolings are trainable and closely related to GNNs, which learn to pool graphs under different assumptions. Though there are various assumptions, the procedure of generating pooled graphs is relatively similar and limited. This work formalizes a novel procedure of pooling graphs, along with a graph pooling approach for average situations.

[LG-2] Approximation Algorithms for Combinatorial Optimization with Predictions

链接: https://arxiv.org/abs/2411.16600
作者: Antonios Antoniadis,Marek Eliáš,Adam Polak,Moritz Venzin
关键词-EN: study of utilizing, Steiner Tree, initiate a systematic, systematic study, Steiner Tree problem
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We initiate a systematic study of utilizing predictions to improve over approximation guarantees of classic algorithms, without increasing the running time. We propose a systematic method for a wide class of optimization problems that ask to select a feasible subset of input items of minimal (or maximal) total weight. This gives simple (near-)linear time algorithms for, e.g., Vertex Cover, Steiner Tree, Min-Weight Perfect Matching, Knapsack, and Clique. Our algorithms produce optimal solutions when provided with perfect predictions and their approximation ratios smoothly degrade with increasing prediction error. With small enough prediction error we achieve approximation guarantees that are beyond reach without predictions in the given time bounds, as exemplified by the NP-hardness and APX-hardness of many of the above problems. Although we show our approach to be optimal for this class of problems as a whole, there is a potential for exploiting specific structural properties of individual problems to obtain improved bounds; we demonstrate this on the Steiner Tree problem. We conclude with an empirical evaluation of our approach.

[LG-3] Adversarial Attacks for Drift Detection

链接: https://arxiv.org/abs/2411.16591
作者: Fabian Hinder,Valerie Vaquet,Barbara Hammer
关键词-EN: Concept drift refers, distributions over time, Concept drift, Concept, Abstract
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Concept drift refers to the change of data distributions over time. While drift poses a challenge for learning models, requiring their continual adaption, it is also relevant in system monitoring to detect malfunctions, system failures, and unexpected behavior. In the latter case, the robust and reliable detection of drifts is imperative. This work studies the shortcomings of commonly used drift detection schemes. We show how to construct data streams that are drifting without being detected. We refer to those as drift adversarials. In particular, we compute all possible adversairals for common detection schemes and underpin our theoretical findings with empirical evaluations.

[LG-4] Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches

链接: https://arxiv.org/abs/2411.16567
作者: Yinqiu Feng,Aoran Shen,Jiacheng Hu,Yingbin Liang,Shiru Wang,Junliang Du
关键词-EN: Generative Adversarial Networks, integrating data augmentation, enhancing few-shot learning, presents an innovative, innovative approach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents an innovative approach to enhancing few-shot learning by integrating data augmentation with model fine-tuning in a framework designed to tackle the challenges posed by small-sample data. Recognizing the critical limitations of traditional machine learning models that require large datasets-especially in fields such as drug discovery, target recognition, and malicious traffic detection-this study proposes a novel strategy that leverages Generative Adversarial Networks (GANs) and advanced optimization techniques to improve model performance with limited data. Specifically, the paper addresses the noise and bias issues introduced by data augmentation methods, contrasting them with model-based approaches, such as fine-tuning and metric learning, which rely heavily on related datasets. By combining Markov Chain Monte Carlo (MCMC) sampling and discriminative model ensemble strategies within a GAN framework, the proposed model adjusts generative and discriminative distributions to simulate a broader range of relevant data. Furthermore, it employs MHLoss and a reparameterized GAN ensemble to enhance stability and accelerate convergence, ultimately leading to improved classification performance on small-sample images and structured datasets. Results confirm that the MhERGAN algorithm developed in this research is highly effective for few-shot learning, offering a practical solution that bridges data scarcity with high-performing model adaptability and generalization.

[LG-5] ransformers are Deep Optimizers: Provable In-Context Learning for Deep Model Training

链接: https://arxiv.org/abs/2411.16549
作者: Weimin Wu,Maojiang Su,Jerry Yao-Chieh Hu,Zhao Song,Han Liu
关键词-EN: in-context learning, capability for in-context, ICL gradient descent, ICL, deep models
类目: Machine Learning (cs.LG)
*备注: 66 pages, 3 figures

点击查看摘要

Abstract:We investigate the transformer’s capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a (2N+4)L -layer transformer capable of simulating L gradient descent steps of an N -layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.

[LG-6] Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation

链接: https://arxiv.org/abs/2411.16532
作者: Muhammad Burhan Hafez,Kerim Erekmen
关键词-EN: universal learning systems, data arrives, development of universal, retraining from scratch, solve multiple tasks
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in Scientific Reports

点击查看摘要

Abstract:Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: this https URL.

[LG-7] Jaya R Package – A Parameter-Free Solution for Advanced Single and Multi-Objective Optimization

链接: https://arxiv.org/abs/2411.16509
作者: Neeraj Dhanraj Bokde
关键词-EN: Jaya optimization algorithm, parameter-free Jaya optimization, suitable for solving, offers a robust, robust and versatile
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Jaya R package offers a robust and versatile implementation of the parameter-free Jaya optimization algorithm, suitable for solving both single-objective and multi-objective optimization problems. By integrating advanced features such as constraint handling, adaptive population management, Pareto front tracking for multi-objective trade-offs, and parallel processing for computational efficiency, the package caters to a wide range of optimization challenges. Its intuitive design and flexibility allow users to solve complex, real-world problems across various domains. To demonstrate its practical utility, a case study on energy modeling explores the optimization of renewable energy shares, showcasing the package’s ability to minimize carbon emissions and costs while enhancing system reliability. The Jaya R package is an invaluable tool for researchers and practitioners seeking efficient and adaptive optimization solutions.

[LG-8] Distributed communication-efficient and differentially private estimation of KL divergence

链接: https://arxiv.org/abs/2411.16478
作者: Mary Scott,Sayan Biswas,Graham Cormode,Carsten Maple
关键词-EN: managing distributed, measure the extent, sensitive data, key task, Abstract
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 28 pages, 5 figures

点击查看摘要

Abstract:A key task in managing distributed, sensitive data is to measure the extent to which a distribution changes. Understanding this drift can effectively support a variety of federated learning and analytics tasks. However, in many practical settings sharing such information can be undesirable (e.g., for privacy concerns) or infeasible (e.g., for high communication costs). In this work, we describe novel algorithmic approaches for estimating the KL divergence of data across federated models of computation, under differential privacy. We analyze their theoretical properties and present an empirical study of their performance. We explore parameter settings that optimize the accuracy of the algorithm catering to each of the settings; these provide sub-variations that are applicable to real-world tasks, addressing different context- and application-specific trust level requirements. Our experimental results confirm that our private estimators achieve accuracy comparable to a baseline algorithm without differential privacy guarantees.

[LG-9] Distributed Online Optimization with Stochastic Agent Availability

链接: https://arxiv.org/abs/2411.16477
作者: Juliette Achddou,Nicolò Cesa-Bianchi,Hao Qiu
关键词-EN: practical federated learning, federated learning settings, distributed online optimization, Motivated by practical, practical federated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by practical federated learning settings where clients may not be always available, we investigate a variant of distributed online optimization where agents are active with a known probability p at each time step, and communication between neighboring agents can only take place if they are both active. We introduce a distributed variant of the FTRL algorithm and analyze its network regret, defined through the average of the instantaneous regret of the active agents. Our analysis shows that, for any connected communication graph G over N agents, the expected network regret of our FTRL variant after T steps is at most of order (\kappa/p^2)\min\big\sqrtN,N^1/4/\sqrtp\big\sqrtT , where \kappa is the condition number of the Laplacian of G . We then show that similar regret bounds also hold with high probability. Moreover, we show that our notion of regret (average-case over the agents) is essentially equivalent to the standard notion of regret (worst-case over agents), implying that our bounds are not significantly improvable when p=1 . Our theoretical results are supported by experiments on synthetic datasets.

[LG-10] NonSysId: A nonlinear system identification package with improved model term selection for NARMAX models

链接: https://arxiv.org/abs/2411.16475
作者: Rajintha Gunawardena,Zi-Qiang Lang,Fei He
关键词-EN: involves constructing mathematical, identification involves constructing, constructing mathematical models, frequency domains, System identification involves
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:System identification involves constructing mathematical models of dynamic systems using input-output data, enabling analysis and prediction of system behaviour in both time and frequency domains. This approach can model the entire system or capture specific dynamics within it. For meaningful analysis, it is essential for the model to accurately reflect the underlying system’s behaviour. This paper introduces NonSysId, an open-sourced MATLAB software package designed for nonlinear system identification, specifically focusing on NARMAX models. The software incorporates an advanced term selection methodology that prioritises on simulation (free-run) accuracy while preserving model parsimony. A key feature is the integration of iterative Orthogonal Forward Regression (iOFR) with Predicted Residual Sum of Squares (PRESS) statistic-based term selection, facilitating robust model generalisation without the need for a separate validation dataset. Furthermore, techniques for reducing computational overheads are implemented. These features make NonSysId particularly suitable for real-time applications such as structural health monitoring, fault diagnosis, and biomedical signal processing, where it is a challenge to capture the signals under consistent conditions, resulting in limited or no validation data.

[LG-11] Lion Cub: Minimizing Communication Overhead in Distributed Lion

链接: https://arxiv.org/abs/2411.16462
作者: Satoki Ishikawa,Tal Ben-Nun,Brian Van Essen,Rio Yokota,Nikoli Dryden
关键词-EN: slower Ethernet interconnects, current hardware trends, Ethernet interconnects, slower Ethernet, hardware trends
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Communication overhead is a key challenge in distributed deep learning, especially on slower Ethernet interconnects, and given current hardware trends, communication is likely to become a major bottleneck. While gradient compression techniques have been explored for SGD and Adam, the Lion optimizer has the distinct advantage that its update vectors are the output of a sign operation, enabling straightforward quantization. However, simply compressing updates for communication and using techniques like majority voting fails to lead to end-to-end speedups due to inefficient communication algorithms and reduced convergence. We analyze three factors critical to distributed learning with Lion: optimizing communication methods, identifying effective quantization methods, and assessing the necessity of momentum synchronization. Our findings show that quantization techniques adapted to Lion and selective momentum synchronization can significantly reduce communication costs while maintaining convergence. We combine these into Lion Cub, which enables up to 5x speedups in end-to-end training compared to Lion. This highlights Lion’s potential as a communication-efficient solution for distributed training.

[LG-12] On the Reconstruction of Training Data from Group Invariant Networks

链接: https://arxiv.org/abs/2411.16458
作者: Ran Elbaz,Gilad Yehudai,Meirav Galun,Haggai Maron
关键词-EN: privacy and explainability, Reconstructing training data, active area, significant implications, implications for privacy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reconstructing training data from trained neural networks is an active area of research with significant implications for privacy and explainability. Recent advances have demonstrated the feasibility of this process for several data types. However, reconstructing data from group-invariant neural networks poses distinct challenges that remain largely unexplored. This paper addresses this gap by first formulating the problem and discussing some of its basic properties. We then provide an experimental evaluation demonstrating that conventional reconstruction techniques are inadequate in this scenario. Specifically, we observe that the resulting data reconstructions gravitate toward symmetric inputs on which the group acts trivially, leading to poor-quality results. Finally, we propose two novel methods aiming to improve reconstruction in this setup and present promising preliminary experimental results. Our work sheds light on the complexities of reconstructing data from group invariant neural networks and offers potential avenues for future research in this domain.

[LG-13] Machine learning for cerebral blood vessels malformations

链接: https://arxiv.org/abs/2411.16349
作者: Irem Topal,Alexander Cherevko,Yuri Bugay,Maxim Shishlenin,Jean Barbier,Deniz Eroglu,Édgar Roldán,Roman Belousov
关键词-EN: aneurysms and arteriovenous, arteriovenous malformations, malformations are life-threatening, life-threatening hemodynamic pathologies, Cerebral aneurysms
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Quantitative Methods (q-bio.QM)
*备注: 14 pages, 6 main figures, 5 supplementary figures, 2 supplementary tables

点击查看摘要

Abstract:Cerebral aneurysms and arteriovenous malformations are life-threatening hemodynamic pathologies of the brain. While surgical intervention is often essential to prevent fatal outcomes, it carries significant risks both during the procedure and in the postoperative period, making the management of these conditions highly challenging. Parameters of cerebral blood flow, routinely monitored during medical interventions, could potentially be utilized in machine learning-assisted protocols for risk assessment and therapeutic prognosis. To this end, we developed a linear oscillatory model of blood velocity and pressure for clinical data acquired from neurosurgical operations. Using the method of Sparse Identification of Nonlinear Dynamics (SINDy), the parameters of our model can be reconstructed online within milliseconds from a short time series of the hemodynamic variables. The identified parameter values enable automated classification of the blood-flow pathologies by means of logistic regression, achieving an accuracy of 73 %. Our results demonstrate the potential of this model for both diagnostic and prognostic applications, providing a robust and interpretable framework for assessing cerebral blood vessel conditions.

[LG-14] owards Foundation Models for Critical Care Time Series NEURIPS2024

链接: https://arxiv.org/abs/2411.16346
作者: Manuel Burger,Fedor Sergeev,Malte Londschien,Daphné Chopard,Hugo Yèche,Eike Gerdes,Polina Leshetkina,Alexander Morgenroth,Zeynep Babür,Jasmina Bogojeska,Martin Faltys,Rita Kuznetsova,Gunnar Rätsch
关键词-EN: generalist medical large, medical large language, Notable progress, large language models, made in generalist
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for Oral Presentation at AIM-FM Workshop at NeurIPS 2024

点击查看摘要

Abstract:Notable progress has been made in generalist medical large language models across various healthcare areas. However, large-scale modeling of in-hospital time series data - such as vital signs, lab results, and treatments in critical care - remains underexplored. Existing datasets are relatively small, but combining them can enhance patient diversity and improve model robustness. To effectively utilize these combined datasets for large-scale modeling, it is essential to address the distribution shifts caused by varying treatment policies, necessitating the harmonization of treatment variables across the different datasets. This work aims to establish a foundation for training large-scale multi-variate time series models on critical care data and to provide a benchmark for machine learning models in transfer learning across hospitals to study and address distribution shift challenges. We introduce a harmonized dataset for sequence modeling and transfer learning research, representing the first large-scale collection to include core treatment variables. Future plans involve expanding this dataset to support further advancements in transfer learning and the development of scalable, generalizable models for critical healthcare applications.

[LG-15] A Data-Driven Approach to Dataflow-Aware Online Scheduling for Graph Neural Network Inference

链接: https://arxiv.org/abs/2411.16342
作者: Pol Puigdemont,Enrico Russo,Axel Wassington,Abhijit Das,Sergi Abadal,Maurizio Palesi
关键词-EN: Graph Neural Networks, Neural Networks, shown significant promise, network analysis, Graph Neural
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted for ASP-DAC 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown significant promise in various domains, such as recommendation systems, bioinformatics, and network analysis. However, the irregularity of graph data poses unique challenges for efficient computation, leading to the development of specialized GNN accelerator architectures that surpass traditional CPU and GPU performance. Despite this, the structural diversity of input graphs results in varying performance across different GNN accelerators, depending on their dataflows. This variability in performance due to differing dataflows and graph properties remains largely unexplored, limiting the adaptability of GNN accelerators. To address this, we propose a data-driven framework for dataflow-aware latency prediction in GNN inference. Our approach involves training regressors to predict the latency of executing specific graphs on particular dataflows, using simulations on synthetic graphs. Experimental results indicate that our regressors can predict the optimal dataflow for a given graph with up to 91.28% accuracy and a Mean Absolute Percentage Error (MAPE) of 3.78%. Additionally, we introduce an online scheduling algorithm that uses these regressors to enhance scheduling decisions. Our experiments demonstrate that this algorithm achieves up to 3.17\times speedup in mean completion time and 6.26\times speedup in mean execution time compared to the best feasible baseline across all datasets.

[LG-16] Local Learning for Covariate Selection in Nonparametric Causal Effect Estimation with Latent Variables

链接: https://arxiv.org/abs/2411.16315
作者: Zheng Li,Feng Xie,Yan Zeng,Zhi Geng
关键词-EN: fields of science, fundamental problem, covariate selection, nonexperimental data, Estimating causal effects
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating causal effects from nonexperimental data is a fundamental problem in many fields of science. A key component of this task is selecting an appropriate set of covariates for confounding adjustment to avoid bias. Most existing methods for covariate selection often assume the absence of latent variables and rely on learning the global network structure among variables. However, identifying the global structure can be unnecessary and inefficient, especially when our primary interest lies in estimating the effect of a treatment variable on an outcome variable. To address this limitation, we propose a novel local learning approach for covariate selection in nonparametric causal effect estimation, which accounts for the presence of latent variables. Our approach leverages testable independence and dependence relationships among observed variables to identify a valid adjustment set for a target causal relationship, ensuring both soundness and completeness under standard assumptions. We validate the effectiveness of our algorithm through extensive experiments on both synthetic and real-world data.

[LG-17] Understanding Generalization of Federated Learning: the Trade-off between Model Stability and Optimization

链接: https://arxiv.org/abs/2411.16303
作者: Dun Zeng,Zheshun Wu,Shiyu Liu,Yu Pan,Xiaoying Tang,Zenglin Xu
关键词-EN: distributed learning approach, distributed learning, local data private, Learning, Federated Learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a distributed learning approach that trains neural networks across multiple devices while keeping their local data private. However, FL often faces challenges due to data heterogeneity, leading to inconsistent local optima among clients. These inconsistencies can cause unfavorable convergence behavior and generalization performance degradation. Existing studies mainly describe this issue through \textitconvergence analysis, focusing on how well a model fits training data, or through \textitalgorithmic stability, which examines the generalization gap. However, neither approach precisely captures the generalization performance of FL algorithms, especially for neural networks. In this paper, we introduce the first generalization dynamics analysis framework in federated optimization, highlighting the trade-offs between model stability and optimization. Through this framework, we show how the generalization of FL algorithms is affected by the interplay of algorithmic stability and optimization. This framework applies to standard federated optimization and its advanced versions, like server momentum. We find that fast convergence from large local steps or accelerated momentum enlarges stability but obtains better generalization performance. Our insights into these trade-offs can guide the practice of future algorithms for better generalization.

[LG-18] Evaluating Rank-N-Contrast: Continuous and Robust Representations for Regression

链接: https://arxiv.org/abs/2411.16298
作者: Six Valentin,Chidiac Alexandre,Worlikar Arkin
关键词-EN: paper published, RNC, Abstract, arXiv, study validates RNC
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This document is a replication of the original “Rank-N-Contrast” (arXiv:2210.01189v2) paper published in 2023. This evaluation is done for academic purposes. Deep regression models often fail to capture the continuous nature of sample orders, creating fragmented representations and suboptimal performance. To address this, we reproduced the Rank-N-Contrast (RNC) framework, which learns continuous representations by contrasting samples by their rankings in the target space. Our study validates RNC’s theoretical and empirical benefits, including improved performance and robustness. We extended the evaluation to an additional regression dataset and conducted robustness tests using a holdout method, where a specific range of continuous data was excluded from the training set. This approach assessed the model’s ability to generalise to unseen data and achieve state-of-the-art performance. This replication study validates the original findings and broadens the understanding of RNC’s applicability and robustness.

[LG-19] A Graph Neural Architecture Search Approach for Identifying Bots in Social Media

链接: https://arxiv.org/abs/2411.16285
作者: Georgios Tzoumanekas,Michail Chatzianastasis,Loukas Ilias,George Kiokes,John Psarras,Dimitris Askounis
关键词-EN: tangible real-world consequences, bots-automated programs disseminating, programs disseminating misinformation, Social media platforms, Neural Architecture Search
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Social media platforms, including X, Facebook, and Instagram, host millions of daily users, giving rise to bots-automated programs disseminating misinformation and ideologies with tangible real-world consequences. While bot detection in platform X has been the area of many deep learning models with adequate results, most approaches neglect the graph structure of social media relationships and often rely on hand-engineered architectures. Our work introduces the implementation of a Neural Architecture Search (NAS) technique, namely Deep and Flexible Graph Neural Architecture Search (DFG-NAS), tailored to Relational Graph Convolutional Neural Networks (RGCNs) in the task of bot detection in platform X. Our model constructs a graph that incorporates both the user relationships and their metadata. Then, DFG-NAS is adapted to automatically search for the optimal configuration of Propagation and Transformation functions in the RGCNs. Our experiments are conducted on the TwiBot-20 dataset, constructing a graph with 229,580 nodes and 227,979 edges. We study the five architectures with the highest performance during the search and achieve an accuracy of 85.7%, surpassing state-of-the-art models. Our approach not only addresses the bot detection challenge but also advocates for the broader implementation of NAS models in neural network design automation.

[LG-20] Even Sparser Graph Transformers

链接: https://arxiv.org/abs/2411.16278
作者: Hamed Shirzad,Honghao Lin,Balaji Venkatachalam,Ameya Velingker,David Woodruff,Danica Sutherland
关键词-EN: Graph Transformers excel, long-range dependency modeling, generally require quadratic, Transformers excel, quadratic memory complexity
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph Transformers excel in long-range dependency modeling, but generally require quadratic memory complexity in the number of nodes in an input graph, and hence have trouble scaling to large graphs. Sparse attention variants such as Exphormer can help, but may require high-degree augmentations to the input graph for good performance, and do not attempt to sparsify an already-dense input graph. As the learned attention mechanisms tend to use few of these edges, such high-degree connections may be unnecessary. We show (empirically and with theoretical backing) that attention scores on graphs are usually quite consistent across network widths, and use this observation to propose a two-stage procedure, which we call Spexphormer: first, train a narrow network on the full augmented graph. Next, use only the active connections to train a wider network on a much sparser graph. We establish theoretical conditions when a narrow network’s attention scores can match those of a wide network, and show that Spexphormer achieves good performance with drastically reduced memory requirements on various graph datasets.

[LG-21] Deep Learning for Motion Classification in Ankle Exoskeletons Using Surface EMG and IMU Signals

链接: https://arxiv.org/abs/2411.16273
作者: Silas Ruhrberg Estévez,Josée Mallah,Dominika Kazieczko,Chenyu Tang,Luigi G. Occhipinti
关键词-EN: reduce fall risks, garnered considerable interest, Inertial Measurement Units, fall risks, aging population
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ankle exoskeletons have garnered considerable interest for their potential to enhance mobility and reduce fall risks, particularly among the aging population. The efficacy of these devices relies on accurate real-time prediction of the user’s intended movements through sensor-based inputs. This paper presents a novel motion prediction framework that integrates three Inertial Measurement Units (IMUs) and eight surface Electromyography (sEMG) sensors to capture both kinematic and muscular activity data. A comprehensive set of activities, representative of everyday movements in barrier-free environments, was recorded for the purpose. Our findings reveal that Convolutional Neural Networks (CNNs) slightly outperform Long Short-Term Memory (LSTM) networks on a dataset of five motion tasks, achieving classification accuracies of 96.5 \pm 0.8 % and 87.5 \pm 2.9 % , respectively. Furthermore, we demonstrate the system’s proficiency in transfer learning, enabling accurate motion classification for new subjects using just ten samples per class for finetuning. The robustness of the model is demonstrated by its resilience to sensor failures resulting in absent signals, maintaining reliable performance in real-world scenarios. These results underscore the potential of deep learning algorithms to enhance the functionality and safety of ankle exoskeletons, ultimately improving their usability in daily life.

[LG-22] Local Bayesian Optimization for Controller Tuning with Crash Constraints

链接: https://arxiv.org/abs/2411.16267
作者: Alexander von Rohr,David Stenger,Dominik Scheurenberg,Sebastian Trimpe
关键词-EN: involves manual adjustments, manual adjustments, crucial for closed-loop, involves manual, Bayesian optimization
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Published in at-Automatisierungstechnik

点击查看摘要

Abstract:Controller tuning is crucial for closed-loop performance but often involves manual adjustments. Although Bayesian optimization (BO) has been established as a data-efficient method for automated tuning, applying it to large and high-dimensional search spaces remains challenging. We extend a recently proposed local variant of BO to include crash constraints, where the controller can only be successfully evaluated in an a-priori unknown feasible region. We demonstrate the efficiency of the proposed method through simulations and hardware experiments. Our findings showcase the potential of local BO to enhance controller performance and reduce the time and resources necessary for tuning.

[LG-23] Neural Network-based High-index Saddle Dynamics Method for Searching Saddle Points and Solution Landscape

链接: https://arxiv.org/abs/2411.16200
作者: Yuankai Liu,Lei Zhang,Jin Zhao
关键词-EN: computing saddle points, high-index saddle dynamics, solution landscape, powerful approach, approach for computing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The high-index saddle dynamics (HiSD) method is a powerful approach for computing saddle points and solution landscape. However, its practical applicability is constrained by the need for the explicit energy function expression. To overcome this challenge, we propose a neural network-based high-index saddle dynamics (NN-HiSD) method. It utilizes neural network-based surrogate model to approximates the energy function, allowing the use of the HiSD method in the cases where the energy function is either unavailable or computationally expensive. We further enhance the efficiency of the NN-HiSD method by incorporating momentum acceleration techniques, specifically Nesterov’s acceleration and the heavy-ball method. We also provide a rigorous convergence analysis of the NN-HiSD method. We conduct numerical experiments on systems with and without explicit energy functions, specifically including the alanine dipeptide model and bacterial ribosomal assembly intermediates for the latter, demonstrating the effectiveness and reliability of the proposed method.

[LG-24] On the Robustness of the Successive Projection Algorithm

链接: https://arxiv.org/abs/2411.16195
作者: Giovanni Barbarino,Nicolas Gillis
关键词-EN: successive projection algorithm, SPA, latent simplex, successive projection, convex hull
类目: Numerical Analysis (math.NA); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages

点击查看摘要

Abstract:The successive projection algorithm (SPA) is a workhorse algorithm to learn the r vertices of the convex hull of a set of (r-1) -dimensional data points, a.k.a. a latent simplex, which has numerous applications in data science. In this paper, we revisit the robustness to noise of SPA and several of its variants. In particular, when r \geq 3 , we prove the tightness of the existing error bounds for SPA and for two more robust preconditioned variants of SPA. We also provide significantly improved error bounds for SPA, by a factor proportional to the conditioning of the r vertices, in two special cases: for the first extracted vertex, and when r \leq 2 . We then provide further improvements for the error bounds of a translated version of SPA proposed by Arora et al. (‘‘A practical algorithm for topic modeling with provable guarantees’’, ICML, 2013) in two special cases: for the first two extracted vertices, and when r \leq 3 . Finally, we propose a new more robust variant of SPA that first shifts and lifts the data points in order to minimize the conditioning of the problem. We illustrate our results on synthetic data.

[LG-25] BadSFL: Backdoor Attack against Scaffold Federated Learning

链接: https://arxiv.org/abs/2411.16167
作者: Xingshuo Han,Xiang Lan,Haozhao Wang,Shengmin Xu,Shen Ren,Jason Zeng,Ming Wu,Michael Heinrich,Tianwei Zhang
关键词-EN: preserve data privacy, Federated learning, deep learning models, global model, backdoor attacks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables the training of deep learning models on distributed clients to preserve data privacy. However, this learning paradigm is vulnerable to backdoor attacks, where malicious clients can upload poisoned local models to embed backdoors into the global model, leading to attacker-desired predictions. Existing backdoor attacks mainly focus on FL with independently and identically distributed (IID) scenarios, while real-world FL training data are typically non-IID. Current strategies for non-IID backdoor attacks suffer from limitations in maintaining effectiveness and durability. To address these challenges, we propose a novel backdoor attack method, \name, specifically designed for the FL framework using the scaffold aggregation algorithm in non-IID settings. \name leverages a Generative Adversarial Network (GAN) based on the global model to complement the training set, achieving high accuracy on both backdoor and benign samples. It utilizes a specific feature as the backdoor trigger to ensure stealthiness, and exploits the Scaffold’s control variate to predict the global model’s convergence direction, ensuring the backdoor’s persistence. Extensive experiments on three benchmark datasets demonstrate the high effectiveness, stealthiness, and durability of \name. Notably, our attack remains effective over 60 rounds in the global model and up to 3 times longer than existing baseline attacks after stopping the injection of malicious updates.

[LG-26] DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders

链接: https://arxiv.org/abs/2411.16154
作者: Sizai Hou,Songze Li,Duanyi Yao
关键词-EN: Self-supervised learning, training high-quality upstream, pervasively exploited, large amount, amount of unlabeled
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 12 pages

点击查看摘要

Abstract:Self-supervised learning (SSL) is pervasively exploited in training high-quality upstream encoders with a large amount of unlabeled data. However, it is found to be susceptible to backdoor attacks merely via polluting a small portion of training data. The victim encoders mismatch triggered inputs with target embeddings, e.g., match the triggered cat input to an airplane embedding, such that the downstream tasks are affected to misbehave when the trigger is activated. Emerging backdoor attacks have shown great threats in different SSL paradigms such as contrastive learning and CLIP, while few research is devoted to defending against such attacks. Besides, the existing ones fall short in detecting advanced stealthy backdoors. To address the limitations, we propose a novel detection mechanism, DeDe, which detects the activation of the backdoor mapping with the cooccurrence of victim encoder and trigger inputs. Specifically, DeDe trains a decoder for the SSL encoder on an auxiliary dataset (can be out-of-distribution or even slightly poisoned), such that for any triggered input that misleads to the target embedding, the decoder outputs an image significantly different from the input. We empirically evaluate DeDe on both contrastive learning and CLIP models against various types of backdoor attacks, and demonstrate its superior performance over SOTA detection methods in both upstream detection performance and ability of preventing backdoors in downstream tasks.

[LG-27] Local Intrinsic Dimensionality for Dynamic Graph Embeddings

链接: https://arxiv.org/abs/2411.16145
作者: Dušica Knežević,Miloš Savić,Miloš Radovanović
关键词-EN: important theoretical implications, local intrinsic dimensionality, notion of local, important theoretical, theoretical implications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The notion of local intrinsic dimensionality (LID) has important theoretical implications and practical applications in the fields of data mining and machine learning. Recent research efforts indicate that LID measures defined for graphs can improve graph representational learning methods based on random walks. In this paper, we discuss how NC-LID, a LID measure designed for static graphs, can be adapted for dynamic networks. Focusing on dynnode2vec as the most representative dynamic graph embedding method based on random walks, we examine correlations between NC-LID and the intrinsic quality of 10 real-world dynamic network embeddings. The obtained results show that NC-LID can be used as a good indicator of nodes whose embedding vectors do not tend to preserve temporal graph structure well. Thus, our empirical findings constitute the first step towards LID-aware dynamic graph embedding methods.

[LG-28] Causal Adjacency Learning for Spatiotemporal Prediction Over Graphs

链接: https://arxiv.org/abs/2411.16142
作者: Zhaobin Mo,Qingyuan Liu,Baohua Yan,Longxiang Zhang,Xuan Di
关键词-EN: adjacency matrix, transportation systems, existing STPG models, crucial for transportation, adjacency
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Spatiotemporal prediction over graphs (STPG) is crucial for transportation systems. In existing STPG models, an adjacency matrix is an important component that captures the relations among nodes over graphs. However, most studies calculate the adjacency matrix by directly memorizing the data, such as distance- and correlation-based matrices. These adjacency matrices do not consider potential pattern shift for the test data, and may result in suboptimal performance if the test data has a different distribution from the training one. This issue is known as the Out-of-Distribution generalization problem. To address this issue, in this paper we propose a Causal Adjacency Learning (CAL) method to discover causal relations over graphs. The learned causal adjacency matrix is evaluated on a downstream spatiotemporal prediction task using real-world graph data. Results demonstrate that our proposed adjacency matrix can capture the causal relations, and using our learned adjacency matrix can enhance prediction performance on the OOD test data, even though causal learning is not conducted in the downstream task.

[LG-29] Beyond Task Vectors: Selective Task Arithmetic Based on Importance Metrics

链接: https://arxiv.org/abs/2411.16139
作者: Tian Bowen,Lai Songning,Wu Jiemin,Shuai Zhihao,Ge Shiming,Yue Yutao
关键词-EN: pre-learned knowledge representations, revolutionized deep learning, textbf, Pretrained models, leveraging large-scale
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Pretrained models have revolutionized deep learning by enabling significant performance improvements across a wide range of tasks, leveraging large-scale, pre-learned knowledge representations. However, deploying these models in real-world multi-task learning (MTL) scenarios poses substantial challenges, primarily due to high computational costs and inefficiencies in inference. Traditional approaches such as pruning, quantization, and knowledge distillation have been explored to mitigate these issues, but they often fall short in fully addressing the complexities of multi-task environments. This paper introduces \textbf\underlineSelective \textbf\underlineTask \textbf\underlineArithmetic \underline\textbf(STA), a training-free framework designed to enhance multi-task performance through task-specific parameter fusion. STA addresses three key challenges: (i) \textbfParameter importance diversity: Recognizing that different tasks relie on distinct parameters, STA employs a loss-sensitive parameter importance metric derived from a first-order Taylor expansion to accurately measure the importance of parameters for each task. (ii) \textbfOver-reliance on hyperparameter tuning: By enhancing the sparsity of task vectors through parameter importance metrics, STA reduces the need for extensive hyperparameter tuning, thereby improving the generalization and robustness of the model. (iii) \textbfNeglect of other abilities in task arithmetic: Previous works have largely overlooked the potential for more precise task forgetting. STA leverages its parameter importance metric to achieve more controlled and effective task forgetting, minimizing the impact of noisy elements that can degrade model performance. Experimental results demonstrate that STA achieves superior multi-task performance across benchmarks and excellent performance in task forgetting.

[LG-30] Context Awareness Gate For Retrieval Augmented Generation

链接: https://arxiv.org/abs/2411.16133
作者: Mohammad Hassan Heydari,Arshia Hemmat,Erfan Naman,Afsaneh Fatemi
关键词-EN: widely adopted approach, Retrieval Augmented Generation, large language models, Augmented Generation, answering domain-specific questions
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has emerged as a widely adopted approach to mitigate the limitations of large language models (LLMs) in answering domain-specific questions. Previous research has predominantly focused on improving the accuracy and quality of retrieved data chunks to enhance the overall performance of the generation pipeline. However, despite ongoing advancements, the critical issue of retrieving irrelevant information – which can impair the ability of the model to utilize its internal knowledge effectively – has received minimal attention. In this work, we investigate the impact of retrieving irrelevant information in open-domain question answering, highlighting its significant detrimental effect on the quality of LLM outputs. To address this challenge, we propose the Context Awareness Gate (CAG) architecture, a novel mechanism that dynamically adjusts the LLMs’ input prompt based on whether the user query necessitates external context retrieval. Additionally, we introduce the Vector Candidates method, a core mathematical component of CAG that is statistical, LLM-independent, and highly scalable. We further examine the distributions of relationships between contexts and questions, presenting a statistical analysis of these distributions. This analysis can be leveraged to enhance the context retrieval process in Retrieval Augmented Generation (RAG) systems.

[LG-31] DF-GNN: Dynamic Fusion Framework for Attention Graph Neural Networks on GPUs

链接: https://arxiv.org/abs/2411.16127
作者: Jiahui Liu,Zhenkun Cai,Zhiyong Chen,Minjie Wang
关键词-EN: Attention Graph Neural, Graph Neural Networks, Attention Graph, Graph Transformer, Neural Networks
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Attention Graph Neural Networks (AT-GNNs), such as GAT and Graph Transformer, have demonstrated superior performance compared to other GNNs. However, existing GNN systems struggle to efficiently train AT-GNNs on GPUs due to their intricate computation patterns. The execution of AT-GNN operations without kernel fusion results in heavy data movement and significant kernel launch overhead, while fixed thread scheduling in existing GNN kernel fusion strategies leads to sub-optimal performance, redundant computation and unbalanced workload. To address these challenges, we propose a dynamic kernel fusion framework, DF-GNN, for the AT-GNN family. DF-GNN introduces a dynamic bi-level thread scheduling strategy, enabling flexible adjustments to thread scheduling while retaining the benefits of shared memory within the fused kernel. DF-GNN tailors specific thread scheduling for operations in AT-GNNs and considers the performance bottleneck shift caused by the presence of super nodes. Additionally, DF-GNN is integrated with the PyTorch framework for high programmability. Evaluations across diverse GNN models and multiple datasets reveal that DF-GNN surpasses existing GNN kernel optimization works like cuGraph and dgNN, with speedups up to 7.0\times over the state-of-the-art non-fusion DGL sparse library. Moreover, it achieves an average speedup of 2.16\times in end-to-end training compared to the popular GNN computing framework DGL.

[LG-32] BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

链接: https://arxiv.org/abs/2411.16102
作者: Yilong Zhao,Shuo Yang,Kan Zhu,Lianmin Zheng,Baris Kasikci,Yang Zhou,Jiarong Xing,Ion Stoica
关键词-EN: achieve higher throughput, Offline batch inference, lower costs, latency-insensitive applications, leverages the flexibility
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a request schedule that maximizes resource overlapping can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing sub-optimal inference throughput. We present BlendServe, a system that maximizes resource utilization of offline batch inference by combining the benefits of resource overlapping and prefix sharing using a resource-aware prefix tree. BlendServe exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing. We evaluate BlendServe on a variety of synthetic multi-modal workloads and show that it provides up to 1.44\times throughput boost compared to widely-used industry standards, vLLM and SGLang.

[LG-33] LDACP: Long-Delayed Ad Conversions Prediction Model for Bidding Strategy

链接: https://arxiv.org/abs/2411.16095
作者: Peng Cui(1),Yiming Yang(2),Fusheng Jin(1),Siyuan Tang(2),Yunli Wang(2),Fukang Yang(2),Yalong Jia(2),Qingpeng Cai(2),Fei Pan(2),Changcheng Li(2),Peng Jiang(2) ((1) Beijing Institute of Technology, (2) Kuaishou Technology)
关键词-EN: Cost Per Action, system dynamically adjusts, conversion numbers, automated bidding system, bidding system dynamically
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures, 6 tables

点击查看摘要

Abstract:In online advertising, once an ad campaign is deployed, the automated bidding system dynamically adjusts the bidding strategy to optimize Cost Per Action (CPA) based on the number of ad conversions. For ads with a long conversion delay, relying solely on the real-time tracked conversion number as a signal for bidding strategy can significantly overestimate the current CPA, leading to conservative bidding strategies. Therefore, it is crucial to predict the number of long-delayed conversions. Nonetheless, it is challenging to predict ad conversion numbers through traditional regression methods due to the wide range of ad conversion numbers. Previous regression works have addressed this challenge by transforming regression problems into bucket classification problems, achieving success in various scenarios. However, specific challenges arise when predicting the number of ad conversions: 1) The integer nature of ad conversion numbers exacerbates the discontinuity issue in one-hot hard labels; 2) The long-tail distribution of ad conversion numbers complicates tail data prediction. In this paper, we propose the Long-Delayed Ad Conversions Prediction model for bidding strategy (LDACP), which consists of two sub-modules. To alleviate the issue of discontinuity in one-hot hard labels, the Bucket Classification Module with label Smoothing method (BCMS) converts one-hot hard labels into non-normalized soft labels, then fits these soft labels by minimizing classification loss and regression loss. To address the challenge of predicting tail data, the Value Regression Module with Proxy labels (VRMP) uses the prediction bias of aggregated pCTCVR as proxy labels. Finally, a Mixture of Experts (MoE) structure integrates the predictions from BCMS and VRMP to obtain the final predicted ad conversion number.

[LG-34] Exploring the Generalization Capabilities of AID-based Bi-level Optimization

链接: https://arxiv.org/abs/2411.16081
作者: Congliang Chen,Li Shen,Zhiqiang Xu,Wei Liu,Zhi-Quan Luo,Peilin Zhao
关键词-EN: achieved considerable success, contemporary machine learning, bi-level optimization methods, Bi-level optimization, machine learning applications
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bi-level optimization has achieved considerable success in contemporary machine learning applications, especially for given proper hyperparameters. However, due to the two-level optimization structure, commonly, researchers focus on two types of bi-level optimization methods: approximate implicit differentiation (AID)-based and iterative differentiation (ITD)-based approaches. ITD-based methods can be readily transformed into single-level optimization problems, facilitating the study of their generalization capabilities. In contrast, AID-based methods cannot be easily transformed similarly but must stay in the two-level structure, leaving their generalization properties enigmatic. In this paper, although the outer-level function is nonconvex, we ascertain the uniform stability of AID-based methods, which achieves similar results to a single-level nonconvex problem. We conduct a convergence analysis for a carefully chosen step size to maintain stability. Combining the convergence and stability results, we give the generalization ability of AID-based bi-level optimization methods. Furthermore, we carry out an ablation study of the parameters and assess the performance of these methods on real-world tasks. Our experimental results corroborate the theoretical findings, demonstrating the effectiveness and potential applications of these methods.

[LG-35] VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

链接: https://arxiv.org/abs/2411.16063
作者: Yadi Cao,Yuxuan Liu,Liu Yang,Rose Yu,Hayden Schaeffer,Stanley Osher
关键词-EN: In-Context Operator Networks, Operator Networks, Multiple Physics Pretraining, In-Context Operator, Networks
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:In-Context Operator Networks (ICONs) are models that learn operators across different types of PDEs using a few-shot, in-context approach. Although they show successful generalization to various PDEs, existing methods treat each data point as a single token, and suffer from computational inefficiency when processing dense data, limiting their application in higher spatial dimensions. In this work, we propose Vision In-Context Operator Networks (VICON), incorporating a vision transformer architecture that efficiently processes 2D functions through patch-wise operations. We evaluated our method on three fluid dynamics datasets, demonstrating both superior performance (reducing scaled L^2 error by 40% and 61.6% for two benchmark datasets for compressible flows, respectively) and computational efficiency (requiring only one-third of the inference time per frame) in long-term rollout predictions compared to the current state-of-the-art sequence-to-sequence model with fixed timestep prediction: Multiple Physics Pretraining (MPP). Compared to MPP, our method preserves the benefits of in-context operator learning, enabling flexible context formation when dealing with insufficient frame counts or varying timestep values.

[LG-36] Binary Search with Distributional Predictions

链接: https://arxiv.org/abs/2411.16030
作者: Michael Dinitz,Sungjin Im,Thomas Lavastida,Benjamin Moseley,Aidin Niaparast,Sergei Vassilvitskii
关键词-EN: combining traditional worst-case, traditional worst-case algorithms, modern machine learning, machine learning, machine learning system
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Algorithms with (machine-learned) predictions is a powerful framework for combining traditional worst-case algorithms with modern machine learning. However, the vast majority of work in this space assumes that the prediction itself is non-probabilistic, even if it is generated by some stochastic process (such as a machine learning system). This is a poor fit for modern ML, particularly modern neural networks, which naturally generate a distribution. We initiate the study of algorithms with distributional predictions, where the prediction itself is a distribution. We focus on one of the simplest yet fundamental settings: binary search (or searching a sorted array). This setting has one of the simplest algorithms with a point prediction, but what happens if the prediction is a distribution? We show that this is a richer setting: there are simple distributions where using the classical prediction-based algorithm with any single prediction does poorly. Motivated by this, as our main result, we give an algorithm with query complexity O(H§ + \log \eta) , where H§ is the entropy of the true distribution p and \eta is the earth mover’s distance between p and the predicted distribution \hat p . This also yields the first distributionally-robust algorithm for the classical problem of computing an optimal binary search tree given a distribution over target keys. We complement this with a lower bound showing that this query complexity is essentially optimal (up to constants), and experiments validating the practical usefulness of our algorithm.

[LG-37] M3: Mamba-assisted Multi-Circuit Optimization via MBRL with Effective Scheduling

链接: https://arxiv.org/abs/2411.16019
作者: Youngmin Oh,Jinje Park,Seunggeun Kim,Taejin Paik,David Pan,Bosun Hwang
关键词-EN: demonstrated significant potential, Recent advancements, diverse circuit topologies, Mamba architecture, reinforcement learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in reinforcement learning (RL) for analog circuit optimization have demonstrated significant potential for improving sample efficiency and generalization across diverse circuit topologies and target specifications. However, there are challenges such as high computational overhead, the need for bespoke models for each circuit. To address them, we propose M3, a novel Model-based RL (MBRL) method employing the Mamba architecture and effective scheduling. The Mamba architecture, known as a strong alternative to the transformer architecture, enables multi-circuit optimization with distinct parameters and target specifications. The effective scheduling strategy enhances sample efficiency by adjusting crucial MBRL training parameters. To the best of our knowledge, M3 is the first method for multi-circuit optimization by leveraging both the Mamba architecture and a MBRL with effective scheduling. As a result, it significantly improves sample efficiency compared to existing RL methods.

[LG-38] Stability properties of gradient flow dynamics for the symmetric low-rank matrix factorization problem

链接: https://arxiv.org/abs/2411.15972
作者: Hesameddin Mohammadi,Mohammad Tinati,Stephen Tu,Mahdi Soltanolkotabi,Mihailo R. Jovanović
关键词-EN: including matrix recovery, symmetric low-rank matrix, low-rank matrix factorization, matrix factorization serves, learning tasks
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:The symmetric low-rank matrix factorization serves as a building block in many learning tasks, including matrix recovery and training of neural networks. However, despite a flurry of recent research, the dynamics of its training via non-convex factorized gradient-descent-type methods is not fully understood especially in the over-parameterized regime where the fitted rank is higher than the true rank of the target matrix. To overcome this challenge, we characterize equilibrium points of the gradient flow dynamics and examine their local and global stability properties. To facilitate a precise global analysis, we introduce a nonlinear change of variables that brings the dynamics into a cascade connection of three subsystems whose structure is simpler than the structure of the original system. We demonstrate that the Schur complement to a principal eigenspace of the target matrix is governed by an autonomous system that is decoupled from the rest of the dynamics. In the over-parameterized regime, we show that this Schur complement vanishes at an O(1/t) rate, thereby capturing the slow dynamics that arises from excess parameters. We utilize a Lyapunov-based approach to establish exponential convergence of the other two subsystems. By decoupling the fast and slow parts of the dynamics, we offer new insight into the shape of the trajectories associated with local search algorithms and provide a complete characterization of the equilibrium points and their global stability properties. Such an analysis via nonlinear control techniques may prove useful in several related over-parameterized problems.

[LG-39] Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise DATE

链接: https://arxiv.org/abs/2411.15958
作者: Enea Monzio Compagnoni,Tianlin Liu,Rustem Islamov,Frank Norbert Proske,Antonio Orvieto,Aurelien Lucchi
关键词-EN: adaptive optimization methods, vast empirical evidence, empirical evidence supporting, deep learning, vast empirical
类目: Machine Learning (cs.LG)
*备注: An earlier version, titled ‘SDEs for Adaptive Methods: The Role of Noise’ and dated May 2024, is available on OpenReview

点击查看摘要

Abstract:Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. This work introduces novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama discretization on various neural network architectures such as MLPs, CNNs, ResNets, and Transformers. Our SDEs accurately track the behavior of the respective optimizers, especially when compared to previous SDEs derived for Adam and RMSprop. We believe our approach can provide valuable insights into best training practices and novel scaling rules.

[LG-40] Understanding Machine Learning Paradigms through the Lens of Statistical Thermodynamics: A tutorial

链接: https://arxiv.org/abs/2411.15945
作者: Star(Xinxin)Liu
关键词-EN: elucidating the potential, principles from physics, investigates the convergence, convergence of statistical, statistical mechanics
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Statistics Theory (math.ST); Chemical Physics (physics.chem-ph)
*备注: 19 pages

点击查看摘要

Abstract:This tutorial investigates the convergence of statistical mechanics and learning theory, elucidating the potential enhancements in machine learning methodologies through the integration of foundational principles from physics. The tutorial delves into advanced techniques like entropy, free energy, and variational inference which are utilized in machine learning, illustrating their significant contributions to model efficiency and robustness. By bridging these scientific disciplines, we aspire to inspire newer methodologies in researches, demonstrating how an in-depth comprehension of physical systems’ behavior can yield more effective and dependable machine learning models, particularly in contexts characterized by uncertainty.

[LG-41] Customer Lifetime Value Prediction with Uncertainty Estimation Using Monte Carlo Dropout

链接: https://arxiv.org/abs/2411.15944
作者: Xinzhe Cao,Yadong Xu,Xiaofeng Yang
关键词-EN: Accurately predicting customer, predicting customer Lifetime, Accurately predicting, customer Lifetime, revenue strategies
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Accurately predicting customer Lifetime Value (LTV) is crucial for companies to optimize their revenue strategies. Traditional deep learning models for LTV prediction are effective but typically provide only point estimates and fail to capture model uncertainty in modeling user behaviors. To address this limitation, we propose a novel approach that enhances the architecture of purely neural network models by incorporating the Monte Carlo Dropout (MCD) framework. We benchmarked the proposed method using data from one of the most downloaded mobile games in the world, and demonstrated a substantial improvement in predictive Top 5% Mean Absolute Percentage Error compared to existing state-of-the-art methods. Additionally, our approach provides confidence metric as an extra dimension for performance evaluation across various neural network models, facilitating more informed business decisions.

[LG-42] An AutoML-based approach for Network Intrusion Detection

链接: https://arxiv.org/abs/2411.15920
作者: Nana Kankam Gyimah,Judith Mwakalonge,Gurcan Comert,Saidi Siuhi,Robert Akinie,Methusela Sulle,Denis Ruganuza,Benibo Izison,Arthur Mukwaya
关键词-EN: MLJAR AutoML framework, automated machine learning, machine learning, network intrusion detection, MLJAR AutoML
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present an automated machine learning (AutoML) approach for network intrusion detection, leveraging a stacked ensemble model developed using the MLJAR AutoML framework. Our methodology combines multiple machine learning algorithms, including LightGBM, CatBoost, and XGBoost, to enhance detection accuracy and robustness. By automating model selection, feature engineering, and hyperparameter tuning, our approach reduces the manual overhead typically associated with traditional machine learning methods. Extensive experimentation on the NSL-KDD dataset demonstrates that the stacked ensemble model outperforms individual models, achieving high accuracy and minimizing false positives. Our findings underscore the benefits of using AutoML for network intrusion detection, as the AutoML-driven stacked ensemble achieved the highest performance with 90% accuracy and an 89% F1 score, outperforming individual models like Random Forest (78% accuracy, 78% F1 score), XGBoost and CatBoost (both 80% accuracy, 80% F1 score), and LightGBM (78% accuracy, 78% F1 score), providing a more adaptable and efficient solution for network security applications.

[LG-43] Enhancing Symbolic Regression and Universal Physics-Informed Neural Networks with Dimensional Analysis

链接: https://arxiv.org/abs/2411.15919
作者: Lena Podina,Diba Darooneh,Joshveer Grewal,Mohammad Kohandel
关键词-EN: dimensional analysis, Physics-Informed Neural Networks, Universal Physics-Informed Neural, enhancing symbolic regression, symbolic regression
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a new method for enhancing symbolic regression for differential equations via dimensional analysis, specifically Ipsen’s and Buckingham pi methods. Since symbolic regression often suffers from high computational costs and overfitting, non-dimensionalizing datasets reduces the number of input variables, simplifies the search space, and ensures that derived equations are physically meaningful. As our main contribution, we integrate Ipsen’s method of dimensional analysis with Universal Physics-Informed Neural Networks. We also combine dimensional analysis with the AI Feynman symbolic regression algorithm to show that dimensional analysis significantly improves the accuracy of the recovered equation. The results demonstrate that transforming data into a dimensionless form significantly decreases computation time and improves accuracy of the recovered hidden term. For algebraic equations, using the Buckingham pi theorem reduced complexity, allowing the AI Feynman model to converge faster with fewer data points and lower error rates. For differential equations, Ipsen’s method was combined with Universal Physics-Informed Neural Networks (UPINNs) to identify hidden terms more effectively. These findings suggest that integrating dimensional analysis with symbolic regression can significantly lower computational costs, enhance model interpretability, and increase accuracy, providing a robust framework for automated discovery of governing equations in complex systems when data is limited.

[LG-44] From Laws to Motivation: Guiding Exploration through Law-Based Reasoning and Rewards

链接: https://arxiv.org/abs/2411.15891
作者: Ziyu Chen,Zhiqing Xiao,Xinbei Jiang,Junbo Zhao
关键词-EN: Reinforcement Learning, Large Language Models, Large Language, building autonomous agents, powerful approaches
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Reinforcement Learning (RL) are two powerful approaches for building autonomous agents. However, due to limited understanding of the game environment, agents often resort to inefficient exploration and trial-and-error, struggling to develop long-term strategies or make decisions. We propose a method that extracts experience from interaction records to model the underlying laws of the game environment, using these experience as internal motivation to guide agents. These experience, expressed in language, are highly flexible and can either assist agents in reasoning directly or be transformed into rewards for guiding training. Our evaluation results in Crafter demonstrate that both RL and LLM agents benefit from these experience, leading to improved overall performance.

[LG-45] ExAL: An Exploration Enhanced Adversarial Learning Algorithm

链接: https://arxiv.org/abs/2411.15878
作者: A Vinil,Aneesh Sreevallabh Chivukula,Pranav Chintareddy
关键词-EN: jeopardize machine learning, Adversarial learning, Adversarial, aiming to defend, machine learning systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial learning is critical for enhancing model robustness, aiming to defend against adversarial attacks that jeopardize machine learning systems. Traditional methods often lack efficient mechanisms to explore diverse adversarial perturbations, leading to limited model resilience. Inspired by game-theoretic principles, where adversarial dynamics are analyzed through frameworks like Nash equilibrium, exploration mechanisms in such setups allow for the discovery of diverse strategies, enhancing system robustness. However, existing adversarial learning methods often fail to incorporate structured exploration effectively, reducing their ability to improve model defense comprehensively. To address these challenges, we propose a novel Exploration-enhanced Adversarial Learning Algorithm (ExAL), leveraging the Exponentially Weighted Momentum Particle Swarm Optimizer (EMPSO) to generate optimized adversarial perturbations. ExAL integrates exploration-driven mechanisms to discover perturbations that maximize impact on the model’s decision boundary while preserving structural coherence in the data. We evaluate the performance of ExAL on the MNIST Handwritten Digits and Blended Malware datasets. Experimental results demonstrate that ExAL significantly enhances model resilience to adversarial attacks by improving robustness through adversarial learning.

[LG-46] An Extensive Study on D2C: Overfitting Remediation in Deep Learning Using a Decentralized Approach

链接: https://arxiv.org/abs/2411.15876
作者: Md. Saiful Bari Siddiqui,Md Mohaiminul Islam,Md. Golam Rabiul Alam
关键词-EN: limited training data, remains a significant, significant challenge, training data, limited training
类目: Machine Learning (cs.LG)
*备注: 17 Pages

点击查看摘要

Abstract:Overfitting remains a significant challenge in deep learning, often arising from data outliers, noise, and limited training data. To address this, we propose Divide2Conquer (D2C), a novel technique to mitigate overfitting. D2C partitions the training data into multiple subsets and trains identical models independently on each subset. To balance model generalization and subset-specific learning, the model parameters are periodically aggregated and averaged during training. This process enables the learning of robust patterns while minimizing the influence of outliers and noise. Empirical evaluations on benchmark datasets across diverse deep-learning tasks demonstrate that D2C significantly enhances generalization performance, particularly with larger datasets. Our analysis includes evaluations of decision boundaries, loss curves, and other performance metrics, highlighting D2C’s effectiveness both as a standalone technique and in combination with other overfitting reduction methods. We further provide a rigorous mathematical justification for D2C’s underlying principles and examine its applicability across multiple domains. Finally, we explore the trade-offs associated with D2C and propose strategies to address them, offering a holistic view of its strengths and limitations. This study establishes D2C as a versatile and effective approach to combating overfitting in deep learning. Our codes are publicly available at: this https URL.

[LG-47] Ruppert-Polyak averaging for Stochastic Order Oracle

链接: https://arxiv.org/abs/2411.15866
作者: V.N. Smirnov,K.M. Kazistova,I.A. Sudakov,V. Leplat,A.V. Gasnikov,A.V. Lobanov
关键词-EN: Order Oracle Concept, Stochastic Order Oracle, rapidly growing field, faces challenges due, Black-box optimization
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Black-box optimization, a rapidly growing field, faces challenges due to limited knowledge of the objective function’s internal mechanisms. One promising approach to address this is the Stochastic Order Oracle Concept. This concept, similar to other Order Oracle Concepts, relies solely on relative comparisons of function values without requiring access to the exact values. This paper presents a novel, improved estimation of the covariance matrix for the asymptotic convergence of the Stochastic Order Oracle Concept. Our work surpasses existing research in this domain by offering a more accurate estimation of asymptotic convergence rate. Finally, numerical experiments validate our theoretical findings, providing strong empirical support for our proposed approach.

[LG-48] FedQP: Towards Accurate Federated Learning using Quadratic Programming Guided Mutation

链接: https://arxiv.org/abs/2411.15847
作者: Jiawen Weng,Zeke Xia,Ran Li,Ming Hu,Mingsong Chen
关键词-EN: machine learning systems, distributed machine learning, Federated Learning, learning systems, machine learning
类目: Machine Learning (cs.LG)
*备注: SEKE 2024, 6 pages

点击查看摘要

Abstract:Due to the advantages of privacy-preserving, Federated Learning (FL) is widely used in distributed machine learning systems. However, existing FL methods suffer from low-inference performance caused by data heterogeneity. Specifically, due to heterogeneous data, the optimization directions of different local models vary greatly, making it difficult for the traditional FL method to get a generalized global model that performs well on all clients. As one of the state-of-the-art FL methods, the mutation-based FL method attempts to adopt a stochastic mutation strategy to guide the model training towards a well-generalized area (i.e., flat area in the loss landscape). Specifically, mutation allows the model to shift within the solution space, providing an opportunity to escape areas with poor generalization (i.e., sharp area). However, the stochastic mutation strategy easily results in diverse optimal directions of mutated models, which limits the performance of the existing mutation-based FL method. To achieve higher performance, this paper proposes a novel mutation-based FL approach named FedQP, utilizing a quadratic programming strategy to regulate the mutation directions wisely. By biasing the model mutation towards the direction of gradient update rather than traditional random mutation, FedQP can effectively guide the model to optimize towards a well-generalized area (i.e., flat area). Experiments on multiple well-known datasets show that our quadratic programming-guided mutation strategy effectively improves the inference accuracy of the global model in various heterogeneous data scenarios.

[LG-49] Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization

链接: https://arxiv.org/abs/2411.15795
作者: Corrado Coppola,Lorenzo Papa,Irene Amerini,Laura Palagi
关键词-EN: Adaptive gradient methods, deep learning community, learning community due, Adaptive gradient, sensitivity to hyper-parameters
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Adaptive gradient methods have been increasingly adopted by deep learning community due to their fast convergence and reduced sensitivity to hyper-parameters. However, these methods come with limitations, such as increased memory requirements for elements like moving averages and a poorly understood convergence theory. To overcome these challenges, we introduce F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method featuring a sufficient decrease condition and a line-search procedure to ensure loss reduction per epoch, along with its deterministic proof of global convergence to a stationary point. To evaluate the F-CMA, we integrate it into conventional training protocols for classification tasks involving both convolutional neural networks and vision transformer models, allowing for a direct comparison with popular optimizers. Computational tests show significant improvements, including a decrease in the overall training time by up to 68%, an increase in per-epoch efficiency by up to 20%, and in model accuracy by up to 5%.

[LG-50] LLM Online Spatial-temporal Signal Reconstruction Under Noise

链接: https://arxiv.org/abs/2411.15764
作者: Yi Yan,Dayu Qin,Ercan Engin Kuruoglu
关键词-EN: Large Language Models, Online Spatial-temporal Reconstruction, Graph Signal Processing, spatial-temporal signal reconstruction, online spatial-temporal signal
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This work introduces the LLM Online Spatial-temporal Reconstruction (LLM-OSR) framework, which integrates Graph Signal Processing (GSP) and Large Language Models (LLMs) for online spatial-temporal signal reconstruction. The LLM-OSR utilizes a GSP-based spatial-temporal signal handler to enhance graph signals and employs LLMs to predict missing values based on spatiotemporal patterns. The performance of LLM-OSR is evaluated on traffic and meteorological datasets under varying Gaussian noise levels. Experimental results demonstrate that utilizing GPT-4-o mini within the LLM-OSR is accurate and robust under Gaussian noise conditions. The limitations are discussed along with future research insights, emphasizing the potential of combining GSP techniques with LLMs for solving spatiotemporal prediction tasks.

[LG-51] Research on Effectiveness Evaluation and Optimization of Baseball Teaching Method Based on Machine Learning

链接: https://arxiv.org/abs/2411.15721
作者: Shaoxuan Sun,Jingao Yuan,Yuelin Yang
关键词-EN: gradually attracted attention, machine learning model, students’ sports performance, machine learning, students’ comprehensive scores
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In modern physical education, data-driven evaluation methods have gradually attracted attention, especially the quantitative prediction of students’ sports performance through machine learning model. The purpose of this study is to use a variety of machine learning models to regress and predict students’ comprehensive scores in baseball training, so as to evaluate the effectiveness of the current baseball teaching methods and put forward targeted training optimization suggestions. We set up a model and evaluate the performance of students by collecting many characteristics, such as hitting times, running times and batting. The experimental results show that K-Neighbors Regressor and Gradient Boosting Regressor are excellent in comprehensive prediction accuracy and stability, and the R score and error index are significantly better than other models. In addition, through the analysis of feature importance, it is found that cumulative hits and cumulative runs are the key factors affecting students’ comprehensive scores. Based on the results of this study, this paper puts forward some suggestions on optimizing training strategies to help students get better performance in baseball training. The results show that the data-driven teaching evaluation method can effectively support physical education and promote personalized and refined teaching plan design.

[LG-52] Learning Algorithm Hyperparameters for Fast Parametric Convex Optimization

链接: https://arxiv.org/abs/2411.15717
作者: Rajiv Sambharya,Bartolomeo Stellato
关键词-EN: quickly solve parametric, solve parametric convex, parametric convex optimization, convex optimization problems, gradient descent
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce a machine-learning framework to learn the hyperparameter sequence of first-order methods (e.g., the step sizes in gradient descent) to quickly solve parametric convex optimization problems. Our computational architecture amounts to running fixed-point iterations where the hyperparameters are the same across all parametric instances and consists of two phases. In the first step-varying phase the hyperparameters vary across iterations, while in the second steady-state phase the hyperparameters are constant across iterations. Our learned optimizer is flexible in that it can be evaluated on any number of iterations and is guaranteed to converge to an optimal solution. To train, we minimize the mean square error to a ground truth solution. In the case of gradient descent, the one-step optimal step size is the solution to a least squares problem, and in the case of unconstrained quadratic minimization, we can compute the two and three-step optimal solutions in closed-form. In other cases, we backpropagate through the algorithm steps to minimize the training objective after a given number of steps. We show how to learn hyperparameters for several popular algorithms: gradient descent, proximal gradient descent, and two ADMM-based solvers: OSQP and SCS. We use a sample convergence bound to obtain generalization guarantees for the performance of our learned algorithm for unseen data, providing both lower and upper bounds. We showcase the effectiveness of our method with many examples, including ones from control, signal processing, and machine learning. Remarkably, our approach is highly data-efficient in that we only use 10 problem instances to train the hyperparameters in all of our examples.

[LG-53] ackling Data Heterogeneity in Federated Time Series Forecasting

链接: https://arxiv.org/abs/2411.15716
作者: Wei Yuan,Guanhua Ye,Xiangyu Zhao,Quoc Viet Hung Nguyen,Yang Cao,Hongzhi Yin
关键词-EN: disease transmission monitoring, including energy consumption, Time series forecasting, Time series, series forecasting
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Time series forecasting plays a critical role in various real-world applications, including energy consumption prediction, disease transmission monitoring, and weather forecasting. Although substantial progress has been made in time series forecasting, most existing methods rely on a centralized training paradigm, where large amounts of data are collected from distributed devices (e.g., sensors, wearables) to a central cloud server. However, this paradigm has overloaded communication networks and raised privacy concerns. Federated learning, a popular privacy-preserving technique, enables collaborative model training across distributed data sources. However, directly applying federated learning to time series forecasting often yields suboptimal results, as time series data generated by different devices are inherently heterogeneous. In this paper, we propose a novel framework, Fed-TREND, to address data heterogeneity by generating informative synthetic data as auxiliary knowledge carriers. Specifically, Fed-TREND generates two types of synthetic data. The first type of synthetic data captures the representative distribution information from clients’ uploaded model updates and enhances clients’ local training consensus. The second kind of synthetic data extracts long-term influence insights from global model update trajectories and is used to refine the global model after aggregation. Fed-TREND is compatible with most time series forecasting models and can be seamlessly integrated into existing federated learning frameworks to improve prediction performance. Extensive experiments on eight datasets, using several federated learning baselines and four popular time series forecasting models, demonstrate the effectiveness and generalizability of Fed-TREND.

[LG-54] DrugAgent : Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration

链接: https://arxiv.org/abs/2411.15692
作者: Sizhe Liu,Yizhou Lu,Siyu Chen,Xiyang Hu,Jieyu Zhao,Tianfan Fu,Yue Zhao
关键词-EN: Large Language Models, Large Language, drug discovery processes, Recent advancements, advancements in Large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have opened new avenues for accelerating drug discovery processes. Despite their potential, several critical challenges remain unsolved, particularly in translating theoretical ideas into practical applications within the highly specialized field of pharmaceutical research, limiting practitioners from leveraging the latest AI development in drug discovery. To this end, we introduce DrugAgent, a multi-agent framework aimed at automating machine learning (ML) programming in drug discovery. DrugAgent incorporates domain expertise by identifying specific requirements and building domain-specific tools, while systematically exploring different ideas to find effective solutions. A preliminary case study demonstrates DrugAgent’s potential to overcome key limitations LLMs face in drug discovery, moving toward AI-driven innovation. For example, DrugAgent is able to complete the ML programming pipeline end-to-end, from data acquisition to performance evaluation for the ADMET prediction task, and finally select the best model, where the random forest model achieves an F1 score of 0.92 when predicting absorption using the PAMPA dataset.

[LG-55] Can a Large Language Model Learn Matrix Functions In Context?

链接: https://arxiv.org/abs/2411.15675
作者: Paimon Goulart,Evangelos E. Papalexakis
关键词-EN: Large Language Models, Large Language, In-Context Learning, Stochastic Gradient Descent, Language Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated the ability to solve complex tasks through In-Context Learning (ICL), where models learn from a few input-output pairs without explicit fine-tuning. In this paper, we explore the capacity of LLMs to solve non-linear numerical computations, with specific emphasis on functions of the Singular Value Decomposition. Our experiments show that while LLMs perform comparably to traditional models such as Stochastic Gradient Descent (SGD) based Linear Regression and Neural Networks (NN) for simpler tasks, they outperform these models on more complex tasks, particularly in the case of top-k Singular Values. Furthermore, LLMs demonstrate strong scalability, maintaining high accuracy even as the matrix size increases. Additionally, we found that LLMs can achieve high accuracy with minimal prior examples, converging quickly and avoiding the overfitting seen in classical models. These results suggest that LLMs could provide an efficient alternative to classical methods for solving high-dimensional problems. Future work will focus on extending these findings to larger matrices and more complex matrix operations while exploring the effect of using different numerical representations in ICL.

[LG-56] Best of Both Worlds: Advantages of Hybrid Graph Sequence Models

链接: https://arxiv.org/abs/2411.15671
作者: Ali Behrouz,Ali Parviz,Mahdi Karami,Clayton Sanford,Bryan Perozzi,Vahab Mirrokni
关键词-EN: Passing Neural Networks, Message Passing Neural, graph sequence model, sequence models, recent deep learning
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Modern sequence models (e.g., Transformers, linear RNNs, etc.) emerged as dominant backbones of recent deep learning frameworks, mainly due to their efficiency, representational power, and/or ability to capture long-range dependencies. Adopting these sequence models for graph-structured data has recently gained popularity as the alternative to Message Passing Neural Networks (MPNNs). There is, however, a lack of a common foundation about what constitutes a good graph sequence model, and a mathematical description of the benefits and deficiencies in adopting different sequence models for learning on graphs. To this end, we first present Graph Sequence Model (GSM), a unifying framework for adopting sequence models for graphs, consisting of three main steps: (1) Tokenization, which translates the graph into a set of sequences; (2) Local Encoding, which encodes local neighborhoods around each node; and (3) Global Encoding, which employs a scalable sequence model to capture long-range dependencies within the sequences. This framework allows us to understand, evaluate, and compare the power of different sequence model backbones in graph tasks. Our theoretical evaluations of the representation power of Transformers and modern recurrent models through the lens of global and local graph tasks show that there are both negative and positive sides for both types of models. Building on this observation, we present GSM++, a fast hybrid model that uses the Hierarchical Affinity Clustering (HAC) algorithm to tokenize the graph into hierarchical sequences, and then employs a hybrid architecture of Transformer to encode these sequences. Our theoretical and experimental results support the design of GSM++, showing that GSM++ outperforms baselines in most benchmark evaluations.

[LG-57] Implicit High-Order Moment Tensor Estimation and Learning Latent Variable Models

链接: https://arxiv.org/abs/2411.15669
作者: Ilias Diakonikolas,Daniel M. Kane
关键词-EN: Toggle, algorithm, mathrm, poly, Code
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Abstract shortened due to arxiv requirements

点击查看摘要

Abstract:We study the task of learning latent-variable models. An obstacle towards designing efficient algorithms for such models is the necessity of approximating moment tensors of super-constant degree. Motivated by such applications, we develop a general efficient algorithm for implicit moment tensor computation. Our algorithm computes in \mathrmpoly(d, k) time a succinct approximate description of tensors of the form M_m=\sum_i=1^kw_iv_i^\otimes m , for w_i\in\mathbbR_+ --even for m=\omega(1) --assuming there exists a polynomial-size arithmetic circuit whose expected output on an appropriate samplable distribution is equal to M_m , and whose covariance on this input is bounded. Our framework broadly generalizes the work of~\citeLL21-opt which developed an efficient algorithm for the specific moment tensors that arise in clustering mixtures of spherical Gaussians. By leveraging our general algorithm, we obtain the first polynomial-time learners for the following models. * Mixtures of Linear Regressions. We give a \mathrmpoly(d, k, 1/\epsilon) -time algorithm for this task. The previously best algorithm has super-polynomial complexity in k . * Learning Mixtures of Spherical Gaussians. We give a \mathrmpoly(d, k, 1/\epsilon) -time density estimation algorithm, under the condition that the means lie in a ball of radius O(\sqrt\log k) . Prior algorithms incur super-polynomial complexity in k . We also give a \mathrmpoly(d, k, 1/\epsilon) -time parameter estimation algorithm, under the \em optimal mean separation of \Omega(\log^1/2(k/\epsilon)) . * PAC Learning Sums of ReLUs. We give a learner with complexity \mathrmpoly(d, k) 2^\mathrmpoly(1/\epsilon) . This is the first algorithm for this task that runs in \mathrmpoly(d, k) time for subconstant values of \epsilon = o_k, d(1) . Comments: Abstract shortened due to arxiv requirements Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2411.15669 [cs.DS] (or arXiv:2411.15669v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.15669 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ilias Diakonikolas [view email] [v1] Sat, 23 Nov 2024 23:13:24 UTC (39 KB) Full-text links: Access Paper: View a PDF of the paper titled Implicit High-Order Moment Tensor Estimation and Learning Latent Variable Models, by Ilias Diakonikolas and Daniel M. KaneView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2024-11 Change to browse by: cs cs.LG math math.ST stat stat.ML stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-58] Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

链接: https://arxiv.org/abs/2411.15664
作者: Himel Ghosh
关键词-EN: review report discusses, cold start latency, report discusses, cold start, cold start problem
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, TUM Cloud Computing Seminar

点击查看摘要

Abstract:This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models. Traditional serverless approaches struggle with high latency due to the size of LLM checkpoints and the overhead of initializing GPU resources. ServerlessLLM introduces a multitier checkpoint loading system, leveraging underutilized GPU memory and storage to reduce startup times by 6–8x compared to existing methods. It also proposes live inference migration and a startup-time-optimized model scheduler, ensuring efficient resource allocation and minimizing delays. This system significantly improves performance and scalability in serverless environments for LLM workloads. Besides ServerlessLLM, several other methods from recent research literature, including Rainbowcake, are reviewed in this paper. Further discussions explore how FaaS providers tackle cold starts and the possible future scopes.

[LG-59] MC-NEST – Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree

链接: https://arxiv.org/abs/2411.15645
作者: Gollam Rabby,Farhana Keya,Parvez Zamil,Sören Auer
关键词-EN: Monte Carlo Tree, Monte Carlo Nash, large language models, Carlo Nash Equilibrium, Monte Carlo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mathematical reasoning has proven to be a critical yet challenging task for large language models (LLMs), as they often struggle with complex multi-step problems. To address these limitations, we introduce the Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST) algorithm, an enhancement of the Monte Carlo Tree Self-Refine (MCTSr) approach. By integrating Nash Equilibrium strategies with LLM-based self-refinement and self-evaluation processes, MC-NEST aims to improve decision-making for complex mathematical reasoning tasks. This method ensures balanced exploration and exploitation of potential solutions, leveraging Upper Confidence Bound (UCT) scores and various selection policies. Through iterative critique and refinement, MC-NEST enhances the reasoning capabilities of LLMs, particularly for problems requiring strategic decision-making. Comparative analysis reveals that GPT-4o, equipped with MC-NEST using an Importance Sampling Policy, achieved superior accuracy in domains such as Number Theory and Geometry. These results suggest that both LLMs GPT-4o and Phi-3-mini can benefit from MC-NEST, with iterative self-refinement proving especially effective in expanding the reasoning capacity and problem-solving performance of LLMs. We evaluate the effectiveness of MC-NEST on challenging Olympiad-level benchmarks, demonstrating its potential to significantly boost complex mathematical reasoning performance in LLMs.

[LG-60] On the Boundary Feasibility for PDE Control with Neural Operators

链接: https://arxiv.org/abs/2411.15643
作者: Hanjiang Hu,Changliu Liu
关键词-EN: partial differential equations, underlying partial differential, unknown analytical forms, physical world dynamics, unknown PDE dynamics
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
*备注: 27 pages, 5 figures, 8 tables

点击查看摘要

Abstract:The physical world dynamics are generally governed by underlying partial differential equations (PDEs) with unknown analytical forms in science and engineering problems. Neural network based data-driven approaches have been heavily studied in simulating and solving PDE problems in recent years, but it is still challenging to move forward from understanding to controlling the unknown PDE dynamics. PDE boundary control instantiates a simplified but important problem by only focusing on PDE boundary conditions as the control input and output. However, current model-free PDE controllers cannot ensure the boundary output satisfies some given user-specified safety constraint. To this end, we propose a safety filtering framework to guarantee the boundary output stays within the safe set for current model-free controllers. Specifically, we first introduce a general neural boundary control barrier function (BCBF) to ensure the feasibility of the trajectorywise constraint satisfaction of boundary output. Based on a neural operator modeling the transfer function from boundary control input to output trajectories, we show that the change in the BCBF depends linearly on the change in input boundary, so quadratic programming-based safety filtering can be done for pre-trained model-free controllers. Extensive experiments under challenging hyperbolic, parabolic and Navier-Stokes PDE dynamics environments validate the effectiveness of the proposed method in achieving better general performance and boundary constraint satisfaction compared to the model-free controller baselines.

[LG-61] Learning state and proposal dynamics in state-space models using differentiable particle filters and neural networks

链接: https://arxiv.org/abs/2411.15638
作者: Benjamin Cox,Santiago Segarra,Victor Elvira
关键词-EN: analysing sequential data, popular statistical framework, State-space models, sequential data, popular statistical
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:State-space models are a popular statistical framework for analysing sequential data. Within this framework, particle filters are often used to perform inference on non-linear state-space models. We introduce a new method, StateMixNN, that uses a pair of neural networks to learn the proposal distribution and transition distribution of a particle filter. Both distributions are approximated using multivariate Gaussian mixtures. The component means and covariances of these mixtures are learnt as outputs of learned functions. Our method is trained targeting the log-likelihood, thereby requiring only the observation series, and combines the interpretability of state-space models with the flexibility and approximation power of artificial neural networks. The proposed method significantly improves recovery of the hidden state in comparison with the state-of-the-art, showing greater improvement in highly non-linear scenarios.

[LG-62] A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation

链接: https://arxiv.org/abs/2411.15616
作者: Vennela Yarabolu,Govind Waghmare,Sonia Gupta,Siddhartha Asthana
关键词-EN: continuous machine learning, significant performance degradation, data, continuous machine, machine learning
类目: Machine Learning (cs.LG)
*备注: Accepted in CODS-COMAD 2024

点击查看摘要

Abstract:In many real-world applications, continuous machine learning (ML) systems are crucial but prone to data drift, a phenomenon where discrepancies between historical training data and future test data lead to significant performance degradation and operational inefficiencies. Traditional drift adaptation methods typically update models using ensemble techniques, often discarding drifted historical data, and focus primarily on either covariate drift or concept drift. These methods face issues such as high resource demands, inability to manage all types of drifts effectively, and neglecting the valuable context that historical data can provide. We contend that explicitly incorporating drifted data into the model training process significantly enhances model accuracy and robustness. This paper introduces an advanced framework that integrates the strengths of data-centric approaches with adaptive management of both covariate and concept drift in a scalable and efficient manner. Our framework employs sophisticated data segmentation techniques to identify optimal data batches that accurately reflect test data patterns. These data batches are then utilized for training on test data, ensuring that the models remain relevant and accurate over time. By leveraging the advantages of both data segmentation and scalable drift management, our solution ensures robust model accuracy and operational efficiency in large-scale ML deployments. It also minimizes resource consumption and computational overhead by selecting and utilizing relevant data subsets, leading to significant cost savings. Experimental results on classification task on real-world and synthetic datasets show our approach improves model accuracy while reducing operational costs and latency. This practical solution overcomes inefficiencies in current methods, providing a robust, adaptable, and scalable approach.

[LG-63] From Complexity to Parsimony: Integrating Latent Class Analysis to Uncover Multimodal Learning Patterns in Collaborative Learning

链接: https://arxiv.org/abs/2411.15590
作者: Lixiang Yan,Dragan Gašević,Linxuan Zhao,Vanessa Echeverria,Yueqiao Jin,Roberto Martinez-Maldonado
关键词-EN: leverages advanced sensing, advanced sensing technologies, Multimodal Learning Analytics, insights remains challenging, diverse data sources
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Multimodal Learning Analytics (MMLA) leverages advanced sensing technologies and artificial intelligence to capture complex learning processes, but integrating diverse data sources into cohesive insights remains challenging. This study introduces a novel methodology for integrating latent class analysis (LCA) within MMLA to map monomodal behavioural indicators into parsimonious multimodal ones. Using a high-fidelity healthcare simulation context, we collected positional, audio, and physiological data, deriving 17 monomodal indicators. LCA identified four distinct latent classes: Collaborative Communication, Embodied Collaboration, Distant Interaction, and Solitary Engagement, each capturing unique monomodal patterns. Epistemic network analysis compared these multimodal indicators with the original monomodal indicators and found that the multimodal approach was more parsimonious while offering higher explanatory power regarding students’ task and collaboration performances. The findings highlight the potential of LCA in simplifying the analysis of complex multimodal data while capturing nuanced, cross-modality behaviours, offering actionable insights for educators and enhancing the design of collaborative learning interventions. This study proposes a pathway for advancing MMLA, making it more parsimonious and manageable, and aligning with the principles of learner-centred education.

[LG-64] Haar-Laplacian for directed graphs

链接: https://arxiv.org/abs/2411.15527
作者: Theodor-Adrian Badea,Bogdan Dumitrescu
关键词-EN: Laplacian matrix aiming, Laplacian matrix, paper introduces, aiming to enable, enable the construction
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: 30 pages, 9 figures, 4 tables

点击查看摘要

Abstract:This paper introduces a novel Laplacian matrix aiming to enable the construction of spectral convolutional networks and to extend the signal processing applications for directed graphs. Our proposal is inspired by a Haar-like transformation and produces a Hermitian matrix which is not only in one-to-one relation with the adjacency matrix, preserving both direction and weight information, but also enjoys desirable additional properties like scaling robustness, sensitivity, continuity, and directionality. We take a theoretical standpoint and support the conformity of our approach with the spectral graph theory. Then, we address two use-cases: graph learning (by introducing HaarNet, a spectral graph convolutional network built with our Haar-Laplacian) and graph signal processing. We show that our approach gives better results in applications like weight prediction and denoising on directed graphs.

[LG-65] Partial Knowledge Distillation for Alleviating the Inherent Inter-Class Discrepancy in Federated Learning

链接: https://arxiv.org/abs/2411.15403
作者: Xiaoyu Gan,Xizi Chen,Jingyang Zhu,Xiaomeng Wang,Jingbo Jiang,Chi-Ying Tsui
关键词-EN: Substantial efforts, long-tailed class distribution, weak classes, devoted to alleviating, alleviating the impact
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Substantial efforts have been devoted to alleviating the impact of the long-tailed class distribution in federated learning. In this work, we observe an interesting phenomenon that weak classes consistently exist even for class-balanced learning. These weak classes, different from the minority classes in the previous works, are inherent to data and remain fairly consistent for various network structures and learning paradigms. The inherent inter-class accuracy discrepancy can reach over 36.9% for federated learning on the FashionMNIST and CIFAR-10 datasets, even when the class distribution is balanced both globally and locally. In this study, we empirically analyze the potential reason for this phenomenon. Furthermore, a class-specific partial knowledge distillation method is proposed to improve the model’s classification accuracy for weak classes. In this approach, knowledge transfer is initiated upon the occurrence of specific misclassifications within certain weak classes. Experimental results show that the accuracy of weak classes can be improved by 10.7%, reducing the inherent interclass discrepancy effectively.

[LG-66] Less is More: Optimizing Function Calling for LLM Execution on Edge Devices DATE2025

链接: https://arxiv.org/abs/2411.15399
作者: Varatheepan Paramanayakam,Andreas Karatzas,Iraklis Anagnostopoulos,Dimitrios Stamoulis
关键词-EN: foundation models open, complex API tasks, perform complex API, advanced function-calling capabilities, API tasks
类目: Performance (cs.PF); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted at DATE 2025

点击查看摘要

Abstract:The advanced function-calling capabilities of foundation models open up new possibilities for deploying agents to perform complex API tasks. However, managing large amounts of data and interacting with numerous APIs makes function calling hardware-intensive and costly, especially on edge devices. Current Large Language Models (LLMs) struggle with function calling at the edge because they cannot handle complex inputs or manage multiple tools effectively. This results in low task-completion accuracy, increased delays, and higher power consumption. In this work, we introduce Less-is-More, a novel fine-tuning-free function-calling scheme for dynamic tool selection. Our approach is based on the key insight that selectively reducing the number of tools available to LLMs significantly improves their function-calling performance, execution time, and power efficiency on edge devices. Experimental results with state-of-the-art LLMs on edge hardware show agentic success rate improvements, with execution time reduced by up to 70% and power consumption by up to 40%.

[LG-67] Gradient dynamics for low-rank fine-tuning beyond kernels

链接: https://arxiv.org/abs/2411.15385
作者: Arif Kerem Dayi,Sitan Chen
关键词-EN: low computational cost, fine-tuning foundation models, LoRA has emerged, memory footprint, facto methods
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical sucess, from a mathematical perspective it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model f , as well as i.i.d. samples (x,f^(x)) where x is Gaussian and f^ is the teacher model given by perturbing the weights of f by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of f are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of f , the training dynamics are nonlinear. Nevertheless, in this regime we prove under mild assumptions that a student model which is initialized at the base model and trained with online gradient descent will converge to the teacher in dk^O(1) iterations, where k is the number of neurons in f . Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation’s Hermite expansion. We also prove that in our setting, learning the teacher model "from scratch’’ can require significantly more iterations. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2411.15385 [cs.LG] (or arXiv:2411.15385v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.15385 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-68] he Power of Types: Exploring the Impact of Type Checking on Neural Bug Detection in Dynamically Typed Languages ICSE’25

链接: https://arxiv.org/abs/2411.15368
作者: Boqi Chen,José Antonio Hernández López,Gunter Mussbacher,Dániel Varró
关键词-EN: neural bug detectors, neural bug, bug detectors, Automated bug detection, Python is essential
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Accepted by ICSE’25 Research Track

点击查看摘要

Abstract:Motivation: Automated bug detection in dynamically typed languages such as Python is essential for maintaining code quality. The lack of mandatory type annotations in such languages can lead to errors that are challenging to identify early with traditional static analysis tools. Recent progress in deep neural networks has led to increased use of neural bug detectors. In statically typed languages, a type checker is integrated into the compiler and thus taken into consideration when the neural bug detector is designed for these languages. Problem: However, prior studies overlook this aspect during the training and testing of neural bug detectors for dynamically typed languages. When an optional type checker is used, assessing existing neural bug detectors on bugs easily detectable by type checkers may impact their performance estimation. Moreover, including these bugs in the training set of neural bug detectors can shift their detection focus toward the wrong type of bugs. Contribution: We explore the impact of type checking on various neural bug detectors for variable misuse bugs, a common type targeted by neural bug detectors. Existing synthetic and real-world datasets are type-checked to evaluate the prevalence of type-related bugs. Then, we investigate how type-related bugs influence the training and testing of the neural bug detectors. Findings: Our findings indicate that existing bug detection datasets contain a significant proportion of type-related bugs. Building on this insight, we discover integrating the neural bug detector with a type checker can be beneficial, especially when the code is annotated with types. Further investigation reveals neural bug detectors perform better on type-related bugs than other bugs. Moreover, removing type-related bugs from the training data helps improve neural bug detectors’ ability to identify bugs beyond the scope of type checkers. Comments: Accepted by ICSE’25 Research Track Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2411.15368 [cs.SE] (or arXiv:2411.15368v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2411.15368 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Boqi Chen [view email] [v1] Fri, 22 Nov 2024 22:29:37 UTC (1,194 KB)

[LG-69] Dynamic Tube MPC: Learning Tube Dynamics with Massively Parallel Simulation for Robust Safety in Practice ICRA2025

链接: https://arxiv.org/abs/2411.15350
作者: William D. Compton,Noel Csomay-Shanklin,Cole Johnson,Aaron D. Ames
关键词-EN: challenge in robotics, critical challenge, planning model, dynamic tube, Dynamic Tube MPC
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Safe navigation of cluttered environments is a critical challenge in robotics. It is typically approached by separating the planning and tracking problems, with planning executed on a reduced order model to generate reference trajectories, and control techniques used to track these trajectories on the full order dynamics. Inevitable tracking error necessitates robustification of the nominal plan to ensure safety; in many cases, this is accomplished via worst-case bounding, which ignores the fact that some trajectories of the planning model may be easier to track than others. In this work, we present a novel method leveraging massively parallel simulation to learn a dynamic tube representation, which characterizes tracking performance as a function of actions taken by the planning model. Planning model trajectories are then optimized such that the dynamic tube lies in the free space, allowing a balance between performance and safety to be traded off in real time. The resulting Dynamic Tube MPC is applied to the 3D hopping robot ARCHER, enabling agile and performant navigation of cluttered environments, and safe collision-free traversal of narrow corridors.

[LG-70] Dependence Induced Representations

链接: https://arxiv.org/abs/2411.15328
作者: Xiangxiang Xu,Lizhong Zheng
关键词-EN: dependence induced representations, random variables, study the problem, pair of random, dependence induced
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of learning feature representations from a pair of random variables, where we focus on the representations that are induced by their dependence. We provide sufficient and necessary conditions for such dependence induced representations, and illustrate their connections to Hirschfeld–Gebelein–Rényi (HGR) maximal correlation functions and minimal sufficient statistics. We characterize a large family of loss functions that can learn dependence induced representations, including cross entropy, hinge loss, and their regularized variants. In particular, we show that the features learned from this family can be expressed as the composition of a loss-dependent function and the maximal correlation function, which reveals a key connection between representations learned from different losses. Our development also gives a statistical interpretation of the neural collapse phenomenon observed in deep classifiers. Finally, we present the learning design based on the feature separation, which allows hyperparameter tuning during inference.

[LG-71] GreenMachine: Automatic Design of Zero-Cost Proxies for Energy-Efficient NAS CVPR2025

链接: https://arxiv.org/abs/2411.15290
作者: Gabriel Cortês,Nuno Lourenço,Penousal Machado
关键词-EN: Artificial Intelligence, driven innovations, innovations and created, created new opportunities, Deep Neural Networks
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Submitted to CVPR 2025

点击查看摘要

Abstract:Artificial Intelligence (AI) has driven innovations and created new opportunities across various sectors. However, leveraging domain-specific knowledge often requires automated tools to design and configure models effectively. In the case of Deep Neural Networks (DNNs), researchers and practitioners usually resort to Neural Architecture Search (NAS) approaches, which are resource- and time-intensive, requiring the training and evaluation of numerous candidate architectures. This raises sustainability concerns, particularly due to the high energy demands involved, creating a paradox: the pursuit of the most effective model can undermine sustainability goals. To mitigate this issue, zero-cost proxies have emerged as a promising alternative. These proxies estimate a model’s performance without the need for full training, offering a more efficient approach. This paper addresses the challenges of model evaluation by automatically designing zero-cost proxies to assess DNNs efficiently. Our method begins with a randomly generated set of zero-cost proxies, which are evolved and tested using the NATS-Bench benchmark. We assess the proxies’ effectiveness using both randomly sampled and stratified subsets of the search space, ensuring they can differentiate between low- and high-performing networks and enhance generalizability. Results show our method outperforms existing approaches on the stratified sampling strategy, achieving strong correlations with ground truth performance, including a Kendall correlation of 0.89 on CIFAR-10 and 0.77 on CIFAR-100 with NATS-Bench-SSS and a Kendall correlation of 0.78 on CIFAR-10 and 0.71 on CIFAR-100 with NATS-Bench-TSS.

[LG-72] Dont Mesh with Me: Generating Constructive Solid Geometry Instead of Meshes by Fine-Tuning a Code-Generation LLM

链接: https://arxiv.org/abs/2411.15279
作者: Maximilian Mews,Ansar Aynetdinov,Vivian Schiller,Peter Eisert,Alan Akbik
关键词-EN: revolutionizing software development, engineers designing mechanical, machine learning, creative industries, manual process
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:While recent advancements in machine learning, such as LLMs, are revolutionizing software development and creative industries, they have had minimal impact on engineers designing mechanical parts, which remains largely a manual process. Existing approaches to generate 3D geometry most commonly use meshes as a 3D representation. While meshes are suitable for assets in video games or animations, they lack sufficient precision and adaptability for mechanical engineering purposes. This paper introduces a novel approach for the generation of 3D geometry that generates surface-based Constructive Solid Geometry (CSG) by leveraging a code-generation LLM. First, we create a dataset of 3D mechanical parts represented as code scripts by converting Boundary Representation geometry (BREP) into CSG-based Python scripts. Second, we create annotations in natural language using GPT-4. The resulting dataset is used to fine-tune a code-generation LLM. The fine-tuned LLM can complete geometries based on positional input and natural language in a plausible way, demonstrating geometric understanding.

[LG-73] Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

链接: https://arxiv.org/abs/2411.15247
作者: Zhiwei Jia,Yuesong Nan,Huixi Zhao,Gengdai Liu
关键词-EN: Recent research, enabling flexible model, flexible model alignment, reinforcement learning, enabling flexible
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ( \le2 -step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation \le2 steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO’s connection to value-based RL, providing theoretical insights. See our webpage at this https URL.

[LG-74] An accuracy improving method for advertising click through rate prediction based on enhanced xDeepFM model

链接: https://arxiv.org/abs/2411.15223
作者: Xiaowei Xi,Song Leng,Yuqing Gong,Dalin Li
关键词-EN: CTR prediction, Advertising click-through rate, advertising CTR prediction, CTR prediction faces, CTR prediction models
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Advertising click-through rate (CTR) prediction aims to forecast the probability that a user will click on an advertisement in a given context, thus providing enterprises with decision support for product ranking and ad placement. However, CTR prediction faces challenges such as data sparsity and class imbalance, which adversely affect model training effectiveness. Moreover, most current CTR prediction models fail to fully explore the associations among user history, interests, and target advertisements from multiple perspectives, neglecting important information at different levels. To address these issues, this paper proposes an improved CTR prediction model based on the xDeepFM architecture. By integrating a multi-head attention mechanism, the model can simultaneously focus on different aspects of feature interactions, enhancing its ability to learn intricate patterns without significantly increasing computational complexity. Furthermore, replacing the linear model with a Factorization Machine (FM) model improves the handling of high-dimensional sparse data by flexibly capturing both first-order and second-order feature interactions. Experimental results on the Criteo dataset demonstrate that the proposed model outperforms other state-of-the-art methods, showing significant improvements in both AUC and Logloss metrics. This enhancement facilitates better mining of implicit relationships between features and improves the accuracy of advertising CTR prediction.

[LG-75] Reflections from the 2024 Large Language Model (LLM ) Hackathon for Applications in Materials Science and Chemistry

链接: https://arxiv.org/abs/2411.15221
作者: Yoel Zimmermann,Adib Bazgir,Zartashia Afzal,Fariha Agbere,Qianxiang Ai,Nawaf Alampara,Alexander Al-Feghali,Mehrad Ansari,Dmytro Antypov,Amro Aswad,Jiaru Bai,Viktoriia Baibakova,Devi Dutta Biswajeet,Erik Bitzek,Joshua D. Bocarsly,Anna Borisova,Andres M Bran,L. Catherine Brinson,Marcel Moran Calderon,Alessandro Canalicchio,Victor Chen,Yuan Chiang,Defne Circi,Benjamin Charmes,Vikrant Chaudhary,Zizhang Chen,Min-Hsueh Chiu,Judith Clymo,Kedar Dabhadkar,Nathan Daelman,Archit Datar,Matthew L. Evans,Maryam Ghazizade Fard,Giuseppe Fisicaro,Abhijeet Sadashiv Gangan,Janine George,Jose D. Cojal Gonzalez,Michael Götte,Ankur K. Gupta,Hassan Harb,Pengyu Hong,Abdelrahman Ibrahim,Ahmed Ilyas,Alishba Imran,Kevin Ishimwe,Ramsey Issa,Kevin Maik Jablonka,Colin Jones,Tyler R. Josephson,Greg Juhasz,Sarthak Kapoor,Rongda Kang,Ghazal Khalighinejad,Sartaaj Khan,Sascha Klawohn,Suneel Kuman,Alvin Noe Ladines,Sarom Leang,Magdalena Lederbauer,Sheng-Lun Mark Liao,Hao Liu,Xuefeng Liu,Stanley Lo,Sandeep Madireddy,Piyush Ranjan Maharana,Shagun Maheshwari,Soroush Mahjoubi,José A. Márquez,Rob Mills,Trupti Mohanty,Bernadette Mohr,Seyed Mohamad Moosavi,Alexander Moßhammer,Amirhossein D. Naghdi,Aakash Naik,Oleksandr Narykov,Hampus Näsström,Xuan Vu Nguyen,Xinyi Ni,Dana O’Connor,Teslim Olayiwola,Federico Ottomano,Aleyna Beste Ozhan,Sebastian Pagel,Chiku Parida,Jaehee Park,Vraj Patel,Elena Patyukova,Martin Hoffmann Petersen,Luis Pinto,José M. Pizarro,Dieter Plessers,Tapashree Pradhan,Utkarsh Pratiush,Charishma Puli,Andrew Qin,Mahyar Rajabi,Francesco Ricci,Elliot Risch,Martiño Ríos-García
关键词-EN: Large Language Model, Large Language, Language Model, global hybrid locations, Materials Science
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注: 98 pages

点击查看摘要

Abstract:Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year’s hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.

[LG-76] Sampling with Adaptive Variance for Multimodal Distributions

链接: https://arxiv.org/abs/2411.15220
作者: Björn Engquist,Kui Ren,Yunan Yang
关键词-EN: overdamped Langevin dynamics, classic overdamped Langevin, adaptive sampling algorithms, target Gibbs distribution, Langevin dynamics
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 26 pages, 6 figures

点击查看摘要

Abstract:We propose and analyze a class of adaptive sampling algorithms for multimodal distributions on a bounded domain, which share a structural resemblance to the classic overdamped Langevin dynamics. We first demonstrate that this class of linear dynamics with adaptive diffusion coefficients and vector fields can be interpreted and analyzed as weighted Wasserstein gradient flows of the Kullback–Leibler (KL) divergence between the current distribution and the target Gibbs distribution, which directly leads to the exponential convergence of both the KL and \chi^2 divergences, with rates depending on the weighted Wasserstein metric and the Gibbs potential. We then show that a derivative-free version of the dynamics can be used for sampling without gradient information of the Gibbs potential and that for Gibbs distributions with nonconvex potentials, this approach could achieve significantly faster convergence than the classical overdamped Langevin dynamics. A comparison of the mean transition times between local minima of a nonconvex potential further highlights the better efficiency of the derivative-free dynamics in sampling.

[LG-77] Quantized symbolic time series approximation

链接: https://arxiv.org/abs/2411.15209
作者: Erin Carson,Xinye Chen,Cheng Kang
关键词-EN: Time series, time series regression, Time, series, ubiquitous in numerous
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series are ubiquitous in numerous science and engineering domains, e.g., signal processing, bioinformatics, and astronomy. Previous work has verified the efficacy of symbolic time series representation in a variety of engineering applications due to its storage efficiency and numerosity reduction. The most recent symbolic aggregate approximation technique, ABBA, has been shown to preserve essential shape information of time series and improve downstream applications, e.g., neural network inference regarding prediction and anomaly detection in time series. Motivated by the emergence of high-performance hardware which enables efficient computation for low bit-width representations, we present a new quantization-based ABBA symbolic approximation technique, QABBA, which exhibits improved storage efficiency while retaining the original speed and accuracy of symbolic reconstruction. We prove an upper bound for the error arising from quantization and discuss how the number of bits should be chosen to balance this with other errors. An application of QABBA with large language models (LLMs) for time series regression is also presented, and its utility is investigated. By representing the symbolic chain of patterns on time series, QABBA not only avoids the training of embedding from scratch, but also achieves a new state-of-the-art on Monash regression dataset. The symbolic approximation to the time series offers a more efficient way to fine-tune LLMs on the time series regression task which contains various application domains. We further present a set of extensive experiments performed across various well-established datasets to demonstrate the advantages of the QABBA method for symbolic approximation. Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML) Cite as: arXiv:2411.15209 [cs.LG] (or arXiv:2411.15209v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.15209 Focus to learn more arXiv-issued DOI via DataCite

[LG-78] ransforming Triple-Entry Accounting with Machine Learning: A Path to Enhanced Transparency Through Analytics

链接: https://arxiv.org/abs/2411.15190
作者: Abraham Itzhak Weinberg,Alessio Faccia
关键词-EN: Triple Entry, double-entry bookkeeping system, conventional double-entry bookkeeping, entries’ to record, utilizes three accounts
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Triple Entry (TE) is an accounting method that utilizes three accounts or ‘entries’ to record each transaction, rather than the conventional double-entry bookkeeping system. Existing studies have found that TE accounting, with its additional layer of verification and disclosure of inter-organizational relationships, could help improve transparency in complex financial and supply chain transactions such as blockchain. Machine learning (ML) presents a promising avenue to augment the transparency advantages of TE accounting. By automating some of the data collection and analysis needed for TE bookkeeping, ML techniques have the potential to make this more transparent accounting method scalable for large organizations with complex international supply chains, further enhancing the visibility and trustworthiness of financial reporting. By leveraging ML algorithms, anomalies within distributed ledger data can be swiftly identified, flagging potential instances of fraud or errors. Furthermore, by delving into transaction relationships over time, ML can untangle intricate webs of transactions, shedding light on obscured dealings and adding an investigative dimension. This paper aims to demonstrate the interaction between TE and ML and how they can leverage transparency levels.

[LG-79] Random Forest-Supervised Manifold Alignment

链接: https://arxiv.org/abs/2411.15179
作者: Jake S. Rhodes,Adam G. Rustad
关键词-EN: shared low-dimensional representation, enabling cross-domain learning, data fusion technique, Manifold alignment, fusion technique
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 4 pages, 3 figures, Accepted at MMAI 2024 (BigData 2024)

点击查看摘要

Abstract:Manifold alignment is a type of data fusion technique that creates a shared low-dimensional representation of data collected from multiple domains, enabling cross-domain learning and improved performance in downstream tasks. This paper presents an approach to manifold alignment using random forests as a foundation for semi-supervised alignment algorithms, leveraging the model’s inherent strengths. We focus on enhancing two recently developed alignment graph-based by integrating class labels through geometry-preserving proximities derived from random forests. These proximities serve as a supervised initialization for constructing cross-domain relationships that maintain local neighborhood structures, thereby facilitating alignment. Our approach addresses a common limitation in manifold alignment, where existing methods often fail to generate embeddings that capture sufficient information for downstream classification. By contrast, we find that alignment models that use random forest proximities or class-label information achieve improved accuracy on downstream classification tasks, outperforming single-domain baselines. Experiments across multiple datasets show that our method typically enhances cross-domain feature integration and predictive performance, suggesting that random forest proximities offer a practical solution for tasks requiring multimodal data alignment.

[LG-80] Opportunities of Reinforcement Learning in South Africas Just Transition

链接: https://arxiv.org/abs/2411.15145
作者: Claude Formanek,Callum Rhys Tilbury,Jonathan P. Shock
关键词-EN: looming climate crisis, South Africa stands, interwoven socio-economic challenges, crucial juncture, grappling with interwoven
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted at the Southern African Conference for Artificial Intelligence Research 2024

点击查看摘要

Abstract:South Africa stands at a crucial juncture, grappling with interwoven socio-economic challenges such as poverty, inequality, unemployment, and the looming climate crisis. The government’s Just Transition framework aims to enhance climate resilience, achieve net-zero greenhouse gas emissions by 2050, and promote social inclusion and poverty eradication. According to the Presidential Commission on the Fourth Industrial Revolution, artificial intelligence technologies offer significant promise in addressing these challenges. This paper explores the overlooked potential of Reinforcement Learning (RL) in supporting South Africa’s Just Transition. It examines how RL can enhance agriculture and land-use practices, manage complex, decentralised energy networks, and optimise transportation and logistics, thereby playing a critical role in achieving a just and equitable transition to a low-carbon future for all South Africans. We provide a roadmap as to how other researchers in the field may be able to contribute to these pressing problems.

[LG-81] Gradient Masking All-at-Once: Ensemble Everything Everywhere Is Not Robust

链接: https://arxiv.org/abs/2411.14834
作者: Jie Zhang,Kristina Nikolić,Nicholas Carlini,Florian Tramèr
关键词-EN: make image classifiers, image classifiers robust, recently proposed, proposed to make, Ensemble
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model’s intermediate representations at multiple noisy image resolutions, producing a single robust classification. This defense was shown to be effective against multiple state-of-the-art attacks. Perhaps even more convincingly, it was shown that the model’s gradients are perceptually aligned: attacks against the model produce noise that perceptually resembles the targeted class. In this short note, we show that this defense is not robust to adversarial attack. We first show that the defense’s randomness and ensembling method cause severe gradient masking. We then use standard adaptive attack techniques to reduce the defense’s robust accuracy from 48% to 1% on CIFAR-100 and from 62% to 4% on CIFAR-10, under the \ell_\infty -norm threat model with \varepsilon=8/255 . Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2411.14834 [cs.LG] (or arXiv:2411.14834v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.14834 Focus to learn more arXiv-issued DOI via DataCite

[LG-82] Schemato – An LLM for Netlist-to-Schematic Conversion

链接: https://arxiv.org/abs/2411.13899
作者: Ryoga Matsuo,Stefan Uhlich,Arun Venkitaraman,Andrea Bonetti,Chia-Yu Hsieh,Ali Momeni,Lukas Mauch,Augusto Capone,Eisaku Ohbuchi,Lorenzo Servadei
关键词-EN: Machine learning models, Machine learning, advancing circuit design, Machine, advancing circuit
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Machine learning models are advancing circuit design, particularly in analog circuits. They typically generate netlists that lack human interpretability. This is a problem as human designers heavily rely on the interpretability of circuit diagrams or schematics to intuitively understand, troubleshoot, and develop designs. Hence, to integrate domain knowledge effectively, it is crucial to translate ML-generated netlists into interpretable schematics quickly and accurately. We propose Schemato, a large language model (LLM) for netlist-to-schematic conversion. In particular, we consider our approach in the two settings of converting netlists to .asc files for LTSpice and LATEX files for CircuiTikz schematics. Experiments on our circuit dataset show that Schemato achieves up to 93% compilation success rate for the netlist-to-LaTeX conversion task, surpassing the 26% rate scored by the state-of-the-art LLMs. Furthermore, our experiments show that Schemato generates schematics with a mean structural similarity index measure that is 3xhigher than the best performing LLMs, therefore closer to the reference human design.

[LG-83] Gaussian Process Priors for Boundary Value Problems of Linear Partial Differential Equations

链接: https://arxiv.org/abs/2411.16663
作者: Jianle iHuang,Marc Härkönen,Markus Lange-Hegermann,Bogdan Raiţă
关键词-EN: partial differential equations, Solving systems, differential equations, traditionally addressed, numerical solvers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Commutative Algebra (math.AC); Numerical Analysis (math.NA)
*备注: 25 pages, 19 figures. Code available at $\href{ [this https URL](https://github.com/Jimmy000207/Boundary-EPGP) }{\text{this https URL}}$ . The paper and all ancillary files are released under CC-BY

点击查看摘要

Abstract:Solving systems of partial differential equations (PDEs) is a fundamental task in computational science, traditionally addressed by numerical solvers. Recent advancements have introduced neural operators and physics-informed neural networks (PINNs) to tackle PDEs, achieving reduced computational costs at the expense of solution quality and accuracy. Gaussian processes (GPs) have also been applied to linear PDEs, with the advantage of always yielding precise solutions. In this work, we propose Boundary Ehrenpreis-Palamodov Gaussian Processes (B-EPGPs), a novel framework for constructing GP priors that satisfy both general systems of linear PDEs with constant coefficients and linear boundary conditions. We explicitly construct GP priors for representative PDE systems with practical boundary conditions. Formal proofs of correctness are provided and empirical results demonstrating significant accuracy improvements over state-of-the-art neural operator approaches.

[LG-84] Fast training of large kernel models with delayed projections

链接: https://arxiv.org/abs/2411.16658
作者: Amirhesam Abedsoltan,Siyuan Ma,Parthe Pandit,Mikhail Belkin
关键词-EN: Classical kernel machines, historically faced significant, faced significant challenges, Classical kernel, neural networks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2302.02605

点击查看摘要

Abstract:Classical kernel machines have historically faced significant challenges in scaling to large datasets and model sizes–a key ingredient that has driven the success of neural networks. In this paper, we present a new methodology for building kernel machines that can scale efficiently with both data size and model size. Our algorithm introduces delayed projections to Preconditioned Stochastic Gradient Descent (PSGD) allowing the training of much larger models than was previously feasible, pushing the practical limits of kernel-based learning. We validate our algorithm, EigenPro4, across multiple datasets, demonstrating drastic training speed up over the existing methods while maintaining comparable or better classification accuracy.

[LG-85] Alpha Entropy Search for New Information-based Bayesian Optimization

链接: https://arxiv.org/abs/2411.16586
作者: Daniel Fernández-Sánchez,Eduardo C. Garrido-Merchán,Daniel Hernández-Lobato
关键词-EN: Bayesian optimization, Alpha Entropy Search, AES, acquisition functions, information theory
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 31 pages, 12 figures, 3 tables, Journal KBS

点击查看摘要

Abstract:Bayesian optimization (BO) methods based on information theory have obtained state-of-the-art results in several tasks. These techniques heavily rely on the Kullback-Leibler (KL) divergence to compute the acquisition function. In this work, we introduce a novel information-based class of acquisition functions for BO called Alpha Entropy Search (AES). AES is based on the \alpha-divergence, that generalizes the KL divergence. Iteratively, AES selects the next evaluation point as the one whose associated target value has the highest level of the dependency with respect to the location and associated value of the global maximum of the optimization problem. Dependency is measured in terms of the \alpha-divergence, as an alternative to the KL divergence. Intuitively, this favors the evaluation of the objective function at the most informative points about the global maximum. The \alpha-divergence has a free parameter \alpha, which determines the behavior of the divergence, trading-off evaluating differences between distributions at a single mode, and evaluating differences globally. Therefore, different values of \alpha result in different acquisition functions. AES acquisition lacks a closed-form expression. However, we propose an efficient and accurate approximation using a truncated Gaussian distribution. In practice, the value of \alpha can be chosen by the practitioner, but here we suggest to use a combination of acquisition functions obtained by simultaneously considering a range of values of \alpha. We provide an implementation of AES in BOTorch and we evaluate its performance in both synthetic, benchmark and real-world experiments involving the tuning of the hyper-parameters of a deep neural network. These experiments show that the performance of AES is competitive with respect to other information-based acquisition functions such as JES, MES or PES.

[LG-86] Quantum Circuit Training with Growth-Based Architectures

链接: https://arxiv.org/abs/2411.16560
作者: Callum Duffy,Smit Chaudhary,Gergana V. Velikova
关键词-EN: Feature Map Growth, Sequential Feature Map, incrementally increase parameterized, Interleave Feature Map, model complexity dynamically
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:This study introduces growth-based training strategies that incrementally increase parameterized quantum circuit (PQC) depth during training, mitigating overfitting and managing model complexity dynamically. We develop three distinct methods: Block Growth, Sequential Feature Map Growth, and Interleave Feature Map Growth, which add reuploader blocks to PQCs adaptively, expanding the accessible frequency spectrum of the model in response to training needs. This approach enables PQCs to achieve more stable convergence and generalization, even in noisy settings. We evaluate our methods on regression tasks and the 2D Laplace equation, demonstrating that dynamic growth methods outperform traditional, fixed-depth approaches, achieving lower final losses and reduced variance between runs. These findings underscore the potential of growth-based PQCs for quantum scientific machine learning (QSciML) applications, where balancing expressivity and stability is essential.

[LG-87] Anomaly Detection and RFI Classification with Unsupervised Learning in Narrowband Radio Technosignature Searches

链接: https://arxiv.org/abs/2411.16556
作者: Ben Jacobson-Bell,Steve Croft,Carmen Choza,Alex Andersson,Daniel Bautista,Vishal Gajjar,Matthew Lebofsky,David H. E. MacMahon,Caleb Painter,Andrew P. V. Siemion
关键词-EN: candidate signals represent, signals represent needles, anomaly detection problem, radio-frequency interference, Grouping Low-frequency Observations
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 20 pages, 14 figures, submitted to AJ

点击查看摘要

Abstract:The search for radio technosignatures is an anomaly detection problem: candidate signals represent needles of interest in the proverbial haystack of radio-frequency interference (RFI). Current search frameworks find an enormity of false-positive signals, especially in large surveys, requiring manual follow-up to a sometimes prohibitive degree. Unsupervised learning provides an algorithmic way to winnow the most anomalous signals from the chaff, as well as group together RFI signals that bear morphological similarities. We present GLOBULAR (Grouping Low-frequency Observations By Unsupervised Learning After Reduction) clustering, a signal processing method that uses HDBSCAN to reduce the false-positive rate and isolate outlier signals for further analysis. When combined with a standard narrowband signal detection and spatial filtering pipeline, such as turboSETI, GLOBULAR clustering offers significant improvements in the false-positive rate over the standard pipeline alone, suggesting dramatic potential for the amelioration of manual follow-up requirements for future large surveys. By removing RFI signals in regions of high spectral occupancy, GLOBULAR clustering may also enable the detection of signals missed by the standard pipeline. We benchmark our method against the Choza et al. (2024) turboSETI-only search of 97 nearby galaxies at L-band, demonstrating a false-positive hit reduction rate of 93.1% and a false-positive event reduction rate of 99.3%.

[LG-88] Graph Transformer Networks for Accurate Band Structure Prediction: An End-to-End Approach

链接: https://arxiv.org/abs/2411.16483
作者: Weiyi Gong,Tao Sun,Hexin Bai,Jeng-Yuan Tsai,Haibin Ling,Qimin Yan
关键词-EN: Predicting electronic band, understanding structure-property correlations, Predicting electronic, materials science, band
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Predicting electronic band structures from crystal structures is crucial for understanding structure-property correlations in materials science. First-principles approaches are accurate but computationally intensive. Recent years, machine learning (ML) has been extensively applied to this field, while existing ML models predominantly focus on band gap predictions or indirect band structure estimation via solving predicted Hamiltonians. An end-to-end model to predict band structure accurately and efficiently is still lacking. Here, we introduce a graph Transformer-based end-to-end approach that directly predicts band structures from crystal structures with high accuracy. Our method leverages the continuity of the k-path and treat continuous bands as a sequence. We demonstrate that our model not only provides accurate band structure predictions but also can derive other properties (such as band gap, band center, and band dispersion) with high accuracy. We verify the model performance on large and diverse datasets.

[LG-89] Statistical inference for quantum singular models

链接: https://arxiv.org/abs/2411.16396
作者: Hiroshi Yano,Yota Maeda,Naoki Yamamoto
关键词-EN: quantum singular models, theoretical evidence suggesting, Deep learning, quantum, statistical models
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Algebraic Geometry (math.AG); Machine Learning (stat.ML)
*备注: 57 pages, 8 figures

点击查看摘要

Abstract:Deep learning has seen substantial achievements, with numerical and theoretical evidence suggesting that singularities of statistical models are considered a contributing factor to its performance. From this remarkable success of classical statistical models, it is naturally expected that quantum singular models will play a vital role in many quantum statistical tasks. However, while the theory of quantum statistical models in regular cases has been established, theoretical understanding of quantum singular models is still limited. To investigate the statistical properties of quantum singular models, we focus on two prominent tasks in quantum statistical inference: quantum state estimation and model selection. In particular, we base our study on classical singular learning theory and seek to extend it within the framework of Bayesian quantum state estimation. To this end, we define quantum generalization and training loss functions and give their asymptotic expansions through algebraic geometrical methods. The key idea of the proof is the introduction of a quantum analog of the likelihood function using classical shadows. Consequently, we construct an asymptotically unbiased estimator of the quantum generalization loss, the quantum widely applicable information criterion (QWAIC), as a computable model selection metric from given measurement outcomes.

[LG-90] Solaris: A Foundation Model of the Sun

链接: https://arxiv.org/abs/2411.16339
作者: Harris Abdul Majid,Pietro Sittoni,Francesco Tudisco
关键词-EN: demonstrated remarkable success, scientific domains, motivating our exploration, demonstrated remarkable, remarkable success
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注:

点击查看摘要

Abstract:Foundation models have demonstrated remarkable success across various scientific domains, motivating our exploration of their potential in solar physics. In this paper, we present Solaris, the first foundation model for forecasting the Sun’s atmosphere. We leverage 13 years of full-disk, multi-wavelength solar imagery from the Solar Dynamics Observatory, spanning a complete solar cycle, to pre-train Solaris for 12-hour interval forecasting. Solaris is built on a large-scale 3D Swin Transformer architecture with 109 million parameters. We demonstrate Solaris’ ability to generalize by fine-tuning on a low-data regime using a single wavelength (1700 Å), that was not included in pre-training, outperforming models trained from scratch on this specific wavelength. Our results indicate that Solaris can effectively capture the complex dynamics of the solar atmosphere and transform solar forecasting.

[LG-91] Efficient pooling of predictions via kernel embeddings

链接: https://arxiv.org/abs/2411.16246
作者: Sam Allen,David Ginsbourger,Johanna Ziegel
关键词-EN: Probabilistic predictions, predictions, linear pool, Kernel Hilbert Space, probability distributions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probabilistic predictions are probability distributions over the set of possible outcomes. Such predictions quantify the uncertainty in the outcome, making them essential for effective decision making. By combining multiple predictions, the information sources used to generate the predictions are pooled, often resulting in a more informative forecast. Probabilistic predictions are typically combined by linearly pooling the individual predictive distributions; this encompasses several ensemble learning techniques, for example. The weights assigned to each prediction can be estimated based on their past performance, allowing more accurate predictions to receive a higher weight. This can be achieved by finding the weights that optimise a proper scoring rule over some training data. By embedding predictions into a Reproducing Kernel Hilbert Space (RKHS), we illustrate that estimating the linear pool weights that optimise kernel-based scoring rules is a convex quadratic optimisation problem. This permits an efficient implementation of the linear pool when optimally combining predictions on arbitrary outcome domains. This result also holds for other combination strategies, and we additionally study a flexible generalisation of the linear pool that overcomes some of its theoretical limitations, whilst allowing an efficient implementation within the RKHS framework. These approaches are compared in an application to operational wind speed forecasts, where this generalisation is found to offer substantial improvements upon the traditional linear pool.

[LG-92] Flow Annealed Importance Sampling Bootstrap meets Differentiable Particle Physics NEURIPS

链接: https://arxiv.org/abs/2411.16234
作者: Annalena Kofler,Vincent Stimper,Mikhail Mikhasenko,Michael Kagan,Lukas Heinrich
关键词-EN: High-energy physics requires, called matrix elements, analytically tractable distributions, tractable distributions called, distributions called matrix
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Accepted at the ‘Machine Learning and the Physical Sciences 2024’ workshop at NeurIPS

点击查看摘要

Abstract:High-energy physics requires the generation of large numbers of simulated data samples from complex but analytically tractable distributions called matrix elements. Surrogate models, such as normalizing flows, are gaining popularity for this task due to their computational efficiency. We adopt an approach based on Flow Annealed importance sampling Bootstrap (FAB) that evaluates the differentiable target density during training and helps avoid the costly generation of training data in advance. We show that FAB reaches higher sampling efficiency with fewer target evaluations in high dimensions in comparison to other methods.

[LG-93] Effective Non-Random Extreme Learning Machine

链接: https://arxiv.org/abs/2411.16229
作者: Daniela De Canditiis,Fabiano Veglianti
关键词-EN: Extreme Learning Machine, growing statistical technique, statistical technique widely, technique widely applied, hidden layer weights
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Extreme Learning Machine (ELM) is a growing statistical technique widely applied to regression problems. In essence, ELMs are single-layer neural networks where the hidden layer weights are randomly sampled from a specific distribution, while the output layer weights are learned from the data. Two of the key challenges with this approach are the architecture design, specifically determining the optimal number of neurons in the hidden layer, and the method’s sensitivity to the random initialization of hidden layer weights. This paper introduces a new and enhanced learning algorithm for regression tasks, the Effective Non-Random ELM (ENR-ELM), which simplifies the architecture design and eliminates the need for random hidden layer weight selection. The proposed method incorporates concepts from signal processing, such as basis functions and projections, into the ELM framework. We introduce two versions of the ENR-ELM: the approximated ENR-ELM and the incremental ENR-ELM. Experimental results on both synthetic and real datasets demonstrate that our method overcomes the problems of traditional ELM while maintaining comparable predictive performance. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2411.16229 [stat.ML] (or arXiv:2411.16229v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.16229 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-94] DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing

链接: https://arxiv.org/abs/2411.16121
作者: Utsab Saha,Tanvir Muntakim Tonoy,Hafiz Imtiaz
关键词-EN: created significant opportunities, including healthcare, recent years, informed decision-making, created significant
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Under review in Elsevier Array

点击查看摘要

Abstract:In recent years, the growth of data across various sectors, including healthcare, security, finance, and education, has created significant opportunities for analysis and informed decision-making. However, these datasets often contain sensitive and personal information, which raises serious privacy concerns. Protecting individual privacy is crucial, yet many existing machine learning and data publishing algorithms struggle with high-dimensional data, facing challenges related to computational efficiency and privacy preservation. To address these challenges, we introduce an effective data publishing algorithm \emphDP-CDA. Our proposed algorithm generates synthetic datasets by randomly mixing data in a class-specific manner, and inducing carefully-tuned randomness to ensure formal privacy guarantees. Our comprehensive privacy accounting shows that DP-CDA provides a stronger privacy guarantee compared to existing methods, allowing for better utility while maintaining strict level of privacy. To evaluate the effectiveness of DP-CDA, we examine the accuracy of predictive models trained on the synthetic data, which serves as a measure of dataset utility. Importantly, we identify an optimal order of mixing that balances privacy guarantee with predictive accuracy. Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by traditional data publishing algorithms, even when subject to the same privacy requirements.

[LG-95] Machine-learning emergent spacetime from linear response in future tabletop quantum gravity experiments

链接: https://arxiv.org/abs/2411.16052
作者: Koji Hashimoto,Koshiro Matsuo,Masaki Murata,Gakuto Ogiwara,Daichi Takeda
关键词-EN: interpretable Neural Network, Neural Network, condensed matter system, perform precision bulk, condensed matter
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:We introduce a novel interpretable Neural Network (NN) model designed to perform precision bulk reconstruction under the AdS/CFT correspondence. According to the correspondence, a specific condensed matter system on a ring is holographically equivalent to a gravitational system on a bulk disk, through which tabletop quantum gravity experiments may be possible as reported in arXiv:2211.13863. The purpose of this paper is to reconstruct a higher-dimensional gravity metric from the condensed matter system data via machine learning using the NN. Our machine reads spatially and temporarily inhomogeneous linear response data of the condensed matter system, and incorporates a novel layer that implements the Runge-Kutta method to achieve better numerical control. We confirm that our machine can let a higher-dimensional gravity metric be automatically emergent as its interpretable weights, using a linear response of the condensed matter system as data, through supervised machine learning. The developed method could serve as a foundation for generic bulk reconstruction, i.e., a practical solution to the AdS/CFT correspondence, and would be implemented in future tabletop quantum gravity experiments.

[LG-96] Downlink MIMO Channel Estimation from Bits: Recoverability and Algorithm

链接: https://arxiv.org/abs/2411.16043
作者: Rajesh Shrestha,Mingjie Shao,Mingyi Hong,Wing-Kin Ma,Xiao Fu
关键词-EN: frequency division duplex, massive MIMO systems, major challenge lies, channel state information, downlink channel state
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In frequency division duplex (FDD) massive MIMO systems, a major challenge lies in acquiring the downlink channel state information\ (CSI) at the base station (BS) from limited feedback sent by the user equipment (UE). To tackle this fundamental task, our contribution is twofold: First, a simple feedback framework is proposed, where a compression and Gaussian dithering-based quantization strategy is adopted at the UE side, and then a maximum likelihood estimator (MLE) is formulated at the BS side. Recoverability of the MIMO channel under the widely used double directional model is established. Specifically, analyses are presented for two compression schemes – showing one being more overhead-economical and the other computationally lighter at the UE side. Second, to realize the MLE, an alternating direction method of multipliers (ADMM) algorithm is proposed. The algorithm is carefully designed to integrate a sophisticated harmonic retrieval (HR) solver as subroutine, which turns out to be the key of effectively tackling this hard MLE this http URL numerical experiments are conducted to validate the efficacy of our approach.

[LG-97] Generative AI for Brane Configurations Tropical Coamoeba and 4d N=1 Quiver Gauge Theories

链接: https://arxiv.org/abs/2411.16033
作者: Rak-Kyeong Seong
关键词-EN: Type IIB brane, IIB brane configurations, Type IIB, IIB brane, supersymmetric gauge theories
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Mathematical Physics (math-ph); Algebraic Geometry (math.AG)
*备注: 21 pages, 8 figures, 1 table

点击查看摘要

Abstract:We introduce a generative AI model to obtain Type IIB brane configurations that realize toric phases of a family of 4d N=1 supersymmetric gauge theories. These 4d N=1 quiver gauge theories are worldvolume theories of a D3-brane probing a toric Calabi-Yau 3-fold. The Type IIB brane configurations that realize this family of 4d N=1 theories are known as brane tilings and are given by the tropical coamoeba projection of the mirror curve associated with the toric Calabi-Yau 3-fold. The shape of the mirror curve and its coamoeba projection, as well as the corresponding Type IIB brane configuration and the toric phase of the 4d N=1 theory, all depend on the complex structure moduli parameterizing the mirror curve. We train a generative AI model, a conditional variational autoencoder (CVAE), that takes a choice of complex structure moduli as input and generates the corresponding tropical coamoeba. This enables us not only to obtain a high-resolution representation of the entire phase space for a family of brane tilings corresponding to the same toric Calabi-Yau 3-fold, but also to continuously track the movements of the mirror curve and individual branes in the corresponding Type IIB brane configurations during phase transitions associated with Seiberg duality.

[LG-98] Lattice phi4 field theory as a multi-agent system of financial markets

链接: https://arxiv.org/abs/2411.15813
作者: Dimitrios Bachtis
关键词-EN: reproduce stylized facts, lattice field theory, clustered volatility, reproduce stylized, stylized facts
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Multiagent Systems (cs.MA); High Energy Physics - Lattice (hep-lat)
*备注: Code is available from this https URL

点击查看摘要

Abstract:We introduce a \phi^4 lattice field theory with frustrated dynamics as a multi-agent system to reproduce stylized facts of financial markets such as fat-tailed distributions of returns and clustered volatility. Each lattice site, represented by a continuous degree of freedom, corresponds to an agent experiencing a set of competing interactions which influence its decision to buy or sell a given stock. These interactions comprise a cooperative term, which signifies that the agent should imitate the behavior of its neighbors, and a fictitious field, which compels the agent instead to conform with the opinion of the majority or the minority. To introduce the competing dynamics we exploit the Markov field structure to pursue a constructive decomposition of the \phi^4 probability distribution which we recompose with a Ferrenberg-Swendsen acceptance or rejection sampling step. We then verify numerically that the multi-agent \phi^4 field theory produces behavior observed on empirical data from the FTSE 100 London Stock Exchange index. We conclude by discussing how the presence of continuous degrees of freedom within the \phi^4 lattice field theory enables a representational capacity beyond that possible with multi-agent systems derived from Ising models.

[LG-99] Gradient Norm Regularization Second-Order Algorithms for Solving Nonconvex-Strongly Concave Minimax Problems

链接: https://arxiv.org/abs/2411.15769
作者: Jun-Lin Wang,Zi Xu
关键词-EN: concave minimax problems, nonconvex-strongly concave minimax, solving nonconvex-strongly concave, minimax problems, concave minimax
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we study second-order algorithms for solving nonconvex-strongly concave minimax problems, which have attracted much attention in recent years in many fields, especially in machine learning. We propose a gradient norm regularized trust region (GRTR) algorithm to solve nonconvex-strongly concave minimax problems, where the objective function of the trust region subproblem in each iteration uses a regularized version of the Hessian matrix, and the regularization coefficient and the radius of the ball constraint are proportional to the square root of the gradient norm. The iteration complexity of the proposed GRTR algorithm to obtain an \mathcalO(\epsilon,\sqrt\epsilon) -second-order stationary point is proved to be upper bounded by \tilde\mathcalO(\rho^0.5\kappa^1.5\epsilon^-3/2) , where \rho and \kappa are the Lipschitz constant of the Jacobian matrix and the condition number of the objective function respectively, which matches the best known iteration complexity of second-order methods for solving nonconvex-strongly concave minimax problems. We further propose a Levenberg-Marquardt algorithm with a gradient norm regularization coefficient and use the negative curvature direction to correct the iteration direction (LMNegCur), which does not need to solve the trust region subproblem at each iteration. We also prove that the LMNegCur algorithm achieves an \mathcalO(\epsilon,\sqrt\epsilon) -second-order stationary point within \tilde\mathcalO(\rho^0.5\kappa^1.5\epsilon^-3/2) number of iterations. Numerical results show the efficiency of both proposed algorithms.

[LG-100] Disentangling the Complex Multiplexed DIA Spectra in De Novo Peptide Sequencing

链接: https://arxiv.org/abs/2411.15684
作者: Zheng Ma,Zeping Mao,Ruixue Zhang,Jiazhen Chen,Lei Xin,Paul Shan,Ali Ghodsi,Ming Li
关键词-EN: Data-Independent Acquisition, sampling high-intensity peaks, Acquisition, DIA, mass spectrometry
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-Independent Acquisition (DIA) was introduced to improve sensitivity to cover all peptides in a range rather than only sampling high-intensity peaks as in Data-Dependent Acquisition (DDA) mass spectrometry. However, it is not very clear how useful DIA data is for de novo peptide sequencing as the DIA data are marred with coeluted peptides, high noises, and varying data quality. We present a new deep learning method DIANovo, and address each of these difficulties, and improves the previous established system DeepNovo-DIA by from 25% to 81%, averaging 48%, for amino acid recall, and by from 27% to 89%, averaging 57%, for peptide recall, by equipping the model with a deeper understanding of coeluted DIA spectra. This paper also provides criteria about when DIA data could be used for de novo peptide sequencing and when not to by providing a comparison between DDA and DIA, in both de novo and database search mode. We find that while DIA excels with narrow isolation windows on older-generation instruments, it loses its advantage with wider windows. However, with Orbitrap Astral, DIA consistently outperforms DDA due to narrow window mode enabled. We also provide a theoretical explanation of this phenomenon, emphasizing the critical role of the signal-to-noise profile in the successful application of de novo sequencing.

[LG-101] Circuit design in biology and machine learning. II. Anomaly detection

链接: https://arxiv.org/abs/2411.15647
作者: Steven A. Frank
关键词-EN: Anomaly detection, identifying observations, typical patterns, machine learning, well-established field
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection is a well-established field in machine learning, identifying observations that deviate from typical patterns. The principles of anomaly detection could enhance our understanding of how biological systems recognize and respond to atypical environmental inputs. However, this approach has received limited attention in analyses of cellular and physiological circuits. This study builds on machine learning techniques – such as dimensionality reduction, boosted decision trees, and anomaly classification – to develop a conceptual framework for biological circuits. One problem is that machine learning circuits tend to be unrealistically large for use by cellular and physiological systems. I therefore focus on minimal circuits inspired by machine learning concepts, reduced to cellular scale. Through illustrative models, I demonstrate that small circuits can provide useful classification of anomalies. The analysis also shows how principles from machine learning – such as temporal and atemporal anomaly detection, multivariate signal integration, and hierarchical decision-making cascades – can inform hypotheses about the design and evolution of cellular circuits. This interdisciplinary approach enhances our understanding of cellular circuits and highlights the universal nature of computational strategies across biological and artificial systems.

[LG-102] rans-Glasso: A Transfer Learning Approach to Precision Matrix Estimation

链接: https://arxiv.org/abs/2411.15624
作者: Boxin Zhao,Cong Ma,Mladen Kolar
关键词-EN: study are limited, challenging when samples, Precision matrix estimation, Precision matrix, estimation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 49 pages, 7 figures

点击查看摘要

Abstract:Precision matrix estimation is essential in various fields, yet it is challenging when samples for the target study are limited. Transfer learning can enhance estimation accuracy by leveraging data from related source studies. We propose Trans-Glasso, a two-step transfer learning method for precision matrix estimation. First, we obtain initial estimators using a multi-task learning objective that captures shared and unique features across studies. Then, we refine these estimators through differential network estimation to adjust for structural differences between the target and source precision matrices. Under the assumption that most entries of the target precision matrix are shared with source matrices, we derive non-asymptotic error bounds and show that Trans-Glasso achieves minimax optimality under certain conditions. Extensive simulations demonstrate Trans Glasso’s superior performance compared to baseline methods, particularly in small-sample settings. We further validate Trans-Glasso in applications to gene networks across brain tissues and protein networks for various cancer subtypes, showcasing its effectiveness in biological contexts. Additionally, we derive the minimax optimal rate for differential network estimation, representing the first such guarantee in this area.

[LG-103] Accelerated Hydration Site Localization and Thermodynamic Profiling

链接: https://arxiv.org/abs/2411.15618
作者: Florian B. Hinz,Matthew R. Masters,Julia N. Kieu,Amr H. Mahmoud,Markus A. Lill
关键词-EN: plays a fundamental, fundamental role, Water plays, water molecules surrounding, water molecules
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Water plays a fundamental role in the structure and function of proteins and other biomolecules. The thermodynamic profile of water molecules surrounding a protein are critical for ligand binding and recognition. Therefore, identifying the location and thermodynamic behavior of relevant water molecules is important for generating and optimizing lead compounds for affinity and selectivity to a given target. Computational methods have been developed to identify these hydration sites, but are largely limited to simplified models that fail to capture multi-body interactions, or dynamics-based methods that rely on extensive sampling. Here we present a method for fast and accurate localization and thermodynamic profiling of hydration sites for protein structures. The method is based on a geometric deep neural network trained on a large, novel dataset of explicit water molecular dynamics simulations. We confirm the accuracy and robustness of our model on experimental data and demonstrate it’s utility on several case studies.

[LG-104] Risk Management with Feature-Enriched Generative Adversarial Networks (FE-GAN)

链接: https://arxiv.org/abs/2411.15519
作者: Ling Chen
关键词-EN: Generative Adversarial Network, Expected Shortfall, Feature-Enriched Generative Adversarial, Wasserstein Generative Adversarial, Tail Generative Adversarial
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper investigates the application of Feature-Enriched Generative Adversarial Networks (FE-GAN) in financial risk management, with a focus on improving the estimation of Value at Risk (VaR) and Expected Shortfall (ES). FE-GAN enhances existing GANs architectures by incorporating an additional input sequence derived from preceding data to improve model performance. Two specialized GANs models, the Wasserstein Generative Adversarial Network (WGAN) and the Tail Generative Adversarial Network (Tail-GAN), were evaluated under the FE-GAN framework. The results demonstrate that FE-GAN significantly outperforms traditional architectures in both VaR and ES estimation. Tail-GAN, leveraging its task-specific loss function, consistently outperforms WGAN in ES estimation, while both models exhibit similar performance in VaR estimation. Despite these promising results, the study acknowledges limitations, including reliance on highly correlated temporal data and restricted applicability to other domains. Future research directions include exploring alternative input generation methods, dynamic forecasting models, and advanced neural network architectures to further enhance GANs-based financial risk estimation.

[LG-105] SPRINT Enables Interpretable and Ultra-Fast Virtual Screening against Thousands of Proteomes NEURIPS2024

链接: https://arxiv.org/abs/2411.15418
作者: Andrew T. McNutt,Abhinav K. Adduri,Caleb N. Ellington,Monica T. Dayao,Eric P. Xing,Hosein Mohimani,David R. Koes
关键词-EN: predicting drug-target interactions, accelerate drug discovery, small molecules, targets can accelerate, SPRINT
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Machine Learning for Structural Biology Workshop, NeurIPS 2024

点击查看摘要

Abstract:Virtual screening of small molecules against protein targets can accelerate drug discovery and development by predicting drug-target interactions (DTIs). However, structure-based methods like molecular docking are too slow to allow for broad proteome-scale screens, limiting their application in screening for off-target effects or new molecular mechanisms. Recently, vector-based methods using protein language models (PLMs) have emerged as a complementary approach that bypasses explicit 3D structure modeling. Here, we develop SPRINT, a vector-based approach for screening entire chemical libraries against whole proteomes for DTIs and novel mechanisms of action. SPRINT improves on prior work by using a self-attention based architecture and structure-aware PLMs to learn drug-target co-embeddings for binder prediction, search, and retrieval. SPRINT achieves SOTA enrichment factors in virtual screening on LIT-PCBA and DTI classification benchmarks, while providing interpretability in the form of residue-level attention maps. In addition to being both accurate and interpretable, SPRINT is ultra-fast: querying the whole human proteome against the ENAMINE Real Database (6.7B drugs) for the 100 most likely binders per protein takes 16 minutes. SPRINT promises to enable virtual screening at an unprecedented scale, opening up new opportunities for in silico drug repurposing and development. SPRINT is available on the web as ColabScreen: this https URL

[LG-106] Accelerating CALPHAD-based Phase Diagram Predictions in Complex Alloys Using Universal Machine Learning Potentials: Opportunities and Challenges

链接: https://arxiv.org/abs/2411.15351
作者: Siya Zhu,Raymundo Arróyave,Doğuhan Sarıtürk
关键词-EN: advancing materials design, Theoretic Automated Toolkit, materials design, crucial for understanding, advancing materials
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Accurate phase diagram prediction is crucial for understanding alloy thermodynamics and advancing materials design. While traditional CALPHAD methods are robust, they are resource-intensive and limited by experimentally assessed data. This work explores the use of machine learning interatomic potentials (MLIPs) such as M3GNet, CHGNet, MACE, SevenNet, and ORB to significantly accelerate phase diagram calculations by using the Alloy Theoretic Automated Toolkit (ATAT) to map calculations of the energies and free energies of atomistic systems to CALPHAD-compatible thermodynamic descriptions. Using case studies including Cr-Mo, Cu-Au, and Pt-W, we demonstrate that MLIPs, particularly ORB, achieve computational speedups exceeding three orders of magnitude compared to DFT while maintaining phase stability predictions within acceptable accuracy. Extending this approach to liquid phases and ternary systems like Cr-Mo-V highlights its versatility for high-entropy alloys and complex chemical spaces. This work demonstrates that MLIPs, integrated with tools like ATAT within a CALPHAD framework, provide an efficient and accurate framework for high-throughput thermodynamic modeling, enabling rapid exploration of novel alloy systems. While many challenges remain to be addressed, the accuracy of some of these MLIPs (ORB in particular) are on the verge of paving the way toward high-throughput generation of CALPHAD thermodynamic descriptions of multi-component, multi-phase alloy systems.

[LG-107] Lie-Equivariant Quantum Graph Neural Networks NEURIPS2024

链接: https://arxiv.org/abs/2411.15315
作者: Jogi Suda Neto,Roy T. Forestano,Sergei Gleyzer,Kyoungchul Kong,Konstantin T. Matchev,Katia Matcheva
关键词-EN: Large Hadron Collider, Hadron Collider, Large Hadron, Discovering new phenomena, involves the identification
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注: 10 pages, 5 figures, accepted to the Machine Learning with New Compute Paradigms (MLNCP) Workshop at NeurIPS 2024

点击查看摘要

Abstract:Discovering new phenomena at the Large Hadron Collider (LHC) involves the identification of rare signals over conventional backgrounds. Thus binary classification tasks are ubiquitous in analyses of the vast amounts of LHC data. We develop a Lie-Equivariant Quantum Graph Neural Network (Lie-EQGNN), a quantum model that is not only data efficient, but also has symmetry-preserving properties. Since Lorentz group equivariance has been shown to be beneficial for jet tagging, we build a Lorentz-equivariant quantum GNN for quark-gluon jet discrimination and show that its performance is on par with its classical state-of-the-art counterpart LorentzNet, making it a viable alternative to the conventional computing paradigm.

[LG-108] Heavy-tailed Contamination is Easier than Adversarial Contamination

链接: https://arxiv.org/abs/2411.15306
作者: Yeshwanth Cherapanamjeri,Daniel Lee
关键词-EN: computer science communities, science communities dating, communities dating back, computationally efficient outlier-robust, efficient outlier-robust estimators
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to statistically and computationally efficient outlier-robust estimators. Two particular outlier models have received significant attention: the adversarial and heavy-tailed models. While the former models outliers as the result of a malicious adversary manipulating the data, the latter relaxes distributional assumptions on the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of statistical efficiency, where the dependence on the failure probability is paramount. Despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that any adversarially robust estimator is also resilient to heavy-tailed outliers for any statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we construct heavy-tailed estimators whose application to the adversarial setting requires any black-box reduction to remove almost all the outliers in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches for the heavy-tailed setting. Additionally, confidence intervals obtained for adversarially robust estimation also hold with high-probability. Subjects: Statistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2411.15306 [math.ST] (or arXiv:2411.15306v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2411.15306 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-109] Proportional infinite-width infinite-depth limit for deep linear neural networks

链接: https://arxiv.org/abs/2411.15267
作者: Federico Bassetti,Lucia Ladelli,Pietro Rotondo
关键词-EN: Neural Network Gaussian, Network Gaussian Process, linear neural networks, neural networks converge, neural networks
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We study the distributional properties of linear neural networks with random parameters in the context of large networks, where the number of layers diverges in proportion to the number of neurons per layer. Prior works have shown that in the infinite-width regime, where the number of neurons per layer grows to infinity while the depth remains fixed, neural networks converge to a Gaussian process, known as the Neural Network Gaussian Process. However, this Gaussian limit sacrifices descriptive power, as it lacks the ability to learn dependent features and produce output correlations that reflect observed labels. Motivated by these limitations, we explore the joint proportional limit in which both depth and width diverge but maintain a constant ratio, yielding a non-Gaussian distribution that retains correlations between outputs. Our contribution extends previous works by rigorously characterizing, for linear activation functions, the limiting distribution as a nontrivial mixture of Gaussians.

[LG-110] A Comparison of Machine Learning Algorithms for Predicting Sea Surface Temperature in the Great Barrier Reef Region

链接: https://arxiv.org/abs/2411.15202
作者: Dennis Quayesam,Jacob Akubire,Oliveira Darkwah
关键词-EN: Great Barrier Reef, Barrier Reef, Great Barrier, Predicting Sea Surface, Extreme Gradient Boosting
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Predicting Sea Surface Temperature (SST) in the Great Barrier Reef (GBR) region is crucial for the effective management of its fragile ecosystems. This study provides a rigorous comparative analysis of several machine learning techniques to identify the most effective method for SST prediction in this area. We evaluate the performance of ridge regression, Least Absolute Shrinkage and Selection Operator (LASSO), Random Forest, and Extreme Gradient Boosting (XGBoost) algorithms. Our results reveal that while LASSO and ridge regression perform well, Random Forest and XGBoost significantly outperform them in terms of predictive accuracy, as evidenced by lower Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Prediction Error (RMSPE). Additionally, XGBoost demonstrated superior performance in minimizing Kullback- Leibler Divergence (KLD), indicating a closer alignment of predicted probability distributions with actual observations. These findings highlight the efficacy of using ensemble methods, particularly XGBoost, for predicting sea surface temperatures, making them valuable tools for climatological and environmental modeling.

[LG-111] Physically Parameterized Differentiable MUSIC for DoA Estimation with Uncalibrated Arrays

链接: https://arxiv.org/abs/2411.15144
作者: Baptiste Chatelier(INSA Rennes, IETR, MERCE-France),José Miguel Mateos-Ramos,Vincent Corlay(MERCE-France),Christian Häger,Matthieu Crussière(INSA Rennes, IETR),Henk Wymeersch,Luc Le Magoarou(INSA Rennes, IETR)
关键词-EN: Direction of arrival, common sensing problem, wireless communication systems, problem in radar, wireless communication
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Direction of arrival (DoA) estimation is a common sensing problem in radar, sonar, audio, and wireless communication systems. It has gained renewed importance with the advent of the integrated sensing and communication paradigm. To fully exploit the potential of such sensing systems, it is crucial to take into account potential hardware impairments that can negatively impact the obtained performance. This study introduces a joint DoA estimation and hardware impairment learning scheme following a model-based approach. Specifically, a differentiable version of the multiple signal classification (MUSIC) algorithm is derived, allowing efficient learning of the considered impairments. The proposed approach supports both supervised and unsupervised learning strategies, showcasing its practical potential. Simulation results indicate that the proposed method successfully learns significant inaccuracies in both antenna locations and complex gains. Additionally, the proposed method outperforms the classical MUSIC algorithm in the DoA estimation task.

[LG-112] Data-driven Modeling of Granular Chains with Modern Koopman Theory

链接: https://arxiv.org/abs/2411.15142
作者: Atoosa Parsa,James Bagrow,Corey S. O’Hern,Rebecca Kramer-Bottiglio,Josh Bongard
关键词-EN: Externally driven dense, driven dense packings, effective medium theory, linearized approximate models, exhibit nonlinear wave
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Externally driven dense packings of particles can exhibit nonlinear wave phenomena that are not described by effective medium theory or linearized approximate models. Such nontrivial wave responses can be exploited to design sound-focusing/scrambling devices, acoustic filters, and analog computational units. At high amplitude vibrations or low confinement pressures, the effect of nonlinear particle contacts becomes increasingly noticeable, and the interplay of nonlinearity, disorder, and discreteness in the system gives rise to remarkable properties, particularly useful in designing structures with exotic properties. In this paper, we build upon the data-driven methods in dynamical system analysis and show that the Koopman spectral theory can be applied to granular crystals, enabling their phase space analysis beyond the linearizable regime and without recourse to any approximations considered in the previous works. We show that a deep neural network can map the dynamics to a latent space where the essential nonlinearity of the granular system unfolds into a high-dimensional linear space. As a proof of concept, we use data from numerical simulations of a two-particle system and evaluate the accuracy of the trajectory predictions under various initial conditions. By incorporating data from experimental measurements, our proposed framework can directly capture the underlying dynamics without imposing any assumptions about the physics model. Spectral analysis of the trained surrogate system can help bridge the gap between the simulation results and the physical realization of granular crystals and facilitate the inverse design of materials with desired behaviors.

信息检索

[IR-0] Stop Playing the Guessing Game! Target-free User Simulation for Evaluating Conversational Recommender Systems

链接: https://arxiv.org/abs/2411.16160
作者: Sunghwan Kim,Tongyoung Kim,Kwangwook Seo,Jinyoung Yeo,Dongha Lee
关键词-EN: Conversational Recommender Systems, Recommender Systems, Conversational Recommender, Recent approaches, approaches in Conversational
类目: Information Retrieval (cs.IR)
*备注: Work in progress

点击查看摘要

Abstract:Recent approaches in Conversational Recommender Systems (CRSs) have tried to simulate real-world users engaging in conversations with CRSs to create more realistic testing environments that reflect the complexity of human-agent dialogue. Despite the significant advancements, reliably evaluating the capability of CRSs to elicit user preferences still faces a significant challenge. Existing evaluation metrics often rely on target-biased user simulators that assume users have predefined preferences, leading to interactions that devolve into simplistic guessing game. These simulators typically guide the CRS toward specific target items based on fixed attributes, limiting the dynamic exploration of user preferences and struggling to capture the evolving nature of real-user interactions. Additionally, current evaluation metrics are predominantly focused on single-turn recall of target items, neglecting the intermediate processes of preference elicitation. To address this, we introduce PEPPER, a novel CRS evaluation protocol with target-free user simulators constructed from real-user interaction histories and reviews. PEPPER enables realistic user-CRS dialogues without falling into simplistic guessing games, allowing users to gradually discover their preferences through enriched interactions, thereby providing a more accurate and reliable assessment of the CRS’s ability to elicit personal preferences. Furthermore, PEPPER presents detailed measures for comprehensively evaluating the preference elicitation capabilities of CRSs, encompassing both quantitative and qualitative measures that capture four distinct aspects of the preference elicitation process. Through extensive experiments, we demonstrate the validity of PEPPER as a simulation environment and conduct a thorough analysis of how effectively existing CRSs perform in preference elicitation and recommendation.

[IR-1] Ensemble Learning via Knowledge Transfer for CTR Prediction

链接: https://arxiv.org/abs/2411.16122
作者: Honghao Li,Yiwen Zhang,Yi Zhang,Lei Sang
关键词-EN: Click-through rate, web searches, plays a critical, critical role, role in recommender
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Click-through rate (CTR) prediction plays a critical role in recommender systems and web searches. While many existing methods utilize ensemble learning to improve model performance, they typically limit the ensemble to two or three sub-networks, with little exploration of larger ensembles. In this paper, we investigate larger ensemble networks and find three inherent limitations in commonly used ensemble learning method: (1) performance degradation with more networks; (2) sharp decline and high variance in sub-network performance; (3) large discrepancies between sub-network and ensemble predictions. To simultaneously address the above limitations, this paper investigates potential solutions from the perspectives of Knowledge Distillation (KD) and Deep Mutual Learning (DML). Based on the empirical performance of these methods, we combine them to propose a novel model-agnostic Ensemble Knowledge Transfer Framework (EKTF). Specifically, we employ the collective decision-making of the students as an abstract teacher to guide each student (sub-network) towards more effective learning. Additionally, we encourage mutual learning among students to enable knowledge acquisition from different views. To address the issue of balancing the loss hyperparameters, we design a novel examination mechanism to ensure tailored teaching from teacher-to-student and selective learning in peer-to-peer. Experimental results on five real-world datasets demonstrate the effectiveness and compatibility of EKTF. The code, running logs, and detailed hyperparameter configurations are available at: this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2411.16122 [cs.IR] (or arXiv:2411.16122v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.16122 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval

链接: https://arxiv.org/abs/2411.15766
作者: Suyuan Huang,Chao Zhang,Yuanyuan Wu,Haoxin Zhang,Yuan Wang,Maolin Wang,Shaosheng Cao,Tong Xu,Xiangyu Zhao,Zengchang Qin,Yan Gao,Yunhan Bai,Jun Fan,Yao Hu,Enhong Chen
关键词-EN: industries employs dual-tower, employs dual-tower architectures, Dense retrieval, Large Language Models, Dense
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Dense retrieval in most industries employs dual-tower architectures to retrieve query-relevant documents. Due to online deployment requirements, existing real-world dense retrieval systems mainly enhance performance by designing negative sampling strategies, overlooking the advantages of scaling up. Recently, Large Language Models (LLMs) have exhibited superior performance that can be leveraged for scaling up dense retrieval. However, scaling up retrieval models significantly increases online query latency. To address this challenge, we propose ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency. The first stage is training dual towers, both initialized from the same LLM, to unlock the potential of LLMs for dense retrieval. Then, we distill only the query tower using mean squared error loss and cosine similarity to reduce online costs. Through theoretical analysis and comprehensive offline and online experiments, we show the effectiveness and efficiency of ScalingNote. Our two-stage scaling method outperforms end-to-end models and verifies the scaling law of dense retrieval with LLMs in industrial scenarios, enabling cost-effective scaling of dense retrieval systems. Our online method incorporating ScalingNote significantly enhances the relevance between retrieved documents and queries.

[IR-3] Quantitative Analysis of IITs Research Growth and SDG Contributions

链接: https://arxiv.org/abs/2411.15451
作者: Kiran Sharma,Akshat Nagori,Manya,Mehul Dubey,Parul Khurana
关键词-EN: India research ecosystem, Indian Institutes, vital to India, Institutes of Technology, India research
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The Indian Institutes of Technology (IITs) are vital to India’s research ecosystem, advancing technology and engineering for industrial and societal benefits. This study reviews the research performance of top IITs-Bombay, Delhi, Madras, Kharagpur, and Kanpur based on Scopus-indexed publications (1952-2024). Research output has grown exponentially, supported by increased funding and collaborations. IIT-Kanpur excels in research impact, while IIT-Bombay and IIT-Madras are highly productive but show slightly lower per-paper impact. Internationally, IITs collaborate robustly with the USA, Germany, and the UK, alongside Asian nations like Japan and South Korea, with IIT-Madras leading inter-IIT partnerships. Research priorities align with SDG 3 (Health), SDG 7 (Clean Energy), and SDG 11 (Sustainable Cities). Despite strengths in fields like energy, fluid dynamics, and materials science, challenges persist, including limited collaboration with newer IITs and gaps in emerging fields. Strengthening specialization and partnerships is crucial for addressing global challenges and advancing sustainable development.

[IR-4] he Landscape of Data Reuse in Interactive Information Retrieval: Motivations Sources and Evaluation of Reusability

链接: https://arxiv.org/abs/2411.15430
作者: Tianji Jiang,Wenqi Li,Jiqun Liu
关键词-EN: effectively reduce redundant, reduce redundant efforts, teams conducting human-centered, conducting human-centered system, data reuse
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注:

点击查看摘要

Abstract:Sharing and reusing research data can effectively reduce redundant efforts in data collection and curation, especially for small labs and research teams conducting human-centered system research, and enhance the replicability of evaluation experiments. Building a sustainable data reuse process and culture relies on frameworks that encompass policies, standards, roles, and responsibilities, all of which must address the diverse needs of data providers, curators, and reusers. To advance the knowledge and accumulate empirical understandings on data reuse, this study investigated the data reuse practices of experienced researchers from the area of Interactive Information Retrieval (IIR) studies, where data reuse has been strongly advocated but still remains a challenge. To enhance the knowledge on data reuse behavior and reusability assessment strategies within IIR community, we conducted 21 semi-structured in-depth interviews with IIR researchers from varying demographic backgrounds, institutions, and stages of careers on their motivations, experiences, and concerns over data reuse. We uncovered the reasons, strategies of reusability assessments, and challenges faced by data reusers within the field of IIR as they attempt to reuse researcher data in their studies. The empirical finding improves our understanding of researchers’ motivations for reusing data, their approaches to discovering reusable research data, as well as their concerns and criteria for assessing data reusability, and also enriches the on-going discussions on evaluating user-generated data and research resources and promoting community-level data reuse culture and standards. Subjects: Information Retrieval (cs.IR); Digital Libraries (cs.DL) Cite as: arXiv:2411.15430 [cs.IR] (or arXiv:2411.15430v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.15430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] Preliminary Evaluation of the Test-Time Training Layers in Recommendation System (Student Abstract) AAAI-25

链接: https://arxiv.org/abs/2411.15186
作者: Tianyu Zhan,Zheqi Lv,Shengyu Zhang,Jiwei Li
关键词-EN: Test-Time Training, recommendation systems, TTT, paper explores, explores the application
类目: Information Retrieval (cs.IR)
*备注: To be published in AAAI-25 Student Abstract and Poster Program

点击查看摘要

Abstract:This paper explores the application and effectiveness of Test-Time Training (TTT) layers in improving the performance of recommendation systems. We developed a model, TTT4Rec, utilizing TTT-Linear as the feature extraction layer. Our tests across multiple datasets indicate that TTT4Rec, as a base model, performs comparably or even surpasses other baseline models in similar environments.

[IR-6] IMBRE: Efficient Job Recommendation On Heterogeneous Graphs For Professional Recruiters

链接: https://arxiv.org/abs/2411.15146
作者: Eric Behar,Julien Romero,Amel Bouzeghoub,Katarzyna Wegrzyn-Wolska
关键词-EN: Job recommendation gathers, gathers many challenges, challenges well-known, Job recommendation, Job
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Job recommendation gathers many challenges well-known in recommender systems. First, it suffers from the cold start problem, with the user (the candidate) and the item (the job) having a very limited lifespan. It makes the learning of good user and item representations hard. Second, the temporal aspect is crucial: We cannot recommend an item in the future or too much in the past. Therefore, using solely collaborative filtering barely works. Finally, it is essential to integrate information about the users and the items, as we cannot rely only on previous interactions. This paper proposes a temporal graph-based method for job recommendation: TIMBRE (Temporal Integrated Model for Better REcommendations). TIMBRE integrates user and item information into a heterogeneous graph. This graph is adapted to allow efficient temporal recommendation and evaluation, which is later done using a graph neural network. Finally, we evaluate our approach with recommender system metrics, rarely computed on graph-based recommender systems.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-26

目录

概览 (2024-11-26)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载